A Practical Guide to Theoretical Probability

Journal of Risky Logic 1:119-129
A Practical Guide to Theoretical Probability
Clark D. Carrington1
1. Mixed Probability Calculations
For personal decisions, the evidential form of probability is more familiar. Every
time a decision is made on the basis of a particular assertion being “probably true”
then theoretical probability was involved. However, the probability of chance that is
amenable to a mathematical treatment and is the main form found in academic
discourse, is very familiar too. The relative importance of the two probabilities can
vary with the problem. Sometimes one or the other will dominate, while in other
instances both are important. Recreational betting games serve as an example:



Roulette. Betting on a roulette wheel is purely a game of chance. The odds
and a long term expected return can be calculated very accurately.
Horse Racing. In theory, some horses are faster than others – chance has very
little to do with who wins. Sure, historical records are important, but that’s
mainly because they indicate which horses are fast and which ones are not.
Poker. The odds that a certain card or cards will turn up can be calculated,
and the game of poker can be simply played as a game of chance. But good
poker players also take the mannerisms of their opponents into account when
they bet, which turns poker into a mixed probability game.
It’s the last category of problems that make risk analysis interesting: Giving the
statistical and theoretical probabilities the same epistemic standing becomes essential.
So, when Franklin (2001) correctly states that “the degree to which evidence supports
hypotheses in law or science is not usually quantified”, that’s a problem2. Either the
mathematical probabilities of chance are going to have to be converted to words, or
the verbal determinations of theoretical probability are going to have to be put on a
scale of 0 to 1 – even if the numbers are no better than the words3. That doesn’t have
to be done formally or objectively; poker players do it all in their heads. Horse
theorists presumably use similar techniques when figuring the odds on a horse.
Division of Applied Epistemology, Institute for Risky Logic, Stanardsville, VA 22973
Franklin K (2001). The Science of Conjecture. Johns Hopkins University Press, p. xi
3
Fox J, Krauss P, Elvang-Gøransson M (1993). Argumentation as a General Framework for Uncertain
Reasoning. In: Uncertainty in Artificial Intelligence: Proceedings of the Ninth Conference. Ed. By
Heckerman D and Mandani A. Morton Kaufman, San Mateo CA, pp. 428-434.
1
2
J Risky Logic 2016
If you are betting on a single instance (i.e. what to do now), then boiling down
theoretical probability and statistical probability into a single judgment or number is
essential. A simple equation will suffice to represent this notion:
pTotal = pTheory * pChance
If the roulette wheel is fair, then pTheory =1, and therefore pTotal is dominated by
calculating the probability of chance. If the fastest horse wins, then pChance is 1, and
pTotal is dominated by pTheory. When gamblers bet on a horse, converting horse
theory to numerical odds is exactly what they do. Perhaps they use a probability tree
(see chapter 2) where each node is a different horse. Poker players have a tougher
calculation – not only do they have to know the odds of a card turning up, they also
have to assign a probability to the notion that bluffing will work, or that their
opponent is bluffing.
But once the bet becomes about the long run, or about public health instead of an
individual, then the calculation is quite different. It’s a two dimensional problem
where the primary goal is to predict the frequency of a result or different results, and
there will also be uncertainty about estimated frequencies. The probability
calculation isn’t the same any more. The probability of chance is often a statistical
frequency instead. In fact, it may or may not be a theoretical frequency. For
example, there can be a range of statistical estimates which can be purely empirical,
purely theoretical, or an empirically supported theory. In the first category, an
historical record with a large number of observations may justify a frequency estimate
with no theoretical uncertainty. On the other hand, a fewer number of observations
may serve to support a statistical theory instead, which begets theoretical uncertainty.
The frequency calculation is now a function instead of a single number, so the
relationship between theoretical probability and the frequency of occurrence is now a
function that is potentially far more complicated:
p(Frequency) = pTheory(pChance)
Empirical observations may be used to disprove a statistical theory too, but it will
take many of them. For example, a large number of observations may show a
particular die to be unfair. The again, there may only be enough data the favor one
theory over another without being able to conclusively decide that one is indubitably
correct. That means you are going to need a probability tree.
Purely statistical probability schemes often acknowledge theoretical probability (e.g.
as “systematic error”), but then usually go on to treat it as a done deal or as an
insoluble problem. On the other hand, Bayesian probability schemes typically treat
120
A Practical Guide
theoretical and statistical probabilities interchangeably4. If you are betting on a single
instance, that works reasonably well. Updating a theoretical prior with data can
gradually transform the probability into one of chance – the more data there are, the
less the theory matters, but that isn’t really very scientific. If the data were used to
discriminate among alternative theories, the data might be put to better use. That
problem is even more critical for the estimation of long run frequencies. Updating
the parameter estimates for a theory that is in the process of being proven to be wrong
doesn’t make much sense.
Since it really is more consistent with how scientific knowledge is developed,
explicitly assigning probabilities to theories is a better strategy for long-term issues
where knowledge may be expected to progress. Since theoretical probabilities are
inherently subjective, it is hard to improve upon convening a panel of experts to
weigh the scientific evidence. Even if the experts don’t get it quite right, or they
aren’t the right sort of experts, the process of assigning probabilities to competing
creates an occasion for scientific discussion. As long as no one thinks that the
numerical probabilities assigned to theories are anything more than a way of getting
through the decisions of the day, they won’t get in the way of that5.
Translating numerical characterizations of theoretical or total probability to legal
evidentiary standards (see chapter 1) is also possible. At least one translation is
straightforward: If one has a distribution of probable outcomes, the as-likely-as-not
corresponds to the middle of it, and it doesn’t matter whether or not the probability is
statistical, theoretical, objective, or subjective. It is tempting to equate “beyond a
reasonable doubt” with a probability of 0.95, but there really is no precedent for it.
When quantifying theoretical probability, it becomes important to remember that
uncertainty is not the same as frequency (see chapter 2), and therefore a theoretical or
total probability cannot be used to predict the frequency of an outcome. As a recent
example of the problem, Trasande6 et al (2015) provided an overview of the efforts to
characterize the theoretical probabilities for causal theories involving potential health
effects of Endocrine Disrupting Chemicals (EDCs):
We now describe the general methods used to attribute disease and disability
to EDCs, to weigh the probability of causation based upon the available
Howson C and Urbach P (1993). Chapter 13: Objective Probability. In: Scietific Reasoning. The
Bayesian Approach. Open Court Press, Peri, pp. 319-351.
5
This is a concern raised by Morgan GM (2014). Use (and abuse) of expert elicitation in support of
decision making for public policy. PNAS 111:7176–7184.
6
Trasande L, Zoeller RT, Hass U, Kortenkamp A, Grandjean P, Myers JP, DiGangi J, Bellanger M,
Hauser R, Legler J, Skakkebaek NE, and Heindel JJ (2015). Estimating Burden and Disease Costs of
Exposure to Endocrine-Disrupting Chemicals in the European Union. J Clin Endocrinol Metab 100:
1245–1255.
4
121
J Risky Logic 2016
evidence, and to translate attributable disease burden into costs. During a 2day workshop in April 2014, five expert panels identified conditions where
the evidence is strongest for causation and developed ranges for fractions of
disease burden that can be attributed to EDCs.
There are more than a few quibbles that could be raised about with exactly what they
did, ranging from how the problems were characterized in the first place (i.e. by
presuming independent attributable risks), the use of implausible dose-response
models, the lack of serious consideration of other (i.e. non-EDC) causal factors, and
the relationship between association and causation is all-or-none. Also, because the
probability assignments are subjective, a two-day workshop of experts with similar
interests are perhaps insufficient for a decision involving the economic impacts that
are alleged. However, praise for the process itself is deserved; even if the
probabilities aren’t entirely reasonable, putting them out for criticism starts a useful
process. Nonetheless, as the present topic of discussion, there is one error in how the
theoretical probability was employed after it was arrived at that must not go
unnoticed:
Finally, recognizing that attributable cost estimates were accompanied by a
probability, we performed a series of Monte Carlo simulations to produce
ranges of probable costs across all the exposure-outcome relationships,
assuming independence of each probabilistic event. Separate random number
generation events were used to assign 1) causation or not causation, and 2)
cost given causation, using the base case estimate as well as the range of
sensitivity analytic inputs produced by the expert panel. To illustrate with an
example, for an exposure-outcome relationship with an 80% probability of
causation, random values between 0 and 1 in each simulation led to the first
step, which either assigned no costs (random value ≤ 0.2) and costs (random
value > 0.2).
If the problem required the combination of both theoretical and statistical
probabilities, the use of the probability tree in a Monte-Carlo simulation would be
appropriate. However, there is a problem in implementation that arises from the fact
that a causal probability is NOT a frequency or a probability of chance: A theory is
either true all the time or false all the time, and the entire cost estimate is dependent
on the truth of the theory: So, no you can’t assume independence, and using a causal
probability to calculate the frequency of an event is inappropriate. Instead, the logic
should go like this: Since all of the causal probabilities have a probability of less than
95%, the lower bound cost estimate of all of the end points should be zero (see table
four). For those endpoints with a causal probability of less than 50%, the central
estimate should be zero as well.
122
A Practical Guide
2. Probabilities vs Weights
So, when a decision is at stake, but not otherwise, giving a numerical value to
theoretical probabilities is desirable. If there is a formal risk assessment being
developed as part of a debate over public policy, then having a formal process for
doing that is rather essential. A 2009 paper by Swaen and van Amelsvoort7 does that
for what is perhaps the best known theoretical probability issue, namely determining
whether or not a chemical causes cancer. The technique uses a version the approach
first described by Carnap8, where it is supposed that the probability of a hypothesis is
directly determined by the evidence for it. It also proceeds by using the commonly
accepted Hill criteria as a means of identifying and grading evidence, giving a
numerical value to each category. While the actual probability assignments are
reasonable enough, the presentation of the technique is worthy of several objections:




The paper claims to provide an “empirical basis” for assigning probabilities to
theories. This they obviously did not do. Yes, they use the available data, and
they outlined a formal procedure for determining causal probabilities, but
between the data and the final probability are a plethora of subjective
judgments.
The evidential values assigned to each category are labeled as probabilities,
which each category of evidence is considered to have a “probability of being
true”. For most of the categories, that seems odd. For example, they use
Relative Risk to determine the Strength of association; how does a relative
risk get turned into probability of truth?
The heading for the column of evidential probabilities is “%”. That rather
obviously implies that they are think of the evidence as being statistical even
though frequency of occurrence cannot be equated with a probability of truth.
That obviously invites the sort of misinterpretation exhibited by Trasande.
There is no mention of what the alternative theories might be. Implicitly, the
probability of the not being a causal relationship is one minus the causal
probability, but that is never alluded to.
All of those problems can be avoided by using the Keynesian strategy (see chapter 2)
of separating numerical evidential weights from the theoretical probability
assignments. A probability tree, where the sum of all alternatives under consideration
is 1, can then be used to generate the theory probabilities by making them
proportional to the weights. For example, the probabilities for a two node tree would
be:
Swaen G and van Amelsvoort L (2009). A Weight of the Evidence Approach to Causal Inference. J
Clin Epidemiol 62:270-277.
8
Carnap R (1950). The Logical Foundations of Probability. University of Chicago Press, Chicago.
7
123
J Risky Logic 2016
p(H1) = weight1/(weight1 + weight2)
and
p(H2) = weight2/(weight1 + weight2)
The advantage of that method is that each alternative theory can be evaluated
independently, and a new alternative theory can be added to the tree without
invalidating the evaluation of the other theories9.
3. Weight of the Evidence
Since 1972, the most renowned agency tasked with conducting weight-of-theevidence evaluations is the United Nations sponsored International Agency for
Research on Cancer (IARC). IARC sponsored workgroups have evaluated hundreds
of chemicals and substances with the goal of place each in one of four evidential
categories10:




Sufficient evidence of carcinogenicity: The Working Group considers that a
causal relationship has been established between exposure to the agent and
human cancer. That is, a positive relationship has been observed between the
exposure and cancer in studies in which chance, bias and confounding could
be ruled out with reasonable confidence. A statement that there is sufficient
evidence is followed by a separate sentence that identifies the target organ(s)
or tissue(s) where an increased risk of cancer was observed in humans.
Identification of a specific target organ or tissue does not preclude the
possibility that the agent may cause cancer at other sites.
Limited evidence of carcinogenicity: A positive association has been observed
between exposure to the agent and cancer for which a causal interpretation is
considered by the Working Group to be credible, but chance, bias or
confounding could not be ruled out with reasonable confidence.
Inadequate evidence of carcinogenicity: The available studies are of
insufficient quality, consistency or statistical power to permit a conclusion
regarding the presence or absence of a causal association between exposure
and cancer, or no data on cancer in humans are available.
Evidence suggesting lack of carcinogenicity: There are several adequate
studies covering the full range of levels of exposure that humans are known to
encounter, which are mutually consistent in not showing a positive association
between exposure to the agent and any studied cancer at any observed level of
Then again, it might. For example, if a new theory provides a new explanation, it might alter the
theoretical weights accorded to some of the other theories.
10
International Agency for Research on Cancer (2006). The Preamble to the IARC Monographs.
http://monographs.iarc.fr/ENG/Preamble/CurrentPreamble.pdf
9
124
A Practical Guide
exposure. The results from these studies alone or combined should have
narrow confidence intervals with an upper limit close to the null value (e.g. a
relative risk of 1.0). Bias and confounding should be ruled out with reasonable
confidence, and the studies should have an adequate length of follow-up. A
conclusion of evidence suggesting lack of carcinogenicity is inevitably limited
to the cancer sites, conditions and levels of exposure, and length of
observation covered by the available studies. In addition, the possibility of a
very small risk at the levels of exposure studied can never be excluded.
Although placement into categories is clearly concerned with causal interpretation of
epidemiological data, IARC evaluations do not formally use the Hill criteria.
However, the lines of reasoning used in IARC monographs are found on Hill’s list.
In particular, strength of association and theoretical arguments are both brought
forward in the evaluations.
Although IARC evaluations are primarily concerned with human carcinogenesis, they
also generally consider the evidence for carcinogenicity in animals as well. In some
cases, the evidential categories for human and animal carcinogenesis are
distinguished. In particular, a category 2A designation indicates that there is limited
evidence for human carcinogenicity, while 2B indicates that there is limited evidence
of carcinogenicity in animals only. The IARC weight of the evidence evaluations are
implicitly geared towards a two-node probability tree. Either the chemical is
carcinogenic, or it is not.
While the 1986 EPA Cancer Risk Assessment Guidelines mention the importance of
weight of the evidence evaluations, the issue is brought up in a brief discussion that
gives almost no guidance as to how they are to be done11. It is stipulated that results
of the evaluation will be stored in categories similar to those used by IARC. The
presumption seemed to be that a statistical significance test would settle the
matter. Until the next set of “finalized” guidelines were issued in 2005, this proved to
be a matter of substantial debate. The main issue that arose was this: How a chemical
may cause cancer came to be regarded as being as being at least as important as the
“Delaney” question of whether or not it can cause cancer at all.
The 2005 Guidelines have a far more extensive discussion of Weight of the
Evidence12. The guidelines adopt a slightly modified version of the Hill criteria to
structure the evaluations. In fact, they do this twice. The first time is for the same
purpose as the IARC or 1986 evaluations, where the goal is to determine whether the
U.S. Environmental Protection Agency (1986). Guidelines for Carcinogen Risk Assessment.
EPA/630/R-00/004, also at Federal Register 51(185):33992-34003
12
U.S. Environmental Protection Agency (2005). Guidelines for Carcinogen Risk Assessment.
EPA/630/P-03/001F
11
125
J Risky Logic 2016
chemical should be identified as a carcinogen or not. The second time is when is in
the context of a discussion of potential alternative modes of action. Tree-wise, the
most pertinent section of the guidelines is section 2.4.3.3. Consideration of the
Possibility of Other Modes of Action:
The possible involvement of more than one mode of action at the tumor site
should be considered. Pertinent observations that are not consistent with the
hypothesized mode of action can suggest the possibility of other modes of
action. Some pertinent observations can be consistent with more than one
mode of action. Furthermore, different modes of action can operate in
different dose ranges; for example, an agent can act predominantly through
cytotoxicity at high doses and through mutagenicity at lower doses where
cytotoxicity may not occur.
This passage indicates that if there is evidence for more than one mode of action, each
should receive a separate analysis. There may be an uneven level of experimental
support for the different modes of action. Sometimes this can reflect disproportionate
resources spent on investigating one particular mode of action and not the validity or
relative importance of the other possible modes of action. Ultimately, however, the
information on all of the modes of action should be integrated to better understand
how and when each mode acts, and which mode(s) may be of interest for exposure
levels relevant to human exposures of interest.
You might think that the weight of the evidence narrative that follows such careful
examination would embrace the possibility of multiple plausible interpretations of the
body of evidence where each potential mode becomes a node on the tree. For
example, if one node may stipulate that there is no causal relationship, a second node
that supposes that genotoxicity is responsible, and a third node that supposes that
increased cell proliferation is responsible for increased tumor incidence. But, that’s
not what the guidelines recommend. Instead, they keep the IARC-like categories that
simply grade whether or not the chemical should be considered to be
carcinogenic. The mode of action evaluation is folded into the category of Not
Likely to Be Carcinogenic to Humans:
This descriptor is appropriate when the available data are considered robust
for deciding that there is no basis for human hazard concern. In some
instances, there can be positive results in experimental animals when there is
strong, consistent evidence that each mode of action in experimental animals
does not operate in humans. In other cases, there can be convincing evidence
in both humans and animals that the agent is not carcinogenic.
The problem with this tactic is this: In order to argue that the risk is “unlikely” one
must essentially conduct a risk assessment, so excluding regulatory scrutiny on the
126
A Practical Guide
basis of evidential grading is premature. Even if the conclusion is correct, a
probability tree would be much better at laying out the real argument. Otherwise, the
obvious question to ask is “How unlikely is it?”.
At the time the 2005 guidelines appeared, the phrase Weight-of-the-Evidence was
also used to connote evidential evaluations in many risk assessment context besides
cancer risk assessments, and in many different ways13. Since then, there has been a
proliferation of guidelines for both generating and using weight of the evidence
(WoE) evaluations for a range of different purposes. Evidence-based Medicine uses a
WoE process to evaluate individual study quality. WoE guidelines issued by the EPA
in 2011 lay out a process for evaluating potential mechanisms of action involving
hormonal regulation14. A more interesting development is the use of WoE for
environmental assessments. At about the same time, Suter and Cormier described a
process that integrates epidemiology and risk assessment by treating them as
complimentary processes15. They also differentiate between “weighting evidence” a
la evidence based medicine vs “weighing evidence” where the strength of a
hypothesis is evaluated. They also discuss the possibility of enumerating evidential
weight:
However, it must be recognized that the weights do not have natural units and
cannot be naturally added or averaged. For example, if you assign a weight of
2 for strength of a piece of evidence and 4 for quality, those scores are
numerical but not quantitative. If you sum them to obtain an overall weight of
six for the evidence, it may impart a sense of rigor, but it is still a qualitative
score. Therefore, qualitative weights are better expressed by symbolic systems
such as the scale, +++, ++, +, 0,−, −,−−−, commonly used in causal
assessments.
The counterargument is this: Precisely because the weights aren’t natural, they can
be construed or reverse-engineered as necessary. In other words, if summing
numerical weights doesn’t yield reasonable numerical probability assignments, then
perhaps the numerical assignments for the weights need to be changed so they do.
That is, in fact, exactly what Swaen and van Amelsvoort16 did.
Suter and Cormier also discuss the idea of “Building a Case” for environmental risk
assessments; they obviously have the evidential form of probability in mind. In the
Weed DL (2005). Weight of evidence: a review of concept and methods. Risk Anal. 25:1545-57.
U.S. Environmental Protection Agency (2011). Weight-of-Evidence: Evaluating Results of EDSP
Tier 1 Screening to Identify the Need for Tier 2 Testing.
15
Suter, GW and Cormier SM (2011). Why and how to combine evidence in environmental
assessments: Weighing evidence and building cases. Science of the Total Environment 409:1406–
1417.
16
Swaen and van Amelsvoort (2009). Ibid.
13
14
127
J Risky Logic 2016
same vein, Lutter et al recently suggested using “Hypothesis-based weight of the
evidence” for chemical risk assessments in general17. Both of these suggestions
embody the notion of competing theories. But, neither of them convey the main
advantage of a tree in assimilating scientific information: It is not necessary to prove
one hypothesis to the exclusion of all others. Instead competing hypotheses can share
space on the same probability tree. Perhaps, at some time in the future one of the
hypotheses will drive the others to extinction. Meanwhile, decisions can still be made
with the best information available at the time.
4. Iterative WoE
According to the NRC Redbook18, a risk assessment commences with a hazard
identification step that is largely a subjective determination. The Hazard ID is then
followed with a more objective process that involves conducting an exposure
assessment and characterizing the dose-response relationship for the specific cause
and effect. Yet, when hazard identification itself is thought of as a formal weight-ofthe-evidence process, the iterative nature of risk assessment becomes apparent. This
is especially true when human epidemiological studies are pivotal in establishing that
there is a causal relationship: Hazard identification needs a weight of evidence
evaluation, and a weight of the evidence evaluation needs a dose-response evaluation.
In addition, working with epidemiological data often requires estimation of the dose
that the subjects in the cohorts received, which means the hazard ID may also require
exposure assessment. In other words, in order to conduct a proper hazard
identification, one must do a risk assessment.
However, it probably won’t be exactly the same risk assessment as one needed for the
decision at hand. For one thing, the exposures will certainly be different for a cohort
that was selected precisely because they are known to have unusually high exposures.
There also may be differences in other causal influences on a particular disease or
health measure in a cohort that was studied and the population for which a risk
estimate is needed. Nonetheless, one would generally expect that a dose-response
analysis used to establish that there is causal relationship should resemble the one
used to estimate the risk.
So it would seem that the consideration of evidential weight and the dose-response
assessment cannot be treated as entirely separate processes. For example, Suter and
Cormier19 characterized environmental epidemiology as risk assessment in reverse,
Lutter R, Abbott L, Becker R, Bradley A, Charnley G, Dudley S, Felsot A, Golden N, Gray G,
Juberg D, Mary Mitchell M, Rachman N, Rhomberg L, Solomon K, Sundlof S and Willett K(2015).
Improving Weight of Evidence Approaches to Chemical Evaluations. Risk Anal 35:186-192.
18
National Research Council (1983). Risk Assessment in the Federal Government: Managing the
Process. National Academy Press, Washington
19
Suter and Cormier (2011). Ibid
17
128
A Practical Guide
where both epidemiology and risk assessment share a common need for weighing
evidence and building the case for a causal relationship. But, the weighing of
evidence itself isn’t really reversed. Yes, one should look for alternative plausible
explanations when conducting a causal assessment, but the need to characterize
uncertainty should make that part of the risk assessment as well. For example, the
fact that an association may be plausibly explained by both causal and noncausal
relationships may be an important source of model uncertainty. Where epidemiology
does, or should, work in reverse is during study design. Since the expectation of how
the data are going to be used drives what data are to be collected, knowing how data
becomes evidence will determine what a study looks for. An epidemiology study
designed to support regulatory decision making should look like a risk assessment.
That is not where the convergence stops. Risks assessment tends to be tiered and
iterative. They start out subjectively and tend to get more formally objective when
the policy issues get warm. The start out simple and tend to get more complex20.
That line of thinking should be applied to the WoE process as well. When all is said
and done, the Hill criteria are fairly rudimentary. If the literature is extensive, the
divisions between the evidence categories break apart. Whether or not there is a
dose-response trend may depend on what is considered plausible. Whether or not the
associations are consistent may depend on what anomalies can be explained.
Specificity may not matter if interactions with other factors are likely. So, all of those
criteria may coalesce into two: empirical support and theoretical support.
The other important thing to recognize is that the actual dose-response relationship
really is becomes more important as the evidence for a causal effect become stronger.
If the evidence is weak then the decision will be dominated by pCausality since the
apparent effect probably isn’t real anyway. On the other hand, if pCausality gets into
the upper ranges of more-likely-than-not, then the actual effects predicted are
naturally going to be important. If there are different theories that will explain a
causal relationship (e.g. genotoxic carcinogen vs. promoter – se chapter 4), then that
will become the main focus, and the Hill criteria will fade into oblivion. However,
the probability tree may still have some mix of both causal and acausal explanations.
Unless the causal relationship is completely certain, it should. May the best theory
win.
National Research Council (1994). Science and Judgment in Risk Assessment. National Academy
Press, Washington, p. 202.
20
129