Journal of Risky Logic 1:119-129 A Practical Guide to Theoretical Probability Clark D. Carrington1 1. Mixed Probability Calculations For personal decisions, the evidential form of probability is more familiar. Every time a decision is made on the basis of a particular assertion being “probably true” then theoretical probability was involved. However, the probability of chance that is amenable to a mathematical treatment and is the main form found in academic discourse, is very familiar too. The relative importance of the two probabilities can vary with the problem. Sometimes one or the other will dominate, while in other instances both are important. Recreational betting games serve as an example: Roulette. Betting on a roulette wheel is purely a game of chance. The odds and a long term expected return can be calculated very accurately. Horse Racing. In theory, some horses are faster than others – chance has very little to do with who wins. Sure, historical records are important, but that’s mainly because they indicate which horses are fast and which ones are not. Poker. The odds that a certain card or cards will turn up can be calculated, and the game of poker can be simply played as a game of chance. But good poker players also take the mannerisms of their opponents into account when they bet, which turns poker into a mixed probability game. It’s the last category of problems that make risk analysis interesting: Giving the statistical and theoretical probabilities the same epistemic standing becomes essential. So, when Franklin (2001) correctly states that “the degree to which evidence supports hypotheses in law or science is not usually quantified”, that’s a problem2. Either the mathematical probabilities of chance are going to have to be converted to words, or the verbal determinations of theoretical probability are going to have to be put on a scale of 0 to 1 – even if the numbers are no better than the words3. That doesn’t have to be done formally or objectively; poker players do it all in their heads. Horse theorists presumably use similar techniques when figuring the odds on a horse. Division of Applied Epistemology, Institute for Risky Logic, Stanardsville, VA 22973 Franklin K (2001). The Science of Conjecture. Johns Hopkins University Press, p. xi 3 Fox J, Krauss P, Elvang-Gøransson M (1993). Argumentation as a General Framework for Uncertain Reasoning. In: Uncertainty in Artificial Intelligence: Proceedings of the Ninth Conference. Ed. By Heckerman D and Mandani A. Morton Kaufman, San Mateo CA, pp. 428-434. 1 2 J Risky Logic 2016 If you are betting on a single instance (i.e. what to do now), then boiling down theoretical probability and statistical probability into a single judgment or number is essential. A simple equation will suffice to represent this notion: pTotal = pTheory * pChance If the roulette wheel is fair, then pTheory =1, and therefore pTotal is dominated by calculating the probability of chance. If the fastest horse wins, then pChance is 1, and pTotal is dominated by pTheory. When gamblers bet on a horse, converting horse theory to numerical odds is exactly what they do. Perhaps they use a probability tree (see chapter 2) where each node is a different horse. Poker players have a tougher calculation – not only do they have to know the odds of a card turning up, they also have to assign a probability to the notion that bluffing will work, or that their opponent is bluffing. But once the bet becomes about the long run, or about public health instead of an individual, then the calculation is quite different. It’s a two dimensional problem where the primary goal is to predict the frequency of a result or different results, and there will also be uncertainty about estimated frequencies. The probability calculation isn’t the same any more. The probability of chance is often a statistical frequency instead. In fact, it may or may not be a theoretical frequency. For example, there can be a range of statistical estimates which can be purely empirical, purely theoretical, or an empirically supported theory. In the first category, an historical record with a large number of observations may justify a frequency estimate with no theoretical uncertainty. On the other hand, a fewer number of observations may serve to support a statistical theory instead, which begets theoretical uncertainty. The frequency calculation is now a function instead of a single number, so the relationship between theoretical probability and the frequency of occurrence is now a function that is potentially far more complicated: p(Frequency) = pTheory(pChance) Empirical observations may be used to disprove a statistical theory too, but it will take many of them. For example, a large number of observations may show a particular die to be unfair. The again, there may only be enough data the favor one theory over another without being able to conclusively decide that one is indubitably correct. That means you are going to need a probability tree. Purely statistical probability schemes often acknowledge theoretical probability (e.g. as “systematic error”), but then usually go on to treat it as a done deal or as an insoluble problem. On the other hand, Bayesian probability schemes typically treat 120 A Practical Guide theoretical and statistical probabilities interchangeably4. If you are betting on a single instance, that works reasonably well. Updating a theoretical prior with data can gradually transform the probability into one of chance – the more data there are, the less the theory matters, but that isn’t really very scientific. If the data were used to discriminate among alternative theories, the data might be put to better use. That problem is even more critical for the estimation of long run frequencies. Updating the parameter estimates for a theory that is in the process of being proven to be wrong doesn’t make much sense. Since it really is more consistent with how scientific knowledge is developed, explicitly assigning probabilities to theories is a better strategy for long-term issues where knowledge may be expected to progress. Since theoretical probabilities are inherently subjective, it is hard to improve upon convening a panel of experts to weigh the scientific evidence. Even if the experts don’t get it quite right, or they aren’t the right sort of experts, the process of assigning probabilities to competing creates an occasion for scientific discussion. As long as no one thinks that the numerical probabilities assigned to theories are anything more than a way of getting through the decisions of the day, they won’t get in the way of that5. Translating numerical characterizations of theoretical or total probability to legal evidentiary standards (see chapter 1) is also possible. At least one translation is straightforward: If one has a distribution of probable outcomes, the as-likely-as-not corresponds to the middle of it, and it doesn’t matter whether or not the probability is statistical, theoretical, objective, or subjective. It is tempting to equate “beyond a reasonable doubt” with a probability of 0.95, but there really is no precedent for it. When quantifying theoretical probability, it becomes important to remember that uncertainty is not the same as frequency (see chapter 2), and therefore a theoretical or total probability cannot be used to predict the frequency of an outcome. As a recent example of the problem, Trasande6 et al (2015) provided an overview of the efforts to characterize the theoretical probabilities for causal theories involving potential health effects of Endocrine Disrupting Chemicals (EDCs): We now describe the general methods used to attribute disease and disability to EDCs, to weigh the probability of causation based upon the available Howson C and Urbach P (1993). Chapter 13: Objective Probability. In: Scietific Reasoning. The Bayesian Approach. Open Court Press, Peri, pp. 319-351. 5 This is a concern raised by Morgan GM (2014). Use (and abuse) of expert elicitation in support of decision making for public policy. PNAS 111:7176–7184. 6 Trasande L, Zoeller RT, Hass U, Kortenkamp A, Grandjean P, Myers JP, DiGangi J, Bellanger M, Hauser R, Legler J, Skakkebaek NE, and Heindel JJ (2015). Estimating Burden and Disease Costs of Exposure to Endocrine-Disrupting Chemicals in the European Union. J Clin Endocrinol Metab 100: 1245–1255. 4 121 J Risky Logic 2016 evidence, and to translate attributable disease burden into costs. During a 2day workshop in April 2014, five expert panels identified conditions where the evidence is strongest for causation and developed ranges for fractions of disease burden that can be attributed to EDCs. There are more than a few quibbles that could be raised about with exactly what they did, ranging from how the problems were characterized in the first place (i.e. by presuming independent attributable risks), the use of implausible dose-response models, the lack of serious consideration of other (i.e. non-EDC) causal factors, and the relationship between association and causation is all-or-none. Also, because the probability assignments are subjective, a two-day workshop of experts with similar interests are perhaps insufficient for a decision involving the economic impacts that are alleged. However, praise for the process itself is deserved; even if the probabilities aren’t entirely reasonable, putting them out for criticism starts a useful process. Nonetheless, as the present topic of discussion, there is one error in how the theoretical probability was employed after it was arrived at that must not go unnoticed: Finally, recognizing that attributable cost estimates were accompanied by a probability, we performed a series of Monte Carlo simulations to produce ranges of probable costs across all the exposure-outcome relationships, assuming independence of each probabilistic event. Separate random number generation events were used to assign 1) causation or not causation, and 2) cost given causation, using the base case estimate as well as the range of sensitivity analytic inputs produced by the expert panel. To illustrate with an example, for an exposure-outcome relationship with an 80% probability of causation, random values between 0 and 1 in each simulation led to the first step, which either assigned no costs (random value ≤ 0.2) and costs (random value > 0.2). If the problem required the combination of both theoretical and statistical probabilities, the use of the probability tree in a Monte-Carlo simulation would be appropriate. However, there is a problem in implementation that arises from the fact that a causal probability is NOT a frequency or a probability of chance: A theory is either true all the time or false all the time, and the entire cost estimate is dependent on the truth of the theory: So, no you can’t assume independence, and using a causal probability to calculate the frequency of an event is inappropriate. Instead, the logic should go like this: Since all of the causal probabilities have a probability of less than 95%, the lower bound cost estimate of all of the end points should be zero (see table four). For those endpoints with a causal probability of less than 50%, the central estimate should be zero as well. 122 A Practical Guide 2. Probabilities vs Weights So, when a decision is at stake, but not otherwise, giving a numerical value to theoretical probabilities is desirable. If there is a formal risk assessment being developed as part of a debate over public policy, then having a formal process for doing that is rather essential. A 2009 paper by Swaen and van Amelsvoort7 does that for what is perhaps the best known theoretical probability issue, namely determining whether or not a chemical causes cancer. The technique uses a version the approach first described by Carnap8, where it is supposed that the probability of a hypothesis is directly determined by the evidence for it. It also proceeds by using the commonly accepted Hill criteria as a means of identifying and grading evidence, giving a numerical value to each category. While the actual probability assignments are reasonable enough, the presentation of the technique is worthy of several objections: The paper claims to provide an “empirical basis” for assigning probabilities to theories. This they obviously did not do. Yes, they use the available data, and they outlined a formal procedure for determining causal probabilities, but between the data and the final probability are a plethora of subjective judgments. The evidential values assigned to each category are labeled as probabilities, which each category of evidence is considered to have a “probability of being true”. For most of the categories, that seems odd. For example, they use Relative Risk to determine the Strength of association; how does a relative risk get turned into probability of truth? The heading for the column of evidential probabilities is “%”. That rather obviously implies that they are think of the evidence as being statistical even though frequency of occurrence cannot be equated with a probability of truth. That obviously invites the sort of misinterpretation exhibited by Trasande. There is no mention of what the alternative theories might be. Implicitly, the probability of the not being a causal relationship is one minus the causal probability, but that is never alluded to. All of those problems can be avoided by using the Keynesian strategy (see chapter 2) of separating numerical evidential weights from the theoretical probability assignments. A probability tree, where the sum of all alternatives under consideration is 1, can then be used to generate the theory probabilities by making them proportional to the weights. For example, the probabilities for a two node tree would be: Swaen G and van Amelsvoort L (2009). A Weight of the Evidence Approach to Causal Inference. J Clin Epidemiol 62:270-277. 8 Carnap R (1950). The Logical Foundations of Probability. University of Chicago Press, Chicago. 7 123 J Risky Logic 2016 p(H1) = weight1/(weight1 + weight2) and p(H2) = weight2/(weight1 + weight2) The advantage of that method is that each alternative theory can be evaluated independently, and a new alternative theory can be added to the tree without invalidating the evaluation of the other theories9. 3. Weight of the Evidence Since 1972, the most renowned agency tasked with conducting weight-of-theevidence evaluations is the United Nations sponsored International Agency for Research on Cancer (IARC). IARC sponsored workgroups have evaluated hundreds of chemicals and substances with the goal of place each in one of four evidential categories10: Sufficient evidence of carcinogenicity: The Working Group considers that a causal relationship has been established between exposure to the agent and human cancer. That is, a positive relationship has been observed between the exposure and cancer in studies in which chance, bias and confounding could be ruled out with reasonable confidence. A statement that there is sufficient evidence is followed by a separate sentence that identifies the target organ(s) or tissue(s) where an increased risk of cancer was observed in humans. Identification of a specific target organ or tissue does not preclude the possibility that the agent may cause cancer at other sites. Limited evidence of carcinogenicity: A positive association has been observed between exposure to the agent and cancer for which a causal interpretation is considered by the Working Group to be credible, but chance, bias or confounding could not be ruled out with reasonable confidence. Inadequate evidence of carcinogenicity: The available studies are of insufficient quality, consistency or statistical power to permit a conclusion regarding the presence or absence of a causal association between exposure and cancer, or no data on cancer in humans are available. Evidence suggesting lack of carcinogenicity: There are several adequate studies covering the full range of levels of exposure that humans are known to encounter, which are mutually consistent in not showing a positive association between exposure to the agent and any studied cancer at any observed level of Then again, it might. For example, if a new theory provides a new explanation, it might alter the theoretical weights accorded to some of the other theories. 10 International Agency for Research on Cancer (2006). The Preamble to the IARC Monographs. http://monographs.iarc.fr/ENG/Preamble/CurrentPreamble.pdf 9 124 A Practical Guide exposure. The results from these studies alone or combined should have narrow confidence intervals with an upper limit close to the null value (e.g. a relative risk of 1.0). Bias and confounding should be ruled out with reasonable confidence, and the studies should have an adequate length of follow-up. A conclusion of evidence suggesting lack of carcinogenicity is inevitably limited to the cancer sites, conditions and levels of exposure, and length of observation covered by the available studies. In addition, the possibility of a very small risk at the levels of exposure studied can never be excluded. Although placement into categories is clearly concerned with causal interpretation of epidemiological data, IARC evaluations do not formally use the Hill criteria. However, the lines of reasoning used in IARC monographs are found on Hill’s list. In particular, strength of association and theoretical arguments are both brought forward in the evaluations. Although IARC evaluations are primarily concerned with human carcinogenesis, they also generally consider the evidence for carcinogenicity in animals as well. In some cases, the evidential categories for human and animal carcinogenesis are distinguished. In particular, a category 2A designation indicates that there is limited evidence for human carcinogenicity, while 2B indicates that there is limited evidence of carcinogenicity in animals only. The IARC weight of the evidence evaluations are implicitly geared towards a two-node probability tree. Either the chemical is carcinogenic, or it is not. While the 1986 EPA Cancer Risk Assessment Guidelines mention the importance of weight of the evidence evaluations, the issue is brought up in a brief discussion that gives almost no guidance as to how they are to be done11. It is stipulated that results of the evaluation will be stored in categories similar to those used by IARC. The presumption seemed to be that a statistical significance test would settle the matter. Until the next set of “finalized” guidelines were issued in 2005, this proved to be a matter of substantial debate. The main issue that arose was this: How a chemical may cause cancer came to be regarded as being as being at least as important as the “Delaney” question of whether or not it can cause cancer at all. The 2005 Guidelines have a far more extensive discussion of Weight of the Evidence12. The guidelines adopt a slightly modified version of the Hill criteria to structure the evaluations. In fact, they do this twice. The first time is for the same purpose as the IARC or 1986 evaluations, where the goal is to determine whether the U.S. Environmental Protection Agency (1986). Guidelines for Carcinogen Risk Assessment. EPA/630/R-00/004, also at Federal Register 51(185):33992-34003 12 U.S. Environmental Protection Agency (2005). Guidelines for Carcinogen Risk Assessment. EPA/630/P-03/001F 11 125 J Risky Logic 2016 chemical should be identified as a carcinogen or not. The second time is when is in the context of a discussion of potential alternative modes of action. Tree-wise, the most pertinent section of the guidelines is section 2.4.3.3. Consideration of the Possibility of Other Modes of Action: The possible involvement of more than one mode of action at the tumor site should be considered. Pertinent observations that are not consistent with the hypothesized mode of action can suggest the possibility of other modes of action. Some pertinent observations can be consistent with more than one mode of action. Furthermore, different modes of action can operate in different dose ranges; for example, an agent can act predominantly through cytotoxicity at high doses and through mutagenicity at lower doses where cytotoxicity may not occur. This passage indicates that if there is evidence for more than one mode of action, each should receive a separate analysis. There may be an uneven level of experimental support for the different modes of action. Sometimes this can reflect disproportionate resources spent on investigating one particular mode of action and not the validity or relative importance of the other possible modes of action. Ultimately, however, the information on all of the modes of action should be integrated to better understand how and when each mode acts, and which mode(s) may be of interest for exposure levels relevant to human exposures of interest. You might think that the weight of the evidence narrative that follows such careful examination would embrace the possibility of multiple plausible interpretations of the body of evidence where each potential mode becomes a node on the tree. For example, if one node may stipulate that there is no causal relationship, a second node that supposes that genotoxicity is responsible, and a third node that supposes that increased cell proliferation is responsible for increased tumor incidence. But, that’s not what the guidelines recommend. Instead, they keep the IARC-like categories that simply grade whether or not the chemical should be considered to be carcinogenic. The mode of action evaluation is folded into the category of Not Likely to Be Carcinogenic to Humans: This descriptor is appropriate when the available data are considered robust for deciding that there is no basis for human hazard concern. In some instances, there can be positive results in experimental animals when there is strong, consistent evidence that each mode of action in experimental animals does not operate in humans. In other cases, there can be convincing evidence in both humans and animals that the agent is not carcinogenic. The problem with this tactic is this: In order to argue that the risk is “unlikely” one must essentially conduct a risk assessment, so excluding regulatory scrutiny on the 126 A Practical Guide basis of evidential grading is premature. Even if the conclusion is correct, a probability tree would be much better at laying out the real argument. Otherwise, the obvious question to ask is “How unlikely is it?”. At the time the 2005 guidelines appeared, the phrase Weight-of-the-Evidence was also used to connote evidential evaluations in many risk assessment context besides cancer risk assessments, and in many different ways13. Since then, there has been a proliferation of guidelines for both generating and using weight of the evidence (WoE) evaluations for a range of different purposes. Evidence-based Medicine uses a WoE process to evaluate individual study quality. WoE guidelines issued by the EPA in 2011 lay out a process for evaluating potential mechanisms of action involving hormonal regulation14. A more interesting development is the use of WoE for environmental assessments. At about the same time, Suter and Cormier described a process that integrates epidemiology and risk assessment by treating them as complimentary processes15. They also differentiate between “weighting evidence” a la evidence based medicine vs “weighing evidence” where the strength of a hypothesis is evaluated. They also discuss the possibility of enumerating evidential weight: However, it must be recognized that the weights do not have natural units and cannot be naturally added or averaged. For example, if you assign a weight of 2 for strength of a piece of evidence and 4 for quality, those scores are numerical but not quantitative. If you sum them to obtain an overall weight of six for the evidence, it may impart a sense of rigor, but it is still a qualitative score. Therefore, qualitative weights are better expressed by symbolic systems such as the scale, +++, ++, +, 0,−, −,−−−, commonly used in causal assessments. The counterargument is this: Precisely because the weights aren’t natural, they can be construed or reverse-engineered as necessary. In other words, if summing numerical weights doesn’t yield reasonable numerical probability assignments, then perhaps the numerical assignments for the weights need to be changed so they do. That is, in fact, exactly what Swaen and van Amelsvoort16 did. Suter and Cormier also discuss the idea of “Building a Case” for environmental risk assessments; they obviously have the evidential form of probability in mind. In the Weed DL (2005). Weight of evidence: a review of concept and methods. Risk Anal. 25:1545-57. U.S. Environmental Protection Agency (2011). Weight-of-Evidence: Evaluating Results of EDSP Tier 1 Screening to Identify the Need for Tier 2 Testing. 15 Suter, GW and Cormier SM (2011). Why and how to combine evidence in environmental assessments: Weighing evidence and building cases. Science of the Total Environment 409:1406– 1417. 16 Swaen and van Amelsvoort (2009). Ibid. 13 14 127 J Risky Logic 2016 same vein, Lutter et al recently suggested using “Hypothesis-based weight of the evidence” for chemical risk assessments in general17. Both of these suggestions embody the notion of competing theories. But, neither of them convey the main advantage of a tree in assimilating scientific information: It is not necessary to prove one hypothesis to the exclusion of all others. Instead competing hypotheses can share space on the same probability tree. Perhaps, at some time in the future one of the hypotheses will drive the others to extinction. Meanwhile, decisions can still be made with the best information available at the time. 4. Iterative WoE According to the NRC Redbook18, a risk assessment commences with a hazard identification step that is largely a subjective determination. The Hazard ID is then followed with a more objective process that involves conducting an exposure assessment and characterizing the dose-response relationship for the specific cause and effect. Yet, when hazard identification itself is thought of as a formal weight-ofthe-evidence process, the iterative nature of risk assessment becomes apparent. This is especially true when human epidemiological studies are pivotal in establishing that there is a causal relationship: Hazard identification needs a weight of evidence evaluation, and a weight of the evidence evaluation needs a dose-response evaluation. In addition, working with epidemiological data often requires estimation of the dose that the subjects in the cohorts received, which means the hazard ID may also require exposure assessment. In other words, in order to conduct a proper hazard identification, one must do a risk assessment. However, it probably won’t be exactly the same risk assessment as one needed for the decision at hand. For one thing, the exposures will certainly be different for a cohort that was selected precisely because they are known to have unusually high exposures. There also may be differences in other causal influences on a particular disease or health measure in a cohort that was studied and the population for which a risk estimate is needed. Nonetheless, one would generally expect that a dose-response analysis used to establish that there is causal relationship should resemble the one used to estimate the risk. So it would seem that the consideration of evidential weight and the dose-response assessment cannot be treated as entirely separate processes. For example, Suter and Cormier19 characterized environmental epidemiology as risk assessment in reverse, Lutter R, Abbott L, Becker R, Bradley A, Charnley G, Dudley S, Felsot A, Golden N, Gray G, Juberg D, Mary Mitchell M, Rachman N, Rhomberg L, Solomon K, Sundlof S and Willett K(2015). Improving Weight of Evidence Approaches to Chemical Evaluations. Risk Anal 35:186-192. 18 National Research Council (1983). Risk Assessment in the Federal Government: Managing the Process. National Academy Press, Washington 19 Suter and Cormier (2011). Ibid 17 128 A Practical Guide where both epidemiology and risk assessment share a common need for weighing evidence and building the case for a causal relationship. But, the weighing of evidence itself isn’t really reversed. Yes, one should look for alternative plausible explanations when conducting a causal assessment, but the need to characterize uncertainty should make that part of the risk assessment as well. For example, the fact that an association may be plausibly explained by both causal and noncausal relationships may be an important source of model uncertainty. Where epidemiology does, or should, work in reverse is during study design. Since the expectation of how the data are going to be used drives what data are to be collected, knowing how data becomes evidence will determine what a study looks for. An epidemiology study designed to support regulatory decision making should look like a risk assessment. That is not where the convergence stops. Risks assessment tends to be tiered and iterative. They start out subjectively and tend to get more formally objective when the policy issues get warm. The start out simple and tend to get more complex20. That line of thinking should be applied to the WoE process as well. When all is said and done, the Hill criteria are fairly rudimentary. If the literature is extensive, the divisions between the evidence categories break apart. Whether or not there is a dose-response trend may depend on what is considered plausible. Whether or not the associations are consistent may depend on what anomalies can be explained. Specificity may not matter if interactions with other factors are likely. So, all of those criteria may coalesce into two: empirical support and theoretical support. The other important thing to recognize is that the actual dose-response relationship really is becomes more important as the evidence for a causal effect become stronger. If the evidence is weak then the decision will be dominated by pCausality since the apparent effect probably isn’t real anyway. On the other hand, if pCausality gets into the upper ranges of more-likely-than-not, then the actual effects predicted are naturally going to be important. If there are different theories that will explain a causal relationship (e.g. genotoxic carcinogen vs. promoter – se chapter 4), then that will become the main focus, and the Hill criteria will fade into oblivion. However, the probability tree may still have some mix of both causal and acausal explanations. Unless the causal relationship is completely certain, it should. May the best theory win. National Research Council (1994). Science and Judgment in Risk Assessment. National Academy Press, Washington, p. 202. 20 129
© Copyright 2025 Paperzz