Statistical Consulting

Math 6330: Statistical Consulting
Class 3
Tony Cox
[email protected]
University of Colorado at Denver
Course web site: http://cox-associates.com/6330/
Wrap-up from initial exercises
• Continual open-mindedness about
uncertainties + technical skills to
reduce them are essential for using
data to learn how to act more
effectively.
• A combination of humility
(willingness to learn from data) and
skepticism (insistence on learning
from data) about what is assumed
or considered known boosts
performance in forecasting,
decision-making, diagnosis,
research, and management
2
Assignment for next time (February 7)
• Evaluate evidence that PM2.5 causes elderly mortality in new
data set, Sample4.xlsx, at http://cox-associates.com/6330/.
Be prepared to present your thoughts in < 5-minute
presentation (just in case!)
• Read Russo & Schoemaker, 1989, Chapter 5 (improving
intelligence-gathering and estimation),
https://professional.sauder.ubc.ca/re_creditprogram/course_
resources/courses/content/499/russo.pdf
• Software: Download Netica, bring it next time
http://www.norsys.com/download.html
• (Optional) Youtube: Hans Rosling TED talk,
www.ted.com/talks/hans_rosling_shows_the_best_stats_you
_ve_ever_seen
• (Optional) Fair Coin problem
3
Introduction to descriptive analytics
(cont.) – Some high-value tools
•
•
•
•
•
CART trees
Bayesian Networks (BNs)
Random Forests
Partial dependence plots
Visualization
4
Fair Coin Problem
• A box contains two coins: (a) A fair coin; and (b) A
coin with a head on each side. One coin is
selected at random (we don’t know which) and
tossed once. It comes up heads.
• Q1: What is the probability that the coin is the fair
coin?
• Q2: If the same coin is tossed again and shows
heads again, then what is the new (posterior)
probability that it is the fair coin?
Solve manually and/or using Netica.
5
Bayesian Networks (BNs) show
information relations among variables
• BNs provides high-level roadmap
for descriptive analytics
• Each node has a conditional
probability table (CPT) (or
regression model, CART tree, etc.)
describing how the conditional
probabilities of its values depend
on other variables.
• If no arrow connects two variables,
then they are conditionally
independent of each other, given
the other variables in the BN.
– Omitted variables can create
statistical dependencies
– Conditioning on variables can also
sometimes create dependencies
• Information principle for causality:
Causes are not conditionally
independent of their effects.
6
Interpreting BNs
• Nodes represent variables
– Influence diagrams: Chance, Choice, and Value nodes
• Links (arrows) represent statistical dependencies
• Absence of links reveal conditional independence relations
• Each node with inward-pointed arrows has a conditional
probability table (CPT) for its value, given the values of its
parents
• A BN can be used to propagate evidence by setting values of
some variables and computing updated distributions of the
rest using exact or Markov Chain Monte Carlo algorithms
• BNs can sometimes be learned from data
– Structure learning, Dirichlet priors for CPTs, constraint-based and
scoring algorithms, R package bnlearn
7
Algorithms for finding what has
changed
• Change-point analysis (CPA)
– Basic idea: Search for most likely explanation of
observed data
– Provides estimates of
• Times of changes
• Sizes of changes
• Confidence intervals and levels provided
• Recent breakthroughs in CPA algorithms
– Model-free CPA using order statistics
– Multiple time series
8
Example: How quickly can a change be
detected/used for planning?
http://jamia.oxfordjournals.org/content/19/6/1075
9
Example: Did time series
change? If so, when?
Surveillance time series showing a possible increase in hospitalization rates
10
Output from simple likelihood-based
CPA algorithm
Data
Posterior
distribution for
time of change.
(Algorithm also
estimates size of
change)
11
Automatically noticing and describing
what matters
• Simple approach: Create binary indicator for
“this period” vs. “recent periods”
• Treat indicator as dependent variable, find
most parsimonious/best predictors in
multivariate data
– Show’s what’s different now
– Highlights informative changes
• Embed key predictors in causal network model
to explain and predict changes
12
Example: Change analysis of years
2007-2010
• Hot, high humidity
days are more likely
to occur in more
recent years
13
Standard machine learning (ML) tools
for descriptive analytics
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
14
Caveat for descriptive analytics:
Beware the data!
• Many claimed facts are
wrong
– Data misinterpreted
– Definitions not clear
– Incorrect generalization
– Results misunderstood
or misquoted
– Cherry-picking and
manipulation
www.amazon.com/Damned-Lies-Statistics-Untangling-Politicians/dp/0520219783
15
Example: Understanding America’s
hunger epidemic
• “According to Feeding America, 1 in 7 people in
the U.S. face hunger every year. The rates of
hunger in children are even higher, with about 1
in 5 lacking proper access to food at some point
during the year. … We have an epidemic of
hunger right here.”
http://mashable.com/2016/07/14/child-hungerunited-states/#YX5UBpiUTaqt
16
Digging into that 1 in 5 (or 6) children
struggling with hunger statistic
•
Follow the definitions: “Food Insecurity – Low food security (old label = Food
insecurity without hunger): reports of reduced quality, variety, or desirability of
diet. Little or no indication of reduced food intake. … The CNSTAT panel also
recommended that USDA consider alternative labels to convey the severity of food
insecurity without using the word ‘hunger,’ since hunger is not adequately
assessed in the food security survey.” www.ers.usda.gov/topics/food-nutritionassistance/food-security-in-the-us/definitions-of-food-security.aspx
– By this definition, a child who “struggles with hunger” according to Feeding America may
never actually experience hunger, but parents may occasionally buy foods on sale instead of
usual brands to get savings.
•
•
A better statistic: “While children are usually shielded from the disrupted eating
patterns and reduced food intake that characterize very low food security, both
children and adults experienced instances of very low food security in 0.7 percent
of households with children (274,000 households) in 2015. The decline from 2014
(1.1 percent) was statistically significant. ”
www.ers.usda.gov/webdocs/publications/err215/err215_summary.pdf?v=42636
Follow the money: www.feedingamerica.org/about-us/about-feedingamerica/partners/food-and-fund-partners/visionary-partners/
– Albertson’s, CONAGRA, Food Lion, General Mills, PepsiCo, etc.
•
Follow the opposing views:
www.forbes.com/sites/paulroderickgregory/2011/11/20/are-one-in-five-americanchildren-hungry/#1a5f09c7cdb2
17
Communicating results from
descriptive analytics:
How results are described can
change decisions
18
Which elicits stronger willingness-topay?
• A: “Purchase new equipment at airport that
will save 150 lives if there is an accident”
• B: “Purchase new equipment at airport that
will save 85% of 150 lives if there is an
accident”
19
Which leads to more patient releases?
• A: "20 out of every 100" similar patients will
commit an act of violence after release
• B: "20 percent" of similar patients will commit
an act of violence after release
http://onlinelibrary.wiley.com/doi/10.1111/risa.12105/abstract
20
Which leads to more patient releases?
• A: "20 out of every 100" similar patients will
commit an act of violence after release
• B: "20 percent" of similar patients will commit
an act of violence after release
• (Answer: Psychiatrists are about twice as
likely to keep a patient confined if A is used
instead of B)
http://onlinelibrary.wiley.com/doi/10.1111/risa.12105/abstract
21
Framing affects choice
• A: Surgery described as giving a "68% chance
of being alive” a year after surgery [44% prefer
to radiation treatment]
• B: Same surgery described as giving a "32%
chance of dying" within a year after surgery
[18% prefer to radiation treatment]
http://onlinelibrary.wiley.com/doi/10.1111/risa.12105/abstract
22
Ethical and practical implication
• If presenting the same statistical information
in different ways can change the decisions
(and perceptions) that clients take based on it,
then the statistical consultant must frame and
present the information in multiple ways to
avoid manipulating the outcome.
23
Descriptive analytics: Visualization
• Hans Rosling TED talk,
www.ted.com/talks/hans_rosling_shows_the_
best_stats_you_ve_ever_seen
24