Statistical Consulting - Cox Associates Consulting

Math 6330: Statistical Consulting
Class 2
Tony Cox
[email protected]
University of Colorado at Denver
Course web site: http://cox-associates.com/6330/
Student introductions
1.
2.
3.
4.
5.
6.
Name
Affiliation (academic, professional)
Technical interests
Any special expertise in data analysis areas
Any projects or data sets of interest
Thoughts on Assignment 1? (How does PM2.5
affect elderly mortality in this data set?)
7. Hopes and goals for course
2
Assignment # 1
• Download data set Sample1.xlsx from
http://cox-associates.com/6330/
• Analyze the data to answer the following
client question: “Is there evidence that high
concentrations of fine particulate matter
(PM2.5) increase daily elderly mortality counts
(AllCause75)? If so, how large is the effect?”
• E-mail questions to [email protected]
3
Assignment 2 (Due January 31)
• Download data set Class2DataBenzene.xlsx from http://coxassociates.com/6330/
• Setting: 3 factories in China, many workers, some measured more
than once (“splits”).
• Some missing data. Detection limit for benzene in air in 0.2 ppm.
• Analyze the data to answer the following client question: “Is there
evidence that low concentrations of benzene in air (e.g., AB < 1)
produce disproportionately more toxic metabolites (PH, CA, HQ for
phenol, catechol, hydroquinone) or total urinary metabolites (UB)
than higher concentrations? What is the shape of the lowconcentration relation between AB and each metabolite?”
• Background: Who cares, and why?
http://retractionwatch.com/2013/04/08/environmental-scientistscall-for-retraction-of-oil-industry-funded-paper-on-benzeneexposure/
• E-mail questions to [email protected]
4
Reminder: Goals for student projects
• Extract good problems from available knowledge
and data
– “Good” = high value of analysis = large improvement
in decisions, results, etc.
• Apply high-value techniques to produce valuable
answers and insights
– Unexpected directions are ok!
• Present the results so that the potential value is
actually delivered
• If possible, document impact and next steps
5
Some high-value consulting tools –
Beyond clustering and regression
• Classification and regression trees (CART)
• Random Forest
• Bayesian networks
– Influence diagrams
• Predictive analytics
• Causal analytics
• State transition models
– Dynamic simulation modeling
– Markov Decision Processes (MDPs)
• Partially observable MDPs
• Simulation-optimization
6
Components of a successful project
1.
2.
3.
4.
5.
6.
7.
8.
Problem statement and motivation
Data
Analysis plan/narrative
Tools and software
Results: Reports and displays
Presentation: What did we learn?
Evaluation: What was the impact?
Proposed next steps
7
High-value statistical consulting
skills
8
Components of a consulting
engagement
1. Agreed-to problem statement or question
2. Understanding of why it matters, underlying
goals, decisions, or questions
3. Data that are relevant (maybe) for answering
the question
4. Methods: Tools, analyses, software
5. Results and interpretation. Caveats/limitations
6. Report to client (summarizes 1-5)
7. Proposed next steps (usually) – Builds on 1-6
9
Key steps in consulting
• Vision: Define and agree on success – goals
and measures
– What you measure is what you get
•
•
•
•
•
Clarify objectives
Generate alternatives
Compare/evaluate alternatives
Make recommendations, show why
Evaluate performance
10
Toward higher-value analytics
Reorientation: From solving well-posed problems to
discovering how to act more effectively
1.
2.
3.
4.
5.
6.
7.
Descriptive analytics: What’s happening?
Predictive analytics: What’s (probably) coming next?
Causal analytics: What can we do about it?
Prescriptive analytics: What should we do?
Evaluation analytics: How well is it working?
Learning analytics: How to do better?
Collaboration: How to do better together?
11
High-value statistical skills
• Describe current situation
• Predict what is likely to happen next if we do
not take action
• Predict what is likely to happen next if we take
different actions
• Optimize decisions about what to do
• Evaluate how well current policies are working
• Learn to improve current policies
12
Introduction to descriptive
analytics
13
Descriptive analytics:
What’s going on?
• What is the current situation?
– Attribution: How much harm/loss/opportunity cost is
being caused by X?
• Causes are often unobserved or uncertain
• What has changed recently? (Why?)
– Example: More extreme event reports caused by real
change or by media?
– Change-point analysis (CPA) algorithms
• What should we worry about?
– How is this year’s season shaping up?
14
Air pollution example:
Classification tree descriptive analytics
• tmin, tmax, month, year,
MAXRH are potential
predictors of AllCause75
(elderly mortality)
• PM2.5 does not appear in
this tree
– AllCause75 is conditionally
independent of PM2.5 in
this analysis, given the
other variables in the tree
• Making year and month
into categorical variables
changes the tree but not
this conclusion.
15
How a CART tree works
•
Basic idea: Always ask the most informative
question next, given answers so far.
– Questions are represented by splits in tree
– Leaf nodes show conditional means (or conditional
distributions) of dependent variable
– Internal nodes show significance level for split: how
significant are differences between conditional
distributions
•
•
Reduces prediction error for dependent variable
Stop this “recursive partitioning” when further
questions (splits in tree) do not significantly
improve prediction.
– Classification & Regression Tree (CART) algorithm
•
Some refinements:
– Grow a large tree and prune back to minimize crossvalidation error
– fit multiple trees to random subsets of data and let
them vote for best splits (“bagging”)
– over-train on mis-predicted cases (“boosting”)
– average predictions from many trees
(“RandomForest” ensemble prediction)
– Join prediction “patches” together smoothly (MARS)
16
Bayesian Networks (BNs) show
information relations among variables
• BNs provides high-level roadmap
for descriptive analytics
• Each node has a conditional
probability table (CPT) (or
regression model, CART tree, etc.)
describing how the conditional
probabilities of its values depend
on other variables.
• If no arrow connects two variables,
then they are conditionally
independent of each other, given
the other variables in the BN.
– Omitted variables can create
statistical dependencies
– Conditioning on variables can also
sometimes create dependencies
• Information principle for causality:
Causes are not conditionally
independent of their effects.
17