Statistical Consulting - Cox Associates Consulting

Math 6330: Statistical Consulting
Class 10
Tony Cox
[email protected]
University of Colorado at Denver
Course web site: http://cox-associates.com/6330/
Course schedule
•
•
•
•
April 14: Draft of project/term paper due
April 18, 25, May 2: In-class presentations
May 2: Last class
May 4: Final project/paper due by 8:00 PM
2
Newsvendor problem: Analysis
• How many papers to stock?
–
–
–
–
Each costs $k
Each sells for $
Number sold = min(stock, demand)
demand is uncertain, with CDF of F
• Analytic solution:
– Profit if stock = a and demand = s is *min(a, s) - a*k
– Optimization: Marginal benefit (revenue) = marginal cost
– (1 - F(a)) = k  F(a*) = 1 – k/
• (1 - F(a)) - k = expected additional (marginal) profit from buying one
more paper = 0 at optimum.
• F(a) = probability it remains unsold
3
www.old-picture.com/united-states-history-1900s---1930s/Evening-newsboy-buying-from.htm
Statistical estimation:
Proportions
4
Estimating a proportion from data
• Applications: Drug trials, A/B testing, etc.
• Point estimate and confidence intervals
– Maximum likelihood estimate of p = x/n
– Least squares estimate of p is also x/n
• Confidence intervals
– Normal approximation:
x/n + {(1/n)(x/n)[1 – (x/n)]}1/2
– Exact confidence interval
• Bayesian
– Beta posterior distribution
5
Exact confidence interval for a
proportion based on data
• Finds the largest and smallest values of p that are
compatible with observed evidence x/n.
• Generated by binom.test(x, n) (two-sided) or
binom.test(x, n, alternative = “greater”) or
binom.test(x, n, alternative = “less”) (one-sided)
– Two-sided test puts half the significance level (denoted
by  = 1 - confidence level) in each tail.
– One-sided puts all in one tail
6
Example
• Evidence: Suppose that 20 components are
randomly sampled and that 5 are observed to
be defective (non-conforming).
• What can we infer about the true proportion of
defects?
7
Maximum likelihood estimate (MLE): x/n maximizes
likelihood , L = C*px(1 - p)n-x
• “Best” (MLE) point estimate of p based on x = 5
and n = 20 is p = 5/20 = 0.25
1.2e-05
8.0e-06
4.0e-06
0.0e+00
p = c(0:100)/100;
y = (p^5)*(1-p)^15;
plot(p,y)
> y[25]
[1] 1.297956e-05
> y[26]
# p[26] = 0.25
[1] 1.305025e-05
> y[27]
[1] 1.298203e-05
y
•
•
•
•
•
•
•
•
•
– Maximizes L = 20C5p5(1-p)15
0.0
0.2
0.4
0.6
0.8
1.0
8
p
Exact confidence interval
• Intuition: Observed evidence x/n = 5/20 is
very unlikely if p is too large or too small
0.0
0.0
0.2
0.2
0.4
0.4
y
z
0.6
0.6
0.8
0.8
1.0
1.0
– p = c(0:100)/100; y = 1 - pbinom(4, 20,p)
– z = pbinom(5, 20, p); plot(p, y); plot(p,z)
y = P(x ≥ 5 | p)
z = P(x ≤ 5 | p)
0.0
0.2
0.4
0.6
p
0.8
1.0
0.0
0.2
0.4
0.6
p
0.8
1.0
9
Exact confidence interval
• For an exact 95% confidence interval, find
– Upper bound: Largest value of p for which z =
pbinom(5, 20, p) = Pr(x ≤ 5 | p) = 0.05/2
– Lower bound: smallest value of p for which 1 pbinom(4,20,p) = y = Pr(x ≥ 5 | p) = 0.05/2
– The interval between these is the exact 95%
confidence interval
• In R, can just use binom.test(x, n, p)
10
Exact confidence interval
calculation in R
• binom.test(5, 20, 0.25)
Exact binomial test
data: 5 and 20
number of successes = 5, number of trials = 20, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.25
95 percent confidence interval:
0.08657147 0.49104587
sample estimates:
probability of success
0.25
11
Example: Broken windows
• Randomly sample 5 windows in a large
housing development
• All 5 have a construction defect that allows
water penetration
• What can we infer about the true proportion
of defective windows?
12
Estimating a proportion using a
binomial distribution
• Data: 5 defects in 5 trials
• What can we infer about the true proportion of
defects, p?
• P(5 defects in 5 trials | Pr(defect) = p) = p5.
• Observed outcome, 5/5, has no more than 0.05
probability of occurring by chance if p5 < 0.05, or p <
(0.05)^(1/5) = 0.5493.
• So, 0.5493 is an exact 95% lower confidence limit on
the proportion of windows with defects
13
Exact confidence interval in R
• Data: 5 defects in 5 trials
• binom.test(5, 5) does not produce what we want,
but returns a 95% exact confidence interval of
[0.4782, 1]
• Reason: The default is a two-sided 95% confidence
interval
• To get one-sided interval, use binom.test(5, 5,
alternative = “greater”)
• This returns [0.5493, 1], as expected.
14
An alternative: Beta posteriors for
proportions
• If the prior distribution for unknown binomial
proportion is a beta distribution (e.g.,
uniform)…
• And data show x successes in n binomial
trials…
• Then posterior distribution, conditioned on
this evidence, is also a beta distribution
– Beta family provides “conjugate prior”
distributions for binomial sampling
15
Beta distribution,
pbeta(q, x, n-x)
• A distribution for proportions or probabilities
– Concentrated on interval from 0 to 1
– E(X) = x/n (think of x successes in n trials)
– Var(X) = x(n – x)/[n*n*(n+1)]
– Uniform: dbeta(p, 1, 1) = 1
• Example: P(X < 0.29 | n = 10, x = 4) =
pbeta(0.29, 4, 6) = 0.24781
• Example: P(X < 0.52 | n = 10, x = 4) =
pbeta(0.52, 4, 6) = 0.7838555
16
Bayesian estimation of proportion
• Choose beta prior distribution for p
– Initial x and n reflect any prior evidence or belief
– Common choice is uniform prior, dbeta(p,1,1)
• Update with data (by conditioning via Bayes’ rule for
continuous variables)
• Result is a posterior beta distribution with updated n
and x
– If prior was uniform, posterior has x + 1, n + 2
– Posterior mean = (x + 1)/(n + 2) based on x successes in n
trials
17
Bayesian estimation of proportion:
Broken windows
• Given x = 5 defects in n = 5 trials, the Bayesian
posterior mean probability of defects = (x +
1)/(n + 2) = 6/7 = 0.857
• This is “the probability” that a randomly
selected unit will have a defect
– All probabilities are conditional
– Here, 0.857 is the posterior probability after
conditioning a uniform prior (U[0, 1] with mean
0.5) on the data x = n = 5.
18
Bayesian estimation of proportion:
Example with n = 20, x = 5
• Choose uniform prior
– Initial n = 1, x = 1, pbeta(q, x, n-x)
– x = c(0:100)/100
– y = dbeta(x, 1, 1)
• Result is a posterior beta distribution with
updated n = 21 and x = 6
– z = dbeta(x, 6, 15)
19
Bayesian estimation of proportion:
Example with n = 20, x = 5
3
1
2
z
1.0
0.8
0
0.6
y
1.2
4
1.4
• Prior and posterior pdfs
0.0
0.2
0.4
0.6
x
0.8
1.0
0.0
0.2
0.4
0.6
x
0.8
1.0
20
Wrap-up on Bayesian estimation of
proportions
• Posterior distribution characterizes
uncertainty about the true proportion
– Depends on prior and on data
– For binomial sampling, posterior is beta if prior is
beta. (Beta is self-conjugate family.)
• Provides much more information than a
confidence interval
• Leads to simple, intuitive formula for point
estimates: (x + 1)/(n + 2)
21
Hypothesis testing
22
Statistical inference: Hypothesis
testing using the sign test
• 20 participants in a taste test experiment try
each of two drinks in random order
– Assume a random sample
• 15 of them prefer A to B.
• Statistical hypothesis testing: What can we
justifiably infer?
– Can we confidently conclude from this evidence
that most people prefer A to B?
• Prescriptive analysis: What should we do?
23
Statistical inference: Sign test
• Null hypothesis: No difference.
– Random participant is equally likely to prefer
either.
• Simple null hypothesis: P(prefer A to B) = 0.5
• Composite null hypothesis: P(A > B) < 0.5
• Test of null hypothesis: How likely is it that 15
or more of 20 subjects would prefer A to B by
chance, if null hypothesis is true?
– One-sided test
24
Statistical inference: Sign test
• P(at least 15 out of 20 choose A over B if p =
0.5) = 1 - pbinom(14, 20, 0.5) = 0.02.
– Logic: Quantify P(≥ observed outcome | H0)
• So, we can be 98% confident that this event did
not happen by chance.
– Significance level = 0.02 = 1 – confidence level
• It is likely, based on this evidence, that most
people really do prefer A to B.
• (Decision analysis: But is it a good bet?)
25
Sign test using binom.test in R
• Even a two-sided test lets us reject the null
hypothesis (since 95% CI is above 0.5)
> binom.test(15, 20)
Exact binomial test
data: 15 and 20
number of successes = 15, number of trials = 20, p-value = 0.04139
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5089541 0.9134285
sample estimates:
probability of success
0.75
26
Sign test for time series
• Consider a time series of counts by quarter of
year (for car accidents, crimes, heart attack
deaths, etc.)
• 6, 4, 4, 7, *, 4, 2, 2, 5 (* = intervention time)
• Intervention to reduce counts is made at start
of period 5 (quarter 1 of year 2)
• Did it work? (Did counts drop significantly?)
27
Sign test for time series
• Time series of counts by quarter of year (for car
accidents, crimes, heart attack deaths, etc.):
6, 4, 4, 7, *, 4, 2, 2, 5 (* = intervention time)
• Intervention at start of period 5
• Did it work? (Did counts drop significantly?)
• Each value after intervention < lagged value a year
earlier  p = 1/16 = 0.0625
– Something probably worked (confidence 94%)
– Can’t tell from data alone whether it was intervention
• Need a control group
28
Summary on sign test
• Nonparametric statistical test (or decision) of
“no difference” hypothesis
– No specific statistical model for data values needs
to be known or assumed
• Depends only on comparing which member of
each pair is greater
– No need for precise measurement or large
differences
• Simple to calculate using binomial
29
Logic and vocabulary of statistical
hypothesis testing
• Formulate a null hypothesis to be tested, H0
– H0 is “What you are trying to reject”
– If true, H0 determines a probability distribution for
the test statistic (a function of the data)
• Choose  = significance level for test = P(reject
null hypothesis H0 | H0 is true)
• Decision rule: Reject H0 if and only if the test
statistic falls in a critical region of values that are
unlikely (p < ) if H0 is true.
30
Hypothesis testing picture
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_test_1.htm
31
Interpretation of hypothesis test
• Either something unlikely has happened
(having probability p < , where p = P(test
statistic has observed or more extreme value |
H0 is correct) or H0 is not true.
• It is conventional to choose a significance level
of  = 0.05, but other values may be chosen to
minimize the sum of costs of type 1 error
(falsely reject H0) and type 2 error (falsely fail
to reject H0).
32
Neyman-Pearson Lemma
• How to minimize Pr(type 2 error), given ?
• Answer: Reject H0 in favor of HA if and only if
P(data | HA)/P(data | H0) > k, for some
constant k
– The ratio LR = P(data | HA)/P(data | H0) is called
the likelihood ratio
– With independent samples, P(data | H) = product
of P(xi | H) values for all data points xi
– k is determined from .
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_neyman_pearson.htm
33
Statistical decision theory:
Key ideas
• Statistical inference from data can be formulated in
terms of decision problems
• Point estimation: Minimize expected loss from error,
given a loss function
– Implies using posterior mean if loss function is quadratic
(mean squared error)
– Implies using posterior median if loss function is absolute
value of error
• Hypothesis testing: Minimize total expected loss =
loss from false positives + loss from false negatives +
sampling costs
34
Updating normal distributions
35
Updating normal distributions)
• Probability model: N(m, s2) ; pnorm(x, m, s)
• Initial uncertainty about input m is modeled by
a normal prior with parameters m0, s0
– Prior N(x, m0, s0) has mean m0
• Observe data: x1 = sample mean of n1
independent observations
• Posterior uncertainty about m: N(m*, s*2), m* =
wm0 + (1 - w)x1, s* = sqrt(ws02)
• w = (s2/n1)/(s2/n1 + s02) = 1/(1 + n1s02/s2)
36
Bayesian updating of normal
distributions (Cont.)
• Posterior uncertainty about m: N(m*, s*2), m* =
wm0 + (1 - w)x1, s* = sqrt(ws02)
• w = (s2/n1)/(s2/n1 + s02) = 1/(1 + n1s02/s2)
• Let’s define an “equivalent sample size,” n0, for
the prior, as follows: s02 = s2/n0.
• Then w = n0/(n0 + n1), posterior is N(m*, s*2)
– m* = (n0m0 + n1x1)/(n0 + n1)
– s* = sqrt(s2/(n0 + n1))
37
Predictive distributions
• How to predict probabilities when the inputs to
probability models (p for binom, m and s for
pnorm, etc.), are uncertain?
• Answer 1: Find posterior by Bayesian
conditioning of prior on data.
• Answer 2: Use simulation to sample from
distribution of inputs. Calculate conditional
probabilities from model, given sampled inputs.
Average them to get final probability.
38
Example: Predictive normal
distribution
• If posterior distribution is N(m*, s*2), then the
predictive distribution is N(m*, s2 + s*2)
• Mean is just posterior mean, m*
• Total uncertainty (variance) in prediction =
sum of variance around the (true but
uncertain) mean and variance of the mean
39
Example: Exact vs. simulated
predictive normal distributions
• Model: N(m, 1) with m ~ N(3, 4)
• Exact predictive dist.: N(m*, s2 + s*2) = N(3, 5)
• Simulated predictive dist.: N(2.99, 5.077)
> m = y = NULL; m = rnorm(10000, 3, 2); mean(m); sd(m)^2; for (j in 1:10000) {;
y[j] = rnorm(1, m[j], 1)}; mean(y); sd(y)^2
[1] 3.000202
[1] 4.043804
[1] 2.993081
[1] 5.077026
40
Simulation: The main idea
• To quantify Pr(outcome), create a model for
Pr(outcome | inputs) and Pr(inputs).
– Pr(input) = joint probability distribution of inputs
• Sample values from Pr(input)
– Use rdist
• Create indicator variable for outcome
– 1 if it occurs on a run, else 0
• Mean value of indicator variable = Pr(outcome)
41
Bayesian inference via simulation: Mary
revisited
•
•
•
•
•
•
Pr(test is positive | disease) = 0.95
Pr(test is negative | no disease) = 0.90
Pr(disease) = 0.03
Find P(disease | test is positive)
Answer from Bayes’ Rule” 0.2270916
Answer by simulation:
Disease_state
Yes 22.7
No
77.3
Test_result
Positive
100
Negative
0
# Initialize variables
disease_status = test_result = test_result_if_disease = test_result_if_no_disease = NULL;
n = 100000;
# simulate disease state and test outcomes
disease_status = rbinom(n,1, 0.03); test_result_if_disease = rbinom(n,1, 0.95);
test_result_if_no_disease = rbinom(n,1,0.10); test_result = disease_status*
test_result_if_disease + (1- disease_status)*test_result_if_no_disease;
# calculate and report desired conditional probability
sum(disease_status*test_result)/sum(test_result)
[1] 0.2263892
42
Wrap-up on probability models
• Highly useful for estimating probabilities in
many standard situations
– Pr(0 arrivals in h hours) if mean arrival rate is
known
– Conservative estimates for proportions
• Useful for showing uncertainty about
probabilities using Bayes’ Rule
– Beta posterior distribution for proportions
43
Binomial models for statistical
quality control decisions:
Sequential and adaptive
hypothesis-testing
44
Quality control decisions
• Observe data, decide what to do
– Intervene in process, accept or reject lot
• P-chart for quality control of process
– For attributes (pass/fail, conform/not conform)
• Lot acceptance sampling
– Accept or reject lot based on sample
• Adaptive sampling
– Sequential probability ratio test (SPRT)
45
“Rule of 3”: Using the binomial model
to bound probabilities
• If no failures are observed in N binomial trials,
then how large might the failure probability
be?
• Answer: At most 3/N
– 95% upper confidence limit
– Derivation: If failure probability is p, then the
probability of 0 failures in N trials is (1 - p)N.
– (1 - p)N > 0.05  1 - p > 0.051/N  ln(1 - p) > 2.9957/N  -p > -3/N  p < 3/N
log(1 − x) =
46
P-chart: Pay attention if process exceeds
upper control limit (UCL)
http://www.centerspace.net/blog/nmath-stats/nmath-stats-tutorial/statistical-quality-control-charts/
Decision analysis: Set UCL to minimize average cost of type 1
(false reject) and type 2 (false accept) errors
47
Lot acceptance sampling
(by attributes, i.e., pass/fail inspections)
• Take sample of size n
• Count non-conforming (fail) items
• Accept if number is below threshold; reject if
it is above
• Optimize choice of n and threshold to
minimize expected total costs
– Total cost = cost of sampling + cost of erroneous
decisions
48
Lot acceptance sampling:
Inputs and outputs
http://www.minitab.com/en-US/training/tutorials/accessing-the-power.aspx?id=1688
49
Zero-based acceptance sampling plan
calculator
Squeglia Zero-Based Acceptance Sampling Plan Calculator
Enter your process parameters:
Batch/lot size (N)
The number of items in the batch (lot).
AQL
The Acceptable Quality Level.
If no AQL is contractually specified, an AQL of 1.0% is suggested
http://www.sqconline.com/squeglia-zero-based-acceptance-sampling-plan-calculator
50
Zero-based acceptance sampling plan
calculator
Squeglia Zero-Based Acceptance Sampling Plan (Results)
For a lot of 91 to 150 items, and AQL= 10.0% , the Squeglia zero-based
acceptance sampling plan is:
Sample 5 items.
If the number of non-conforming items is
0
accept the lot
1
reject the lot
This plan is based on DCMA (Defense Contract Management Agency)
recommendations.
http://www.sqconline.com/squeglia-zero-based-acceptance-sampling-plan-calculator
51
Multi-stage lot acceptance sampling
• Take sample of size n
• Count non-conforming (fail) items
• Accept if number is below threshold 1; reject
if it is above threshold 2; sample again if
between the thresholds
– For single-sample decisions, thresholds 1 and 2
are the same
• Optimize choice of n and thresholds to
minimize expected total costs
52
Decision rules for adaptive binomial sampling:
Sequential probability ratio test (SPRT)
Intuition: The expected slope
of the cumulative-defects line
is the average proportion of
defectives. This is just the
probability of defective (nonconforming) items in a
binomial sample.
Simulation-optimization (or
math) can identify optimal
slopes and intercepts to
minimize expected total cost
(of sampling + type 1 and type
2 errors)
http://www.stattools.net/SeqSPRT_Exp.php
53
Generalizations of SPRT
• Main ideas apply to many other (non-binomial) problems
• SPRT decision rule: Use data to compute the likelihood ratio LRt
= P(ct | HA)/P(ct | H0).
• If LRt > (1 – )/ then stop and reject H0
• If LRt < / 1 – ) then stop and accept H0
• Else continue sampling
– ct = number of adverse events by time t
– H0 = null hypothesis (process has acceptably small defect rate); H0 =
alternative hypothesis
–  = false rejection rate for H0 (type 1 error rate)
–  = false acceptance rate for H0 (type 2 error rate)
http://www.tandfonline.com/doi/pdf/10.1080/07474946.2011.539924?noFrame=true
54
Implementing the SPRT
• Optimal slopes and intercepts to achieve different
combinations of type 1 and type 2 errors are tabulated.
Example application: Testing
for mean time to failure
(MTTF) of electronic
components
55
Decision rules for adaptive binomial sampling:
Sequential probability ratio test (SPRT)
http://www.sciencedirect.com/science/article/pii/S0022474X05000056
56
Application: SPRT for deaths from hospital
heart operations
http://www.bmj.com/content/328/7436/375?ijkey=144017772645bb38936abd6f209cd96bfd1930c3&keytype
2=tf_ipsecsha&linkType=ABST&journalCode=bmj&resid=328/7436/375
57
SPRT can greatly reduce sample sizes (e.g., from
hundreds to 5, for construction defects)
http://www.sciencedirect.com/science/article/pii/S0022474X05000056
58
Nonlinear boundaries and truncated stopping
rules can refine the basic idea
http://www.sciencedirect.com/science/article/pii/S0022474X05000056
59
Wrap-up on SPRT
• Sequential and adaptive sampling can reduce total
decision costs (costs of sampling + costs of error)
• Computationally sophisticated (and challenging)
algorithms have been developed to approximately
optimize decision boundaries for statistical decision
rules
• Adaptive approaches are especially valuable for
decisions in uncertain and changing environments.
60
Multi-arm bandits and adaptive
learning
61
Multi-arm bandit (MAB) decision problem:
Comparing uncertain reward distributions
• Multi-arm bandit (MAB) decision problem: On each
turn, can select any of k actions
– Context-dependent bandit: Get to see a “context” (signal)
x before making decision
• Receive random reward with (initially unknown)
distribution that depends on the selected action
• Goal: Maximize sum (or discounted sum) of rewards;
minimize regret (= expected difference between best
(if distributions were known) and actually received
cumulative rewards)
http://jmlr.org/proceedings/papers/v32/gopalan14.pdf Gopalan et al., 2014
https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/
62
MAB applications
• Clinical trials: Compare old drug to new.
Which has higher success rate?
• Web advertising, A/B testing: Which version
of a web ad maximizes click-through,
purchases, etc.?
• Public policies: Which policy best achieves its
goals?
– Use evidence from early adopter locations to
inform subsequent choices
63
Upper confidence bound (UCB1)
algorithm for solving MAB
• Try each action once.
• For each action a, record average reward m(a) obtained
from it so far and how many times it has been tried, n(a).
• Let N = an(a) = total number of actions so far.
• Choose next the action with the greatest upper confidence
bound (UCB): m(a) + sqrt(2*log(N)/n(a))
– Implements “Optimism in the face of uncertainty” principle
– UCB for a decreases quickly with n(a), increases slowly with N
– Achieves theoretical optimum: logarithmic growth in regret
• Same average increase in first 10 plays as in next 90, then next 900, and so on
– Requires keeping track each round (not batch updating)
Auer et al., 2002 http://homes.dsi.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
64
Thompson sampling and adaptive
Bayesian control: Bernoulli trials
• Basic idea: Choose each of the k actions
according to the probability that it is best
• Estimate the probability via Bayes’ rule
– It is the mean of the posterior distribution
– Use beta conjugate prior updating for “Bernoulli
bandit” (0-1 reward, fail/succeed)
S = success
F = failure
http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012
65
Thompson sampling: General stochastic
(random) rewards
• Second idea: Generalize from Bernoulli bandit
to arbitrary reward distribution (normalized to
the interval [0, 1]) by considering a trial a
“success” with probability equal to its reward
http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012
66
Thompson sampling with complex
online actions
• Applications include job
scheduling (assigning jobs to
machines) and web
advertising with reward
depending on subsets of ads
displayed
• Updating posteriors can be
done efficiently using a
sampling-based approach
(particle filtering)
Gopalan et al., 2014 http://jmlr.org/proceedings/papers/v32/gopalan14.pdf
67
Comparing methods
• In simulation experiments,
Thompson sampling works well
with batch updating, even with
slowly or occasionally changing
rewards and other realistic
complexities.
• Beats UCB1 in many but not all
comparisons
• More practical than UCB1 for
batch updating because it keeps
experimenting (trying actions
with some randomness) between
updates.
http://engineering.richrelevance.com/recommendations-thompson-sampling/
68