Math 6330: Statistical Consulting Class 10 Tony Cox [email protected] University of Colorado at Denver Course web site: http://cox-associates.com/6330/ Course schedule • • • • April 14: Draft of project/term paper due April 18, 25, May 2: In-class presentations May 2: Last class May 4: Final project/paper due by 8:00 PM 2 Newsvendor problem: Analysis • How many papers to stock? – – – – Each costs $k Each sells for $ Number sold = min(stock, demand) demand is uncertain, with CDF of F • Analytic solution: – Profit if stock = a and demand = s is *min(a, s) - a*k – Optimization: Marginal benefit (revenue) = marginal cost – (1 - F(a)) = k F(a*) = 1 – k/ • (1 - F(a)) - k = expected additional (marginal) profit from buying one more paper = 0 at optimum. • F(a) = probability it remains unsold 3 www.old-picture.com/united-states-history-1900s---1930s/Evening-newsboy-buying-from.htm Statistical estimation: Proportions 4 Estimating a proportion from data • Applications: Drug trials, A/B testing, etc. • Point estimate and confidence intervals – Maximum likelihood estimate of p = x/n – Least squares estimate of p is also x/n • Confidence intervals – Normal approximation: x/n + {(1/n)(x/n)[1 – (x/n)]}1/2 – Exact confidence interval • Bayesian – Beta posterior distribution 5 Exact confidence interval for a proportion based on data • Finds the largest and smallest values of p that are compatible with observed evidence x/n. • Generated by binom.test(x, n) (two-sided) or binom.test(x, n, alternative = “greater”) or binom.test(x, n, alternative = “less”) (one-sided) – Two-sided test puts half the significance level (denoted by = 1 - confidence level) in each tail. – One-sided puts all in one tail 6 Example • Evidence: Suppose that 20 components are randomly sampled and that 5 are observed to be defective (non-conforming). • What can we infer about the true proportion of defects? 7 Maximum likelihood estimate (MLE): x/n maximizes likelihood , L = C*px(1 - p)n-x • “Best” (MLE) point estimate of p based on x = 5 and n = 20 is p = 5/20 = 0.25 1.2e-05 8.0e-06 4.0e-06 0.0e+00 p = c(0:100)/100; y = (p^5)*(1-p)^15; plot(p,y) > y[25] [1] 1.297956e-05 > y[26] # p[26] = 0.25 [1] 1.305025e-05 > y[27] [1] 1.298203e-05 y • • • • • • • • • – Maximizes L = 20C5p5(1-p)15 0.0 0.2 0.4 0.6 0.8 1.0 8 p Exact confidence interval • Intuition: Observed evidence x/n = 5/20 is very unlikely if p is too large or too small 0.0 0.0 0.2 0.2 0.4 0.4 y z 0.6 0.6 0.8 0.8 1.0 1.0 – p = c(0:100)/100; y = 1 - pbinom(4, 20,p) – z = pbinom(5, 20, p); plot(p, y); plot(p,z) y = P(x ≥ 5 | p) z = P(x ≤ 5 | p) 0.0 0.2 0.4 0.6 p 0.8 1.0 0.0 0.2 0.4 0.6 p 0.8 1.0 9 Exact confidence interval • For an exact 95% confidence interval, find – Upper bound: Largest value of p for which z = pbinom(5, 20, p) = Pr(x ≤ 5 | p) = 0.05/2 – Lower bound: smallest value of p for which 1 pbinom(4,20,p) = y = Pr(x ≥ 5 | p) = 0.05/2 – The interval between these is the exact 95% confidence interval • In R, can just use binom.test(x, n, p) 10 Exact confidence interval calculation in R • binom.test(5, 20, 0.25) Exact binomial test data: 5 and 20 number of successes = 5, number of trials = 20, p-value = 1 alternative hypothesis: true probability of success is not equal to 0.25 95 percent confidence interval: 0.08657147 0.49104587 sample estimates: probability of success 0.25 11 Example: Broken windows • Randomly sample 5 windows in a large housing development • All 5 have a construction defect that allows water penetration • What can we infer about the true proportion of defective windows? 12 Estimating a proportion using a binomial distribution • Data: 5 defects in 5 trials • What can we infer about the true proportion of defects, p? • P(5 defects in 5 trials | Pr(defect) = p) = p5. • Observed outcome, 5/5, has no more than 0.05 probability of occurring by chance if p5 < 0.05, or p < (0.05)^(1/5) = 0.5493. • So, 0.5493 is an exact 95% lower confidence limit on the proportion of windows with defects 13 Exact confidence interval in R • Data: 5 defects in 5 trials • binom.test(5, 5) does not produce what we want, but returns a 95% exact confidence interval of [0.4782, 1] • Reason: The default is a two-sided 95% confidence interval • To get one-sided interval, use binom.test(5, 5, alternative = “greater”) • This returns [0.5493, 1], as expected. 14 An alternative: Beta posteriors for proportions • If the prior distribution for unknown binomial proportion is a beta distribution (e.g., uniform)… • And data show x successes in n binomial trials… • Then posterior distribution, conditioned on this evidence, is also a beta distribution – Beta family provides “conjugate prior” distributions for binomial sampling 15 Beta distribution, pbeta(q, x, n-x) • A distribution for proportions or probabilities – Concentrated on interval from 0 to 1 – E(X) = x/n (think of x successes in n trials) – Var(X) = x(n – x)/[n*n*(n+1)] – Uniform: dbeta(p, 1, 1) = 1 • Example: P(X < 0.29 | n = 10, x = 4) = pbeta(0.29, 4, 6) = 0.24781 • Example: P(X < 0.52 | n = 10, x = 4) = pbeta(0.52, 4, 6) = 0.7838555 16 Bayesian estimation of proportion • Choose beta prior distribution for p – Initial x and n reflect any prior evidence or belief – Common choice is uniform prior, dbeta(p,1,1) • Update with data (by conditioning via Bayes’ rule for continuous variables) • Result is a posterior beta distribution with updated n and x – If prior was uniform, posterior has x + 1, n + 2 – Posterior mean = (x + 1)/(n + 2) based on x successes in n trials 17 Bayesian estimation of proportion: Broken windows • Given x = 5 defects in n = 5 trials, the Bayesian posterior mean probability of defects = (x + 1)/(n + 2) = 6/7 = 0.857 • This is “the probability” that a randomly selected unit will have a defect – All probabilities are conditional – Here, 0.857 is the posterior probability after conditioning a uniform prior (U[0, 1] with mean 0.5) on the data x = n = 5. 18 Bayesian estimation of proportion: Example with n = 20, x = 5 • Choose uniform prior – Initial n = 1, x = 1, pbeta(q, x, n-x) – x = c(0:100)/100 – y = dbeta(x, 1, 1) • Result is a posterior beta distribution with updated n = 21 and x = 6 – z = dbeta(x, 6, 15) 19 Bayesian estimation of proportion: Example with n = 20, x = 5 3 1 2 z 1.0 0.8 0 0.6 y 1.2 4 1.4 • Prior and posterior pdfs 0.0 0.2 0.4 0.6 x 0.8 1.0 0.0 0.2 0.4 0.6 x 0.8 1.0 20 Wrap-up on Bayesian estimation of proportions • Posterior distribution characterizes uncertainty about the true proportion – Depends on prior and on data – For binomial sampling, posterior is beta if prior is beta. (Beta is self-conjugate family.) • Provides much more information than a confidence interval • Leads to simple, intuitive formula for point estimates: (x + 1)/(n + 2) 21 Hypothesis testing 22 Statistical inference: Hypothesis testing using the sign test • 20 participants in a taste test experiment try each of two drinks in random order – Assume a random sample • 15 of them prefer A to B. • Statistical hypothesis testing: What can we justifiably infer? – Can we confidently conclude from this evidence that most people prefer A to B? • Prescriptive analysis: What should we do? 23 Statistical inference: Sign test • Null hypothesis: No difference. – Random participant is equally likely to prefer either. • Simple null hypothesis: P(prefer A to B) = 0.5 • Composite null hypothesis: P(A > B) < 0.5 • Test of null hypothesis: How likely is it that 15 or more of 20 subjects would prefer A to B by chance, if null hypothesis is true? – One-sided test 24 Statistical inference: Sign test • P(at least 15 out of 20 choose A over B if p = 0.5) = 1 - pbinom(14, 20, 0.5) = 0.02. – Logic: Quantify P(≥ observed outcome | H0) • So, we can be 98% confident that this event did not happen by chance. – Significance level = 0.02 = 1 – confidence level • It is likely, based on this evidence, that most people really do prefer A to B. • (Decision analysis: But is it a good bet?) 25 Sign test using binom.test in R • Even a two-sided test lets us reject the null hypothesis (since 95% CI is above 0.5) > binom.test(15, 20) Exact binomial test data: 15 and 20 number of successes = 15, number of trials = 20, p-value = 0.04139 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5089541 0.9134285 sample estimates: probability of success 0.75 26 Sign test for time series • Consider a time series of counts by quarter of year (for car accidents, crimes, heart attack deaths, etc.) • 6, 4, 4, 7, *, 4, 2, 2, 5 (* = intervention time) • Intervention to reduce counts is made at start of period 5 (quarter 1 of year 2) • Did it work? (Did counts drop significantly?) 27 Sign test for time series • Time series of counts by quarter of year (for car accidents, crimes, heart attack deaths, etc.): 6, 4, 4, 7, *, 4, 2, 2, 5 (* = intervention time) • Intervention at start of period 5 • Did it work? (Did counts drop significantly?) • Each value after intervention < lagged value a year earlier p = 1/16 = 0.0625 – Something probably worked (confidence 94%) – Can’t tell from data alone whether it was intervention • Need a control group 28 Summary on sign test • Nonparametric statistical test (or decision) of “no difference” hypothesis – No specific statistical model for data values needs to be known or assumed • Depends only on comparing which member of each pair is greater – No need for precise measurement or large differences • Simple to calculate using binomial 29 Logic and vocabulary of statistical hypothesis testing • Formulate a null hypothesis to be tested, H0 – H0 is “What you are trying to reject” – If true, H0 determines a probability distribution for the test statistic (a function of the data) • Choose = significance level for test = P(reject null hypothesis H0 | H0 is true) • Decision rule: Reject H0 if and only if the test statistic falls in a critical region of values that are unlikely (p < ) if H0 is true. 30 Hypothesis testing picture http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_test_1.htm 31 Interpretation of hypothesis test • Either something unlikely has happened (having probability p < , where p = P(test statistic has observed or more extreme value | H0 is correct) or H0 is not true. • It is conventional to choose a significance level of = 0.05, but other values may be chosen to minimize the sum of costs of type 1 error (falsely reject H0) and type 2 error (falsely fail to reject H0). 32 Neyman-Pearson Lemma • How to minimize Pr(type 2 error), given ? • Answer: Reject H0 in favor of HA if and only if P(data | HA)/P(data | H0) > k, for some constant k – The ratio LR = P(data | HA)/P(data | H0) is called the likelihood ratio – With independent samples, P(data | H) = product of P(xi | H) values for all data points xi – k is determined from . http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_neyman_pearson.htm 33 Statistical decision theory: Key ideas • Statistical inference from data can be formulated in terms of decision problems • Point estimation: Minimize expected loss from error, given a loss function – Implies using posterior mean if loss function is quadratic (mean squared error) – Implies using posterior median if loss function is absolute value of error • Hypothesis testing: Minimize total expected loss = loss from false positives + loss from false negatives + sampling costs 34 Updating normal distributions 35 Updating normal distributions) • Probability model: N(m, s2) ; pnorm(x, m, s) • Initial uncertainty about input m is modeled by a normal prior with parameters m0, s0 – Prior N(x, m0, s0) has mean m0 • Observe data: x1 = sample mean of n1 independent observations • Posterior uncertainty about m: N(m*, s*2), m* = wm0 + (1 - w)x1, s* = sqrt(ws02) • w = (s2/n1)/(s2/n1 + s02) = 1/(1 + n1s02/s2) 36 Bayesian updating of normal distributions (Cont.) • Posterior uncertainty about m: N(m*, s*2), m* = wm0 + (1 - w)x1, s* = sqrt(ws02) • w = (s2/n1)/(s2/n1 + s02) = 1/(1 + n1s02/s2) • Let’s define an “equivalent sample size,” n0, for the prior, as follows: s02 = s2/n0. • Then w = n0/(n0 + n1), posterior is N(m*, s*2) – m* = (n0m0 + n1x1)/(n0 + n1) – s* = sqrt(s2/(n0 + n1)) 37 Predictive distributions • How to predict probabilities when the inputs to probability models (p for binom, m and s for pnorm, etc.), are uncertain? • Answer 1: Find posterior by Bayesian conditioning of prior on data. • Answer 2: Use simulation to sample from distribution of inputs. Calculate conditional probabilities from model, given sampled inputs. Average them to get final probability. 38 Example: Predictive normal distribution • If posterior distribution is N(m*, s*2), then the predictive distribution is N(m*, s2 + s*2) • Mean is just posterior mean, m* • Total uncertainty (variance) in prediction = sum of variance around the (true but uncertain) mean and variance of the mean 39 Example: Exact vs. simulated predictive normal distributions • Model: N(m, 1) with m ~ N(3, 4) • Exact predictive dist.: N(m*, s2 + s*2) = N(3, 5) • Simulated predictive dist.: N(2.99, 5.077) > m = y = NULL; m = rnorm(10000, 3, 2); mean(m); sd(m)^2; for (j in 1:10000) {; y[j] = rnorm(1, m[j], 1)}; mean(y); sd(y)^2 [1] 3.000202 [1] 4.043804 [1] 2.993081 [1] 5.077026 40 Simulation: The main idea • To quantify Pr(outcome), create a model for Pr(outcome | inputs) and Pr(inputs). – Pr(input) = joint probability distribution of inputs • Sample values from Pr(input) – Use rdist • Create indicator variable for outcome – 1 if it occurs on a run, else 0 • Mean value of indicator variable = Pr(outcome) 41 Bayesian inference via simulation: Mary revisited • • • • • • Pr(test is positive | disease) = 0.95 Pr(test is negative | no disease) = 0.90 Pr(disease) = 0.03 Find P(disease | test is positive) Answer from Bayes’ Rule” 0.2270916 Answer by simulation: Disease_state Yes 22.7 No 77.3 Test_result Positive 100 Negative 0 # Initialize variables disease_status = test_result = test_result_if_disease = test_result_if_no_disease = NULL; n = 100000; # simulate disease state and test outcomes disease_status = rbinom(n,1, 0.03); test_result_if_disease = rbinom(n,1, 0.95); test_result_if_no_disease = rbinom(n,1,0.10); test_result = disease_status* test_result_if_disease + (1- disease_status)*test_result_if_no_disease; # calculate and report desired conditional probability sum(disease_status*test_result)/sum(test_result) [1] 0.2263892 42 Wrap-up on probability models • Highly useful for estimating probabilities in many standard situations – Pr(0 arrivals in h hours) if mean arrival rate is known – Conservative estimates for proportions • Useful for showing uncertainty about probabilities using Bayes’ Rule – Beta posterior distribution for proportions 43 Binomial models for statistical quality control decisions: Sequential and adaptive hypothesis-testing 44 Quality control decisions • Observe data, decide what to do – Intervene in process, accept or reject lot • P-chart for quality control of process – For attributes (pass/fail, conform/not conform) • Lot acceptance sampling – Accept or reject lot based on sample • Adaptive sampling – Sequential probability ratio test (SPRT) 45 “Rule of 3”: Using the binomial model to bound probabilities • If no failures are observed in N binomial trials, then how large might the failure probability be? • Answer: At most 3/N – 95% upper confidence limit – Derivation: If failure probability is p, then the probability of 0 failures in N trials is (1 - p)N. – (1 - p)N > 0.05 1 - p > 0.051/N ln(1 - p) > 2.9957/N -p > -3/N p < 3/N log(1 − x) = 46 P-chart: Pay attention if process exceeds upper control limit (UCL) http://www.centerspace.net/blog/nmath-stats/nmath-stats-tutorial/statistical-quality-control-charts/ Decision analysis: Set UCL to minimize average cost of type 1 (false reject) and type 2 (false accept) errors 47 Lot acceptance sampling (by attributes, i.e., pass/fail inspections) • Take sample of size n • Count non-conforming (fail) items • Accept if number is below threshold; reject if it is above • Optimize choice of n and threshold to minimize expected total costs – Total cost = cost of sampling + cost of erroneous decisions 48 Lot acceptance sampling: Inputs and outputs http://www.minitab.com/en-US/training/tutorials/accessing-the-power.aspx?id=1688 49 Zero-based acceptance sampling plan calculator Squeglia Zero-Based Acceptance Sampling Plan Calculator Enter your process parameters: Batch/lot size (N) The number of items in the batch (lot). AQL The Acceptable Quality Level. If no AQL is contractually specified, an AQL of 1.0% is suggested http://www.sqconline.com/squeglia-zero-based-acceptance-sampling-plan-calculator 50 Zero-based acceptance sampling plan calculator Squeglia Zero-Based Acceptance Sampling Plan (Results) For a lot of 91 to 150 items, and AQL= 10.0% , the Squeglia zero-based acceptance sampling plan is: Sample 5 items. If the number of non-conforming items is 0 accept the lot 1 reject the lot This plan is based on DCMA (Defense Contract Management Agency) recommendations. http://www.sqconline.com/squeglia-zero-based-acceptance-sampling-plan-calculator 51 Multi-stage lot acceptance sampling • Take sample of size n • Count non-conforming (fail) items • Accept if number is below threshold 1; reject if it is above threshold 2; sample again if between the thresholds – For single-sample decisions, thresholds 1 and 2 are the same • Optimize choice of n and thresholds to minimize expected total costs 52 Decision rules for adaptive binomial sampling: Sequential probability ratio test (SPRT) Intuition: The expected slope of the cumulative-defects line is the average proportion of defectives. This is just the probability of defective (nonconforming) items in a binomial sample. Simulation-optimization (or math) can identify optimal slopes and intercepts to minimize expected total cost (of sampling + type 1 and type 2 errors) http://www.stattools.net/SeqSPRT_Exp.php 53 Generalizations of SPRT • Main ideas apply to many other (non-binomial) problems • SPRT decision rule: Use data to compute the likelihood ratio LRt = P(ct | HA)/P(ct | H0). • If LRt > (1 – )/ then stop and reject H0 • If LRt < / 1 – ) then stop and accept H0 • Else continue sampling – ct = number of adverse events by time t – H0 = null hypothesis (process has acceptably small defect rate); H0 = alternative hypothesis – = false rejection rate for H0 (type 1 error rate) – = false acceptance rate for H0 (type 2 error rate) http://www.tandfonline.com/doi/pdf/10.1080/07474946.2011.539924?noFrame=true 54 Implementing the SPRT • Optimal slopes and intercepts to achieve different combinations of type 1 and type 2 errors are tabulated. Example application: Testing for mean time to failure (MTTF) of electronic components 55 Decision rules for adaptive binomial sampling: Sequential probability ratio test (SPRT) http://www.sciencedirect.com/science/article/pii/S0022474X05000056 56 Application: SPRT for deaths from hospital heart operations http://www.bmj.com/content/328/7436/375?ijkey=144017772645bb38936abd6f209cd96bfd1930c3&keytype 2=tf_ipsecsha&linkType=ABST&journalCode=bmj&resid=328/7436/375 57 SPRT can greatly reduce sample sizes (e.g., from hundreds to 5, for construction defects) http://www.sciencedirect.com/science/article/pii/S0022474X05000056 58 Nonlinear boundaries and truncated stopping rules can refine the basic idea http://www.sciencedirect.com/science/article/pii/S0022474X05000056 59 Wrap-up on SPRT • Sequential and adaptive sampling can reduce total decision costs (costs of sampling + costs of error) • Computationally sophisticated (and challenging) algorithms have been developed to approximately optimize decision boundaries for statistical decision rules • Adaptive approaches are especially valuable for decisions in uncertain and changing environments. 60 Multi-arm bandits and adaptive learning 61 Multi-arm bandit (MAB) decision problem: Comparing uncertain reward distributions • Multi-arm bandit (MAB) decision problem: On each turn, can select any of k actions – Context-dependent bandit: Get to see a “context” (signal) x before making decision • Receive random reward with (initially unknown) distribution that depends on the selected action • Goal: Maximize sum (or discounted sum) of rewards; minimize regret (= expected difference between best (if distributions were known) and actually received cumulative rewards) http://jmlr.org/proceedings/papers/v32/gopalan14.pdf Gopalan et al., 2014 https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/ 62 MAB applications • Clinical trials: Compare old drug to new. Which has higher success rate? • Web advertising, A/B testing: Which version of a web ad maximizes click-through, purchases, etc.? • Public policies: Which policy best achieves its goals? – Use evidence from early adopter locations to inform subsequent choices 63 Upper confidence bound (UCB1) algorithm for solving MAB • Try each action once. • For each action a, record average reward m(a) obtained from it so far and how many times it has been tried, n(a). • Let N = an(a) = total number of actions so far. • Choose next the action with the greatest upper confidence bound (UCB): m(a) + sqrt(2*log(N)/n(a)) – Implements “Optimism in the face of uncertainty” principle – UCB for a decreases quickly with n(a), increases slowly with N – Achieves theoretical optimum: logarithmic growth in regret • Same average increase in first 10 plays as in next 90, then next 900, and so on – Requires keeping track each round (not batch updating) Auer et al., 2002 http://homes.dsi.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf 64 Thompson sampling and adaptive Bayesian control: Bernoulli trials • Basic idea: Choose each of the k actions according to the probability that it is best • Estimate the probability via Bayes’ rule – It is the mean of the posterior distribution – Use beta conjugate prior updating for “Bernoulli bandit” (0-1 reward, fail/succeed) S = success F = failure http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012 65 Thompson sampling: General stochastic (random) rewards • Second idea: Generalize from Bernoulli bandit to arbitrary reward distribution (normalized to the interval [0, 1]) by considering a trial a “success” with probability equal to its reward http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012 66 Thompson sampling with complex online actions • Applications include job scheduling (assigning jobs to machines) and web advertising with reward depending on subsets of ads displayed • Updating posteriors can be done efficiently using a sampling-based approach (particle filtering) Gopalan et al., 2014 http://jmlr.org/proceedings/papers/v32/gopalan14.pdf 67 Comparing methods • In simulation experiments, Thompson sampling works well with batch updating, even with slowly or occasionally changing rewards and other realistic complexities. • Beats UCB1 in many but not all comparisons • More practical than UCB1 for batch updating because it keeps experimenting (trying actions with some randomness) between updates. http://engineering.richrelevance.com/recommendations-thompson-sampling/ 68
© Copyright 2026 Paperzz