Making Decisions with Probability Distributions [email protected] BioPhia Consulting, Inc. 1 David LeBlond, PhD 2/2017 Definitions Decision: An action taken to minimize risk. Probability distribution (PD): A tool for quantifying risk. x a b c d e f x g h Bad thing happens 2 David LeBlond, PhD 2/2017 Relationship to Quality Knowing the… means knowing the… quality risk risk probability probability distribution. x a b c d e f x g h Bad thing happens 3 David LeBlond, PhD 2/2017 Role of PDs in science & technology Probability Theory Probability Tools: 350+ years old Statistical Tools: 110 years old Statistics Need PDs for model building Inference: Estimate model parameters from data Deduction: Predict future values, given estimates. 4 David LeBlond, PhD 2/2017 Some applications of PDs • • • • • • • • risk assessment (FMEA, ICH Q9) acceptance sampling sample size estimation Monte-Carlo predictions process capability statistical tests interval estimation Bayesian methods x a b c d e f x g h Bad thing happens 5 David LeBlond, PhD 2/2017 Why calculate PDs MS Excel? Cons • Manual • Formulas must be ‘verified’ • Many complex analyses cannot be done in EXCEL Pros • Availability • No black box • Empowering • Provide templates to clients • Complex formulas duplicated by ‘cut and paste’ • ‘Accessible’ modeling environment • Common distributions relatively easy to model 6 David LeBlond, PhD 2/2017 Outline • Why PDs? Why Excel? • pdf, pmf, cdf, qf • Excel formulas for 5 count and 10 continous PDs • Generating random draws from a PD • Histograms in Excel • Skewness and Kurtosis • Testing for normality • Transformations to normality • Bonus: Matrix calculations in Excel Approach: minimize algebra, Greek letters include a little historical trivia 7 David LeBlond, PhD 2/2017 probability density function cdf cumulative distribution function 0.5 0.4 0.3 0.2 0.1 0 -4 -3 -2 -1 0 1 2 3 4 x Cumulative distribution pdf Probability density PDs for continuous variates 1 qf 0.8 0.6 0.4 0.2 0 -4 -3 -2 -1 0 1 2 3 4 quantile function x 8 David LeBlond, PhD 2/2017 PDs for count variates 0.3 pmf probability mass function probability mass 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10 x cdf cumulative distribution function Cumulative distribution 1.2 1 qf 0.8 0.6 quantile function 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 x 9 David LeBlond, PhD 2/2017 Parameters of a PD X is a random variate, x is a specific value of X Continuous pdf: = NORMDIST(x, m, s, FALSE) True mean m True sigma s Count pmf: = BINOMDIST(x, n, p, FALSE) Repetitions n Success rate p In Excel formula, substitute the cell range for x, m, s, n, p, … 10 David LeBlond, PhD 2/2017 Relationships among distributions Neg.Binomial n, p David LeBlond, PhD 2/2017 Beta a, b Normal m, s Std Normal LS Student t v, m, s Binomial n, p Poisson m Lognormal m, s Student t v Hypergeometric n, M, N F v1, v2 Bernoulli p Std Uniform Gamma a, b Chisquare v Exponential b Uniform h, d Weibull c,b 11 Jacob Bernoulli, 1713, Art of Conjecture Swiss Mathematician Binomial Story: A jar contains a very large (i.e., infinite) number of widgets. A known proportion, p, are defective. We randomly select n widgets and measure x, the number in this sample that are defective. 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 n=10, p=0.1 n=1 Bernoulli PD remember: cdf is really a step function. Cumulative distribution Probability mass pmf: M = BINOMDIST(x,n,p,FALSE) cdf: P = BINOMDIST(x,n,p,TRUE) qf: x = CRITBINOM(n,p,P) n=10, p=0.5 n=10, p=0.9 1.2 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 x David LeBlond, PhD 2/2017 6 7 8 9 10 0 1 2 3 4 5 x 6 7 8 9 10 12 Negative Binomial Blaize Pascal, 1654, France mathematician, theologian, gambler AKA “Pascal PD” Story: A process has a known probability of success, p, on each run. We run the process until we make n good lots and measure x, the number of failed lots produced. pmf: M=NEGBINOMDIST(X, n, p) Probability mass 0.3 0.25 n=2, p=0.25 0.2 n=2, p=0.5 0.15 n=5, p=0.5 0.1 0.05 18 15 12 9 6 3 0 0 x 13 David LeBlond, PhD 2/2017 Negative Binomial Cumulative distribution cdf: create pmf for 0,…,x then SUM(pmf range) qf: create cdf range for -1 to x, then LOOKUP(P,cdf range,quantile range)+1 1.2 1 0.8 0.6 n=2, p=0.25 0.4 n=2, p=0.5 0.2 n=5, p=0.5 18 15 12 9 6 3 0 0 x 14 David LeBlond, PhD 2/2017 Poisson Simeon Poisson 1838, France mathematician, astronomer Story: A white ointment is contaminated with randomly dispersed black specs. The true mean number of black specs per unit volume is m. We randomly sample a unit volume and measure x, the number of black specs in the sample. pmf: M = POISSON(X, m, FALSE) cdf: P = POISSON(X, m, TRUE) qf: create cdf range for -1 to x, then LOOKUP(P,cdf range,quantile range)+1 m=1 m=2 m=4 m=8 0.35 0.3 0.25 0.2 0.15 0.1 1 0.8 0.6 m=1 m=2 m=4 m=8 0.4 0.2 x 12 10 8 6 4 2 0 12 10 8 6 4 0 2 0 0.05 0 Cumulative distribution Probability mass 0.4 x 15 David LeBlond, PhD 2/2017 Hypergeometric Abraham de Moivre 1711, France-England “Doctrine of Chances” prized by gamblers Story: K of the N widgets in a small batch are known to be defective. We randomly select n widgets and measure x, the number in this sample that are defective. (Note if n-x > N-K or if x >K or x>n, pmf must = 0). pmf: M =IF(AND(MAX(0,n+K-N)<=X,X<=MIN(K,n)),HYPGEOMDIST(X,n,K,N),0) cdf: create pmf for 0,…,x then SUM(pmf range) qf: create cdf range for -1 to x, then LOOKUP(P,cdf range,quantile range)+1 n=1, K=10, N=20 n=2, K=10, N=20 n=4, K=10, N=20 n=8, K=10, N=20 0.5 0.4 0.3 0.2 0.1 0 Cumulative distribution 1 0.8 0.6 n=1, K=10, N=20 n=2, K=10, N=20 n=4, K=10, N=20 n=8, K=10, N=20 0.4 0.2 x 12 10 8 6 4 2 12 10 8 6 4 2 0 0 0 Probability mass 0.6 x 16 David LeBlond, PhD 2/2017 Normal Karl Friedrich Gauss 1809, German rigorously justified Story: x is the sum of the effects of many small, independent factors. No single process has a dominant effect. The true mean and sigma of x are m and s. Abraham de Moivre 1711, France First described Cumulative distribution Probability density pdf: D = NORMDIST(x, m, s, FALSE) cdf: P = NORMDIST(x, m, s, TRUE) qf: x = NORMINV(P, m, s) = m + s*NORMINV(P,0,1) 0.5 A 1 0.61 0.4 B e 0.3 A 0.2 B 0.1 0 -4 -3 -2 -1 0 x 1 2 3 4 1 0.8 0.6 0.9973 0.4 0.68 0.955 0.2 0 -4 -3 -2 -1 0 1 2 3 4 x 17 David LeBlond, PhD 2/2017 LogNormal Story: x is the multiplicative product of the effects of many small, independent factors. The natural log of x follows a normal distribution with true mean m and true standard deviation s. pdf:D = EXP(0-(LN(x)-m)^2/2/s^2)/SQRT(2*PI())/s/x cdf: P=LOGNORMDIST(X, m , s) Sir Francis Galton qf: X=LOGINV(P,m, s) 1879, Darwin’s cousin 1 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Cumulative distribution Probability density developed regression, correlation m=0, s=0.25 m=0, s=0.5 m=0, s=1 m=0, s=2 0.8 0.6 m=0, s=0.25 m=0, s=0.5 m=0, s=1 m=0, s=2 0.4 0.2 0 0 1 2 x 3 0 1 x 2 3 18 David LeBlond, PhD 2/2017 Exponential Story: An electrical system experiences occasional random blackouts. The long term mean blackout rate, b, is constant. Measure x, the time between blackouts. Leonard Euler 1773, Swiss mathematician first to use letter “e” pdf: D = EXPONDIST(x, b, FALSE) cdf: P = EXPONDIST(x, b, TRUE) qf: x = -LN(1-P)/b 1 0.4 0.3 Cumulative distribution Probability density b=0.05 b=0.1 b=0.2 0.2 b=0.4 0.1 0.8 0.6 b=0.05 b=0.1 b=0.2 b=0.4 0.4 0.2 0 0 0 5 10 x 15 0 5 x 10 15 19 David LeBlond, PhD 2/2017 Gamma My Gamma (& Gampa) 1965, Battle Creek, MI Story: Same as Exponential, except measure x, the time required for a ≥ 1 blackouts to occur. pdf: D = GAMMADIST(x, a, 1/b, FALSE) cdf: P = GAMMADIST(x, a, 1/b, TRUE) qf: x = GAMMAINV(P,a,1/b) 1 a=1, b=0.4 a=2, b=0.4 a=4, b=0.4 a=8, b=0.4 0.3 0.2 Cumulative distribution Probability density 0.4 0.1 0.8 a=1, b=0.4 a=2, b=0.4 a=4, b=0.4 a=8, b=0.4 0.6 0.4 0.2 0 0 0 5 10 x 15 0 5 x 10 15 20 David LeBlond, PhD 2/2017 Woloddi Weibull 1939, Sweedish engineer Explosions to study ocean floor Weibull Story: Same as Exponential except the long term mean blackout rate b is not constant but increases (c > 1, “wear out”) or decreases (c< 1, “infant mortality”) over time. Measure x, the time between blackouts. pdf: D = WEIBULL(x, c,1/b , FALSE) cdf: P = WEIBULL(x, c, 1/b , TRUE) qf: x = (-LN(1-P))^(1/c)/b 1 c=1, b=0.4 c=2, b=0.4 c=4, b=0.4 c=8, b=0.4 1 0.8 0.6 0.4 0.2 0 Cumulative distribution Probability density 1.2 0.8 0.6 c=1, b=0.4 c=2, b=0.4 c=4, b=0.4 c=8, b=0.4 0.4 0.2 0 0 1 2 3 x 4 5 0 1 2 x 3 4 5 21 David LeBlond, PhD 2/2017 Beta Piere Simon de Laplace 1814, French mathematician “rule of succession” based on beta Story: A very large lot of widgets contains K defects. Widgets are sampled randomly from the lot until “a” defectives have been found. Let b = K-a+1. Measure x, the fraction of the lot tested. pdf: D=EXP(GAMMALN(a+b)-GAMMALN(a)-GAMMALN(b))*x^(a-1)*(1-x)^(b-1) cdf: P = BETADIST(x ,a ,b ,0 ,1 ) qf: x = BETAINV(P, a, b, 0, 1) Probability density 2 1.5 1 0.5 1 Cumulative distribution a=2, b=4 a=4, b=2 a=1, b=1 a=0.5, b=0.5 2.5 0.8 0.6 a=2, b=4 a=4, b=2 a=1, b=1 a=0.5, b=0.5 0.4 0.2 0 0 0 0.2 0.4 0.6 x 0.8 1 0 0.2 0.4 x 0.6 0.8 1 22 David LeBlond, PhD 2/2017 Rev. Thomas Bayes 1763, London “Bayes Rule” Uniform Story: All values of X between a and b are equally likely to occur. pdf: D =1/(d-c) cdf: P = c+x*(d-c) qf: x = (P-c)/(d-c) 1 d c SDPOP 0 c d c 12 d Cumulative distribution Probability density Standard Uniform has c=0 & d=1 (same as Beta with K=a=b=1) 1 0 c d 23 David LeBlond, PhD 2/2017 Student’s t William Sealy Gossett (“Student”) 1908, Dublin Guinness Brewery Story: A sample of size v+1 is taken from a normal population with true mean m and true sigma s. The sample average and SD are calculated. Let x = (average-m)/SD/SQRT(v+1). pdf: D=EXP(GAMMALN((v+1)/2)-GAMMALN(v/2))/SQRT(PI()*v)/(1+x^2/v)^((v+1)/2) cdf: P = IF(x<0,TDIST(-x,v,1),1-TDIST(x,v,1)) qf: x = IF(P<0.5,-TINV(2*P,v),TINV(2*(1-P),v)) x’ = m+s*x ~ LSSt. substitute (x’-m)/s for x and D/s for D to obtain the pdf cdf and qf for the LSSt distribution. 1 Probability Density 0.4 Cumulative distribution v=1 v=2 v=4 0.3 v=8 v=16 0.2 v=32 Std Normal 0.1 0 0.8 v=1 v=2 v=4 v=8 v=16 v=32 Std Normal 0.6 0.4 0.2 0 -6 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 6 -6 -5 -4 -3 -2 -1 0 x 1 2 3 4 5 6 24 David LeBlond, PhD 2/2017 Friedrich Helmert, discoverer 1875, German Geologist Lunar crater Karl Pearson 1900, England “Chisquare test” Chisquare Story: A sample of size v+1 is taken from a normal random variable with known standard deviation, s. The sample variance (SD^2) is calculated. Measure x = v*(sample variance)/s^2. pdf: D = GAMMADIST(x, v/2, 2, FALSE) cdf: P = 1-CHIDIST(X,v) qf: X = CHIINV(1-P,v) 1 0.15 Cumulative distribution Probability density 0.2 v=4 v=10 v=20 0.1 0.05 0 0 5 10 15 20 x 25 30 35 0.8 v=4 0.6 v=10 0.4 v=20 0.2 0 0 5 10 15 20 25 30 35 x 25 David LeBlond, PhD 2/2017 F Sir Ronald Fisher ~1925, England Invented ANOVA (and everything else) A random sample of size v1+1 is taken from a normal population and the sample variance (V1) is calculated. A second independent random sample of size v2+1 is also taken from the same population and the sample variance (V2) is calculated. Measure x = V1/V2 . pdf: D = EXP( GAMMALN((v1+v2)/2)-GAMMALN(v1/2 )-GAMMALN(v2/2 ) ) *v1^(v1/2)*v2^(v2/2)*x^(v1/2-1)/(v2+v1*x)^(v1/2+v2/2) cdf: P = 1-FDIST(X, v1, v2) qf: x = FINV(1 - P,v1, v2) 1.4 1 Cumulative distribution Probability density 1.2 v1=4, v2=40 v1=10, v2=40 v1=40, v2=40 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 x 2 2.5 3 0.8 0.6 v1=4, v2=40 v1=10, v2=40 0.4 v1=40, v2=40 0.2 0 0 0.5 1 1.5 x 2 2.5 3 26 David LeBlond, PhD 2/2017 Why Monte-Carlo simulation? Consider … N = some messy f(A…M) We know (or guess) the PDs of A…M. We know (or guess) f() Goal: distribution of N? typical manufacturing process Solution: • Generate random “draws” for variates A…M • Calculate N=f(A…M) for each draw • The calculated N’s will be draws from the target PD 27 David LeBlond, PhD 2/2017 Simulating continuous variates uniform pdf 1 RAND() 0 1 cdf of target distribution 0 qf of target distribution Inverse cdf method: just substitute RAND() for P in the qf pseudo-random draws from target distribution e.g. x is a draw from a normal distribution with mean m and sigma s x = NORMINV(RAND(), m, s) 28 David LeBlond, PhD 2/2017 Simulating count variates 29 David LeBlond, PhD 2/2017 Simulating count variates 0.4 0.4 0.3 Observed Proportion Theoretical pmf 0.2 0.1 0.3 0.2 0.1 0 10 8 6 4 0 2 0 Proportion or Probability mass Histogram Count 30 David LeBlond, PhD 2/2017 Creating a histogram in EXCEL 2 1 3 31 David LeBlond, PhD 2/2017 Creating a histogram in EXCEL 4 32 David LeBlond, PhD 2/2017 Creating a histogram in EXCEL 9 5 6 7 8 33 David LeBlond, PhD 2/2017 Creating a histogram in EXCEL wala…! Histogram 12 Frequency 10 8 6 4 2 400 380 360 340 320 300 280 260 240 220 200 180 160 0 Bin 34 David LeBlond, PhD 2/2017 Distribution Shape Skewness Kurtosis Platykurtic Leptokurtic Student, 1927 35 David LeBlond, PhD 2/2017 Summary statistics Statistic Estimates true… n EXCEL formula = COUNT(range) xbar mean = AVERAGE(range) SD sigma = STDDEV(range) g1 skewness = SKEW(range) g2 excess kurtosis = KURT(range) Pth quantile qf = PERCENTILE(range,P) 36 David LeBlond, PhD 2/2017 Normality tests Shapiro-Wilk, Ryan-Joiner •based on a variance ratio (W) •powerful but don’t work well if data are rounded •not informative Kolmogorov-Smirnov, Lilliefors •based on largest difference between theoretical and empirical cdf •not very powerful, not informative Anderson-Darling, Cramer von Mises •based on squared difference between theoret and empirical cdf •very powerful, maybe too powerful, not informative Normal Probability (Q-Q) Plot •based on comparison of theoretical and empirical qf •not a statistical test, but very informative D’Agostinos K2, Jarque-Bera, Skewness-Kurtosis-Omnibus •based on testing for Skewness=0, Kurtosis=3, or both •powerful and informative •in Stata, PRISM, Distribution Analyzer 37 David LeBlond, PhD 2/2017 Beware Kolmogorov-Smirnov Look normal? Histogram 12 Frequency 10 Probability Plot of X 8 Normal 6 99.9 4 2 400 380 360 340 320 95 90 Bin 300 280 240 Percent 220 200 180 160 260 99 0 80 70 60 50 40 30 20 Mean StDev N KS P-Value 10 5 1 0.1 100 150 200 250 X 300 250.0 41.44 62 0.104 0.091 350 ? 400 38 David LeBlond, PhD 2/2017 Beware Shapiro-Wilk Raw 4.64869244 12.4021887 9.67430794 8.158978 11.4408614 9.48448461 13.5148202 . . . Goodness-of-Fit Test Rounded 5 12 10 8 11 9 14 . . . Goodness-of-Fit Test Shapiro-Wilk W Test Shapiro-Wilk W Test W Prob<W W Prob<W 0.982650 0.2126 0.972848 0.0367* ? 39 David LeBlond, PhD 2/2017 A test for skewness n = COUNT(range) Skewness = SKEW(range) A = (n-2)*Skewness/SQRT(n*(n-1)) B = A*SQRT( (n+1)*(n+3)/6/(n-2) C = 3*(n^2+27*n-70)*(n+1)*(n+3)/(n-2)/(n+5)/(n+7)/(n+9) D = -1+SQRT(2*(C-1)) E = 1/SQRT(LN(SQRT(D))) F = SQRT(2/(D-1)) Z1 = E*LN(B/F+SQRT((B/F)^2+1)) p-value = 2*(1-NORMDIST(ABS(Z1),0,1,TRUE)) ) 40 David LeBlond, PhD 2/2017 A test for kurtosis n = COUNT(range) ExcessKurtosis = KURT(range) G = (n-2)*(n-3)*(ExcessKurtosis)/(n+1)/(n-1)+3*(n-1)/(n+1) H = 3*(n-1)/(n+1) I = 24*n*(n-2)*(n-3)/(n+1)^2/(n+3)/(n+5) J = (G - H)/SQRT(I) K = SQRT( 6*(n+3)*(n+5)/n/(n-2)/(n-3) )*6*(n^2-5*n+2)/(n+7)/(n+9) L = 6+(8/K)*( 2/K+SQRT(1+4/K^2) Z2 = (1-2/9/L-((1-2/L)/(1+J*SQRT(2/(L-4))))^(1/3))/SQRT(2/9/L) p-value = 2*(1-NORMDIST(ABS(Z2),0,1,TRUE)) ) 41 David LeBlond, PhD 2/2017 An omnibus test for both skewness and kurtosis p-value = CHIDIST(Z1^2+Z2^2,2) 42 David LeBlond, PhD 2/2017 Example: some non-normal data Raw Data N 18 xbar 7.17 SD 6.24 Skewness Excess Kurtosis 6 Frequency 5 4 3 2 1 Bin 24 20 16 12 8 4 0 0 1.62 1.79 p-value (Skewness) 0.006 p-value (Kurtosis) 0.117 p-value (Omnibus) 0.006 43 David LeBlond, PhD 2/2017 Example: Transformation to normality Transformed = -5.2 + 1.21*ASINH((Raw – 0.18)/0.14) (Johnson “SU” transformation) Transformed Scale N 18 xbar 0 SD 1.03 Skewness Excess Kurtosis 7 Frequency 6 5 4 3 -0.03 0.24 2 1 Bin 2. 5 M or e 2 1. 5 1 0. 5 0 -1 -0 .5 -2 -1 .5 -2 .5 0 p-value (Skewness) 0.96 p-value (Kurtosis) 0.64 p-value (Omnibus) 0.90 44 David LeBlond, PhD 2/2017 Identifying a transformation (Distribution Analyzer Software*) 10 Distributions Kurtosis Normal Beta Exponential Weibull Gamma Johnson Su Lognormal Normal Uniform Impossible Area Untransformed (S=1.62, K=4.79) Transformed (S=0.08, K=3.12) 7 4 Original Raw Data 1 -3 -2 -1 0 1 Skewness * Taylor Enterprises, www.variation.com 2 3 Transformation identified 45 David LeBlond, PhD 2/2017 Bonus: Excel matrix functions a b c g h i a g b h c i d e f j k l d j e k f l = Arange + Brange a d a b c b e d e f c f T = TRANSPOSE(Arange) g h i j a b c ag bk co ah bl cp ai bm cp aj bn cr k l m n d e f o p q r dg ek fo dh el fp di em fp dj en fr = MMULT(Arange,Brange) 1. 2. 3. 4. 5. 6. a b d e g h c f i 1 = MINVERSE(Arange) Name the ranges for the input matrices Determine the “shape” (rows&columns) of the result Select the cell range for the result matrix Type the matrix formula Press CNTL+SHFT+ENTER If desired, name the new range for use in further matrix formulas 46 David LeBlond, PhD 2/2017 Matrix Example: simple linear regression Goal: Estimate Slope and intercept ̂ X X X Y T 1 T =MMULT(MMULT(MINVERSE(MMULT(TRANSPOSE(Xrange),Xrange)), TRANSPOSE(Xrange)),Yrange) 47 David LeBlond, PhD 2/2017 Key messages • Understanding PDs is critical for quality decisions • pdf, pmf, cdf, qf EXCEL formulas • Histograms in EXCEL • Monte-Carlo simulation in EXCEL • Skewness-Kurtosis normality test in EXCEL • Matrix calculations in EXCEL •Taylor Enterprises “Distribution Analyzer” • PPT on ASQ FD&C web site • questions? [email protected] 48 David LeBlond, PhD 2/2017 References 1. Evans M et al (1993) Statistical distributions 2nd edition, John Wiley, NY 2. D’Agostino RB et al (1990) A suggestion for using powerful and informative tests of normality American Statistician 44(4) 316-321 3. Distribution analyzer software and other good information at Wayne Taylor’s web site: www.variation.com/da 4. LeBlond, D (2008) Data, Variation, Uncertainty, and Probability Distributions, Journal of GxP Compliance, Vol. 12, No. 3, pp 30-41. 5. LeBlond, D (2008) Using Probability Distributions to Make Decisions, Journal of Validation Technology, Spring 2008, pp 2 – 14. 6. LeBlond, D (2008) Estimation: knowledge building with probability distributions, Journal of GxP Compliance, Vol. 12 (4), 42-59. 7. LeBlond, D (2008) Estimation: knowledge building with probability distributions – Reader Q&A, Journal of Validation Technology, Vol. 14(5), 50-64. 8. LeBlond D (2009) Hypothesis Testing: Examples in Pharmaceutical Process and Analytical Development, Journal of GxP Compliance, to be published 9. Neter J, et al (1996) Applied Linear Statistical Models, 4th edn, Irwin 49 David LeBlond, PhD 2/2017
© Copyright 2026 Paperzz