Applied Data Analysis Spring 2017 Security retinue or harem? Karen Albert [email protected] Thursdays, 4-5 PM (Hark 302) Lecture outline 1. Finishing the last lecture 2. Discrete distributions 3. Bernoulli 4. Binomial Z-scores Z-scores count up how many standard deviations away from the mean a particular point is. z= x −µ σ The Z-score is the point of interest minus the mean divided by the standard deviation. The Z-score in action 0.02 0.00 0.01 dnorm(x, 20, 10) 0.03 0.04 X ∼ N(20, 102 ) -20 0 20 40 60 2 4 x 0.2 0.1 0.0 dnorm(x, 0, 1) 0.3 0.4 X ∼ N(0, 1) -4 -2 0 x Questions I can ask about the normal.... 1. What percentage of the data is less (greater) than a certain number? 2. What percentage of the data is between two points? 3. Given a certain percentage, what number corresponds to it? Question 2 In poor countries, the growth of children can be an important indicator of general levels of nutrition and health. Data in the paper “The Osteological Paradox: Problems of Inferring Prehistoric Health from Skeletal Samples” suggests that a reasonable model for the population distribution of the height of 5-year old children is a normal distribution with mean 100 cm and a standard deviation of 6 cm. What proportion of the population has heights between 94 cm and 112 cm? Question 2 0.04 0.00 dnorm(x, 100, 6) cord.x <- c(94,seq(94,112,0.01),112) cord.y <- c(0,dnorm(seq(94,112,0.01),100,6),0) curve(dnorm(x,100,6),76,124) polygon(cord.x,cord.y,col='skyblue') 80 90 100 x 110 120 Question 2 solution pnorm(112,100,6,lower.tail=TRUE)-pnorm(94,100,6,lower.tail=TRUE) ## [1] 0.8185946 Check your answer • We expect 68% within 1 sd of the mean. Check your answer • We expect 68% within 1 sd of the mean. • We expect 95% within 2 sds of the mean. Check your answer • We expect 68% within 1 sd of the mean. • We expect 95% within 2 sds of the mean. • The percentage we are looking for must be between these two. Check your answer • We expect 68% within 1 sd of the mean. • We expect 95% within 2 sds of the mean. • The percentage we are looking for must be between these two. • Is it? Yes. 81.86% is between 68% and 95%. Question 3 The distribution of the length of time required for students to complete telephone registration is well approximated by a normal distribution with a mean of 12 minutes and a standard deviation of 2 minutes. The University would like to choose an automatic disconnect time such that only 1% of the students will be disconnected will they are still attempting to register. What time should be chosen? Question 3 solution What does the top 1% mean? Question 3 solution What does the top 1% mean? Disconnecting those that take the longest. Question 3 solution What does the top 1% mean? Disconnecting those that take the longest. 0.00 0.10 0.20 dnorm(x, 12, 2) cord.x <- c(17,seq(17,20,0.01),20) cord.y <- c(0,dnorm(seq(17,20,0.01),12,2),0) curve(dnorm(x,12,2),4,20) polygon(cord.x,cord.y,col='skyblue') 5 10 15 x 20 Question 3 solution qnorm(0.01,12,2,lower.tail=FALSE) ## [1] 16.6527 Check your answer • We expect 2.5% of the data to be above 16 minutes. Check your answer • We expect 2.5% of the data to be above 16 minutes. • We expect 0.15% of the data to be above 18 minutes. Check your answer • We expect 2.5% of the data to be above 16 minutes. • We expect 0.15% of the data to be above 18 minutes. • Our answer must be between these two. Check your answer • We expect 2.5% of the data to be above 16 minutes. • We expect 0.15% of the data to be above 18 minutes. • Our answer must be between these two. • Is it? Yes. 16.66 is between 16 and 18. Probability distributions A probability distribution lists the possible values of a random variable and their probabilities. Probability distributions A probability distribution lists the possible values of a random variable and their probabilities. Two kinds: • Discrete (probability mass functions) • Continuous (probability density function) Probability mass functions A probability mass function assigns a probability to each possible value of a discrete variable. Probability mass functions A probability mass function assigns a probability to each possible value of a discrete variable. Properties • 0 ≤ Pr(y ) ≤ 1 (each prob. is between 0 and 1) • P all y Pr(y ) = 1 (the probs. must sum to 1) Bernoulli distribution The Bernoulli distribution is the simplest of all discrete distributions. It describes random variables that only have two levels: success and failure. Examples: • Voted v. did not vote • Female v. male • Southern v. not Southern Bernoulli distribution Random variables that are distributed Bernoulli are usually denoted 1 for “success” and 0 for “failure.” Ten trials: 0 1 0 1 1 0 1 0 0 0 What is the mean and the variance? Mean and variance of the Bernoulli x <- c(0,1,0,1,1,0,1,0,0,0) 1*.4+0*.6 (0-mean(x))^2*.6+(1-mean(x))^2*.4 Mean and variance of the Bernoulli x <- c(0,1,0,1,1,0,1,0,0,0) 1*.4+0*.6 ## [1] 0.4 (0-mean(x))^2*.6+(1-mean(x))^2*.4 ## [1] 0.24 Another way mean(x) mean(x)*(1-mean(x)) Another way mean(x) ## [1] 0.4 mean(x)*(1-mean(x)) ## [1] 0.24 Bernoulli distribution f (x) = Mean is p. Variance is p(1 − p). px (1 − p)1−x , x = 0, 1 0 otherwise Binomial distribution Flip a coin 10 times. Each “trial” is a Bernoulli random variable. The sum of a bunch of Bernoulli-distributed RVs is a Binomial RV. Binomial distribution What is the probability of getting 3 heads in 10 flips? Let’s say the heads come on the first 3 flips. p ∗p ∗p ∗(1−p)∗(1−p)∗(1−p)∗(1−p)∗(1−p)∗(1−p)∗(1−p) or p3 ∗ (1 − p)7 But what if the heads don’t come on the first 3 flips? How many ways? n n! = k k !(n − k )! How many ways? n n! = k k !(n − k )! 4 4 4 4 4 = 1, = 4, = 6, = 4, =1 0 1 2 3 4 How many ways? n n! = k k !(n − k )! 4 4 4 4 4 = 1, = 4, = 6, = 4, =1 0 1 2 3 4 choose(4,2) ## [1] 6 Binomial distribution n k Pr(X = k ) = p (1 − p)n−k , k = 0, 1, . . . , n k Mean is np. Variance is np(1 − p). Simulating a Binomial distribution library(MASS) x <- rbinom(1000,10,0.2) truehist(x,prob=TRUE) Simulating a Binomial distribution 0.0 0.3 0.6 library(MASS) x <- rbinom(1000,10,0.2) truehist(x,prob=TRUE) 0 1 2 3 x 4 5 6 Simulating a Binomial distribution library(MASS) x <- rbinom(1000,10,0.7) truehist(x,prob=TRUE) Simulating a Binomial distribution 0.0 0.3 library(MASS) x <- rbinom(1000,10,0.7) truehist(x,prob=TRUE) 2 4 6 x 8 10 Practice What is the probability of 2 heads in 4 flips of a coin? n k p (1 − p)n−k k = = = = 2 (4−2) 1 4 1 2 2 2 4 1 6 2 6 16 0.375 Another way dbinom(2,4,.5) ## [1] 0.375 When the Binomial is appropriate 1. The trials are independent. 2. The number of trials, n, is fixed. 3. Each trial outcome can be classified as a success or failure. 4. The probability of a success, p, is the same for each trial. Practice Polls estimate that 60% of American voters disapprove of the Trump administration’s travel ban. What is the probability that at least 1 of out 10 randomly sampled American voters disapproves of the ban? What is the probability that at most 2 out of 10 randomly sample American voters approves of the ban? Solution, part 1 Pr(at least 1) = 1 − Pr(none) 10 = 1− ∗ 0.60 ∗ 0.410 0 = 1 − 0.410 ≈ 1 Solution, part 2 Pr(at most 2) = Pr(X = 0) + Pr(X = 1) + Pr(X = 2) 10 10 0 10 = ∗ 0.4 ∗ 0.6 + ∗ 0.41 ∗ 0.69 0 1 10 + ∗ 0.42 ∗ 0.68 2 = 0.610 + 10 ∗ 0.4 ∗ 0.69 + 45 ∗ 0.16 ∗ 0.68 = ? Using R 1-dbinom(0,10,.6) dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4) Using R 1-dbinom(0,10,.6) ## [1] 0.9998951 dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4) ## [1] 0.1672898 What did we learn? • Z-scores • Normal practice • Bernoulli distribution • Binomial distribution
© Copyright 2026 Paperzz