Lecture 7

Applied Data Analysis
Spring 2017
Security retinue or harem?
Karen Albert
[email protected]
Thursdays, 4-5 PM (Hark 302)
Lecture outline
1. Finishing the last lecture
2. Discrete distributions
3. Bernoulli
4. Binomial
Z-scores
Z-scores count up how many standard deviations away from
the mean a particular point is.
z=
x −µ
σ
The Z-score is the point of interest minus the mean divided by
the standard deviation.
The Z-score in action
0.02
0.00
0.01
dnorm(x, 20, 10)
0.03
0.04
X ∼ N(20, 102 )
-20
0
20
40
60
2
4
x
0.2
0.1
0.0
dnorm(x, 0, 1)
0.3
0.4
X ∼ N(0, 1)
-4
-2
0
x
Questions I can ask about the normal....
1. What percentage of the data is less (greater) than a certain
number?
2. What percentage of the data is between two points?
3. Given a certain percentage, what number corresponds to
it?
Question 2
In poor countries, the growth of children can be an important
indicator of general levels of nutrition and health. Data in the
paper “The Osteological Paradox: Problems of Inferring
Prehistoric Health from Skeletal Samples” suggests that a
reasonable model for the population distribution of the height of
5-year old children is a normal distribution with mean 100 cm
and a standard deviation of 6 cm.
What proportion of the population has heights between 94 cm
and 112 cm?
Question 2
0.04
0.00
dnorm(x, 100, 6)
cord.x <- c(94,seq(94,112,0.01),112)
cord.y <- c(0,dnorm(seq(94,112,0.01),100,6),0)
curve(dnorm(x,100,6),76,124)
polygon(cord.x,cord.y,col='skyblue')
80
90
100
x
110
120
Question 2 solution
pnorm(112,100,6,lower.tail=TRUE)-pnorm(94,100,6,lower.tail=TRUE)
## [1] 0.8185946
Check your answer
• We expect 68% within 1 sd of the mean.
Check your answer
• We expect 68% within 1 sd of the mean.
• We expect 95% within 2 sds of the mean.
Check your answer
• We expect 68% within 1 sd of the mean.
• We expect 95% within 2 sds of the mean.
• The percentage we are looking for must be between these
two.
Check your answer
• We expect 68% within 1 sd of the mean.
• We expect 95% within 2 sds of the mean.
• The percentage we are looking for must be between these
two.
• Is it? Yes. 81.86% is between 68% and 95%.
Question 3
The distribution of the length of time required for students to
complete telephone registration is well approximated by a
normal distribution with a mean of 12 minutes and a standard
deviation of 2 minutes.
The University would like to choose an automatic disconnect
time such that only 1% of the students will be disconnected will
they are still attempting to register. What time should be
chosen?
Question 3 solution
What does the top 1% mean?
Question 3 solution
What does the top 1% mean?
Disconnecting those that take the longest.
Question 3 solution
What does the top 1% mean?
Disconnecting those that take the longest.
0.00 0.10 0.20
dnorm(x, 12, 2)
cord.x <- c(17,seq(17,20,0.01),20)
cord.y <- c(0,dnorm(seq(17,20,0.01),12,2),0)
curve(dnorm(x,12,2),4,20)
polygon(cord.x,cord.y,col='skyblue')
5
10
15
x
20
Question 3 solution
qnorm(0.01,12,2,lower.tail=FALSE)
## [1] 16.6527
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
• We expect 0.15% of the data to be above 18 minutes.
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
• We expect 0.15% of the data to be above 18 minutes.
• Our answer must be between these two.
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
• We expect 0.15% of the data to be above 18 minutes.
• Our answer must be between these two.
• Is it? Yes. 16.66 is between 16 and 18.
Probability distributions
A probability distribution lists the possible values of a random
variable and their probabilities.
Probability distributions
A probability distribution lists the possible values of a random
variable and their probabilities.
Two kinds:
• Discrete (probability mass functions)
• Continuous (probability density function)
Probability mass functions
A probability mass function assigns a probability to each
possible value of a discrete variable.
Probability mass functions
A probability mass function assigns a probability to each
possible value of a discrete variable.
Properties
• 0 ≤ Pr(y ) ≤ 1 (each prob. is between 0 and 1)
•
P
all y Pr(y ) = 1 (the probs. must sum to 1)
Bernoulli distribution
The Bernoulli distribution is the simplest of all discrete
distributions.
It describes random variables that only have two levels:
success and failure.
Examples:
• Voted v. did not vote
• Female v. male
• Southern v. not Southern
Bernoulli distribution
Random variables that are distributed Bernoulli are usually
denoted 1 for “success” and 0 for “failure.”
Ten trials: 0 1 0 1 1 0 1 0 0 0
What is the mean and the variance?
Mean and variance of the Bernoulli
x <- c(0,1,0,1,1,0,1,0,0,0)
1*.4+0*.6
(0-mean(x))^2*.6+(1-mean(x))^2*.4
Mean and variance of the Bernoulli
x <- c(0,1,0,1,1,0,1,0,0,0)
1*.4+0*.6
## [1] 0.4
(0-mean(x))^2*.6+(1-mean(x))^2*.4
## [1] 0.24
Another way
mean(x)
mean(x)*(1-mean(x))
Another way
mean(x)
## [1] 0.4
mean(x)*(1-mean(x))
## [1] 0.24
Bernoulli distribution
f (x) =
Mean is p.
Variance is p(1 − p).
px (1 − p)1−x , x = 0, 1
0
otherwise
Binomial distribution
Flip a coin 10 times.
Each “trial” is a Bernoulli random variable.
The sum of a bunch of Bernoulli-distributed RVs is a Binomial
RV.
Binomial distribution
What is the probability of getting 3 heads in 10 flips?
Let’s say the heads come on the first 3 flips.
p ∗p ∗p ∗(1−p)∗(1−p)∗(1−p)∗(1−p)∗(1−p)∗(1−p)∗(1−p)
or
p3 ∗ (1 − p)7
But what if the heads don’t come on the first 3 flips?
How many ways?
n
n!
=
k
k !(n − k )!
How many ways?
n
n!
=
k
k !(n − k )!
4
4
4
4
4
= 1,
= 4,
= 6,
= 4,
=1
0
1
2
3
4
How many ways?
n
n!
=
k
k !(n − k )!
4
4
4
4
4
= 1,
= 4,
= 6,
= 4,
=1
0
1
2
3
4
choose(4,2)
## [1] 6
Binomial distribution
n k
Pr(X = k ) =
p (1 − p)n−k , k = 0, 1, . . . , n
k
Mean is np.
Variance is np(1 − p).
Simulating a Binomial distribution
library(MASS)
x <- rbinom(1000,10,0.2)
truehist(x,prob=TRUE)
Simulating a Binomial distribution
0.0
0.3
0.6
library(MASS)
x <- rbinom(1000,10,0.2)
truehist(x,prob=TRUE)
0
1
2
3
x
4
5
6
Simulating a Binomial distribution
library(MASS)
x <- rbinom(1000,10,0.7)
truehist(x,prob=TRUE)
Simulating a Binomial distribution
0.0
0.3
library(MASS)
x <- rbinom(1000,10,0.7)
truehist(x,prob=TRUE)
2
4
6
x
8
10
Practice
What is the probability of 2 heads in 4 flips of a coin?
n k
p (1 − p)n−k
k
=
=
=
=
2 (4−2)
1
4
1
2
2
2
4
1
6
2
6
16
0.375
Another way
dbinom(2,4,.5)
## [1] 0.375
When the Binomial is appropriate
1. The trials are independent.
2. The number of trials, n, is fixed.
3. Each trial outcome can be classified as a success or
failure.
4. The probability of a success, p, is the same for each trial.
Practice
Polls estimate that 60% of American voters disapprove of the
Trump administration’s travel ban.
What is the probability that at least 1 of out 10 randomly
sampled American voters disapproves of the ban?
What is the probability that at most 2 out of 10 randomly
sample American voters approves of the ban?
Solution, part 1
Pr(at least 1) = 1 − Pr(none)
10
= 1−
∗ 0.60 ∗ 0.410
0
= 1 − 0.410
≈ 1
Solution, part 2
Pr(at most 2) = Pr(X = 0) + Pr(X = 1) + Pr(X = 2)
10
10
0
10
=
∗ 0.4 ∗ 0.6 +
∗ 0.41 ∗ 0.69
0
1
10
+
∗ 0.42 ∗ 0.68
2
= 0.610 + 10 ∗ 0.4 ∗ 0.69 + 45 ∗ 0.16 ∗ 0.68
= ?
Using R
1-dbinom(0,10,.6)
dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4)
Using R
1-dbinom(0,10,.6)
## [1] 0.9998951
dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4)
## [1] 0.1672898
What did we learn?
• Z-scores
• Normal practice
• Bernoulli distribution
• Binomial distribution