Unit 1: Probability

Statistics for Cyber Security
Wenyaw Chan
Division of Biostatistics
School of Public Health
University of Texas
- Health Science Center at Houston
Module (a): Basic Properties
of Probability and Statistical
Inferences
Definition of Probability
Three ways of defining Probability:
1. Objective Probability
2. Deductive Logic Definition of Probability
3. Subjective Definition of Probability
Definition of Probability
• Objective Probability
– If E is an event in an experiment, the experiment is repeated a
very large number of times, say N, and the event E is observed
in n of these N trials then Prob(E) =n/N.
• Deductive Logic Definition of Probability
– The probability of an event is determined logically from
symmetry or geometric considerations associated with the
experiment.
• Subjective Definition of Probability
– The probability of an event is determined subjectively, reflecting
a person’s “ degree of belief” that an event will occur!
Examples
1. Toss a coin 10000 times.
Pr(Head)=0.5001
2. Throw a dart on the circular Target
Pr(region A)=area(A)/total area
3. What is the probability that John will
pass this course? Who is answering
this question? John, Professor or his
friend.
Properties of probability
1. 0 <= P(E) <= 1
2. P(A or B occurs)=P(A) +P(B), if A and B
can not happen at the same time
Mutually Exclusive
– Two events are mutually exclusive, if AB =0
– A1 = {1,2}, A2 = {3,4} and A3 = {5,6} are
mutually exclusive
Properties of probability II
3. P(A  B)=P(A) + P(B) - P(A  B)
– Two events A and B are independent, then
P(A  B) = P(A)P(B)
4. If events A and B are independent, then
P(A  B)=P(A) + P(B) - P(A)P(B)
=P(A) + P(B) [1- P(A)]
Conditional Probability
Conditional probability of B given A:
P(BA)= P(A  B)/P(A).
–
If A and B are independent, then
P(BA)= P(B)= P(BA).
–
If A and B are not independent, then
P(BA) P(B)P(BA).
Toss two dices: what is the probability that sum=6,
given that sum=even?
(5/36)/(1/2)
Exhaustive Events
A set of events A1 , A2, A3  Ak is exhaustive if at least
one of the events must occur.
i.e. A1  A2  A3   Ak = sample space
Toss a die
A1 = {1,2,3}, A2 = {1,3,4} and A3 = {2,5,6} are exhaustive
Law of Total Probability
P(B)= P(BAi) P(Ai), if A1 A2 A3  Ak
are mutually exclusive and exhaustive
Toss a blue die and a red die:
Pr(red= even)=
Pr(red= even| sum=2)
+Pr(red= even| sum=3)
+ Pr(red= even| sum=4) +.............
+ Pr(red= even| sum=12)
Bayes’ Rule
Let A and B be two events, then
P(B|A)
=[P(A|B)P(B)]/[ P(A|B) P(B) + P(A|B) P(B)]
Toss a blue die and a red die:
Pr(sum= even| red=2) =
Pr(red=2|sum= even)Pr(sum= even )
{Pr(red=2|sum= even)Pr(sum= even )+
Pr(red=2|sum= odd)Pr(sum= odd)}
Population and Samples
• Random Sample
– is a selection of some members of the population such that each
member is independently chosen and has a known nonzero
probability of being selected.
• Simple Random Sample
– is a random sample in which each group member has the same
probability of being selected.
• Cluster Sampling
– involves selecting a random sample of clusters and then looking
at all study units within the chosen clusters.
– (one-stage)
– In two-stage sampling, a random sample of clusters is selected
and then, within each cluster, a random sample of study units is
selected.
Random Variables and their
Distributions
Random Variable
• Random Variable:
– A numeric function that assigns probabilities to
different events in a sample.
• Discrete Random Variable:
– A random variable that assumes only a finite or
denumerable number of values.
– The probability mass function of a discrete random
variable X that assumes values x1, x2,… is p(x1),
p(x2), …., where p(xi)=Pr[X= xi].
• Continuous Random Variable:
– A random variable whose possible values cannot be
enumerated.
Example: Flip a coin 3 times
• Random Variable
– X = # of heads in the 3 coin tosses
• Probability Mass Function
–
–
–
–
P(X=3) = P{(HHH)} =1/8
P(X=2) = P{HHT, HTH, THH}= 3/8
P(X=1) = P{HTT,THT, TTH} = 3/8
P(X=0) = P{TTT} = 1/8
• X is a discrete random variable with probability
(mass) function
x
0
1
2
3
P(X=x)
1/8
3/8
3/8
1/8
Random Variable
Expected value of X :
k
E ( X )     xi Pr( X  xi )
i 1
Variance of X :
2
k
Var ( X )     ( xi   ) 2 Pr( X  xi )
i 1
Standard Deviation of X:
= Var ( X )
Random Variable
• Note :
2
Var ( X )  E ( X   )
2
2
 E ( X )  [ E ( X )]
• Cumulative Distribution Function
– of X : Pr(X<=x) = F(x)
Binomial Distribution
• Examples of the binomial distribution have a
common structure:
– n independent trials
– each trial has only two possible outcomes, called
“success” and “failure”.
– Pr (success) = p for all trials
Binomial Distribution
• If X= # of successful trials in these n trials,
then X has a binomial distribution.
n k
P X  k     p (1  p ) n  k
k 
• k=0,1,2,….,n
• where
n
n!
 k   (n  k )!k !
 
• Example: Flip a coin 10 times
Properties of Binomial
Distribution
• If X~ Binomial (n, p), then
E(X) = np
Var (X) = np(1-p)
Poisson Distribution
Pr X  k  
k=0,1,2,…..
e
 k

k!
If X~ Poisson (), then EX =  and VarX = 
Poisson Process
• Assumption 1:
– Pr {1 event occurs in a very small time interval [0,t)} t
– Pr {0 event occurs in a very small time interval [0, t)}1- t
– Pr{more than one event occurs in a very small time interval [0,
t)}0
• Assumption 2:
– Probability that the number of events occur per unit time is the
same through out the entire time interval
• Assumption 3:
– Pr {one event in [t1,t2) | one event in [t0, t1)}
= Pr {one event in [t1, t2)}
Poisson Distribution
• X=The number of events occurred in the time
period t for the above process with
parameter, then mean=t and
Pr  ( X  k ) 
e
t
where k= 0,1,2,…
and e= 2.71828
E(X)=Var(X)=t
(t )
k!
k
Poisson approximation to
Binomial
• If X~ Binomial (n, p), n is large and p is
small, then
P( X  k ) 
e
 np
(np)
k!
k
Continuous Probability
Distributions
• Probability density function (p.d.f.) (of a
random variable):
– a curve such that the area under the curve
between any two points a and b, equals
– Prob[a  x  b ]= ∫ a x  bf(x)dx
Pr(a<=X<=b)
a
b
Continuous Probability
Distributions
• Cumulative distribution function: Pr(x  a)
Pr(X<=a)
a
Continuous Probability
Distributions
• The expected value of a continuous
random variable X is
∫ xf(x)dx, where f(x) is the p.d.f. of X.
• The definition for the variance of a
continuous random variable is the
same as that of a discrete random
variable, i.e.
Var(X)=E(X2)- (EX)2=∫(x-µ)2f(x)dx, where
µ=E(X).
The Normal Distribution
(The Gaussian distribution)
•
• The p.d.f. of a normal distribution
f ( x) 
1
1

2 
exp  
(
x


)

2 2


2
where - < x < 
The Normal Distribution
point of inflection
s
u-s
s
u
u+s
figure: a bell-shaped curve symmetric about 
• Notation: X~N(, 2 )
 : mean
2 : variance
The Normal Distribution
• N(0,1) is the standard normal distribution
• If X~ N(0,1), then
( x)  Pr( X  x)
– ~ : “is distributed as” ,
–  : c.d.f. for the standard normal r.v.
• Note:
– The point of inflection is a point where the slope
of the curve changes its direction.
Properties of the N(0,1)
• 1. (-x) = 1-(x)
• 2.
– About 68% of the area under the standard
normal curve lies between –1 and 1.
– About 95% of the area under the standard
normal curve lies between –2 and 2.
– About 99% of the area under the standard
normal curve lies between –2.5 and 2.5.
Properties of the N(0,1)
• If X~ N(0,1) and P(X< Zu)=u, 0  u  1
then Zu is called the 100uth percentile
of the standard normal distribution.
95th %tile=1.645, 97.5th %tile=1.96, 99th
%tile=2.33
Area=u
Zu
Properties of the N(0,1)
• If X~ N(,
2),
then
X 

~ N (0,1)
• This property allows us to calculate the
probability of a non-standard normal
random variable.
a   X   b   
Pr  a  X  b   Pr 



 
 
b   
a
 
  

  
  
Other Distributions--t distribution
• Let X1, ….Xn be a random sample from a
normal population N(, σ2).
Then
X 
s/ n
has a t distribution with n-1 degrees of
freedom (df).
Other Distributions--Chi-square distribution
• Let X1, ….Xn be a random sample from a
normal population N(0, 1).
Then
n
2
X
 i
i 1
has a chi-square distribution with n
degrees of freedom (df).
Other Distributions--F distribution
• Let U and V be independent random
variables and each has a chi-square
distribution with p and q degrees of
freedom respectively.
U/p
Then
V /q
has a F distribution with p and q degrees
of freedom (df).
Covariance and Correlation
• The covariance between two random
variables is defined by
Cov(X,Y)=E[(X-µX)(Y-µY)].
• The correlation coefficient between two
random variables is defined by
ρ=Corr(X,Y)=Cov(X,Y)/(σX σ Y).
Variance of a Linear
Combination
• Var(c1X1 + c2X2)
 c12Var ( X1 )  c22Var ( X 2 )
 2c1c2Cov( X1, X 2 )
 c12Var ( X1 )  c22Var ( X 2 )
 2c1c2 X  Y Corr( X1, X 2 )
Estimation
• Point Estimates
– A point estimate of a parameter θ is a single
number used as an estimate of the value of θ.
– e.g. A natural estimate to use for estimating the
population mean  is the sample mean
.
X  X /n
__
• Interval Estimation
n

i 1
i
– If an random interval I=(L,U) satisfying Pr(L< θ
<U)=1- α, the observed values of L and U for a
given sample is called a 1- α conference interval
estimate for θ.
Which one is more accurate?
Which one is more precise?
Estimation
What to estimate?
• B(n, p)
 proportion
• Poisson ()  mean
• N(, σ2)  mean and/or variance
Estimation of the Mean of a
Distribution
• A point estimator of the population mean
is sample mean.
• Sampling Distribution X of
is the distribution of values of X over all
possible samples of size n that could have
been selected from the reference population.
E( X )  
Estimation
• An estimator of a parameter is unbiased
estimator if its expectation is equal to the
parameter.
• Note: The unbiasedness is not sufficient to be used as
the only criterion for chosen an estimator.
• The unbiased estimator with the
minimum variance (MVUE) is preferred.
• If the population is normal, then X is the MVUE of .
Sample Mean
• Standard error (of the mean)
= standard deviation of the sample mean

2
n


n
s
• The estimated standard error

n
where s: sample standard deviation .
Central Limit Theorem
• Let X1,…,Xn be a random sample from
some population with mean  and
variance σ2
 2
Then, for large n,
X  N   , 

n 
Interval Estimation
• Let X1, ….Xn be a random sample from a
normal population N(, σ2). If σ2 is known,
a 95% confidence interval (C.I.) for  is
1.96
1.96 

,X 
X 

n
n 

why? (next slide)
Interval Estimation




 2 
X 
If X ~ N   ,  , then Pr  1.96 
 1.96   .95

n 





n
i.e.
 1.96

 X    1.96
n
  X  1.96
 X  1.96

n

n

n
     X  1.96
   X  1.96

n

n
Interval Estimation
Interpretation of Confidence Interval
• Over the collection of 95% confidence
intervals that could be constructed from
repeated random samples of size n, 95%
of them will contain the parameter 
• It is wrong to say:
There is a 95% chance that the parameter
 will fall within a particular 95%
confidence interval.
Interval Estimation
• Note:
1. When  and n are fixed, 99% C.I. is wider than 95% C.I.
2. If the width of the C.I. is specified, the sample size can be
determined.
n 
length 
 
length 
Hypothesis Testing
• Null hypothesis(H0): the statement to be
tested, usually reflecting the status quo.
• Alternative hypothesis (H1): the logical
compliment of H0.
• Note: the null hypothesis is analogous to the
defendant in the court. It is presumed to be true
unless the data argue overwhelmingly to the
contrary.
Hypothesis Testing
• Four possible outcomes of the decision:
Truth
Decision
Ho
H1
Accept H0
OK
Type II error
Reject H0
Type I error
OK
• Notation:
 = Pr (Type I error) = level of significance
 = Pr (Type II error)
1- = power= Pr(reject H0|H1 is true)
Hypothesis Testing
• Goal :
to make  and  both small
• Facts:
 
 
then  
then  
• General Strategy:
fix , minimize 
Testing for the Population
Mean
• When the sample is from normal population
H0 :  = 120 vs H1 :  < 120
• The best test is based on X ,which is called the
test statistic. The "best test" means that the test
has the highest power among all tests with a given
type I error.
Is there any bad test? Yes.
• Rejection Region:
– range of values of test statistic for which H0 is rejected.
One-tailed test
• Our rejection region is X  c
• Now,  Pr(Type I error | Ho is true)
 Pr( X  c | X ~ N ( 0 ,
 (
i.e.
c  0
/ n
c  0
/ n
2
n
))
)
 Z or c  0  Z   / n
Result
• To test H0 :  = 0 vs H1 :  < 0, based
on the samples taken from a normal
population with mean  and variance
unknown,
x
t

the test statistic is
s/ n .
• Assume the level of significance is α then,
0
– if t < tn-1, α , then we reject H0.
– if t ≥ tn-1, α, then we do not reject H0.
P-value
• The minimum α-level at
which we can reject Ho
based on the sample.
• P-value can also be
thought as the
probability of obtaining
a test statistic as
extreme as or more
extreme than the actual
test statistic obtained
from the sample, given
that the null hypothesis
is true.
P value
Remarks
• Two different approaches on determining
the statistical significance:
– Critical value method
– P-value method.
One-tailed test
• Testing H0: µ= µ0 vs H1: µ > µ0
When  2 unknown and population is normal
x
Test Statistic: t 
s/ n
0
Rejection Region: t > tn-1,α
p-value = 1- Ft,n-1 (t), where Ft,n-1 ( ) is the cdf for t distribution
with df=n-1.
• Note: If  2 is known, the s in test statistic will be replaced by
σ and
tn-1,α in rejection region will be replaced by zα , Ft,n-1 (t)
will be
replace by Ф(t).
Testing For Two-Sided
Alternative
• Let X1,….,Xn be the random samples from the
population N(µ, σ²), where σ² is unknown.
• H0 : µ=µ0 vs H1 : µ≠µ0
x
s/ n
– Rejection Region: |t|> tn-1,1-α/2
– Test Statistic: t 
0
– p-value = 2*Ft,n-1 (t), if t<= 0. (see figures on next
slide)
2*[1- Ft,n-1 (t)], if t > 0.
• Warning: exact p-value requires use of
computer.
Testing For Two-Sided
Alternative
P-value for X>U0
P-value for X<=U0
if x> Uo
2Uo-x
Uo
x
if x<= Uo
x
Uo
2Uo-x
The Power of A Test
• To test H0 : µ=µ0 vs H1 : µ<µ0 in normal
population with known variance σ², the power is
[Z  (0 - 1 ) n /  ].
• Review : Power= Pr [rejecting H0 | H0 is false ]
• Factors Affecting the Power
1.    Z   power 
2. | 0  1 |   power 
3.    power 
4. n   power 
The Power of The 1-Sample T Test
• To test H0 : µ=µ0 vs H1 : µ<µ0 in a normal
population with unknown variance σ², the
power, for true mean µ1 and true s.d.= σ, is
F(tn-1, .05), where F( ) is the c.d.f of non-central t
distribution with df=n-1 and non-centrality
  1  0  n  .
Notes:
1. t n-1,0.05 = -t n-1,0.95 .
2. If X and Y are independent random variables
such that Y~ N( ,1) and X ~  2 with d.f.=m,
then Y/ (X/m) is said to have a non-central
t distribution with non-centrality  .
Power Function For Two-Sided
Alternative
• To test H0 : µ=µ0 vs H1 : µ≠µ0 in normal
population with known variance σ², the
power is
[-Z1 2 +  0  1  n  ]  [-Z1 2 +  1  0  n  ]
,where µ1 is true alternative.
Case of Unknown Variance
• For the same test with an unknown
variance population, the power is F(-tn-1, 1α/2) + 1- F(tn-1, 1- α /2), where F( ) is the c.d.f
of non-central t distribution with df=n-1 and
non-centrality   1  0  n  .
Sample Size Determination
For example: H0 : µ=µ0 vs H1 : µ<µ0
power :[Z  ( 0  1 ) n /  ]  1 - 
if   1  0
Hence,Z  (  0  1 ) n /   Z1- 
n
n
(  Z  Z1-  )
(  0  1 )
( Z1  Z1-  ) 2  2
(  0  1 ) 2
Factor Affecting Sample Size
2

  n 
1.
2.    n 
3. 1     n 
4. |    |   n 
• To test H0 : µ=µ0 vs H1 : µ≠µ0, σ² is
known.
Sample size calculation is
0
1
n
( Z1 / 2  Z1-  ) 2  2
( 0  1 ) 2
Relationship between
Hypothesis Testing and
Confidence Interval
• To test H0 : µ=µ0 vs H1 : µ≠µ0, H0 is
rejected with a two-sided level α test if and
only if the two-sided 100%*(1 - α)
confidence interval for µ does not contain
µ0.
One Sample Test for the
Variance of A Normal Population
One Sample Test for A
Proportion
Exact Method
• If p(hat) < p0,
the p-value
 2  Pr[ X # of events in the observatio n | X ~ B( n, P0 )]
 2
# of events

k 0
n k
  P0 (1  P0 )n  k
k 
• If p(hat) ≥ p0,
the p-value
 2  Pr[ X # of events in the observatio n | X ~ B( n, P0 )]
 2
n

k  # of events
n k
  P0 (1  P0 ) n  k
k 
Power and Sample size
One-Sample Inference for the
Poisson Distribution
• X ~ Poisson with mean μ
• To test H0 : µ=µ0 vs H1 : µ≠µ0 at α level of
significance,
– Obtain a two-sided 100(1- α)% C.I. for µ,
say (C1, C2)
– If µ0 (C1, C2), we accept H0 otherwise
reject H0.
One-Sample Inference for the
Poisson Distribution
• The p-value (for above two-sided test)
– If observed X < µ0, then
P  min[ 2  F ( x | 0 ),1]
– If observed X > µ0,
P  min[ 2  (1  F ( x  1 | 0 )),1]
Where F(x | µ0) is the Poisson c.d.f with mean =
µ 0.
Large-Sample Test for Poisson
(for µ0 ≥ 10)
• To test H0 : µ=µ0 vs H1 : µ≠µ0 at α level of
significance,
– Test Statistic:
X2 
( x  0 ) 2
0
~ 1σ2 under H 0
– Rejection Region:
X 2  12,1
– p-value:
Pr  12  X 2 
Introduction to SAS
• To run SAS you must create a file of SAS code, which
the SAS will read and use to run the program.
• You can type your SAS code into the program editor and
create a SAS program file.
• A SAS program usually consists of two components:
Data steps and Procedure steps.
• In the SAS program, any statement terminates with a
semicolon.
• A line comment begins with a *. For a comment of
several lines, we use /* …. */
A Simple SAS Program
•
•
•
•
•
•
•
•
•
•
•
•
•
title “My First SAS Program”;
Data temp;
Input id security_status $ years_of _Using_PC;
Datalines;
1 yes 10
2 yes 9
3 no 15
4 yes 12
5 no 7
;
Proc print;
Var id security_status years_of _Using_PC;
Run;
How to run a SAS program
• To run a SAS program, you click on the running mean icon. At the
center of the top of the main program.
• After you run the program, the log window will become active and
provide you with the information that includes error or warning
messages.
• The outputs can be found in the “Results Viewer-SAS Output”
window.
Probability Models related
Computer Security
• If a computer hacker has a probability of p
to succeed on each attempt, what is the
probability that he/she will succeed after N
attempts.
• The probability that there will no success
after N attempts is (1-p)N.
• The probability that he/she will succeed
after N attempts is 1-(1-p)N.
Probability Models related
Computer Security
• How many attempts are required for this
hacker to receive a x% of success rate?
• From the last question, we would like to
see x%=1-(1-p)N
• i.e. log(1-x%)=N*log(1-p)
• So, N is the smallest integer that is >=
log(1- x%)/log(1-p)