A frequency distribution is a kind of probability distribution. It gives

A frequency distribution is a kind of probability distribution. It gives the
frequency or relative frequency at which given values have been observed
among the data collected. For example, for age,
From
To
0.000000<x<=10.00000
10.00000<x<=20.00000
20.00000<x<=30.00000
30.00000<x<=40.00000
40.00000<x<=50.00000
50.00000<x<=60.00000
60.00000<x<=70.00000
Missing
Frequency table: Var2 (Spreadsheet1)
Count Cumulative Percent Cumulative
Count
Percent
75
75 58.59375
58.5938
26
101 20.31250
78.9063
13
114 10.15625
89.0625
7
121 5.46875
94.5313
5
126 3.90625
98.4375
2
128 1.56250
100.0000
0
128 0.00000
100.0000
0
128 0.00000
100.0000
From the above frequency distribution we can determine that
The probability of a patient being between 10 and 20 years of age = 0.2031.
The probability of a patient being less than or equal to 30 years = 0.8906.
A random variable = a quantity that can assume a number of different values
such that any particular outcome is determined by chance.
Discrete random variables take on isolated values,
e.g. gender=male
female
number of bacteria in a slide or a volume of suspension
Continuous random variables take on any value within a specified interval,
e.g. weight or height or serum cholesterol.
A probability distribution describes the behaviour of a random variable.
For discrete variables, it specifies all possible outcomes along with the
probability that each will occur.
For continuous variables, a probability distribution specifies the probability
associated with specified ranges of values.
1
Three theoretical probability distributions
1. The binomial distribution
Consider a random variable that can take on one of two possible values, where
one of the values can be viewed as a success and the other as a failure,
e.g., Y=the presence/absence of disease.
Y is a Bernoulli random variable.
Let X=the number of successes out of n trials, i.e., the number of patients who
are diseased out of n patients examined, then
The probability distribution for X is the Binomial distribution from which we can
determine the following expression for the probability that X takes on a
specific value, x,
 n x
n− x
P( X = x ) =   p (1 − p)
 x
where p=probability of disease.
Binomial probability distributions:
n=10,p=0.71
2
Application of Binomial probability distribution.
(from Pagano & Gauvreau)
Suppose we are interested in investigating the probability that a patient who has been stuck with a needle
infected with hepatitis B actually develops the disease.
Let Y = the disease status
=1 if individual develops hepatitis
=0 if not.
If 30% of the patients who are exposed to hepatitis B become infected, then
P(Y=1)=0.30 and P(Y=0)=1-0.3=0.70.
Suppose we select 5 individuals from the population of patients who have been stuck with a needle
infected with hepatitis B. Then X= number of patients who develops the disease is a binomial random
variable with n=5 and p=0.30.
So we can calculate the probability that X assumes given values as follows:
 5
p( X = 2) =  (0.30) 2 (1 − 0.30) 5− 2 = 10 × 0.30 2 0.703 = 0.309
 2
In addition, the mean number of people who develop the disease in
repeated samples of size 5 is np = 5 × 0.30 = 1.5
and the standard deviation is
np(1 − p) = 5 × 0.3 × 0.7 = 1.03
2. The Poisson distribution:
-used to model discrete events that occur infrequently in time or space.
More specifically, let
X= the number of occurrences of some event of interest over a given interval.
Let λ= the average number of occurrences of the event in the interval.
Then
P ( X = x) =
e − λ λx
x!
The Poisson distribution is used extensively in bacteriology.
In this case we are not counting the number of events in a fixed time
interval or fixed one-dimensional length, but rather
the number of bacteria in a two-dimensional microscopic slide of given
area, or
the number of bacteria in a fluid suspension of given volume.
3
Poisson probability distributions
Example: (From Armitage & Berry)
Distribution of counts of root nodule bacterium in
Petroff-Hausser counting chamber.
No. of bacteria
Number of squares
The mean number of organisms per square
per square
Observed
=(34x0+68x1+112x2+94x3+55x4+21x5+12x6+4x7) /400
=2.50
0
34
1
68
2
112
3
94
4
55
5
21
6
12
7
4
-----------400
The use the Poisson probability function with
lambda=2.5 to calculate the p (of a given no. of
bacteria), e.g.,
P(2 bacteria per square)=
P ( X = 2) =
e −2.5 2.52
= 0.2565
2!
To get the expected frequency, multiply this
probability with the total number of squares
(400),
400 × 0.2565 = 102.6
4
Example: (From Armitage & Berry)
Distribution of counts of root nodule bacterium in
Petroff-Hausser counting chamber.
No. of bacteria
Number of squares
The mean number of organisms per square
per square
Observed
Expected
0
34
32.8
=(34x0+68x1+112x2+94x3+55x4+21x5+12x6+4x7) /400
=2.50
1
68
82.1
2
112
102.6
The use the Poisson probability function with
lambda=2.5 to calculate the p (of a given no. of
bacteria), e.g.,
3
94
85.5
P(2 bacteria per square)=
4
55
53.4
5
21
26.7
6
12
11.1
7
4
5.7
------------
---------
400
399.9
P ( X = 2) =
e −2.5 2.52
= 0.2565
2!
To get the expected frequency, multiply this
probability with the total number of squares
(400),
400 × 0.2565 = 102.6
We note that the observed and expected frequencies correspond
quite well, indicating that the bacteria do indeed follow a Poisson
distribution.
3. The normal distribution
-To model continuous variables like height, weight, serum cholesterol level,
etc.
- Based on the Central Limit Theorem that states that any variable that can
be viewed as a sum of a large number of random increments will follow a
Normal distribution
.015
-It is bell-shaped and symmetrical
-It is characterised by its
mean (µ) and
0
.005
Density
.01
standard deviation (σ)
0
50
100
Cmax_S
150
200
5
A normal distribution with mean=0 and standard deviation=1 is called
the Standard Normal distribution
Probability Density Function
y=normal(x,0,1)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3
-2
-1
0
1
2
3
We know that 95% of the values of a variable that follows the N(0;1)
distribution lie between -1.96 and +1.96.
We will be using these theoretical probability distributions to construct
confidence intervals for population parameters.
x−µ
2
~ N (0;1)
The sample mean , x ~ N (µ ;σ / n ) − − >
σ/ n
However, most of the time we do not know σ2 and estimate it using s2.
In that case,
x−µ
~ t n −1
s/ n
Where tn-1 is the Student’s t-distribution with n-1 degrees of freedom.
6
STATISTICAL INFERENCE
In statistics we wish to make statements about the true values and
relationships for a complete population.
However, it is not practical to measure the entire population.
So we take random samples that are representative of the population and
use the information contained in these random samples to estimate the
values or relationships for the population.
If we repeat our sampling and draw another random sample from the
population, we will come up with slightly different values of the measures
and relationships that we are estimating.
So our statistics are subject to uncertainty.
We can summarize the different values that our statistics can take on and
the frequency with which they are likely to take on these values with a
probability distribution.
Some of the more common statistics have known probability distributions.
Population parameter
Sample Statistic
Mean, µ
x
Probability distribution
Normal
Proportion, π
p=n/N
Binomial
Variance, σ2
s2
Chi-square
Ratio of two variances
F-distribution
Sample statistics are point estimates of the population parameters.
Every time we draw a different random sample, we’ll get a slightly
different point estimate.
Hence we change the point estimate to an interval estimate that will give
us two cut-off points between which we are 95% certain our true
population parameter will fall.
We call these interval estimates, CONFIDENCE INTERVALS.
7
The general form of a confidence interval is
( statistic − percentile × std .err; statistic + percentile × std .err )
where
percentile=cutoff value that demarcates required percentile of probability
distribution;
e.g., for the normal distribution, we know that 95% of the values lie between -1.96
and +1.96, so the percentile for a 95% confidence interval is 1.96
std.err = estimate of variability of statistic. i.e, if you had to draw many random
samples and calculate a sample statistic each time, how would these statistic vary
among one another.
Note, std.err. = √σ2/n, i.e, the standard deviation divided by the square root of the
sample size.
NOTE: the difference between a standard deviation and a standard error:
A standard deviation tells you how variable the data values are.
0
.005
Density
.01
.015
A standard error measures the variability of the statistic – it is a function of the
variability of the data and the sample size.
0
50
100
Cmax_A1
150
200
. ci Cmax_A1
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
------------------+-----------------------------------------------------------------------------Cmax_A1 |
125 84.66538 2.616363
79.48686 89.84389
ci = mean +/- tn-1std.err = 84.665-1.979*2.616;84.666+1.979*2.616
where tn-1 refers to the Student t-distribution with n-1 degrees of freedom
8
Hypothesis testing (from Fisher & Van Belle, pg 106)
In estimation, we start with a sample statistic and make a statement about the population
parameter : a confidence interval makes a probabilistic statement about straddling the population
parameter.
In hypothesis testing, we start by assuming a value for a parameter and a probability statement is
made about the value of the corresponding statistic.
The basic strategy in hypothesis testing is to measure how far an observed statistic is from a
hypothesized value of the parameter,
or
how “likely” an observed value for a statistic is given the hypothesized value of the parameter .
Probability Density Function
y=student(x,127)
0.5
p-value = the probability under the null
hypothesis of observing a value as unlikely or
more unlikely than the value of the test statistic
0.4
0.3
0.2
0.1
0
1
of
va
lu
e
va
lu
e
d
ed
3
bs
er
ve
he
siz
2
O
Hy
po
t
ic
-1
st
at
ist
-2
-3
P-value
0.0
Hypothesis testing procedure:
•H0: = a null hypothesis that specifies a real value for a parameter
( Usually what you wish to reject and a statement of equivalence)
•HA = an alternative hypothesis that specifies a range of values for a
parameter which will be considered when the null hypothesis is rejected.
•A test statistic derived under the null hypothesis
•A p-value that states the probability of observing such a value for the test
statistic given that the null hypothesis is true
•A decision to reject or not the null hypothesis
9
Decision Rules and Type I and II errors
H0=µ=30
t=
. ttest weight=30
One-sample t test
x−µ
s/ n
-----------------------------------------------------------------------------Variable |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------weight |
128
29.71172
1.706777
19.30998
26.33432
33.08912
-----------------------------------------------------------------------------Degrees of freedom: 127
Ho: mean(weight) = 30
Ha: mean < 30
t = -0.1689
P < t =
0.4331
Ha: mean != 30
t = -0.1689
P > |t| =
0.8661
Ha: mean > 30
t = -0.1689
P > t =
0.5669
.
Probability Density Function
y=student(x,1)
Probability Density Function
y=student(x,1)
Probability Density Function
y=student(x,1)
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
-3
-2
-1
0
1
2
3
0.2
0.1
0.0
-3
-2
-1
0
1
2
3
-3
-2
-1
0
1
2
3
Always use a 2-sided HA unless there are good a priori reasons to motivate a 1-sided HA.
10
Comparing two means:
If data follow Normal distribution, use t-test.
T-test has different forms depending on whether variances in two groups are equal or not.
To compare variance ,use F-test:
. sdtest Cmax_A1,by(trt)
Variance ratio test
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------trtA |
62
87.55352
3.776138
29.73334
80.00267
95.10437
trtB |
63
81.82308
3.618681
28.72239
74.58944
89.05672
---------+-------------------------------------------------------------------combined |
125
84.66538
2.616363
29.25182
79.48686
89.84389
-----------------------------------------------------------------------------Ho: sd(trtA) = sd(trtB)
F(61,62) observed
= F_obs
=
F(61,62) lower tail = F_L
= 1/F_obs =
F(61,62) upper tail = F_U
= F_obs
=
Ha: sd(trtA) < sd(trtB)
P < F_obs = 0.6067
Ha: sd(trtA) != sd(trtB)
P < F_L + P > F_U = 0.7871
1.072
0.933
1.072
Ha: sd(trtA) > sd(trtB)
P > F_obs = 0.3933
Fn1 −1;n2 −1 =
s12
s22
29.733342
28.722392
= 1.072 < F610.;0562 = 1.525531
=
. ttest Cmax_A1,by(trt)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------trtA |
62
87.55352
3.776138
29.73334
80.00267
95.10437
trtB |
63
81.82308
3.618681
28.72239
74.58944
89.05672
---------+-------------------------------------------------------------------combined |
125
84.66538
2.616363
29.25182
79.48686
89.84389
---------+-------------------------------------------------------------------diff |
5.730441
5.228654
-4.619358
16.08024
-----------------------------------------------------------------------------Degrees of freedom: 123
t n1 + n2 − 2 =
Ho: mean(trtA) - mean(trtB) = diff = 0
Ha: diff < 0
t =
1.0960
P < t =
0.8624
Ha: diff != 0
t =
1.0960
0.2752
P > |t| =
Ha: diff > 0
t =
1.0960
0.1376
x1 − x2 − ( µ1 − µ 2 )
s 1 + 1
n1
n2
P > t =
If variances were not equal, we should use t-test for unequal variances:
. ttest Cmax_A1,by(trt) unequal
Two-sample t test with unequal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------trtA |
62
87.55352
3.776138
29.73334
80.00267
95.10437
trtB |
63
81.82308
3.618681
28.72239
74.58944
89.05672
---------+-------------------------------------------------------------------combined |
125
84.66538
2.616363
29.25182
79.48686
89.84389
---------+-------------------------------------------------------------------diff |
5.730441
5.230112
-4.622509
16.08339
-----------------------------------------------------------------------------Satterthwaite's degrees of freedom: 122.685
Ho: mean(trtA) - mean(trtB) = diff = 0
Ha: diff < 0
t =
1.0957
P < t =
0.8623
Ha: diff != 0
t =
1.0957
P > |t| =
0.2754
Ha: diff > 0
t =
1.0957
P > t =
0.1377
t=
x1 − x 2 −( µ1 − µ 2 )
s12 / n1 + s22 / n2
11
If data do not follow Normal distribution, use non-parametric Mann-Whitney U-test:
. ranksum Cmax_A1,by(trt)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
trt |
obs
rank sum
expected
-------------+--------------------------------trtA |
62
4115
3906
trtB |
63
3760
3969
-------------+--------------------------------combined |
125
7875
7875
unadjusted variance
adjustment for ties
adjusted variance
41013.00
0.00
---------41013.00
Ho: Cmax_A1(trt==trtA) = Cmax_A1(trt==trtB)
z =
1.032
Prob > |z| =
0.3021
Confidence intervals and Hypothesis testing:
Recall that to construct a confidence interval for the mean, we use
α
α
( x − tn −21 × std .err; x + t n −21 × std .err )
And to test whether the mean differs from zero, we use the test statistic
x −0
x
t=
=
s / n std .error
The two procedures are thus related and in fact it can be shown that
if the α% confidence interval includes zero, then the hypothesis test
will not be significant at the α% level of significance. On the other
hand, if the confidence interval lies to the right or left of zero, it does
correspond to a significant result.
12