1. Basics of probabilities - Weizmann Institute of Science

1. Basics of probabilities
1
Recommended references
Two simple but complete books:
• Statistics: A guide to the use of Statistical Methods in the Physical Sciences, by R. Barlow
Manchester Physics Series
ISBN 0471922951 Paperbound reprint,1993, Wiley
or original ISBN 0471922943 Hardcover 1989
• Statistical Data Analysis, by G. Cowan
Oxford University Press
ISBN 0-19-850156-0 and 0-19-850-115-2
An excellent online handbook
can be found at:
http://www-library.desy.de/preparch/books/vstatmp_engl.pdf
An online version of Numerical Recipes
can be found at:
http://www.nrbook.com/a/bookcpdf.php
2
Administrativia
My email is: [email protected]
Lectures notes, home work, ... can be found on the course home page URL at:
http://webhome.weizmann.ac.il/home/fhlellou/course/data anal/home.html
3
Probabilities
3.1
Some definitions
• An experiment is any process that generates raw data.
• A set whose elements represent all possible outcomes of an experiment is called the sample
space.
• An element of the sample space is called a sample point.
• An event is a subset of a sample space.
In probability theory a weight is assigned to each sample point normalized so as the total weight
of the sample space is 1. These weights are the ”intuitive” likelihood of the occurrences of the
sample points.
• The probability of an event is the sum of all weights assigned to its sample points.
• Two events are said mutually exclusive if their intersection is empty.
• Two events are said complementary if their union is the sample set and their intersection
empty.
1
Example: the experiment consists of tossing a coin three times. The sample space is
S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T H}
An event could be for example ”2 times out of 3, a head was tossed”, and corresponds to 3 sample
points. The probability of such an event is therefore 3/8.
Exercise: Panem et Circenses! The ‘Loto’ game consists of extracting at random 5 balls out of
50, numbered from 1 to 50. Which event is the more probable?
• The 5 balls are distributed each inside each of the fives decades (1 to 10, 11 to 20,..., 41 to
50).
• All 5 balls belong to the same decade.
Should we conclude from this that it ‘pays off’ to gamble on the former? Is there anyhow a
strategy to hope and win?
3.2
Basic theorems
Notation: Ω is the sample space, A and B are events.
P (A or B) = P (A) + P (B) − P (A and B)
Noting P (A|B) the conditional probability of A given B,
P (A and B) = P (A|B) · P (B) = P (B|A) · P (A)
We can induce from the above the Bayes theorem:
P (A|B) = P (B|A) · P (A)/P (B)
If A1 , A2 , · · · , An are exhaustive and exclusive sets (which means than any event belongs to one and
only one Ai ), then the marginal probability of B can be written
P (B) =
∑
P (Aj and B) =
j
∑
P (B|Aj ) · P (Aj )
j
and the Bayes theorem can be rewritten as
P (B|Ai ) · P (Ai )
P (Ai |B) = ∑
j P (B|Aj ) · P (Aj )
Independent events: A and B are independent if P (A|B) = P (A) i.e. P(A) does not depend
on B being fulfilled or not. If so, P (A and B) = P (A) · P (B) and P (B|A) = P (B).
Warning: dependence between two events does not imply causality !
2
3.3
Digression: Bayesian and Frequentist interpretations of probabilities
Traditionally two types of interpretations of probabilities are given:
• The ”frequentist” approach, which views a probability as the limit of the ratio of successes,
when the test/measurement/experiment is carried out an infinite amount of times
• The subjective or ”approach” approach, which the elements of the sample space as hypotheses
or propositions, ie.e statements which are either true or false.
Let’s call θi an hypothesis (for example ”the value of this parameter is ...”) and let’s assume
∑
we have an exhaustive set of hypotheses ( i P (θi ) = 1). P (θi ) is called the prior probability of
hypothesis θi and P (θi |X) is the posterior probability of θi (after the experiment yielded observation
X). Bayes theorem states that
P (X|θi ) · P (θi )
P (θi |X) = ∑
j P (X|θj ) · P (θj )
where P (X|θi ) is the likelihood of observation X under the hypothesis θi .
If you don’t know all the prior probabilities, or claim, as frequentists do, that they have no meaning,
then the posterior probabilities can not be evaluated. If you know some of the prior probabilities,
you can at least evaluate the betting odds of θi against θj :
P (θi |X)
P (X|θi ) · P (θi )
=
P (θj |X)
P (X|θj ) · P (θj )
Bayes POSTULATE, adopted by Bayesian statisiticians. states that if the distribution of prior
probabilities is completely unknown, one can take them all equal: P (θ1 ) = ... = P (θn ) = P (θ) and
P (X|θi )P (θ)
P (θi |X) = ∑
j P (X|θj )
In this sense, when quoting that a given theory is excluded by experimental data at a given ‘confidence
level’ (e.g. 95%), frequentist statisticians will see it as ‘1. - the probability that data were such,
would the theory be correct’; Bayesian statisticians will view it as ‘the probability that the theory is
indeed correct given the data’, and will read the Bayes theorem as a way to update the probability
that the theory is correct:
P (theory|data) =
P (data|theory)P (theory)
P (data)
P (theory) is called the prior probability, P (data|theory) the likelihood, and P (theory|data)
the posterior probability
3.4
Exercises
1. You are participating to ‘Let’s make a deal’, the great TV show on Arutz Hatimtum! The rules
of the games are the following: the entertainer shows 3 envelopes, two of them are empty, and
one contains a $10,000,000 check. You choose one envelope, and to tempt you, the entertainer
will systematically open one of the remaining two envelopes, show that it is empty, and ask
you whether you want to change for the remaining third one. Will it be worthwhile to change?
3
2. Three drawers contain 2 coins each. In the first one, there are 2 gold coins, in the second one,
2 silver coins, and in the last one, 1 gold + 1 silver coin. You open one drawer, pick one coin
and realize it is gold. What is the probability that the second coin in the drawer is gold as
well?
3. You are running a High Energy Experiment directing a beam of incoming particles onto a
target. In order to identify the incoming particles, you use a ‘Cerenkov counter’ which can
discriminate between ‘π’s and ‘K’s. The response probability of the Cerenkov to π’s is ϵ = 95%
and to K’s a = 6%.
• Assuming that the beam composition is 90%π, 10%K, what are the probabilities of the
incoming particle be a π if the Cerenkov did fire and be a K if not?
• Same question assuming the beam composition: 10%π, 90%K
• Let’s assume that there are a lot of other particles in the incoming beam; the only thing
we know is that there are twice as many K’s as π’s; what are the betting odds of π
against K if the device did fire?
3.5
Simpson’s paradox
Simpson’s paradox (or the Yule-Simpson effect) is a paradox often encountered in social-science and
medical-science statistics, and it occurs when frequency data are hastily given causal interpretations.
An historic example One of the best known real life examples of Simpson’s paradox occurred
when the University of California, Berkeley was sued for bias against women who had applied for
admission to graduate schools there. Let’s simplify the problem as follows: imagine an University
with two faculties, Natural Sciences and Humanities; in both faculties, men and women have the
same probability of passing the exam, while the combined results show that women fail much more
often than men. How can this be? Let’s assume that that probability of passing the exam in natural
sciences (resp. humanities) is 90% (resp 10%), while 30% of women and 90% of men go for science.
The probability for a woman to succeed is .3 ∗ .9 + .7 ∗ .1 = 0.34; for a man it is .9 ∗ .9 + .1 ∗ .1 = 0.82.
Exercise: would you be the judge, would you declare the University guilty of segregation?
Simpson’s paradox and decision making : Here is another real-life example from a medical
study comparing the success rates of two treatments for kidney stones. The table shows the success
rates and numbers of treatments for treatments involving both small and large kidney stones:
Small Stones
Large Stones
Both
Treatment A
93% (81/87)
73% (192/263)
78% (273/350)
Treatment B
87% (234/270)
69% (55/80)
83% (289/350)
Would two doctors, one treating small stones, the other treating large stones, test the treatments,
both would conclude that treatment A is superior to treatment B. However combining the results
show that treatment B is superior! You are a physician treating a patient suffering from kidney
stones. Based on the above, which treatment are you going to prescribe?
4
4
4.1
Probability distributions - One continuous variable
Probability density function; moments
Notation: f (x) with the normalization
∫
∞
−∞
f (x)dx = 1
Expectation of g(x), any function of x:
∫
E[g(x)] =< g(x) >=
f (x)g(x)dx
∫
In particular, the mean is < x >= µ = f (x) x dx
and the variance is v = σ 2 =< (x − µ)2 >=< x2 + µ2 − 2µx >=< x2 > − < x >2
Variance scaling:
v(ax + b) = a2 v(x)
Bienaymé-Tchebychev inequality:
c,
for any positive function of x g(x) > 0 and any constant
1
P (g(x) > c) ≤ E[g(x)]
c
The proof is simple: in the region Ω where g(x) > c,
∫
Ω
g(x) f (x) dx ≥ c
Thus
E[g(x)] ≥
∫
Ω
∫
f (x) dx
Ω
g(x) f (x) dx ≥ c · P (g(x) > c)
Tchebychev theorem: How many points are more than n standard deviations away? Just use
the Bienaymé inequality with g(x) = (x − µ)2
1
P ((x − µ)2 > n2 σ 2 ) ≤
n2 σ 2
P (|x − µ| > nσ) ≤
The cumulative probability density function
∫
F (a) =
E[(x − µ)2 ]
1
n2
expresses the probability that x ≤ a:
a
−∞
f (x)dx
It is a monotonically increasing function from 0 to 1.
Moments: The nth algebraic moment is defined as µ′n =< xn > and the nth central moment is
3/2
defined as µn =< (x−µ)n >. Other useful quantities are the skewness s = µ3 /µ2 which measures
the asymmetry of the distribution and the kurtosis c = µ4 /µ22 − 3 (sometimes called the excess)
which compares the ‘height of the peak’ to that of a normal distribution. The −3 ensures that the
kurtosis of the normal (gaussian) distribution is 0. A positive kurtosis corresponds to a higher peak
and wider wings than the Gaussian distribution.
5
4.2
Exercise: another famous paradox
1. You are (again!) participating to a TV show, where the rule of the game is the following: you
are presented with two envelopes, for which you know a priori that one contains a prize twice
as large as in the second one. You choose one envelope and open it. It contains a check of,
say, n NIS. You are asked whether you want to change it for the second one, which contains,
with equal probabilities, either 2n or n/2 NIS. The expectation of the gain in changing is
1
2 (n − n/2) = n/4, which means that no matter what you chose, you’d better choose the
other envelope. How to solve this paradox?
2. Let’s modify the rules as follows: the ”high” envelope is limited to a highest sum, i.e. it is
uniformly distributed between 0 and 1M NIS. What is now your gain expectation if you always
swap the envelope? What is the best strategy (i.e how to decide when to swap and when not)
?
Sample MATLAB programs to check your answers
--> TwoEnvelopes.m
--> TwoEnvelopes2.m
4.3
Characteristic function of a distribution
∫
Φ(t) =< eitx >=
and the inverse transform is
f (x) =
∫
1
2π
∞
−∞
∞
−∞
eitx f (x)dx
e−itx Φ(t)dt
Expanding the characteristic function around t = 0 gives
Φ(t) =< eitx >=< 1 + itx +
∞
∞
∑
∑
(it)2 2
1
1
x + . . . >=<
(it)k xk >=
(it)k < xk >
2!
k!
k!
k=0
k=0
Thus,
∂ k Φ(t)
|t=0
∂(it)k
µ′k =
For the central moments, consider
Φµ (t) =< eit(x−µ) >=<
∞
∑
1
k!
k=0
(it)k < (x − µ)k >=
∞
∑
1
k!
k=0
(it)k µk
thus,
µk =
∂ k Φµ (t)
|t=0
∂(it)k
The characteristic function is in fact the Fourier transform of the distribution, whose basic property
is that of the convolution:
Φp1 ⊗p2 (t) = Φp1 (t) · Φp2 (t)
Proof:
(p1 ⊗ p2 )(x) =
∫
∞
−∞
6
p1 (y) p2 (x − y) dy
Thus,
∫
Φp1 ⊗p2 (t) =
=
=
=
=
∞
∫
∞
x=−∞ y=−∞
∫ ∞
∫ ∞
eitx p1 (y) p2 (x − y) dx dy
eity p1 (y) eit(x−y) p2 (x − y) dx dy
x=−∞ y=−∞
∫ ∞
∫ ∞
eity p1 (y) [
eit(x−y) p2 (x − y) dx] dy
y=−∞
x=−∞
∫ ∞
∫ ∞
ity
e p1 (y) [
eitu p2 (u) du] dy
y=−∞
u=−∞
∫ ∞
ity
e p1 (y)Φp2 (t)dy
y=−∞
Φp1 ⊗p2 (t) = Φp1 (t) · Φp2 (t)
To compute the distribution of the sum of two independent variables (which is NOT the sum
of the two distributions), it is enough to multiply the two characteristic functions in the ‘frequency
space’ and transform back the result to the ‘variable space’. This result can be used to show that
the variance of the sum of two independent distributions p1 and p2 is the sum of the variances v1
and v2 ; for simplicity, let’s assume that both distributions have a null average.
v(p1 ⊗ p2 ) =
=
=
=
=
since
∂ 2 Φp1 ⊗p2
|t=0
∂(it)2
∂ 2 Φp 1 · Φp 2
|t=0
∂(it)2
∂ 2 Φp1
∂ 2 Φp2
∂Φp1 ∂Φp2
Φp2
|
+
Φ
|
+
2
|t=0
t=0
p
t=0
1
∂(it)2
∂(it) ∂(it)
∂(it)2
∂ 2 Φp2
∂ 2 Φp 1
|
+
|t=0
t=0
∂(it)2
∂(it)2
v(p1 ) + v(p2 )
∂Φp1
∂Φp2
|t=0 =
|t=0 = µ1 = µ2 = 0
∂(it)
∂(it)
and
Φp1 (0) = Φp2 (0) = 1
4.4
Change of variable
x has the probability density function f(x). Let’s perform a change of variable x → y = h(x). The
p.d.f of y is
dx
f (x)
g(y) = f (x)
= ′
|dy|
|h (x)|
In case several xi are solutions of the equation h(x)=y, then
g(y) =
∑ f (xi )
i
7
|h′ (xi )|
Application 1: how to
generate a variable of a given p.d.f. Consider the change of
∫x
variable x → y = h(x) = −∞ f (t)dt, i.e. the variable is transformed into its cumulative distribution.
Then the probability density function of y is:
g(y) =
f (x)
f (x)
=
=1
′
|( −∞ f (t)dt) |
|f (x)|
∫x
The distribution of the cumulative probability is uniform between 0 and 1. Therefore the algorithm
to generate x is:
1. generate a pseudo-random number r uniformly distributed between 0 and 1
2. Solve the equation in x:
∫x
−∞ f (t)dt
=r
Exercise: how to generate a variable x distributed according to f (x) = e−x ?
Digression: what to do when the cumulative function can’t be inverted analytically? It was first
proposed by von Neumann to run an acceptance-rejection method, as follows: let’s assume that f (x)
is completely surrounded by the box xmin < x < xmax and 0 < f (x) < ymax . Then:
1. Generate a pseudo-random number r uniformly distributed between 0 and 1
2. Transform it to x = xmin + r(xmax − xmin )
3. Generate y, uniformly distributed between 0 and ymax
4. If y < f (x), accept. If not, reject
This is illustrated by the MATLAB program
--> expdis.m
Such an algorithm can be very inefficient if the p.d.f exhibits a high peak and needs to be bounded
by a‘high’ rectangle. It can be improved by enclosing f in any other distribution h, which can be
generated analytically, such as f (x) < A · h(x). Then
1. Generate x according to the analytic method for the distribution h
2. Generate y, uniformly distributed between 0 and A · h(x)
3. Accept if y < f (x). If not, reject.
In the case where the normalization of the p.d.f can’t be found analytically, the famous Metropolis
Hastings algorithm can be used. It generates a time independent Markov chain, which is a ’time’
sequence of states x in which each xt+1 depends only on the previous xt . The algorithm uses a
proposal density Q(x′ ; xt ), to generate a new proposed sample x′ . This proposal is accepted as the
next value if a randown variable r uniformally distributed between 0 and 1 satisfies
{
P (x′ )Q(xt ; x′ )
r < min
,1
P (xt )Q(x′ ; xt )
}
If the proposal is accepted, then xt+1 = x′ ; if not, the current value of x is retained: xt+1 = xt . See
the example MATLAB program:
--> Metropolis.m
8
Application 2: a ‘visual check’. Does an observed set of variables actually come from an
assumed∫parent distribution f (x)? A very simple way consists of filling the histogram of the new
x
variable −∞
f (t)dt and ‘see’ whether it is flat or not.
4.5
Exercise
r1 and r2 are two independent variables uniformly distributed between 0 and 1. What is the distribution of r1 · r2 ? Write a small MATLAB program to display the histogram of 100000 entries of
r1 · r2 and visually check the analytic result.
5
5.1
Probability distributions -One discrete variable
Modification of formulas
A discrete variable can take only positive integer values k = 0, 1, · · ·. The elementary probability is
noted pk and the normalization is
∞
∑
pk = 1
k=0
The mean is
µ = E(k) =
∞
∑
k pk
k=0
and the variance is
V (k) =
∞
∑
[k − E(k)]2 pk = E(k 2 ) − [E(k)]2
k=0
5.2
Probability generating function
The equivalent to the characteristic function in the case of the probability distribution of a discrete
variable is called the probability generating function:
G(z) =
∞
∑
z k pk
k=0
Note that G(0)=0 and G(1)=1; It is often used to compute the mean and the variance of a discrete
variable distribution, using the following trick:
′
G (z) =
∞
∑
k z k−1 pk
k=1
G′′ (z) =
∞
∑
k(k − 1) z k−2 pk
k=2
Evaluated at z = 1, these derivatives yield:
G(′ 1) =
∞
∑
k pk = µ
k=1
′′
G (1) =
∞
∑
(k 2 − k) pk = E(k 2 ) − E(k)
k=2
9
Thus
V (k) = G′′ (1) + G′ (1) − [G′ (1)]2
Finally, probability generating functions are multiplicative under convolution: let’s consider two independent distributions p and p′ . The distribution of the sum of p and p′ is:
(p ⊕ p′ )k =
k
∑
pi p′k−i
i=0
and its probability generating function is:
Gp⊕p′ (z) = Gp (z) Gp′ (z) = (
∞
∑
z k pk ) (
k=0
∞
∑
z k p′k )
k=0
Proof: the term in z k in Gp (z) · Gp′ (z) is:
k
∑
pi p′k−i
i=0
While the probability distribution of p ⊕ p′ is:
(p ⊕ p′ )k =
k
∑
pi p′k−i
i=0
and its probability generating function is:
Gp⊕p′ (z) =
∞
∑
k=0
zk
k
∑
pi p′k−i
i=0
and the terms in z k in Gp⊕p′ (z) is the same as the one in Gp (z) · Gp′ (z).
One can use this result to show that the variance of the distribution of the sum of two independent
variables is equal to the sum of the variances. The proof is very similar to that of the case of a
continuous variable:
See an illustration in the MATLAB program
--> dices.m
V (k1 + k2 ) = G′′1+2 (1) + G′1+2 (1) − G′2
1+2 (1)
G′1+2 (1) = G1 (1)G′2 (1) + G′1 (1)G2 (1) = G′1 (1) + G′2 (1)
G′′1+2 (1) = G1 (1)G′′2 (1) + G′′1 (1)G2 (1) + 2G′1 (1)G′2 (1) = G′′2 (1) + G′′1 (1) + 2G′1 (1)G′2 (1)
′2
′
′
′
′
V (k1 + k2 ) = G′′2 (1) + G′′1 (1) + 2G′1 (1)G′2 (1) − G′2
1 (1) − G2 (1) − 2G1 (1)G2 (1) + G1 (1) + G2 (1)
V (k1 + k2 ) = V (k1 ) + V (k2 )
10
6
Some important distributions
6.1
Binomial distribution
The binomial distribution applies to discrete variable and describes the distribution of R, the number
of successes out of a given number of trials n, given an elementary probability of success of p.
(
P (r; n, p) =
(
where
n
r
n
r
)
=
)
pr (1 − p)n−r
n!
r! (n − r)!
is the number of combinations. Traditionally, 1 − p is noted q. The probability generating function
is
(
)
r=n
∑
n
r
G(z) =
z
pr (1 − p)n−r = (zp + q)n
r
r=0
It follows that:
• Mean: G′ (z)|z=1 = np (zp + q)n−1 |z=1 = np, as intuitively expected.
• Variance: σ 2 = G′′ (1)−[G′ (1)]2 +G′ (1) = n(n−1)p2 (zp+q)n−2 |z=1 −n2 p2 +np = np−np2 =
np (1 − p) = npq
• The binomial distribution is inductive, i.e. P (r; n, p) ⊕ P (r; n′ , p) = P (r; n + n′ , p), as
intuitively expected: if ones runs two consecutive experiments, the 1st one consisting of n
trials and the second one of n′ trials, the total number of successes is clearly the same of a
unique experiment consisting of n + n′ trials.
• One could have used the above property to show directly that σ 2 = npq; let’s consider the
simple case of n = 1; In such a situation, µ = 0·(1−p)+1·p = p and σ = 0·(1−p)+1·p−µ2 =
p(1 − p) and induction from 1 to n gives the result.
See the example MATLAB programs:
--> BINOM.M, BMOVIE.M, BMOVIE2.M, BMOVIE3.M, POLL.M, SPOLL.M
Exercise: Most political polls published in Israeli newspapers are based on a population of ca. 500
persons. Assuming that x percent of the people answer ‘yes’ to the question ‘are you happy with
the government?’, what is the standard deviation of the measured rate of satisfaction? Plot it as a
function of x. Write a ‘Monte-Carlo’ program to check this with n = 500 and x = 85%. What is
the trick to reduce the computing time?
6.2
Poisson distribution
Poisson distribution of discrete variables deals with the probability that a rare event occurs n times
in a given time interval ∆t, with an average rate of µ events per ∆t. Note than n is a integer number
while µ is not. Example: on the average, a fisherman catches µ = 2.5 fishes per hour. What is the
probability that he will catch n = 0, 1, · · · fishes in the upcoming hour?
The Poisson distribution can be seen at the limit of the binomial distribution when the number of
time-slices (trials) goes to infinity. Let’s split the time interval ∆t in n time slices; the elementary
11
probability of success is p = nµ and the probability of having two events in the same time slice goes
to zero as n goes to infinity. Thus
(
P (r; µ) = lim
n→∞
n
r
)
nr r
nr µr
µ n µr −µ
p (1 − p)n−r = lim
(1
−
) =
e
n→∞ r!
n→∞ r! nr
n
r!
pr (1 − p)n−r = lim
The probability generating function is
G(z) =
∞
∑
r=0
zr
µr −µ
e = ezµ e−µ = eµ(z−1)
r!
It follows that:
• Mean: G′ (z)|z=1 = µeµ(z−1) |z=1 = µ, as intuitively expected. That’s why, by the way, the
parameter of the Poisson distribution is traditionally called µ.
• Variance: G′′ (z) = µ2 eµ(z−1) , thus σ 2 = G′′ (1) − [G′ (1)]2 + G′ (1) = µ2 − µ2 + µ = µ
• The Poisson distribution is inductive, i.e, P (r; µ) ⊕ P (r; µ′ ) = P (r; µ + µ′ ). The distribution
of the number of fishes you’ll catch in the next 3 hours plus that of the number of fishes you’ll
catch in the following 4 hours is equal to the distribution of the number of fishes you’ll catch
in the next 7 hours!
The symbolic program computing the basic properties of this distribution can be found in
--> PoissonTheo.m
Exercise: hitch hiking You are waiting for a lift on a small road where on the average one car
passes every minute. The probability for the driver to stop and pick you up is 10%. What is the
probability that you’ll still be waiting a. in 10 minutes and b. after 10 cars have passed along? Why
are these numbers different? On the same figure, plot the answer to question a. as a function of
the waiting time and the answer to question b. as a function of the number of cars which passed in
front of you. Same question if the probability to be taken is now 50%. Why is the difference bigger
in the latter case? See an illustration in the MATLAB program:
--> HITCH.m
Exercise: Poisson as the limit of the binomial distribution. Create a small movie of
≃ 20 frames showing the deformation of a binomial distribution of constant mean np = 10 for
various values of n : ∞(or100), 80, 70, · · · , 30, 25, 20, 18, 16, 15, 14, 13, 12. Notice the asymmetry of
the Poisson distribution towards the wing of large r’s, and the asymmetry of the binomial distribution
towards the wing of small r’s when n approaches 10.
6.3
Uniform distribution
The simplest distribution of a continuous variable is the uniform distribution of bounds a and b:
P (x) =
1
b−a
1
2
2
Its mean is µ = a+b
2 and its variance is σ = 12 (a − b) . Numerous physical distributions are of this
form, but in practise limited resolutions of measuring devices ruin its nice properties.
12
As seen in 3.3, the uniform distribution can be very useful for generation of random numbers distributed according to other probability density functions, after a change of variables. That’s why all
computers programs/compilers provide generators of sequences of pseudo-random numbers uniformly
distributed between 0 and 1. Do not forget that:
• The sequence always starts at the same value, to allow deterministic (and not only statistical)
comparisons and debugging.
• The ‘seed’ (1st value of the sequence) can be reset; for example, in the case of a massive
simulation program, it is common practise to save the value of the seed at the end of a ‘job’
and restore it at the beginning of the next job.
• Usually, these pseudo random number generators have a cycle length of 231 = 2 · 109 , which
is not a large number for today’s number crunchers. There exist sophisticated generators of
‘infinite’ (typically 2128 = 4 · 1038 ) cycle length.
6.4
Exponential distribution
The exponential distribution of a continuous variable can be related to the Poisson distribution as
the time distribution of the next event we are waiting for. It depends on one parameter t0 = 1r ,
where r is the rate (number of events per unit of time). In this context, t0 has the dimension of a
time.
Let’s evaluate the probability that no event will occur between now and time t. The number of
events between now and time t is distributed according to a Poisson law of mean µ = t/t0 and
therefore the probability that zero event occurs between now and t is
p(t) =
( tt0 )0 e−t/t0
0!
= e−t/t0
Therefore the probability that the next event will occur at time t is
P (t) = −
dp(t)
1
= e−t/t0
dt
t0
This is illustrated by the program
--> poisson.m
The characteristic function (written as a function of u since now the variable is t!) is:
Φ(u) =
1
t0
∫
∞
e
iut− tt
0
dt =
0
1
1
=
1 − iut0
t0 ( t10 − iu)
It follows from this that
• The mean of the exponential distribution is
µ=
∂Φ(iu)
+t0
|iu=0 =
|iu=0 = t0
∂iu
(1 − iut0 )2
• Its variance is
σ2 =
∂ 2 Φ(iu)
+2 t20
|iu=0 − µ2 =
|iu=0 − t20 = 2 t20 − t20 = t20
2
∂ iu
(1 − iut0 )3
13
• The exponential distribution is NOT inductive. Example: There is a sample of radioactive
material in this room. If i enter the room, start my stopper and watch for the time of the next
radioactive decay, i will observe an exponential law. If i look at the distribution of the time
interval between two decays, i will still get an exponential law. BUT if i enter the room, wait
for 2 decays, and look at my stopper, i will not get an exponential distribution. In fact, this
distribution is called the Erlangian distribution and its probability density function is:
P (t; n, t0 ) =
1
1 −n n−1 −t/t0
e
=
xn−1 e−x
t0 t
Γ(n)
Γ(n) t0
where n is the number of events you are waiting for (n = 1 is the exponential distribution)
and x = t/t0 the dimensionless time. Γ(x) is the gamma function:
∫
Γ(x) =
∞
tz−1 e−t dt
0
For an integer n > 0, Γ(n) = (n − 1)! .
See the symbolic computation and illustration in the MATLAB programs
--> erlang.m, plerlang.m
Exercise: waiting for a bus Every 10 minutes (at hours :00, :10, · · ·,:50), a local bus leaves
Tel Aviv for Haifa with several stops on the way. Because of traffic jams, these buses reach Hadera
at random time. What is your average waiting time if you go to the bus station, assuming you ignore
the time-table, in a. Tel Aviv and b. Hadera? Why are these two numbers different? Where did the
‘missing buses’ disappear? Why is this exercise unrealistic and the situation actually much worse?
Exercise: dead time Any data acquisition system has dead time, which means that after an
event has been accepted for acquisition/processing, the system is blind for some time τp . Assuming
that the rate of incoming events is r = 1/τ0 , express as a function of ρ = τp /τ0 :
• The probability that an incoming event i was preceded by another event i−1 with ti −ti−1 < τp
• The fraction of events which are lost because of dead time
Why are these two numbers different? What is the very famous trick to reduce the fraction of lost
events?
6.5
Normal, or Gaussian, distribution
The Normal, or Gaussian distribution of a continuous variable is the most important one because of
the central limit theorem which we will study later on. We will see that under very general conditions,
the convolution of several distributions approaches quickly the Gaussian law. As a consequence, the
resolution function of almost every apparatus is itself a normal one.
The probability density function can be introduced in numerous ways. For example, let’s study the
behavior of a Poisson distribution when its parameter µ goes to infinity. We know that the mean is
µ; let’s perform a change of variable around the mean: x → t = x − µ, where t ≪ µ can now be
viewed as a continuous variable. Stirling formula says that
√
n! = 2 π n nn e−n
14
Therefore the Poisson law can be rewritten
P (x; µ) =
e−µ µµ+t
1
=√
[eµ et (µ + t)(µ+t) e−µ µ(µ+t) ]
(µ + t)!
2 π (µ + t)
A second order expansion of the term between brackets gives:
ln[eµ et (µ + t)(µ+t) e−µ µ(µ+t) ] = t + µ ln t + t ln µ − (µ + t)(ln µ +
Thus, going back to x,
t
t2
t2
− 2) = −
µ 2µ
2µ
(x−µ)2
1
e−µ µx
−
≃√
e 2µ
x!
2 πµ
Noting that the variance of the Poisson law is σ 2 = µ, this can be rewritten
(x−µ)2
1
N (x; µ, σ) = √
e− 2σ2
2πσ
Please note that the normalization factor is
√1
2πσ
and not
√1
2 πσ
since the distribution has the
1
σ
dimension of Note that the normal distribution can be introduced as well, among others, as the
limit of the Erlangian distribution when n goes to infinity: if i enter a room in which stands a
sample of a radioactive material, start my stopper and record the time of the 100th decay, i’ll get
a distribution very close to the normal one. The ‘unit’ normal distribution N (x; 0, 1) is called the
standard normal distribution. The characteristic function of the normal distribution is
∫
Φ(t) = E(eitx ) =
∞
−∞
(x−µ)2
t2 σ 2
1
√
eitx− 2σ2 dx = eitµ− 2
2 πσ
and a direct integration of the even central moments yields:
∫
µ2k =
∞
(x−µ)2
(2k)! 1 2 k
1
e− 2σ2 dx = · · · =
( σ )
(x − µ)2k √
k! 2
2πσ
−∞
It follows from the above that
• The mean is µ, the variance is σ 2 , the skewness (in fact any odd central moment) is 0 and the
kurtosis is also 0.
∑
• Any linear combination of independent normally distributed random variables i αi xi is itself
∑
∑
normally distributed with a mean µ = i αi µi and a variance σ 2 = i αi2 σi2 . This is a direct
consequence of the exponential properties of the characteristic function.
A unique and fundamental property of the normal distribution is the following: consider n independent
normally distributed random variables of mean µ and variance σ 2 . Because of the above, we know
∑
∑
that the quantity x = n1 i xi is normally distributed. The quantity s2 = n1 i (xi − x)2 has its own
distribution (it is called the χ2 distribution) but the unique property of the normal distribution is
that x and s2 are independent, which means that p(x, s2 ) = p1 (x)p2 (s2 ).
Exercise: Gaussian distribution as the limit of the Erlang distribution. Write a
MATLAB program which creates a movie representing the distortion of the Erlang distribution when
its parameter n runs from 1 to 25. The soltion is given in the MATLAB programs
--> GMOVIE.m
15
∫
t2
x
The cumulative probability density function F (x) = √12 π −∞
e− 2 dt is closely related, but not equal, to the error functions ERF and ERF C provided by most computer compilers/programs:
∫
2 x −t2
e dt
ERF (x) =
π 0
and
ERF C(x) =
2
π
∫
Thus,
F (x) =
∞
x
e−t dt = 1. − ERF (x)
2
1
x
1
+ ERF ( √ )
2
2
2
Exercise: Using MATLAB: in a normal distribution, how many points are at least 2, 3, · · · 5 standard deviations away? How does this compare with Tchebychev theorem?
16