1. Basics of probabilities 1 Recommended references Two simple but complete books: • Statistics: A guide to the use of Statistical Methods in the Physical Sciences, by R. Barlow Manchester Physics Series ISBN 0471922951 Paperbound reprint,1993, Wiley or original ISBN 0471922943 Hardcover 1989 • Statistical Data Analysis, by G. Cowan Oxford University Press ISBN 0-19-850156-0 and 0-19-850-115-2 An excellent online handbook can be found at: http://www-library.desy.de/preparch/books/vstatmp_engl.pdf An online version of Numerical Recipes can be found at: http://www.nrbook.com/a/bookcpdf.php 2 Administrativia My email is: [email protected] Lectures notes, home work, ... can be found on the course home page URL at: http://webhome.weizmann.ac.il/home/fhlellou/course/data anal/home.html 3 Probabilities 3.1 Some definitions • An experiment is any process that generates raw data. • A set whose elements represent all possible outcomes of an experiment is called the sample space. • An element of the sample space is called a sample point. • An event is a subset of a sample space. In probability theory a weight is assigned to each sample point normalized so as the total weight of the sample space is 1. These weights are the ”intuitive” likelihood of the occurrences of the sample points. • The probability of an event is the sum of all weights assigned to its sample points. • Two events are said mutually exclusive if their intersection is empty. • Two events are said complementary if their union is the sample set and their intersection empty. 1 Example: the experiment consists of tossing a coin three times. The sample space is S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T H} An event could be for example ”2 times out of 3, a head was tossed”, and corresponds to 3 sample points. The probability of such an event is therefore 3/8. Exercise: Panem et Circenses! The ‘Loto’ game consists of extracting at random 5 balls out of 50, numbered from 1 to 50. Which event is the more probable? • The 5 balls are distributed each inside each of the fives decades (1 to 10, 11 to 20,..., 41 to 50). • All 5 balls belong to the same decade. Should we conclude from this that it ‘pays off’ to gamble on the former? Is there anyhow a strategy to hope and win? 3.2 Basic theorems Notation: Ω is the sample space, A and B are events. P (A or B) = P (A) + P (B) − P (A and B) Noting P (A|B) the conditional probability of A given B, P (A and B) = P (A|B) · P (B) = P (B|A) · P (A) We can induce from the above the Bayes theorem: P (A|B) = P (B|A) · P (A)/P (B) If A1 , A2 , · · · , An are exhaustive and exclusive sets (which means than any event belongs to one and only one Ai ), then the marginal probability of B can be written P (B) = ∑ P (Aj and B) = j ∑ P (B|Aj ) · P (Aj ) j and the Bayes theorem can be rewritten as P (B|Ai ) · P (Ai ) P (Ai |B) = ∑ j P (B|Aj ) · P (Aj ) Independent events: A and B are independent if P (A|B) = P (A) i.e. P(A) does not depend on B being fulfilled or not. If so, P (A and B) = P (A) · P (B) and P (B|A) = P (B). Warning: dependence between two events does not imply causality ! 2 3.3 Digression: Bayesian and Frequentist interpretations of probabilities Traditionally two types of interpretations of probabilities are given: • The ”frequentist” approach, which views a probability as the limit of the ratio of successes, when the test/measurement/experiment is carried out an infinite amount of times • The subjective or ”approach” approach, which the elements of the sample space as hypotheses or propositions, ie.e statements which are either true or false. Let’s call θi an hypothesis (for example ”the value of this parameter is ...”) and let’s assume ∑ we have an exhaustive set of hypotheses ( i P (θi ) = 1). P (θi ) is called the prior probability of hypothesis θi and P (θi |X) is the posterior probability of θi (after the experiment yielded observation X). Bayes theorem states that P (X|θi ) · P (θi ) P (θi |X) = ∑ j P (X|θj ) · P (θj ) where P (X|θi ) is the likelihood of observation X under the hypothesis θi . If you don’t know all the prior probabilities, or claim, as frequentists do, that they have no meaning, then the posterior probabilities can not be evaluated. If you know some of the prior probabilities, you can at least evaluate the betting odds of θi against θj : P (θi |X) P (X|θi ) · P (θi ) = P (θj |X) P (X|θj ) · P (θj ) Bayes POSTULATE, adopted by Bayesian statisiticians. states that if the distribution of prior probabilities is completely unknown, one can take them all equal: P (θ1 ) = ... = P (θn ) = P (θ) and P (X|θi )P (θ) P (θi |X) = ∑ j P (X|θj ) In this sense, when quoting that a given theory is excluded by experimental data at a given ‘confidence level’ (e.g. 95%), frequentist statisticians will see it as ‘1. - the probability that data were such, would the theory be correct’; Bayesian statisticians will view it as ‘the probability that the theory is indeed correct given the data’, and will read the Bayes theorem as a way to update the probability that the theory is correct: P (theory|data) = P (data|theory)P (theory) P (data) P (theory) is called the prior probability, P (data|theory) the likelihood, and P (theory|data) the posterior probability 3.4 Exercises 1. You are participating to ‘Let’s make a deal’, the great TV show on Arutz Hatimtum! The rules of the games are the following: the entertainer shows 3 envelopes, two of them are empty, and one contains a $10,000,000 check. You choose one envelope, and to tempt you, the entertainer will systematically open one of the remaining two envelopes, show that it is empty, and ask you whether you want to change for the remaining third one. Will it be worthwhile to change? 3 2. Three drawers contain 2 coins each. In the first one, there are 2 gold coins, in the second one, 2 silver coins, and in the last one, 1 gold + 1 silver coin. You open one drawer, pick one coin and realize it is gold. What is the probability that the second coin in the drawer is gold as well? 3. You are running a High Energy Experiment directing a beam of incoming particles onto a target. In order to identify the incoming particles, you use a ‘Cerenkov counter’ which can discriminate between ‘π’s and ‘K’s. The response probability of the Cerenkov to π’s is ϵ = 95% and to K’s a = 6%. • Assuming that the beam composition is 90%π, 10%K, what are the probabilities of the incoming particle be a π if the Cerenkov did fire and be a K if not? • Same question assuming the beam composition: 10%π, 90%K • Let’s assume that there are a lot of other particles in the incoming beam; the only thing we know is that there are twice as many K’s as π’s; what are the betting odds of π against K if the device did fire? 3.5 Simpson’s paradox Simpson’s paradox (or the Yule-Simpson effect) is a paradox often encountered in social-science and medical-science statistics, and it occurs when frequency data are hastily given causal interpretations. An historic example One of the best known real life examples of Simpson’s paradox occurred when the University of California, Berkeley was sued for bias against women who had applied for admission to graduate schools there. Let’s simplify the problem as follows: imagine an University with two faculties, Natural Sciences and Humanities; in both faculties, men and women have the same probability of passing the exam, while the combined results show that women fail much more often than men. How can this be? Let’s assume that that probability of passing the exam in natural sciences (resp. humanities) is 90% (resp 10%), while 30% of women and 90% of men go for science. The probability for a woman to succeed is .3 ∗ .9 + .7 ∗ .1 = 0.34; for a man it is .9 ∗ .9 + .1 ∗ .1 = 0.82. Exercise: would you be the judge, would you declare the University guilty of segregation? Simpson’s paradox and decision making : Here is another real-life example from a medical study comparing the success rates of two treatments for kidney stones. The table shows the success rates and numbers of treatments for treatments involving both small and large kidney stones: Small Stones Large Stones Both Treatment A 93% (81/87) 73% (192/263) 78% (273/350) Treatment B 87% (234/270) 69% (55/80) 83% (289/350) Would two doctors, one treating small stones, the other treating large stones, test the treatments, both would conclude that treatment A is superior to treatment B. However combining the results show that treatment B is superior! You are a physician treating a patient suffering from kidney stones. Based on the above, which treatment are you going to prescribe? 4 4 4.1 Probability distributions - One continuous variable Probability density function; moments Notation: f (x) with the normalization ∫ ∞ −∞ f (x)dx = 1 Expectation of g(x), any function of x: ∫ E[g(x)] =< g(x) >= f (x)g(x)dx ∫ In particular, the mean is < x >= µ = f (x) x dx and the variance is v = σ 2 =< (x − µ)2 >=< x2 + µ2 − 2µx >=< x2 > − < x >2 Variance scaling: v(ax + b) = a2 v(x) Bienaymé-Tchebychev inequality: c, for any positive function of x g(x) > 0 and any constant 1 P (g(x) > c) ≤ E[g(x)] c The proof is simple: in the region Ω where g(x) > c, ∫ Ω g(x) f (x) dx ≥ c Thus E[g(x)] ≥ ∫ Ω ∫ f (x) dx Ω g(x) f (x) dx ≥ c · P (g(x) > c) Tchebychev theorem: How many points are more than n standard deviations away? Just use the Bienaymé inequality with g(x) = (x − µ)2 1 P ((x − µ)2 > n2 σ 2 ) ≤ n2 σ 2 P (|x − µ| > nσ) ≤ The cumulative probability density function ∫ F (a) = E[(x − µ)2 ] 1 n2 expresses the probability that x ≤ a: a −∞ f (x)dx It is a monotonically increasing function from 0 to 1. Moments: The nth algebraic moment is defined as µ′n =< xn > and the nth central moment is 3/2 defined as µn =< (x−µ)n >. Other useful quantities are the skewness s = µ3 /µ2 which measures the asymmetry of the distribution and the kurtosis c = µ4 /µ22 − 3 (sometimes called the excess) which compares the ‘height of the peak’ to that of a normal distribution. The −3 ensures that the kurtosis of the normal (gaussian) distribution is 0. A positive kurtosis corresponds to a higher peak and wider wings than the Gaussian distribution. 5 4.2 Exercise: another famous paradox 1. You are (again!) participating to a TV show, where the rule of the game is the following: you are presented with two envelopes, for which you know a priori that one contains a prize twice as large as in the second one. You choose one envelope and open it. It contains a check of, say, n NIS. You are asked whether you want to change it for the second one, which contains, with equal probabilities, either 2n or n/2 NIS. The expectation of the gain in changing is 1 2 (n − n/2) = n/4, which means that no matter what you chose, you’d better choose the other envelope. How to solve this paradox? 2. Let’s modify the rules as follows: the ”high” envelope is limited to a highest sum, i.e. it is uniformly distributed between 0 and 1M NIS. What is now your gain expectation if you always swap the envelope? What is the best strategy (i.e how to decide when to swap and when not) ? Sample MATLAB programs to check your answers --> TwoEnvelopes.m --> TwoEnvelopes2.m 4.3 Characteristic function of a distribution ∫ Φ(t) =< eitx >= and the inverse transform is f (x) = ∫ 1 2π ∞ −∞ ∞ −∞ eitx f (x)dx e−itx Φ(t)dt Expanding the characteristic function around t = 0 gives Φ(t) =< eitx >=< 1 + itx + ∞ ∞ ∑ ∑ (it)2 2 1 1 x + . . . >=< (it)k xk >= (it)k < xk > 2! k! k! k=0 k=0 Thus, ∂ k Φ(t) |t=0 ∂(it)k µ′k = For the central moments, consider Φµ (t) =< eit(x−µ) >=< ∞ ∑ 1 k! k=0 (it)k < (x − µ)k >= ∞ ∑ 1 k! k=0 (it)k µk thus, µk = ∂ k Φµ (t) |t=0 ∂(it)k The characteristic function is in fact the Fourier transform of the distribution, whose basic property is that of the convolution: Φp1 ⊗p2 (t) = Φp1 (t) · Φp2 (t) Proof: (p1 ⊗ p2 )(x) = ∫ ∞ −∞ 6 p1 (y) p2 (x − y) dy Thus, ∫ Φp1 ⊗p2 (t) = = = = = ∞ ∫ ∞ x=−∞ y=−∞ ∫ ∞ ∫ ∞ eitx p1 (y) p2 (x − y) dx dy eity p1 (y) eit(x−y) p2 (x − y) dx dy x=−∞ y=−∞ ∫ ∞ ∫ ∞ eity p1 (y) [ eit(x−y) p2 (x − y) dx] dy y=−∞ x=−∞ ∫ ∞ ∫ ∞ ity e p1 (y) [ eitu p2 (u) du] dy y=−∞ u=−∞ ∫ ∞ ity e p1 (y)Φp2 (t)dy y=−∞ Φp1 ⊗p2 (t) = Φp1 (t) · Φp2 (t) To compute the distribution of the sum of two independent variables (which is NOT the sum of the two distributions), it is enough to multiply the two characteristic functions in the ‘frequency space’ and transform back the result to the ‘variable space’. This result can be used to show that the variance of the sum of two independent distributions p1 and p2 is the sum of the variances v1 and v2 ; for simplicity, let’s assume that both distributions have a null average. v(p1 ⊗ p2 ) = = = = = since ∂ 2 Φp1 ⊗p2 |t=0 ∂(it)2 ∂ 2 Φp 1 · Φp 2 |t=0 ∂(it)2 ∂ 2 Φp1 ∂ 2 Φp2 ∂Φp1 ∂Φp2 Φp2 | + Φ | + 2 |t=0 t=0 p t=0 1 ∂(it)2 ∂(it) ∂(it) ∂(it)2 ∂ 2 Φp2 ∂ 2 Φp 1 | + |t=0 t=0 ∂(it)2 ∂(it)2 v(p1 ) + v(p2 ) ∂Φp1 ∂Φp2 |t=0 = |t=0 = µ1 = µ2 = 0 ∂(it) ∂(it) and Φp1 (0) = Φp2 (0) = 1 4.4 Change of variable x has the probability density function f(x). Let’s perform a change of variable x → y = h(x). The p.d.f of y is dx f (x) g(y) = f (x) = ′ |dy| |h (x)| In case several xi are solutions of the equation h(x)=y, then g(y) = ∑ f (xi ) i 7 |h′ (xi )| Application 1: how to generate a variable of a given p.d.f. Consider the change of ∫x variable x → y = h(x) = −∞ f (t)dt, i.e. the variable is transformed into its cumulative distribution. Then the probability density function of y is: g(y) = f (x) f (x) = =1 ′ |( −∞ f (t)dt) | |f (x)| ∫x The distribution of the cumulative probability is uniform between 0 and 1. Therefore the algorithm to generate x is: 1. generate a pseudo-random number r uniformly distributed between 0 and 1 2. Solve the equation in x: ∫x −∞ f (t)dt =r Exercise: how to generate a variable x distributed according to f (x) = e−x ? Digression: what to do when the cumulative function can’t be inverted analytically? It was first proposed by von Neumann to run an acceptance-rejection method, as follows: let’s assume that f (x) is completely surrounded by the box xmin < x < xmax and 0 < f (x) < ymax . Then: 1. Generate a pseudo-random number r uniformly distributed between 0 and 1 2. Transform it to x = xmin + r(xmax − xmin ) 3. Generate y, uniformly distributed between 0 and ymax 4. If y < f (x), accept. If not, reject This is illustrated by the MATLAB program --> expdis.m Such an algorithm can be very inefficient if the p.d.f exhibits a high peak and needs to be bounded by a‘high’ rectangle. It can be improved by enclosing f in any other distribution h, which can be generated analytically, such as f (x) < A · h(x). Then 1. Generate x according to the analytic method for the distribution h 2. Generate y, uniformly distributed between 0 and A · h(x) 3. Accept if y < f (x). If not, reject. In the case where the normalization of the p.d.f can’t be found analytically, the famous Metropolis Hastings algorithm can be used. It generates a time independent Markov chain, which is a ’time’ sequence of states x in which each xt+1 depends only on the previous xt . The algorithm uses a proposal density Q(x′ ; xt ), to generate a new proposed sample x′ . This proposal is accepted as the next value if a randown variable r uniformally distributed between 0 and 1 satisfies { P (x′ )Q(xt ; x′ ) r < min ,1 P (xt )Q(x′ ; xt ) } If the proposal is accepted, then xt+1 = x′ ; if not, the current value of x is retained: xt+1 = xt . See the example MATLAB program: --> Metropolis.m 8 Application 2: a ‘visual check’. Does an observed set of variables actually come from an assumed∫parent distribution f (x)? A very simple way consists of filling the histogram of the new x variable −∞ f (t)dt and ‘see’ whether it is flat or not. 4.5 Exercise r1 and r2 are two independent variables uniformly distributed between 0 and 1. What is the distribution of r1 · r2 ? Write a small MATLAB program to display the histogram of 100000 entries of r1 · r2 and visually check the analytic result. 5 5.1 Probability distributions -One discrete variable Modification of formulas A discrete variable can take only positive integer values k = 0, 1, · · ·. The elementary probability is noted pk and the normalization is ∞ ∑ pk = 1 k=0 The mean is µ = E(k) = ∞ ∑ k pk k=0 and the variance is V (k) = ∞ ∑ [k − E(k)]2 pk = E(k 2 ) − [E(k)]2 k=0 5.2 Probability generating function The equivalent to the characteristic function in the case of the probability distribution of a discrete variable is called the probability generating function: G(z) = ∞ ∑ z k pk k=0 Note that G(0)=0 and G(1)=1; It is often used to compute the mean and the variance of a discrete variable distribution, using the following trick: ′ G (z) = ∞ ∑ k z k−1 pk k=1 G′′ (z) = ∞ ∑ k(k − 1) z k−2 pk k=2 Evaluated at z = 1, these derivatives yield: G(′ 1) = ∞ ∑ k pk = µ k=1 ′′ G (1) = ∞ ∑ (k 2 − k) pk = E(k 2 ) − E(k) k=2 9 Thus V (k) = G′′ (1) + G′ (1) − [G′ (1)]2 Finally, probability generating functions are multiplicative under convolution: let’s consider two independent distributions p and p′ . The distribution of the sum of p and p′ is: (p ⊕ p′ )k = k ∑ pi p′k−i i=0 and its probability generating function is: Gp⊕p′ (z) = Gp (z) Gp′ (z) = ( ∞ ∑ z k pk ) ( k=0 ∞ ∑ z k p′k ) k=0 Proof: the term in z k in Gp (z) · Gp′ (z) is: k ∑ pi p′k−i i=0 While the probability distribution of p ⊕ p′ is: (p ⊕ p′ )k = k ∑ pi p′k−i i=0 and its probability generating function is: Gp⊕p′ (z) = ∞ ∑ k=0 zk k ∑ pi p′k−i i=0 and the terms in z k in Gp⊕p′ (z) is the same as the one in Gp (z) · Gp′ (z). One can use this result to show that the variance of the distribution of the sum of two independent variables is equal to the sum of the variances. The proof is very similar to that of the case of a continuous variable: See an illustration in the MATLAB program --> dices.m V (k1 + k2 ) = G′′1+2 (1) + G′1+2 (1) − G′2 1+2 (1) G′1+2 (1) = G1 (1)G′2 (1) + G′1 (1)G2 (1) = G′1 (1) + G′2 (1) G′′1+2 (1) = G1 (1)G′′2 (1) + G′′1 (1)G2 (1) + 2G′1 (1)G′2 (1) = G′′2 (1) + G′′1 (1) + 2G′1 (1)G′2 (1) ′2 ′ ′ ′ ′ V (k1 + k2 ) = G′′2 (1) + G′′1 (1) + 2G′1 (1)G′2 (1) − G′2 1 (1) − G2 (1) − 2G1 (1)G2 (1) + G1 (1) + G2 (1) V (k1 + k2 ) = V (k1 ) + V (k2 ) 10 6 Some important distributions 6.1 Binomial distribution The binomial distribution applies to discrete variable and describes the distribution of R, the number of successes out of a given number of trials n, given an elementary probability of success of p. ( P (r; n, p) = ( where n r n r ) = ) pr (1 − p)n−r n! r! (n − r)! is the number of combinations. Traditionally, 1 − p is noted q. The probability generating function is ( ) r=n ∑ n r G(z) = z pr (1 − p)n−r = (zp + q)n r r=0 It follows that: • Mean: G′ (z)|z=1 = np (zp + q)n−1 |z=1 = np, as intuitively expected. • Variance: σ 2 = G′′ (1)−[G′ (1)]2 +G′ (1) = n(n−1)p2 (zp+q)n−2 |z=1 −n2 p2 +np = np−np2 = np (1 − p) = npq • The binomial distribution is inductive, i.e. P (r; n, p) ⊕ P (r; n′ , p) = P (r; n + n′ , p), as intuitively expected: if ones runs two consecutive experiments, the 1st one consisting of n trials and the second one of n′ trials, the total number of successes is clearly the same of a unique experiment consisting of n + n′ trials. • One could have used the above property to show directly that σ 2 = npq; let’s consider the simple case of n = 1; In such a situation, µ = 0·(1−p)+1·p = p and σ = 0·(1−p)+1·p−µ2 = p(1 − p) and induction from 1 to n gives the result. See the example MATLAB programs: --> BINOM.M, BMOVIE.M, BMOVIE2.M, BMOVIE3.M, POLL.M, SPOLL.M Exercise: Most political polls published in Israeli newspapers are based on a population of ca. 500 persons. Assuming that x percent of the people answer ‘yes’ to the question ‘are you happy with the government?’, what is the standard deviation of the measured rate of satisfaction? Plot it as a function of x. Write a ‘Monte-Carlo’ program to check this with n = 500 and x = 85%. What is the trick to reduce the computing time? 6.2 Poisson distribution Poisson distribution of discrete variables deals with the probability that a rare event occurs n times in a given time interval ∆t, with an average rate of µ events per ∆t. Note than n is a integer number while µ is not. Example: on the average, a fisherman catches µ = 2.5 fishes per hour. What is the probability that he will catch n = 0, 1, · · · fishes in the upcoming hour? The Poisson distribution can be seen at the limit of the binomial distribution when the number of time-slices (trials) goes to infinity. Let’s split the time interval ∆t in n time slices; the elementary 11 probability of success is p = nµ and the probability of having two events in the same time slice goes to zero as n goes to infinity. Thus ( P (r; µ) = lim n→∞ n r ) nr r nr µr µ n µr −µ p (1 − p)n−r = lim (1 − ) = e n→∞ r! n→∞ r! nr n r! pr (1 − p)n−r = lim The probability generating function is G(z) = ∞ ∑ r=0 zr µr −µ e = ezµ e−µ = eµ(z−1) r! It follows that: • Mean: G′ (z)|z=1 = µeµ(z−1) |z=1 = µ, as intuitively expected. That’s why, by the way, the parameter of the Poisson distribution is traditionally called µ. • Variance: G′′ (z) = µ2 eµ(z−1) , thus σ 2 = G′′ (1) − [G′ (1)]2 + G′ (1) = µ2 − µ2 + µ = µ • The Poisson distribution is inductive, i.e, P (r; µ) ⊕ P (r; µ′ ) = P (r; µ + µ′ ). The distribution of the number of fishes you’ll catch in the next 3 hours plus that of the number of fishes you’ll catch in the following 4 hours is equal to the distribution of the number of fishes you’ll catch in the next 7 hours! The symbolic program computing the basic properties of this distribution can be found in --> PoissonTheo.m Exercise: hitch hiking You are waiting for a lift on a small road where on the average one car passes every minute. The probability for the driver to stop and pick you up is 10%. What is the probability that you’ll still be waiting a. in 10 minutes and b. after 10 cars have passed along? Why are these numbers different? On the same figure, plot the answer to question a. as a function of the waiting time and the answer to question b. as a function of the number of cars which passed in front of you. Same question if the probability to be taken is now 50%. Why is the difference bigger in the latter case? See an illustration in the MATLAB program: --> HITCH.m Exercise: Poisson as the limit of the binomial distribution. Create a small movie of ≃ 20 frames showing the deformation of a binomial distribution of constant mean np = 10 for various values of n : ∞(or100), 80, 70, · · · , 30, 25, 20, 18, 16, 15, 14, 13, 12. Notice the asymmetry of the Poisson distribution towards the wing of large r’s, and the asymmetry of the binomial distribution towards the wing of small r’s when n approaches 10. 6.3 Uniform distribution The simplest distribution of a continuous variable is the uniform distribution of bounds a and b: P (x) = 1 b−a 1 2 2 Its mean is µ = a+b 2 and its variance is σ = 12 (a − b) . Numerous physical distributions are of this form, but in practise limited resolutions of measuring devices ruin its nice properties. 12 As seen in 3.3, the uniform distribution can be very useful for generation of random numbers distributed according to other probability density functions, after a change of variables. That’s why all computers programs/compilers provide generators of sequences of pseudo-random numbers uniformly distributed between 0 and 1. Do not forget that: • The sequence always starts at the same value, to allow deterministic (and not only statistical) comparisons and debugging. • The ‘seed’ (1st value of the sequence) can be reset; for example, in the case of a massive simulation program, it is common practise to save the value of the seed at the end of a ‘job’ and restore it at the beginning of the next job. • Usually, these pseudo random number generators have a cycle length of 231 = 2 · 109 , which is not a large number for today’s number crunchers. There exist sophisticated generators of ‘infinite’ (typically 2128 = 4 · 1038 ) cycle length. 6.4 Exponential distribution The exponential distribution of a continuous variable can be related to the Poisson distribution as the time distribution of the next event we are waiting for. It depends on one parameter t0 = 1r , where r is the rate (number of events per unit of time). In this context, t0 has the dimension of a time. Let’s evaluate the probability that no event will occur between now and time t. The number of events between now and time t is distributed according to a Poisson law of mean µ = t/t0 and therefore the probability that zero event occurs between now and t is p(t) = ( tt0 )0 e−t/t0 0! = e−t/t0 Therefore the probability that the next event will occur at time t is P (t) = − dp(t) 1 = e−t/t0 dt t0 This is illustrated by the program --> poisson.m The characteristic function (written as a function of u since now the variable is t!) is: Φ(u) = 1 t0 ∫ ∞ e iut− tt 0 dt = 0 1 1 = 1 − iut0 t0 ( t10 − iu) It follows from this that • The mean of the exponential distribution is µ= ∂Φ(iu) +t0 |iu=0 = |iu=0 = t0 ∂iu (1 − iut0 )2 • Its variance is σ2 = ∂ 2 Φ(iu) +2 t20 |iu=0 − µ2 = |iu=0 − t20 = 2 t20 − t20 = t20 2 ∂ iu (1 − iut0 )3 13 • The exponential distribution is NOT inductive. Example: There is a sample of radioactive material in this room. If i enter the room, start my stopper and watch for the time of the next radioactive decay, i will observe an exponential law. If i look at the distribution of the time interval between two decays, i will still get an exponential law. BUT if i enter the room, wait for 2 decays, and look at my stopper, i will not get an exponential distribution. In fact, this distribution is called the Erlangian distribution and its probability density function is: P (t; n, t0 ) = 1 1 −n n−1 −t/t0 e = xn−1 e−x t0 t Γ(n) Γ(n) t0 where n is the number of events you are waiting for (n = 1 is the exponential distribution) and x = t/t0 the dimensionless time. Γ(x) is the gamma function: ∫ Γ(x) = ∞ tz−1 e−t dt 0 For an integer n > 0, Γ(n) = (n − 1)! . See the symbolic computation and illustration in the MATLAB programs --> erlang.m, plerlang.m Exercise: waiting for a bus Every 10 minutes (at hours :00, :10, · · ·,:50), a local bus leaves Tel Aviv for Haifa with several stops on the way. Because of traffic jams, these buses reach Hadera at random time. What is your average waiting time if you go to the bus station, assuming you ignore the time-table, in a. Tel Aviv and b. Hadera? Why are these two numbers different? Where did the ‘missing buses’ disappear? Why is this exercise unrealistic and the situation actually much worse? Exercise: dead time Any data acquisition system has dead time, which means that after an event has been accepted for acquisition/processing, the system is blind for some time τp . Assuming that the rate of incoming events is r = 1/τ0 , express as a function of ρ = τp /τ0 : • The probability that an incoming event i was preceded by another event i−1 with ti −ti−1 < τp • The fraction of events which are lost because of dead time Why are these two numbers different? What is the very famous trick to reduce the fraction of lost events? 6.5 Normal, or Gaussian, distribution The Normal, or Gaussian distribution of a continuous variable is the most important one because of the central limit theorem which we will study later on. We will see that under very general conditions, the convolution of several distributions approaches quickly the Gaussian law. As a consequence, the resolution function of almost every apparatus is itself a normal one. The probability density function can be introduced in numerous ways. For example, let’s study the behavior of a Poisson distribution when its parameter µ goes to infinity. We know that the mean is µ; let’s perform a change of variable around the mean: x → t = x − µ, where t ≪ µ can now be viewed as a continuous variable. Stirling formula says that √ n! = 2 π n nn e−n 14 Therefore the Poisson law can be rewritten P (x; µ) = e−µ µµ+t 1 =√ [eµ et (µ + t)(µ+t) e−µ µ(µ+t) ] (µ + t)! 2 π (µ + t) A second order expansion of the term between brackets gives: ln[eµ et (µ + t)(µ+t) e−µ µ(µ+t) ] = t + µ ln t + t ln µ − (µ + t)(ln µ + Thus, going back to x, t t2 t2 − 2) = − µ 2µ 2µ (x−µ)2 1 e−µ µx − ≃√ e 2µ x! 2 πµ Noting that the variance of the Poisson law is σ 2 = µ, this can be rewritten (x−µ)2 1 N (x; µ, σ) = √ e− 2σ2 2πσ Please note that the normalization factor is √1 2πσ and not √1 2 πσ since the distribution has the 1 σ dimension of Note that the normal distribution can be introduced as well, among others, as the limit of the Erlangian distribution when n goes to infinity: if i enter a room in which stands a sample of a radioactive material, start my stopper and record the time of the 100th decay, i’ll get a distribution very close to the normal one. The ‘unit’ normal distribution N (x; 0, 1) is called the standard normal distribution. The characteristic function of the normal distribution is ∫ Φ(t) = E(eitx ) = ∞ −∞ (x−µ)2 t2 σ 2 1 √ eitx− 2σ2 dx = eitµ− 2 2 πσ and a direct integration of the even central moments yields: ∫ µ2k = ∞ (x−µ)2 (2k)! 1 2 k 1 e− 2σ2 dx = · · · = ( σ ) (x − µ)2k √ k! 2 2πσ −∞ It follows from the above that • The mean is µ, the variance is σ 2 , the skewness (in fact any odd central moment) is 0 and the kurtosis is also 0. ∑ • Any linear combination of independent normally distributed random variables i αi xi is itself ∑ ∑ normally distributed with a mean µ = i αi µi and a variance σ 2 = i αi2 σi2 . This is a direct consequence of the exponential properties of the characteristic function. A unique and fundamental property of the normal distribution is the following: consider n independent normally distributed random variables of mean µ and variance σ 2 . Because of the above, we know ∑ ∑ that the quantity x = n1 i xi is normally distributed. The quantity s2 = n1 i (xi − x)2 has its own distribution (it is called the χ2 distribution) but the unique property of the normal distribution is that x and s2 are independent, which means that p(x, s2 ) = p1 (x)p2 (s2 ). Exercise: Gaussian distribution as the limit of the Erlang distribution. Write a MATLAB program which creates a movie representing the distortion of the Erlang distribution when its parameter n runs from 1 to 25. The soltion is given in the MATLAB programs --> GMOVIE.m 15 ∫ t2 x The cumulative probability density function F (x) = √12 π −∞ e− 2 dt is closely related, but not equal, to the error functions ERF and ERF C provided by most computer compilers/programs: ∫ 2 x −t2 e dt ERF (x) = π 0 and ERF C(x) = 2 π ∫ Thus, F (x) = ∞ x e−t dt = 1. − ERF (x) 2 1 x 1 + ERF ( √ ) 2 2 2 Exercise: Using MATLAB: in a normal distribution, how many points are at least 2, 3, · · · 5 standard deviations away? How does this compare with Tchebychev theorem? 16
© Copyright 2026 Paperzz