Chapter 5. Introduction to Probability Theory This chapter introduces some elementary probability theory which is the base of statistical inference in the next chapter. The foundations of modern probability theory were laid by Andrey Nikolaevich Kolmogorov (1903-1987) in 1933. Related materials are Chapter 1-5 of Casella and Berger (2002) and Appendix B of Hansen (2016). 1 Foundations The set of all possible outcomes of an experiment is called the sample space for the experiment. Take the simple example of tossing a coin. There are two outcomes, heads and tails, so we can write = fH; T g. If two coins are tossed in sequence, we can write the four outcomes as = fHH; HT; T H; T T g. An event A is any collection of possible outcomes of an experiment. An event is a subset of , including itself and the null set ;. Continuing the two coin example, one event is A = fHH; HT g, the event that the …rst coin is heads. We say that A and B are disjoint or mutually exclusive if A \ B = ;. For example, the sets fHH; HT g and fT Hg are disjoint. Furthermore, if the sets A1 ; A2 ; are pairwise disjoint and [1 , then the collection A1 ; A2 ; is called a i=1 Ai = partition of . A probability function P assigns probabilities (numbers between 0 and 1) to events A in . This is straightforward when is countable; when is uncountable we must be somewhat more careful. A collection of sets, F, is called a sigma algebra (or -…eld) if ; 2 F, A 2 F implies Ac 2 F, and A1 ; A2 ; 2 F implies [1 i=1 Ai 2 F. We only de…ne probabilities for events contained in F. A simple example of F is f;; g which is known as the trivial -algebra. For any sample space , let F be the smallest sigma algebra which contains all of the open sets in . When is countable, F is simply the collection of all subsets of F, including ; and . When is the real line, then F is the collection of all open and closed intervals. We call such an F the Borel -algebra associated with , and a set A 2 F a Borel (measurable) set. We now can give the axiomatic de…nition of probability. Given and F, a probability function P satis…es P ( ) = 1, P (A) 0 for all A 2 F, and if A1 ; A2 ; 2 F are pairwise disjoint, then Pn 1 P ([i=1 Ai ) = i=1 P (Ai ). Some important properties of the probability function include the following 1. P (;) = 0 Email: [email protected] 1 2. P (A) 1 3. P (Ac ) = 1 P (A) 4. P (B \ Ac ) = P (B) P (A \ B) 5. P (A [ B) = P (A) + P (B) 6. If A B then P (A) P (A \ B) P (B) 7. Bonferroni’s Inequality: P (A \ B) 8. Boole’s Inequality: P (A [ B) P (A) + P (B) 1 P (A) + P (B) For some elementary probability models, it is useful to have simple rules to count the number of objects in a set. These counting rules are facilitated by using the binomial coe¢ cients which are de…ned for nonnegative integers n and r, n r, as n r ! = n! : r!(n r)! When counting the number of objects in a set, there are two important distinctions. Counting may be with replacement or without replacement. Counting may be ordered or unordered. For example, consider a lottery where you pick six numbers from the set 1; 2; :::; 49. This selection is without replacement if you are not allowed to select the same number twice, and is with replacement if this is allowed. Counting is ordered or not depending on whether the sequential order of the numbers is relevant to winning the lottery. Depending on these two distinctions, we have four expressions for the number of objects (possible arrangements) of size r from n objects. Without Replacement With Replacement n! Ordered nr (n r)! ! ! n n+r 1 Unordered r r In the lottery example,!if counting is unordered and without replacement, the number of potential 49 combinations is = 13; 983; 816. 6 If P (B) > 0 the conditional probability of the event A given the event B is P (AjB) = P (A \ B) : P (B) For any B, the conditional probability function is a valid probability function where replaced by B. Rearranging the de…nition, we can write P (A \ B) = P (AjB)P (B); 2 has been which is often quite useful. We can say that the occurrence of B has no information about the likelihood of event A when P (AjB) = P (A), in which case we …nd P (A \ B) = P (A)P (B): (1) We say that the events A and B are statistically independent when (1) holds. Furthermore, we say that the collection of events A1 ; :::; Ak are mutually independent when for any subset fAi : i 2 Ig ; ! Y \ P P (Ai ): Ai = i2I i2I Theorem 1 (Bayes’Rule or Theorem) For any set B and any partition A1 ; A2 ; ple space, then for each i = 1; 2; of the sam- P (BjAi )P (Ai ) : P (Ai jB) = P1 j=1 P (BjAj )P (Aj ) Example 1 (Monty Hall Problem) The Monty Hall problem is a brain teaser, …rst appearing on the American television game show Let’s Make a Deal and named after its original host, Monty Hall. Suppose you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice? Solution: Consider the events C1, C2 and C3 indicating the car is behind respectively door 1 2 or 3. All these 3 events have probability 1=3. The player initially choosing door 1 is described by the event X1. As the …rst choice of the player is independent of the position of the car, also the conditional probabilities are P (CijX1) = 1=3. The host opening door 3 is described by H3. For this event it holds: P (H3jC1; X1) = 1=2, P (H3jC2; X1) = 1 and P (H3jC3; X1) = 0: Then, if the player initially selects door 1, and the host opens door 3, the conditional probability of winning by switching is P (C2jH3; X1) = = = = P (H3jC2; X1) P (C2 \ X1) P (H3 \ X1) P (H3jC2; X1) P (C2 \ X1) P3 i=1 P (H3jCi; X1) P (Ci \ X1) P (H3jC2; X1) P3 i=1 P (H3jCi; X1) 1 2 = : 1=2 + 1 + 0 3 3 Figure 1: Tree Showing the Probability of Every Possible Outcome If the Player Initially Picks Door 1 Figure 1 shows the probability of every possible outcome if the player initially picks Door 1. 2 Random Variables A random variable (r.v.) X is a measurable function from a sample space into the real line, i.e., for any Borel set A in R, the inverse image of A is in F.1 This induces a new sample space - the real line - and a new probability function on the real line. Typically, we denote r.v.’s by uppercase letters such as X, and use lowercase letters such as x for potential values and realized values. For a r.v. X we de…ne its cumulative distribution function (cdf) as F (x) = P (X x): Sometimes we write this as FX (x) to denote that it is the cdf of X. A function F (x) is a cdf i¤ the following three properties hold: 1. lim F (x) = 0 and lim F (x) = 1 x! 1 x!1 2. F (x) is nondecreasing in x 3. F (x) is right-continuous 1 Measurability of X guarantees that for any Borel set A, P (X 2 A) P (! 2 jX (!) 2 A) is well de…ned. Since the Borel -algebra can be generated by f( 1; x]jx 2 Rg, we need only de…ne P (X x) for x 2 R to calculate all other probabilities. 4 The support of a r.v. X is the closure of the set of possible values that X can take, denoted as supp(X). We say that the r.v. X is discrete if F (x) is a step function. In this case, the support of X consists of a countable set of real numbers x1 ; ; xJ , where J can be 1. The probability function for X takes the form of the probability mass function (pmf) P (X = xj ) = pj , j = 1; ; J; (2) P where 0 pj 1 and Jj=1 pj = 1. A famous discrete r.v. is the Bernoulli r.v., where J = 2, x1 = 0, x2 = 1, p2 = p and p1 = 1 p. We say that the r.v. X is continuous if F (x) is continuous in x. In this case P (X = x) = 0 for all x 2 R so the representation (2) is unavailable. Instead, we represent the relative probabilities by the probability density function (pdf) f (x) = so that F (x) = Z d F (x) dx x f (u)du 1 and P (a X b) = Z b f (u)du: a These expressions make sense only if F (x) is di¤erentiable. While there are examples of continuous r.v.’s which do not possess a pdf, these cases are unusual and are typically ignored. A function R1 f (x) is a pdf i¤ f (x) 0 for all x 2 R and 1 f (x)dx = 1. The support of a continuous r.v. is fx 2 Rjf (x) > 0g. One famous continuous r.v. is the normal r.v. which will be discussed in Section 7. Another famous continuous r.v. is the uniform r.v. whose pdf is f (x) = 1 (0 x 1), i.e., X can occur only on [0; 1] and occurs uniformly. We denote X U [0; 1]. A generalization is the uniform r.v. on [a; b], a < b, whose pdf is f (x) = 1 (a x b) =(b a). We denote X U [a; b]. 3 Expectation For any measurable real function g, we de…ne the mean or expectation E [g(X)] as follows. If X is discrete, J X E [g(X)] = g(xj )pj ; j=1 and if X is continuous E [g(X)] = Z 1 g(x)f (x)dx: 1 The latter is well de…ned and …nite if Z 1 1 jg(x)j f (x)dx < 1: 5 (3) If (3) does not hold, evaluate I1 = Z g(x)f (x)dx; g(x)>0 I2 = Z g(x)f (x)dx: g(x)<0 If I1 = 1 and I2 < 1 then we de…ne E [g(X)] = 1. If I1 < 1 and I2 = 1 then we de…ne E [g(X)] = 1. If both I1 = 1 and I2 = 1 then E [g(X)] is unde…ned. To cover both the discrete and continuous case (and even more general cases), we often write R E [g(X)] as a Riemann-Stieltjes integral g(x)dF (x) (why intuitively correct? see Chapter 1). Since E [a + bX] = a + b E [X], we say that expectation is a linear operator. For m > 0, we de…ne the m0 th moment of X as E [X m ], the m0 th central moment as E [(X E[X])m ], the m0 th absolute moment of X as E [jXjm ], and the m0 th absolute central moment as E [jX E[X]jm ]. h i 2. Two special moments are the mean = E[X] and variance 2 = E (X )2 = E X 2 p 2 the standard deviation of X. We can also write 2 = V ar(X). For example, We call = this allows the convenient expression V ar(a + bX) = b2 V ar(X). The moment generating function (MGF) of X is M ( ) = E [exp ( X)] : The MGF does not necessarily exist. However, when it does and E [X m ] < 1 then E [X m ] = dm M( ) d m ; =0 which is why it is called the moment generating function. More generally, the characteristic function (CF) of X is C( ) = E [exp (i X)] ; where i = p 1 is the imaginary unit. The CF always exists, and when E [jXjm ] < 1, E [X m ] = 1 dm C( ) im d m which is called Moment Theorem. The Lp norm, p 1, of the r.v. X is kXkp = (E [jXjp ])1=p : 6 ; =0 4 Multivariate Random Variables A pair of bivariate r.v.’s (X; Y ) is a function from the sample space into R2 . The joint cdf of (X; Y ) is F (x; y) = P (X x; Y y) : If F is continuous, the joint probability density function is @2 F (x; y): @x@y f (x; y) = R2 , For a Borel measurable set A P ((X; Y ) 2 A) = Z Z f (x; y)dxdy: A For any measurable function g(x; y), E [g(X; Y )] = Z 1 1 Z 1 g(x; y)f (x; y)dxdy: 1 The marginal distribution of X is FX (x) = P (X x) = lim F (x; y) = y!1 Z x 1 Z 1 f (x; y)dydx; 1 so the marginal density of X is fX (x) = d FX (x) = dx Z 1 f (x; y)dy: 1 Similarly, the marginal density of Y is fY (y) = Z 1 f (x; y)dx: 1 The r.v.’s X and Y are de…ned to be independent if f (x; y) = fX (x)fY (y). Furthermore, X and Y are independent i¤ there exist functions g(x) and h(y) such that f (x; y) = g(x)h(y). (Exercise) If X and Y are independent, then E [g(X)h(Y )] = = = Z Z Z 1 Z 1 g(x)h(y)f (x; y)dxdy 1 1 1 Z 1 1 1 1 g(x)h(y)fX (x)fY (y)dxdy Z 1 g(x)fX (x)dx h(y)fY (y)dy 1 = E [g(X)] E [h(Y )] ; 7 1 (4) if the expectations exist. For example, if X and Y are independent then E [XY ] = E [X] E [Y ] : Another implication of (4) is that if X and Y are independent and Z = X + Y , then MZ ( ) = E [exp ( (X + Y ))] = E [exp ( X) exp ( Y )] = E [exp ( X)] E [exp ( Y )] = MX ( )MY ( ): The covariance between X and Y is Cov(X; Y ) = XY = E [(X E [X]) (Y E [Y ])] = E [XY ] E [X] E [Y ] : The correlation between X and Y is Corr(X; Y ) = XY = XY : X Y The Cauchy-Schwarz Inequality implies that j XY j 1: The correlation is a measure of linear dependence, free of units of measurement. If X and Y are independent, then XY = 0 and XY = 0. The reverse, however, is not true. For example, if E [X] = 0 and E X 3 = 0, then Cov(X; X 2 ) = 0. A useful fact is that V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X; Y ): An implication is that if X and Y are independent, then V ar(X + Y ) = V ar(X) + V ar(Y ); the variance of the sum is the sum of the variances. A k 1 random vector X = (X1 ; ; Xk )0 is a function from to Rk . Let x = (x1 ; denote a vector in Rk . The vector X has the distribution and density functions F (x) = P (X x) ; k @ f (x) = F (x): @x1 @xk 8 ; xk )0 For a measurable function g : Rk ! Rs , we de…ne the expectation Z g(x)f (x)dx; E [g(X)] = Rk where the symbol dx denotes dx1 dxk . In particular, we have the k 1 multivariate mean = E [X] and k k covariance matrix = E (X )0 = E XX 0 ) (X If the elements of X are mutually independent, then V ar k X Xi i=1 5 ! k X = 0 : is a diagonal matrix and V ar (Xi ) : i=1 Conditional Distributions and Expectation The conditional density of Y given X = x is de…ned as fY jX (yjx) = f (x; y) fX (x) if fX (x) > 0. One way to derive this expression from the de…nition of conditional probability is fY jX (yjx) = = = = = @ lim P (Y yjx X x + ") @y "!0 @ P (fY yg \ fx X x + "g) lim @y "!0 P (x X x + ") @ F (x + "; y) F (x; y) lim @y "!0 FX (x + ") FX (x) @ F (x + "; y) @ lim @x @y "!0 fX (x + ") @ @y@x F (x; y) fX (x) = f (x; y) : fX (x) The conditional mean or conditional expectation is the function m(x) = E [Y jX = x] = Z 1 1 yfY jX (yjx)dy: The conditional mean m(x) is a function, meaning that when X equals x, then the expected value 9 of Y is m(x). Similarly, we de…ne the conditional variance of Y given X = x as 2 (x) = V ar (Y jX = x) i h = E (Y m(x))2 X = x = E Y2 X =x m(x)2 : Evaluated at x = X, the conditional mean m(X) and conditional variance 2 (X) are r.v.’s, functions of X. We write this as E [Y jX = x] = m(X) and V ar (Y jX) = 2 (X). For example, if E [Y jX = x] = + 0 x, then E [Y jX] = + 0 X, a transformation of X. ( 8xy; if 0 < x < y < 1; Example 2 Suppose fXY (x; y) = Are X and Y independent? What 0; otherwise. are the marginal densities of X and Y ? What is the conditional density of Y given X? What are E [Y ] and V ar (Y )? What are E [Y jX = x] and V ar (Y jX = x)? Solution: It is tempting to claim that X and Y are independent since it seems that f (x; y) = g(x)h(y) with g(x) = 8x and h(y) = y. However, such an argument neglects the support of (X; Y ) which is shown in the left panel of Figure 2. Note that the support of (X; Y ), supp(X; Y ) = f(x; y) j0 x y 1g, where the boundary is included. The marginal density of X is fX (x) = Z 1 8xydy = 4x(1 x2 ); 0 < x < 1 x and the marginal density of Y is fY (y) = Z y 8xydx = 4y 3 ; 0 < y < 1; 0 so E [Y ] = Z 0 1 4 4y ydy = ; E Y 2 = 5 3 Z 0 1 2 4y 3 y 2 dy = ; 3 and V ar (Y ) = E Y 2 E [Y ]2 = 2 3 4 5 2 = 2 : 75 Obviously, fXY (x; y) = 8xy 1 (0 < x < y < 1) 6= 4x(1 x2 ) 1 (0 < x < 1) 4y 3 1 (0 < y < 1) = fX (x)fY (y); so X and Y are not independent. 10 The conditional density of Y given X = x is f (x; y) fY jX (yjx) = = fX (x) ( 8xy 4x(1 x2 ) = 2y ; 1 x2 0; if 0 < x < y < 1; otherwise. So when 0 < x < 1, E Y 2 jX = x Z 1 2 1 + x + x2 2y dy = ; 1 x2 3 1+x x Z 1 2y 1 1 + x + x2 + x3 y2 = dy = ; 1 x2 2 1+x x E [Y jX = x] = y and V ar (Y jX = x) = E Y 2 jX = x = = E [Y jX = x]2 1 1 + x + x2 + x3 2 1+x (1 2 1 + x + x2 3 1+x 2 x)2 (1 + 4x + x2 ) : 18(1 + x)2 The right panel of Figure 2 shows E [Y jX = x] and V ar (Y jX = x) as a function of x. Obviously, E [Y jX = x] is increasing with E [Y jX = 0] = 2=3 and E [Y jX = 1] = 1, and V ar (Y jX = x) is decreasing with V ar (Y jX = 0) = 1=18 and V ar (Y jX = 1) = 0. Since E [Y jX = x] and V ar (Y jX = x) are not constant, X and Y are not independent (why?). Another observation is that Z 1 2 1 + x + x2 4 = 4x(1 x2 )dx E [Y ] = 5 3 1 + x 0 Z = E [Y jX = x] fX (x)dx: This is an example of the famous law of iterated expectations (LIE) as explained below. Theorem 2 (Simple Law of Iterated Expectations) E [E [Y jX]] = E [Y ] : Proof. E [E [Y jX]] = E [m(X)] Z 1 = m (x) fX (x)dx 1 Z 1Z 1 = yfY jX (yjx)fX (x)dydx 1 1 Z 1 Z 1 = yf (y; x)dydx 1 = E [Y ] : 1 11 Figure 2: supp(X; Y ), E [Y jX = x] and V ar (Y jX = x) Corollary 1 (Law of Iterated Expectations) E [E [Y jX; Z] jX] = E [Y jX] : Theorem 3 (Conditioning Theorem) For any function g(x), E [g(X)Y jX] = g(X)E [Y jX] : Proof. Let h(x) = E [g(X)Y jX = x] Z 1 = g(x)yfY jX (yjx)dy 1 Z 1 yfY jX (yjx)dy = g(x) 1 = g(x)m(x); where m(x) = E [Y jX = x]. Thus h(X) = g(X)m(X), which is the same as E [g(X)Y jX] = g(X)E [Y jX]. 6 Transformations Suppose that X 2 Rk with continuous distribution function FX (x) and density fX (x). Let Y = g(X) where g(x) : Rk ! Rk is one-to-one, di¤erentiable, and invertible. Let h(y) denote the inverse of g(x). Recall that the Jacobian of h is de…ned as J(y) = det (Dh(y)) ; 12 where for a matrix A, det(A) is the determinant of A. Consider the univariate case k = 1. If g(x) is an increasing function, then g(X) so the distribution function of Y is FY (y) = P (g(X) y) = P (X y i¤ X h(y), h(y)) = FX (h(y)): Taking the derivative, the density of Y is fY (y) = d d FY (y) = fX (h(y)) h(y): dy dy If g(x) is a decreasing function, then g(X) FY (y) = P (g(X) y i¤ X y) = P (X h(y), so h(y)) = 1 FX (h(y)); and the density of Y is fY (y) = d FY (y) = dy fX (h(y)) d h(y): dy We can write these two cases jointly as fY (y) = fX (h(y)) jJ(y)j : (5) This is known as the change-of-variables formula. This same formula (5) holds for k > 1, but its justi…cation requires deeper results from analysis. Example 3 Suppose X U [0; 1] and Y = log(X). Here, g(x) = log(x) and h(y) = exp( y) so the Jacobian is J(y) = exp( y). As the range of X is [0; 1], that for Y is [0; 1]. Since fX (x) = 1 for 0 x 1, (5) shows that fY (y) = exp( y), 0 y 1, an exponential density. 7 Normal and Related Distributions The standard normal density is 1 (x) = p exp 2 x2 2 ; 1 < x < 1: It is conventional to write X N (0; 1), and to denote the standard normal density function by (x) and its distribution function by (x). The latter has no closed-form solution. The normal density has all moments …nite. Since it is symmetric about zero all odd moments are zero. By iterated integration by parts, we can also show that E X 2 = 1 and E X 4 = 3. In fact, for any positive integer m, E X 2m = (2m 1)!! = (2m 1)(2m 3) 1. Thus E X 4 = 3, E X 6 = 15, E X 8 = 105, and E X 10 = 945. If Z is standard normal and X = + Z, then using the change-of-variables formula, X has 13 density 1 f (x) = p 2 )2 (x exp 2 2 ! ; 1 < x < 1: which is the univariate normal density. The mean and variance of the distribution are 2 , and it is conventional to write X N ( ; 2 ). For x 2 Rk , the multivariate normal density is f (x) = p 1 1=2 2 det ( ) (x exp 1 (x ) ) 2 and ; x 2 Rk : The mean and covariance matrix of the distribution are and . When X has the multivariate normal density, we call X follows the multivariate normal distribution and write X N ( ; ). 0 The MGF and CF of the multivariate normal are exp 0 + 0 =2 and exp i 0 =2 , respectively. Ifn X o2 Rk is multivariate normal and the elements of X are mutually uncorrelated, then = 2 diag j is a diagonal matrix. In this case the density function can be written as f (x) = = p 1 2 k Y j=1 p exp 1 k 1 2 exp 2 2 1) = 1 (x1 2 2 k) = k (xk 2 xj j 2 j 2 j ! 2 ! ; which is the product of marginal univariate normal densities. This shows that if X is multivariate normal with uncorrelated elements, then they are mutually independent. Theorem 4 If X N ( ; ) is a k-dimensional multivariate normal vector, and Y = a + BX, where B is an m k matrix with full row rank, then Y N (a + BX; B B0 ) is a m-dimensional multivariate normal vector. Theorem 5 Let X written as 2r . Theorem 6 If Z N (0; Ir ). Then Q = X 0 X is distributed chi-square with r degrees of freedom, N (0; A) with A > 0, q Theorem 7 Let Z N (0; 1) and Q student’s t with r degrees of freedom. Remark 1 T1 2 r q, then Z 0 A 1Z p be independent. Then Tr = Z= Q=r is distributed as N (0; 1). E[Tr ] = 0 for r > 1 and V ar (Tr ) = r=(r 2 and Q 2 be independent. Then F Theorem 8 Let P q;r = q r F distribution with q and r degrees of freedom. Remark 2 qFq;1 2. q 2. q 14 2) for r > 2. P=q Q=r is distributed as Fisher’s 0.4 0.35 0.3 Density 0.25 0.2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Figure 3: Density of the t Distribution for r = 1; 2; 5; 1 3.5 3 Density 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 Figure 4: Density of the F Distribution for (q; r) = (1; 1) ; (2; 1) ; (5; 2) ; (10; 1) ; (100; 100) 15
© Copyright 2026 Paperzz