2. Conditional Expectations, Conditional Distributions, Regression Models 2.1. Definition of Conditional Expectations Starting point: (X, Y ) Y R-valued random variable X Rk -valued random variable defined on the same P -space (S, F, P (·)). Regression problem: How much of the random fluctuations of Y can be explained by X? Find function g : Rk → R that minimizes E[{Y − g(X)}2 ] (1) Definition 2.1. Each function g that minimizes (1) is called the conditional expectation of Y given X. Usual notation: E[Y |X] = g(X), E[Y |X = x] = g(X) E[Y |X] is a random variable but E[Y |X = x] is a real number Spanos (1999): Conditioning can only be defined relative to an event. Therefore, the expression E[Y |X] does not make sense. However, the conditional expectation is well defined for each value X = x: Expectation of Y given that the event {s : X(s) = x} has occured. He suggests to turn the conditioning random variable into a set of events: σ(X), the σ-field generated by X. Hence, we write E[Y |σ(X)]. This makes transparent that E[Y |σ(X)] is a function of X and hence a random variable relative to σ(X), i.e. E[Y |σ(X)] is Fx = σ(X) measurable. Furthermore, we can generalize the conditional expectation and condition on a subσ-field D ∈ F : E[Y |D]. Extreme choices for D would be D0 := {S, ∅}, the noninformative sub-σ-field, and DY := σ(Y ), the all-informative (for Y ) sub-σ-field with E[Y |D0 ] = E[Y ] and E[Y |DY ] = Y , respectively. Remarks: (1) Definition (1) generalize the definition of E[Y ]: µ = E[Y ] minimizes E[Y − µ]2 1 (2) Conditional expectation minimizes mean square error (MSFE), i.e. CE provides best mean square error forecast (3) Statement E[Y |X = x] is problematic if P (X = x) = 0 Existence of conditional expectation is proved by so-called Radon-Nikodym Theorem. (Almost sure) uniqueness is assured by Theorem 2.2. For two minimizers g1 , g2 of (1) it holds g1 (X) = g2 (X) a.s. 2.2. Conditional Distributions Special Case 1: X, Y have joint density f (x, y). Then £ 2 E {Y − g(X)} ¤ Z Z = {y − g(x)}2 f (x, y)dxdy = min! Z ⇔ {y − g(x)}2 f (x, y)dy = min! for all x g(x) R R yf (x, y)dy yf (x, y)dy ⇒ g(x) = R = fx (x) f (x, y)dy R R yf (x, y)dy yf (x, y)dy = ⇒ E[Y |X = x] = R fx (x) f (x, y)dy Ry Hence E[Y |X = x] is the mean of the distribution F (Y |X = x) = −∞ f (u, x)du/fx (x) which is called the conditional distribution of Y given X = x. Alternative notation for F (Y |X = x): P Y |X=x Accordingly R given X = x. f (x, y) f (x, y) = = f (y|X = x) = f (y|x) is the conditional density of Y fx (x) f (u, x)du R g(X) = E[Y |X] = yf (y|X)dy is a random variable. Hence the conditional distribution of Ry Y given X, i.e. F (y|X) = −∞ f (u, X)du/fx (X), is a ”random distribution” with density f (y, X) . f (y|X) = fx (X) Alternative notation for F (Y |X): P Y |X General way of obtaining P Y |X : Choose P Y |X (A) = h(X), where h minimizes E[{1[Y ∈A] − h(X)}2 ]. 2 Special Case 2: X, Y discrete P Y |X=x (A) = P Y,X ({x} × A) P (X = x, Y ∈ A) = x P ({x}) P (X = x) if P (X = x) > 0. Otherwise, define P Y |X=x (A) as you like. Special Case 3: X discrete, Y continuous P Y |X=x defined as for Special Case 2. P Y,X (A × B) = Z X f (x, y)dy A X∈B Special Case 4: X continuous, Y discrete P Y,X (A × B) = Z X P P Y |X=x (A) = f (x, y), then B Y ∈A P f (x, y) Y ∈A f (x, y) = P fx (x) u f (x, u) Y ∈A Intuitive interpretation: E[Y |X = x] is a mean in stochastic model, where X is nonrandom and equal to X ∈ Rk . 2.3. Properties of Conditional Expectation Theorem 2.3. g(X) = E[Y |X] functions h. ⇔ E[{Y − g(X)}h(X)] = 0 for all Borel measurable Remarks: (1) Theorem 2.3. provides the justification to interpret regression models as orthogonal decomposition of dependent variable Y : Y = E[Y |X] + ε, where ε = Y − E[Y |X] with E[εh(x)] = 0. Hence, the error is orthogonal to all m.b. functions of X. (2) E[Y |X] is the orthogonal projection of Y onto F2 = {functions of X} (3) Theorem 2.3. implies E[h(X)Y ] = E[g(X)h(X)] Example 1 E[Y 1I1 ] = E[g(X) 1I1 ] or E[Y 1I2 ] = E[g(X) 1I2 ], where 1I1 and 1I2 refer to partitions of the outcome set. 3 Example 2 X discrete E[Y 1(X=x) ] = E[g(x) 1(X=x) ] = P (X = x)g(x) ⇒ g(X) = E[Y 1(X=x) ] P (X = x) Proof: ” ⇒ ” g minimizes E[{Y − g(X)}2 ] ∂ E[{Y − g(X) − ah(X)}2 ]|a=0 = 0 ∂a o¯ ∂ n ¯ 2 2 2 =0 E[{Y − g(X)} ] − 2aE[{Y − g(X)}h(X)] + a E[h (X)] ¯ ⇒ ∂a a=0 ⇒ ⇒ E[{Y − g(X)}h(X)] = 0 ”⇐” Assume E[{Y − g(X)}h(X)] = 0 ∀h and g does not minimize E[{Y − g(X)}2 ] Then ∃g ∗ with E[{Y − g(X)}2 ] < E[{Y − g(X)}2 ] ⇒ E[{Y − g ∗ (x)}2 − {Y − g(x)}2 ] < 0 ⇒ E[−2(g ∗ (x) − g(x)) Y ] + E[g ∗ (x)2 − g(x)2 ] < 0 | {z } =:h(X) ∗ ⇒ E[−2(g (x) − g(x))g(x)] + E[g ∗ (x)2 ] − E[g(x)2 ] < 0 ⇒ E[{g ∗ (x) − g(x)}2 ] < 0 | {z } ≥0 Contradiction: Hence, g is minimizer. Application of Theorem 2.3: Theorem 2.4. Iterated Expectations (i) E(E[Y |X]) = E[Y ] (ii) E[E[Y |X, Z]|Z] = E[Y |Z] a.s. (iii) E[E[Y |X]|X, Z] = E[Y |X] a.s. (i)-(iii) represent versions of the LIE: law of iterated expectation (iv) ”Taking and what is known property”: E[Y f (X)|X] = f (x)E[Y |X] a.s. (v) E[E[Y |X]|f (x)] = E[Y |f (x)] a.s. (vi) E[Y |X, Z, X 2 , X · Z] = E[Y |X, Z] a.s. 4 Examples ad (ii, v) Model Y = m(x1 ) + ε with E[ε|x1 , x2 ] = 0 ⇒ Y = m(x1 ) + ε with E[ε|x1 ] = 0 because of E[E(ε|X1 , X2 )|X1 ] = E[0|X1 ] = 0 = E[ε|X1 ] or E[E(ε|X)|f (X)] = E[0|f (X)] = 0 = E[ε|f (X)] with X = (X1 , X2 ) and f (X) = X1 ad (vi) E[Wage|education, experience] = β0 + β1 educ + β2 exper + β3 educ · exper + β4 educ2 = E[Wage|educ, exper, educ2 ,educ · exper] Thus, it is redundant to also condition on educ2 and educ · exper Proof of (ii): Put g(X, Z) = E[Y |X, Z] and f (Z) = E[g(X, Z)|Z] Want to show: f (Z) = E[Y |Z] This is equivalent to E[f (Z)h(Z)] = E[Y h(Z)] for all h Now E[f (Z)h(Z)] = E[g(X, Z)h(Z)] by Theorem 2.3. = E[Y h(Z)] by Theorem 2.3. Conditional expectations have similar properties as unconditional expectations. Theorem 2.5. (i) E[a1 Y1 + a2 Y2 |X] = a1 E[Y1 |X] + a2 E[Y2 |X] a.s. (ii) Y1 ≤ Y2 ⇒ E[Y1 |X] ≤ E[Y2 |X] a.s. (iii) (E[XY |Z])2 ≤ E[X 2 |Z]E[Y 2 |Z] a.s. (Cauchy-Schwarz inequality) (iv) ϕ : R → R isconvex, E(|ϕ(Y )|) < +∞ ⇒ ϕ(E[Y |X]) ≤ E[ϕ(Y )|X] (Jensen’s inequality) (v) 0 ≤ Xn ↑ X, E[X] < +∞ ⇒ E[Xn |Z] ↑ E[X|Z] · 2¯ ¸ |Y | ¯ (vi) P (|Y | ≥ ε|X) ≤ E ¯X (Chebyshev’s inequality) ε2 Relation of independence and conditional distribution Theorem 2.6. X; Y independent ⇔ P Y |X = P Y a.s. ⇒ E[Y |X] = E[Y ] a.s. (”mean independence”) Proof: X, Y independent ⇔ E[f (Y )h(X)] = E[f (Y )]E[h(X)] ∀f, g 5 ⇔ E[f (Y )|X] = E[f (Y )] a.s. by Theorem 2.3. ⇔ P Y |X = P Y 2.4. Conditional Variance Var[Y |X] = E[(Y − E[Y |X])2 |X] Theorem 2.7. Properties of conditional variances: (i) Var[a(X)Y + b(X)|X] = a(X)2 Var[Y |X] (ii) Var(Y ) = E[Var(Y |X)] + Var(E[Y |X]) (iii) E[Var(Y |X)] ≥ E[Var(Y |X, Z)] Application of (iii) Given Y, X, Z Model 1: Y = m(X) + ε and Model 2: Y = m∗ (X, Z) + η with m(X) = E[Y |X], m∗ (X, Z) = E[Y |X, Z], then E[ε2 ] ≥ E[η 2 ] Note that E[Var(Y |X)] = E[E[ε2 |X]] = E[ε2 ] and E[Var(Y |X, Z)] = E[E[η 2 |X, Z]] = E[η 2 ] Further conditions reduces average conditional variance. 2.5. Regression Models Y = m0 (X) + ε (a) with E[ε|X] = 0 a.s. ⇐⇒ m0 (X) = E[Y |X] a.s. o Mean independence of X and ε implies E[ε] = E[E[ε|X]] = 0 (Theorem 2.4(i): LIE) o Orthogonality of X and ε E[εm0 (X)] = [εX] = 0 (Theorem 2.3) o Conditional variance of ε Var[ε|X] = E[ε2 |X] = E[(y − m0 (X))2 |X] = Var(Y |X) (b) With P ε|X = P ε ; X and ε independent ⇔ E[f (ε)|X] = E[f (ε)] ∀ functionsf (c) Independence implies Var[ε|X] = E[ε2 |X] = E[ε2 ]: homoscedastic errors (if Var[ε|X] = σ 2 (X) 6= const ”heteroscedastic errors”, but then X and ε are not independent) 6 Linear regression model Y = X 0 θ0 + ε, where E[(Y − X 0 θ0 )2 ] = minθ ! Then, E[Y 2 ] − 2θ00 E[XY ] + θ00 E[XX 0 ]θ0 = minθ ! Hence, 0 = ∂ (· · · ) ∂θ = −2E[XY ] + 2E[XX 0 ]θ0 ⇒ E[X [Y − X 0 θ0 ]] = 0 ⇔ E[Xε] = 0 | {z } ε ⇒ θ0 = [E[XX 0 ]]−1 E[XY ] if inverse exists Sample analogue Xi , Yi à θ̂0 = m 1X xi x0i n i=1 7 !−1 n 1X xi y i n i=1
© Copyright 2025 Paperzz