2. Conditional Expectations, Conditional Distributions, Regression

2. Conditional Expectations, Conditional Distributions,
Regression Models
2.1. Definition of Conditional Expectations
Starting point: (X, Y )
Y R-valued random variable
X Rk -valued random variable
defined on the same P -space (S, F, P (·)).
Regression problem: How much of the random fluctuations of Y can be explained by X?
Find function g : Rk → R that minimizes
E[{Y − g(X)}2 ]
(1)
Definition 2.1. Each function g that minimizes (1) is called the conditional expectation of Y
given X.
Usual notation:
E[Y |X] = g(X), E[Y |X = x] = g(X)
E[Y |X] is a random variable but E[Y |X = x] is a real number
Spanos (1999):
Conditioning can only be defined relative to an event. Therefore, the expression
E[Y |X] does not make sense. However, the conditional expectation is well defined
for each value X = x: Expectation of Y given that the event {s : X(s) = x} has
occured.
He suggests to turn the conditioning random variable into a set of events: σ(X),
the σ-field generated by X. Hence, we write E[Y |σ(X)]. This makes transparent
that E[Y |σ(X)] is a function of X and hence a random variable relative to σ(X),
i.e. E[Y |σ(X)] is Fx = σ(X) measurable.
Furthermore, we can generalize the conditional expectation and condition on a subσ-field D ∈ F : E[Y |D]. Extreme choices for D would be D0 := {S, ∅}, the noninformative sub-σ-field, and DY := σ(Y ), the all-informative (for Y ) sub-σ-field
with E[Y |D0 ] = E[Y ] and E[Y |DY ] = Y , respectively.
Remarks:
(1) Definition (1) generalize the definition of E[Y ]: µ = E[Y ] minimizes E[Y − µ]2
1
(2) Conditional expectation minimizes mean square error (MSFE), i.e. CE provides best
mean square error forecast
(3) Statement E[Y |X = x] is problematic if P (X = x) = 0
Existence of conditional expectation is proved by so-called Radon-Nikodym Theorem. (Almost
sure) uniqueness is assured by
Theorem 2.2. For two minimizers g1 , g2 of (1) it holds g1 (X) = g2 (X) a.s.
2.2. Conditional Distributions
Special Case 1: X, Y have joint density f (x, y).
Then
£
2
E {Y − g(X)}
¤
Z Z
=
{y − g(x)}2 f (x, y)dxdy = min!
Z
⇔
{y − g(x)}2 f (x, y)dy = min! for all x
g(x)
R
R
yf (x, y)dy
yf (x, y)dy
⇒ g(x) = R
=
fx (x)
f (x, y)dy
R
R
yf (x, y)dy
yf (x, y)dy
=
⇒ E[Y |X = x] = R
fx (x)
f (x, y)dy
Ry
Hence E[Y |X = x] is the mean of the distribution F (Y |X = x) = −∞ f (u, x)du/fx (x) which
is called the conditional distribution of Y given X = x.
Alternative notation for F (Y |X = x): P Y |X=x
Accordingly R
given X = x.
f (x, y)
f (x, y)
=
= f (y|X = x) = f (y|x) is the conditional density of Y
fx (x)
f (u, x)du
R
g(X) = E[Y |X] = yf (y|X)dy is a random variable. Hence the conditional distribution of
Ry
Y given X, i.e. F (y|X) = −∞ f (u, X)du/fx (X), is a ”random distribution” with density
f (y, X)
.
f (y|X) =
fx (X)
Alternative notation for F (Y |X): P Y |X
General way of obtaining P Y |X :
Choose P Y |X (A) = h(X), where h minimizes E[{1[Y ∈A] − h(X)}2 ].
2
Special Case 2: X, Y discrete
P Y |X=x (A) =
P Y,X ({x} × A)
P (X = x, Y ∈ A)
=
x
P ({x})
P (X = x)
if P (X = x) > 0. Otherwise, define P Y |X=x (A) as you like.
Special Case 3: X discrete, Y continuous
P Y |X=x defined as for Special Case 2.
P
Y,X
(A × B) =
Z X
f (x, y)dy
A X∈B
Special Case 4: X continuous, Y discrete
P
Y,X
(A × B) =
Z X
P
P
Y |X=x
(A) =
f (x, y),
then
B Y ∈A
P
f (x, y)
Y ∈A f (x, y)
= P
fx (x)
u f (x, u)
Y ∈A
Intuitive interpretation:
E[Y |X = x] is a mean in stochastic model, where X is nonrandom and equal to X ∈ Rk .
2.3. Properties of Conditional Expectation
Theorem 2.3. g(X) = E[Y |X]
functions h.
⇔
E[{Y − g(X)}h(X)] = 0 for all Borel measurable
Remarks:
(1) Theorem 2.3. provides the justification to interpret regression models as orthogonal decomposition of dependent variable Y : Y = E[Y |X] + ε, where ε = Y − E[Y |X] with
E[εh(x)] = 0. Hence, the error is orthogonal to all m.b. functions of X.
(2) E[Y |X] is the orthogonal projection of Y onto F2 = {functions of X}
(3) Theorem 2.3. implies E[h(X)Y ] = E[g(X)h(X)]
Example 1
E[Y 1I1 ] = E[g(X) 1I1 ] or E[Y 1I2 ] = E[g(X) 1I2 ], where 1I1 and 1I2 refer to partitions of the
outcome set.
3
Example 2
X discrete
E[Y 1(X=x) ] = E[g(x) 1(X=x) ] = P (X = x)g(x) ⇒ g(X) =
E[Y 1(X=x) ]
P (X = x)
Proof: ” ⇒ ”
g minimizes E[{Y − g(X)}2 ]
∂
E[{Y − g(X) − ah(X)}2 ]|a=0 = 0
∂a
o¯
∂ n
¯
2
2
2
=0
E[{Y − g(X)} ] − 2aE[{Y − g(X)}h(X)] + a E[h (X)] ¯
⇒
∂a
a=0
⇒
⇒ E[{Y − g(X)}h(X)] = 0
”⇐”
Assume E[{Y − g(X)}h(X)] = 0
∀h and g does not minimize E[{Y − g(X)}2 ]
Then ∃g ∗ with
E[{Y − g(X)}2 ] < E[{Y − g(X)}2 ]
⇒
E[{Y − g ∗ (x)}2 − {Y − g(x)}2 ] < 0
⇒
E[−2(g ∗ (x) − g(x)) Y ] + E[g ∗ (x)2 − g(x)2 ] < 0
|
{z
}
=:h(X)
∗
⇒
E[−2(g (x) − g(x))g(x)] + E[g ∗ (x)2 ] − E[g(x)2 ] < 0
⇒
E[{g ∗ (x) − g(x)}2 ] < 0
|
{z
}
≥0
Contradiction: Hence, g is minimizer.
Application of Theorem 2.3:
Theorem 2.4. Iterated Expectations
(i) E(E[Y |X]) = E[Y ]
(ii) E[E[Y |X, Z]|Z] = E[Y |Z] a.s.
(iii) E[E[Y |X]|X, Z] = E[Y |X] a.s.
(i)-(iii) represent versions of the LIE: law of iterated expectation
(iv) ”Taking and what is known property”: E[Y f (X)|X] = f (x)E[Y |X] a.s.
(v) E[E[Y |X]|f (x)] = E[Y |f (x)] a.s.
(vi) E[Y |X, Z, X 2 , X · Z] = E[Y |X, Z] a.s.
4
Examples
ad (ii, v) Model Y = m(x1 ) + ε with E[ε|x1 , x2 ] = 0 ⇒ Y = m(x1 ) + ε with E[ε|x1 ] = 0
because of E[E(ε|X1 , X2 )|X1 ] = E[0|X1 ] = 0 = E[ε|X1 ] or E[E(ε|X)|f (X)] = E[0|f (X)] =
0 = E[ε|f (X)] with X = (X1 , X2 ) and f (X) = X1
ad (vi) E[Wage|education, experience]
= β0 + β1 educ + β2 exper + β3 educ · exper + β4 educ2
= E[Wage|educ, exper, educ2 ,educ · exper]
Thus, it is redundant to also condition on educ2 and educ · exper
Proof of (ii):
Put g(X, Z) = E[Y |X, Z] and f (Z) = E[g(X, Z)|Z]
Want to show: f (Z) = E[Y |Z]
This is equivalent to E[f (Z)h(Z)] = E[Y h(Z)] for all h
Now
E[f (Z)h(Z)] = E[g(X, Z)h(Z)] by Theorem 2.3.
= E[Y h(Z)]
by Theorem 2.3.
Conditional expectations have similar properties as unconditional expectations.
Theorem 2.5.
(i) E[a1 Y1 + a2 Y2 |X] = a1 E[Y1 |X] + a2 E[Y2 |X] a.s.
(ii) Y1 ≤ Y2 ⇒ E[Y1 |X] ≤ E[Y2 |X] a.s.
(iii) (E[XY |Z])2 ≤ E[X 2 |Z]E[Y 2 |Z] a.s. (Cauchy-Schwarz inequality)
(iv) ϕ : R → R isconvex, E(|ϕ(Y )|) < +∞
⇒ ϕ(E[Y |X]) ≤ E[ϕ(Y )|X] (Jensen’s inequality)
(v) 0 ≤ Xn ↑ X, E[X] < +∞
⇒ E[Xn |Z] ↑ E[X|Z]
· 2¯ ¸
|Y | ¯
(vi) P (|Y | ≥ ε|X) ≤ E
¯X (Chebyshev’s inequality)
ε2
Relation of independence and conditional distribution
Theorem 2.6.
X; Y independent ⇔ P Y |X = P Y a.s. ⇒ E[Y |X] = E[Y ] a.s. (”mean independence”)
Proof: X, Y independent
⇔ E[f (Y )h(X)] = E[f (Y )]E[h(X)] ∀f, g
5
⇔ E[f (Y )|X] = E[f (Y )] a.s. by Theorem 2.3.
⇔ P Y |X = P Y
2.4. Conditional Variance
Var[Y |X] = E[(Y − E[Y |X])2 |X]
Theorem 2.7. Properties of conditional variances:
(i) Var[a(X)Y + b(X)|X] = a(X)2 Var[Y |X]
(ii) Var(Y ) = E[Var(Y |X)] + Var(E[Y |X])
(iii) E[Var(Y |X)] ≥ E[Var(Y |X, Z)]
Application of (iii)
Given Y, X, Z
Model 1: Y = m(X) + ε and Model 2: Y = m∗ (X, Z) + η
with m(X) = E[Y |X], m∗ (X, Z) = E[Y |X, Z], then E[ε2 ] ≥ E[η 2 ]
Note that E[Var(Y |X)] = E[E[ε2 |X]] = E[ε2 ] and E[Var(Y |X, Z)] = E[E[η 2 |X, Z]] = E[η 2 ]
Further conditions reduces average conditional variance.
2.5. Regression Models
Y = m0 (X) + ε
(a) with E[ε|X] = 0 a.s. ⇐⇒ m0 (X) = E[Y |X] a.s.
o Mean independence of X and ε implies
E[ε] = E[E[ε|X]] = 0 (Theorem 2.4(i): LIE)
o Orthogonality of X and ε
E[εm0 (X)] = [εX] = 0 (Theorem 2.3)
o Conditional variance of ε
Var[ε|X] = E[ε2 |X] = E[(y − m0 (X))2 |X] = Var(Y |X)
(b) With P ε|X = P ε ; X and ε independent ⇔ E[f (ε)|X] = E[f (ε)] ∀ functionsf
(c) Independence implies Var[ε|X] = E[ε2 |X] = E[ε2 ]: homoscedastic errors
(if Var[ε|X] = σ 2 (X) 6= const ”heteroscedastic errors”, but then X and ε are not independent)
6
Linear regression model
Y = X 0 θ0 + ε, where E[(Y − X 0 θ0 )2 ] = minθ !
Then, E[Y 2 ] − 2θ00 E[XY ] + θ00 E[XX 0 ]θ0 = minθ !
Hence, 0 =
∂
(· · · )
∂θ
= −2E[XY ] + 2E[XX 0 ]θ0
⇒ E[X [Y − X 0 θ0 ]] = 0 ⇔ E[Xε] = 0
| {z }
ε
⇒ θ0 = [E[XX 0 ]]−1 E[XY ] if inverse exists
Sample analogue Xi , Yi
Ã
θ̂0 =
m
1X
xi x0i
n i=1
7
!−1
n
1X
xi y i
n i=1