Joint Distributions

Joint Distributions
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
Lecture 5
0909.400.01 / 0909.400.02
Dr. P.’s
Clinic Consultant Module in
Probability & Statistics
in Engineering
Today in P&S
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 Dealing with multiple random variables at the same time:
ª Jointly distributed random variables
ª Independent random variables
ª Expected values,
ª Covariance and correlation of joint random variables
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
Jointly Distributed
Random Variables
 Often times – in particular in engineering – we are interested in joint behavior of two or more
random variables
ª
ª
ª
ª
Effect of process technology and layout density in defect rate of a chip
Effect of SAT scores, and GPA, and curricular involvement in estimating potential success
Effect of temperature and humidity on a chemical reaction, etc.
Sometimes we are interested in the behavior of sum/difference or combined effect of random variables
in the outcome of an experiment
 In such cases we need to know how these multiple variables behave jointly, described by their
joint probability mass/density function
ª These functions place a probability value to pairs / triplets / quadruplets of random variables
ª The probability of a student randomly selected from a particular high school to have an SAT score of
1400 and a GPA of 3.7 Æ You would expect a high pmf value
• How about an SAT score of 1570 and GPA of 2.3, or SAT score of 1190 and GPA of 3.97?
ª The probability that a chip manufactured with a 0.05μ (new) technology with a density of 1000
devices/cm2 (very low)to be defective?
• How about a 0.5 μ (old) technology and a density of 100000 devices/cm2 (very high) ?
ª The probability that a patient will get a heart attack in the next 5 years if s/he has a HR of 110bpm,
cholesterol level of 240, BP of 100/140, age 65 (four random variables)?
 All these and countless similar random events are governed by joint probability distributions.
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Two Variables,
Discrete Case
μ μ+σ μ+2σ μ+3σ
 Let X and Y be two discrete r.v.’s defined on the sample space of an
experiment. The joint probability mass function p(x, y) is defined for each
pair of numbers (x, y) by
p XY ( x, y ) = p ( x, y ) = P ( X = x & Y = y )
 Let A be the set consisting of pairs of (x, y) values, then
P ⎡⎣( X , Y ) ∈ A⎤⎦ =
∑ ∑ p ( x, y )
( x , y ) ∈A
 Ex: Computers bought on this campus during Fall 2006:
ª X=Chip speeds: 2.6, 3.2 GHz, Y=RAM: 256, 512 and 1024 MB
256 MB
512 MB
1 GB
2.6 GHz
0.20
0.10
0.20
3.2 GHz
0.05
0.15
0.30
(X,Y) pairs: (2.6, 256), …, (3.2, 1024)
P(2.6, 512)=0.1, P(Y≥512)=0.1+0.15+0.20+0.30=0.75 Í Marginal distribution
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Marginal Distributions
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 The distribution of an individual r.v.’s of X and Y from a joint distribution of
p(X,Y) is the marginal probability mass functions, denoted pX(x) and pY(y)
ª To obtain a marginal distribution, we remove the other variable by “summing it
out”
p ( x) =
p ( x, y )
p ( y) =
p ( x, y )
X
∑
Y
∑
y
x
256 MB
512 MB
1 GB
pX(x) È
2.6 GHz
0.20
0.10
0.20
0.5
3.2 GHz
0.05
0.15
0.30
0.5
pY(y) Æ
0.25
0.25
0.5
XÈ
YÆ
⎧0.5 x = 2.6, 3.2GHz
p X (x ) = ⎨
⎩0 otherwise
⎧0.25 y = 256, 512 MB
⎪
pY ( y ) = ⎨0.5 y = 1GB
⎪0 otherwise
⎩
Sum over
Y values
0.2+0.1+0.2
Sum over
X values 0.2+0.3
p ( x, y ) ≥ 0
∑ ∑ p ( x, y ) = 1
x
j
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Conditional
Distributions
μ μ+σ μ+2σ μ+3σ
 Recall the definition of a conditional probabilities of two events A and B
P ( A I B)
P( A | B) =
P( B)
 This can be easily adapted to r.v.’s X and Y
p XY ( x, y )
pX ( x )
pX ( x ) > 0
p XY ( x, y )
pX y ( x ) =
pY ( y )
pY ( y ) > 0
pY x ( y ) =
The probability distribution
(mass function) of possible
values of Y, given that X=x:
P(Y=y|X=x)
Joint distribution
of X and Y
Marginal
distribution of Y
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Conditional
Distributions
μ μ+σ μ+2σ μ+3σ
 Properties of r.v.’s can easily be extended to conditional r.v.’s as well:
ª Conditional mean, or expected value of Y, given X=x
μY x = E ⎣⎡Y x ⎤⎦ = ∑ y ⋅ pY x ( y )
y
ª And conditional variance of Y given X=x
σ
2
Yx
= V ⎡⎣Y x ⎤⎦ = ∑ ( y − μY | x ) ⋅ pY | x ( y ) = ∑ y 2 ⋅ pY | x ( y ) −μY2| x
2
y
y
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Joint Probability
Density Functions
μ μ+σ μ+2σ μ+3σ
 For continuous r.v.’s, the definition of the joint pdf follows directly from
that of a single pdf: essentially, the probability that a pair of variables x,y lie
in a certain interval is given as the double integral computing the area under
the two-dimensional surface curve. That is,
ª Let X and Y be continuous rv’s. Then f (x, y) is a joint probability density
function for X and Y if for any two-dimensional set A
P ⎡⎣( X , Y ) ∈ A⎤⎦ = ∫∫ f ( x, y )dxdy
A
ª If the area in which the pdf is to be evaluated is rectangular,
that is if {( x, y ) : a ≤ x ≤ b, c ≤ y ≤ d } , then the probability
bd
P ⎡⎣( X , Y ) ∈ A⎤⎦ = ∫ ∫ f ( x, y )dydx
ac
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Cont. Joint Density
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
f ( x, y )
A = shaded
rectangle
P ⎡⎣( X , Y ) ∈ A⎤⎦
= Volume under density surface above A
f ( x, y ) ≥ 0
∞ ∞
∫ ∫ f ( x, y ) = 1
− ∞− ∞
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Marginal Probability
Density Functions
μ μ+σ μ+2σ μ+3σ
 The marginal probability density functions of X and Y, denoted fX(x) and
fY(y), are given by
∞
f X ( x) = ∫ f ( x, y )dy for − ∞ < x < ∞
−∞
∞
fY ( y ) =
∫
f ( x, y )dx for − ∞ < y < ∞
−∞
y
fX(x)
fY(y)
f ( x, y )
x
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Independent
Random Variables
μ μ+σ μ+2σ μ+3σ
 Often the observation of a particular r.v. may provide some information
about the probability of another r.v. Consider our previous example:
256 MB
512 MB
1 GB
2.6 GHz
0.20
0.10
0.20
3.2 GHz
0.05
0.15
0.30
ª The marginal probability of chip speed at x=2.6 GHz is 0.5, so is that of x=3.2
GHz. That is, a randomly selected student is equally likely to buy a 2.6 GHz
machine as s/he is a 3.2 GHz machine.
ª However, if we are told that a student just bought a machine with a RAM of
256MB, then the probability that x=2.6GHz is (0.2) four times likely that x=3.2
GHz (0.05).
ª Conversely, if a machine with 1GB RAM is bought then probability that it is a 3.2
GHz machine is (0.3) 50% more then it is a 2.6 GHz machine (0.2)
 Clearly these two variables are not independent!!!
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Independent
Random Variables
μ μ+σ μ+2σ μ+3σ
 Recall that two random events were defined as independent if their joint
probability was the same as product of individual probabilities, that is, if
P(A∩B)=P(A).P(B). For example, the individual outcomes of two dice
rolled are independent.
 A similar definition can be given for random variables: Two random
variables X and Y are said to be independent if for every pair of x and y
values
p ( x, y ) = p X ( x) ⋅ pY ( y )
when X and Y are discrete or
f ( x, y ) = f X ( x) ⋅ fY ( y )
when X and Y are continuous. If the conditions are not satisfied for all (x, y)
then X and Y are dependent. That is, two rv’s are independent, if their joint
distribution is equal to the product of their marginal distributions.
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
How About More than Two
Random Variables…?
 The concept of having a joint distribution for two random variables can be
easily extended to more then two random variables:
ª If X1, X2,…,Xn are all discrete random variables, the joint pmf of the variables is
the function
p ( x1 ,..., xn ) = P( X1 = x1 ,...,X n = xn )
ª If the variables are continuous, the joint pdf is the function f (.)such that for any n
intervals [a1,b1], …,[an,bn], P(a1 ≤ X1 ≤ b1,...,an ≤ X n ≤ bn )
b1
bn
a1
an
= ∫ ... ∫ f ( x1 ,..., xn )dxn ...dx1
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
Multinomial
Distribution
 Recall the binomial experiment? An experiment with only two possible
outcomes, repeated n times giving x successes was the binomial distribution
 Consider the following scenario:
ª An experiment consisting n independent and identical trials, in which each trial can
result in any one of r possible (instead of 2) outcomes. Let pi=P the probability of
outcome i on any given trial, and define Xi=the number of trials resulting in
outcome i, i=1,…,r. ÎMultinomial experiment and the joint pmf of X1, X2,
…Xr is called the multinomial distribution
n!
⎧
p 1x1 p 2x2 L prxr xi = 0,1,2,L , r x1 + x2 + L xr = n
⎪
p(x1 , L , xr ) = ⎨ ( x1!)(x2 !)L (xr !)
⎪0
otherwise
⎩
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
Conditional Distributions
(Continuous Case)
 Consider
ª Suppose X and Y denote the related parameters of two components of a computer, say the core
temperature and the microprocessor speed. If we know that the cooling fan is running slow (the
temp is high), does this give us any additional information about the microprocessor speed?
ª Or how about, let X be the length of a fish and Y be the type of fish (seabass, salmon). Given
the length of a fish is observed as x cm, does that give us any additional information on what
the fish may be…?
 Questions like this and countless many others can be answered by continuous conditional
probability distributions
 Let X and Y be two continuous rv’s with joint pdf f (x, y) and marginal X pdf fX(x).
Then for any X value x for which fX(x) > 0, the conditional probability density
function of Y given that X = x is
fY | X ( y | x) =
f ( x, y )
f X ( x)
−∞ < y < ∞
If X and Y are discrete, replacing pdf’s by pmf’s gives the conditional probability
mass function of Y when X = x.
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Statistics of
joint Distributions
μ μ+σ μ+2σ μ+3σ
 Expected values and variances of individual random variables in joint
distributions can be computed similar to single variable cases:
μ x = ∑∑ x ⋅ p( x, y ) =
x y
∑ x ⋅ p X (x )
μ y = ∑∑ y ⋅ p ( x, y ) = ∑ y ⋅ pY ( y )
x
x y
[
]
= E [(y − μ ) ] = ∑∑ (y − μ ) ⋅ p( x, y ) = ∑ ( y − μ
y
σ x2 = E (x − μ x )2 = ∑∑ (x − μ x )2 ⋅ p ( x, y ) = ∑ (x − μ x )2 ⋅ p X (x )
x y
σ y2
y
x
2
y
2
x y
)2 ⋅ pY ( y )
y
μ x = ∫ ∫ x ⋅ f ( x, y ) dx dy = ∫ x ⋅ f X (x ) dx
x y
Y
x
μ y = ∫ ∫ y ⋅ f ( x, y ) dx dy = ∫ y ⋅ fY ( y ) dy
x y
[
]
= E [( y − μ ) ] = ∫ ∫ ( y − μ ) ⋅ f ( x, y ) dx dy = ∫ ( y − μ
y
σ x2 = E (x − μ x )2 = ∫ ∫ (x − μ x )2 ⋅ f ( x, y ) dx dy = ∫ (x − μ x )2 ⋅ f X ( x )dx
x y
σ y2
x
2
2
y
y
x y
2
)
⋅ fY ( y )dy
Y
y
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
An Example
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 A university offers both on-line and on-campus courses. On any given
semester, let X=the proportion of on-line classes are full and Y=proportion
of on-campus classes are full. Suppose that the joint distribution of p(x,y) is
given as
6
2
f ( x, y ) =
x + y ),
(
5
0 ≤ x ≤ 1, 0 ≤ y ≤ 1
1 1
ª First, it is easy to verify that this is indeed a proper distribution: ∫ ∫ 6 ( x + y 2 )dydx = 1
5
0 0
ª For example, the probability that neither type of classes are more than 25% full is
1
1⎞
⎛
P⎜0 ≤ X ≤ ,0 ≤ Y ≤ ⎟ =
4
4⎠
⎝
1 41 4
∫∫
0 0
6
x + y 2 )dydx = 0.0109
(
5
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Example (Cont.)
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 The marginal probability of X, the probability that on-line classes are full,
regardless of on-campus classes is then:
1
6
6
2
x + y 2 ) dy = x + ,
(
5
5
5
0
fX ( x) = ∫
0 ≤ x ≤1
ª And the marginal dist. of Y
1
6
6
3
x + y 2 ) dx = y 2 + ,
(
5
5
5
0
fY ( y ) = ∫
0 ≤ y ≤1
ª We can then calculate such quantities as, say probability that on-campus courses
are between 25% and 75% full
3/ 4
3⎞
6
3
⎛1
P ⎜ ≤ Y ≤ ⎟ = ∫ y 2 + dy = 0.4625
4 ⎠ 1/ 4 5
5
⎝4
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Example (cont.)
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 How about conditional probabilities: What is the proportion of on-campus
courses being full, given that on-line courses are 80% full?
2
f ( 0.8, y ) 1.2 ( 0.8 + y ) 1
fY | x ( y 0.8 ) =
=
= ( 24 + 30 y 2 ) 0 ≤ y ≤ 1
f X ( 0.8 ) 1.2 ( 0.8 ) + 0.4 34
ª Then, we can calculate, say, probability that on-campus courses are at most halffull, given that on-line courses are 80% full:
P (Y ≤ 0.5 | X = 0.8 ) =
0.5
∫
fY | X ( y | 0.8 ) dy =
−∞
0.5
1
2
+
y
24
30
(
) dy = 0.390
∫−∞ 34
ª Or, we can even calculate the expected proportion of on-campus classes being full,
given that on-line classes are 80% full:
E [Y | X = 0.8] =
∞
∫ y⋅ f
−∞
1
Y |X
1
( y | 0.8) dy = ∫ ( 24 + 30 y 2 ) dy = 0.574
34 0
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
Expected Values of
Functions of Joint PDFs
 Recall that a function h(x) of a r.v. X is itself a random variable. However,
we do not need to compute its pdf/pmf to obtain its expected value, we can
simply compute it as weighted sum of p(x)/f(x). Similar expressions can be
given for joint distributions:
 Let X and Y be jointly distributed rv’s with pmf p(x, y) or pdf f (x, y)
according to whether the variables are discrete or continuous. Then the
expected value of a function h(X, Y), denoted E[h(X, Y)] or μh(X,Y) is
⎧∑∑ h( x, y ) ⋅ p ( x, y )
⎪x y
⎪
E ⎣⎡ h ( x, y ) ⎤⎦ = ⎨ ∞ ∞
⎪
h( x, y ) ⋅ f ( x, y )dxdy
∫
∫
⎪ −∞ −∞
⎩
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Covariance
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 If two variables are NOT independent, then we would like to know how
strongly they are related, a quantity given by the covariance
[
(
)]
⎧∑∑ (x − μ x ) ⋅ (y − μ y )⋅ p ( x, y )
2
σ xy
= Cov( X , Y ) = E ( X − μ x ) Y − μ y = E [( XY )] − μ x ⋅ μ y
⎪⎪ x y
=⎨
⎪∫ ∫ (x − μ x ) ⋅ y − μ y ⋅ f ( x, y ) dx dy
⎪⎩ x y
(
Both X and Y increase or
decrease together Î
positive correlation σXY>0
)
As X increases, Y decreases
and vice versa Î negative
correlation σXY<0
discrete
continuous
X and Y do not seem to be
correlated with each other, Î
near zero correlation σXY≈0
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Covariance Matrix
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 Note that we can also define σxx = σx2, σyy = σy2, and σxy = σyx, all which can be
represented with a single matrix, the covariance matrix, denoted by ∑
⎡σ xx σ xy ⎤
Σ=⎢
⎥
σ
σ
yy ⎦
⎣ yx
ª If X and Y are statistically independent Î σxy=0
ª If σxy=0 Î then the variables are said to be uncorrelated
ª Note that statistical independence is a stronger property then correlation:
statistical independence
uncorrelated
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Example
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 Back to our computer example:
X
P(x,y)
256 MB
512 MB
1 GB
2.6 GHz
0.20
0.10
0.20
3.2 GHz
0.05
0.15
0.30
2.6 GHz 3.2 GHz
PX(x)
0.5
0.5
Y
256MB
512MB
1GB
PY(y)
0.25
0.25
0.5
μ X = ∑ x ⋅ p X (x ) = 2.6 * 0.5 + 3.2 * 0.5 = 2.9GHz
x
μ x = ∑∑ x ⋅ p ( x, y ) =
μY = ∑ y ⋅ pY ( y ) = 256 * 0.25 + 512 * .25 + 1024 * 0.5 = 629.75MB
Cov( X , Y ) = σ XY = ∑∑ ( x − 2.9 )( y − 0.629 ) p( x, y )
x
x
x y
y
μ y = ∑∑ y ⋅ p ( x, y ) = ∑ y ⋅ pY ( y )
y
= 0.256 * .25 + .512 * .25 + 1.024 * 0.5 = 0.629GB
∑ x ⋅ p X (x )
x y
(
)
2
σ xy
= ∑∑ ( x − μ x ) ⋅ y − μ y ⋅ p ( x, y )
x
y
y
= (2.6 − 2.9)(.256 − .629)0.2 + ... + (3.2 − 2.9)(1.024 − .629)0.3 = ...
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
μ-3σ μ-2σ μ-σ
Correlation
Coefficient
μ μ+σ μ+2σ μ+3σ
 The problem with the covariance is that it depends on the absolute magnitude of the numbers,
more specifically to the units. For example, if we used the units of MB instead of GB in
calculating the covariance, the actual number of would be 1000 times the previous number..
 Clearly, the correlation between the two variables should not be dependent on the actual units
used, but simply how the variables are related to each other. This can be achieved simply by
normalizing or scaling the covariance, which yields the correlation coefficient:
ρ XY =
Cov( X , Y )
σ xσ y
2
σ XY
=
σ xσ y
− 1 ≤ ρ XY ≤ 1
ª If ρ=1, then the variables move together, i.e., they are linearly dependent Y=aX+b, a>0
ª If ρ=-1, then the variables are negatively correlated, one decreases as the other increases at
the same rate, again they are linearly dependent with a negative slope: Y=aX+b, a<0
ª If ρ=0 Î the variables are uncorrelated. The variation of one, has no effect on the other.
• Note however, uncorrelated does not necessarily mean independent
ª For all practical purposes, if |ρ|<0.05, the variables are considered to be uncorrelated.
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Homework
μ-3σ μ-2σ μ-σ
μ μ+σ μ+2σ μ+3σ
 Problems from Chapter 5
10, 20, 34, 38, 48*,58*
Bonus: 32, 78, 86*
*: Requires that you need additional sections of Chapter 5 from your book.
© 2006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering