Probability and Statistics I

Probability and Statistics I
FIAS Summer School on Theoretical Neuroscience
and Complex Systems
Frankfurt, August 2-24, 2008
Constantin Rothkopf
1
Constantin Rothkopf
2
from Kersten and Yuille ()
Constantin Rothkopf
3
Part 1: Basic Probability
•
•
•
•
•
•
sample space and events
basic probability laws
conditional probability
independence
conditional independence
Bayes’ rule
Constantin Rothkopf
4
Sample Space & Events
The sample space consists of all possible outcomes of an
experiment:
• Flip a coin (discrete, binary) S={H,T}
• Roll a die (discrete) S={1,2,3,4,5,6}
• Measure the lifetime of an animal (continuous) S=[0,∞)
One can define an event as a set of outcomes:
• Flip a coin and observe heads E={H}
• Roll a die and observe an even number E={2,4,6}
• Measure the lifetime of an animal and observe a lifetime
smaller than 1 day E=[0,24h)
Constantin Rothkopf
5
Probability defined on events
Probability Axioms:
1. probability of event A: P(A) is between zero and one: 0 ≤ P(A) ≤ 1
2. probability of certain event S is one: P(S)=1
3. if events A and B are mutually exclusive, then P(A v B) = P(A)+P(B) (“A
or B occurred”)
•
One interpretation of probability: relative frequency. Repeat the
experiment N times and count the number of times that event A
occurred NA:
NA
P( A) = lim
N →∞
N
• The other interpretation is that of a degree of uncertainty. (Some
experiments can only be done once!)
Example:
• die roll, S={1,2,3,4,5,6}. P(D=1) = P(D=2) = … = 1/6 “fair die”
• shorthand notation: P(D=1) = P(1)
Constantin Rothkopf
6
Example: rolling two dice
• The sample space consist of all possible outcomes of the
experiment
• Probabilities of outcomes:
from Ross (1997)
Constantin Rothkopf
7
Example: rolling two dice
There are N=36 different outcomes.
Assume all outcomes o are distinct and
have the same probability
(fair dice):
P(o) = 1/N = 1/36.
Question: consider the event A: “sum is less than 6”, that is
composed of NA=10 outcomes. What is A’s probability P(A) ?
Answer: (using the third axiom)
NA
P( A) = ∑ P(o) = N A P(o) =
= 10 / 36 = 0.2777...
N
o∈ A
Constantin Rothkopf
8
Venn diagrams
figure from Pitman, 1999
rectangle corresponds to universal set, other events represented as
ellipses, with their probability given by their area
Constantin Rothkopf
9
Elementary Rules
“¬” = not
“v” = or
“^” = and, P(A^B) = P(A,B)
P(A) + P(¬A) = 1
P(A v B) = P(A) + P(B) – P(A,B)
P(A) = P(A,B) + P(A, ¬B)
Constantin Rothkopf
10
Sum Rule
Let Ai be a partition of all possible outcomes, i.e. the Ai are mutually
exclusive and their union is S. Then the probability of any event B can
be written as:
P( B) = ∑ P( B, Ai )
i
Illustration with Venn diagram:
B
Ai
Constantin Rothkopf
11
Conditional Probability - Product Rule
Idea: interaction of events, how does knowledge that one event
occurred change our belief that the other event also has occurred?
Definition: conditional probability (“probability of A given B”):
P(A|B) =P(A,B)/P(B) [or equivalently: P(A,B) = P(A|B)P(B)]
Product rule: P(A,B) = P(A|B) P(B)
Illustration with Venn diagram:
A
Constantin Rothkopf
A^B
B
12
Chain Rule
Repeatedly apply the product rule, which follows directly from the
definition of conditional probability.
Product rule: P(A,B) = P(A|B) P(B)
Now apply this to:
P(A,B,C) = P(A,B|C) P(C)
= P(A|B,C) P(B|C) P(C)
and in general this leads to the
Chain rule:
P(X1, X2, …, XN) = P(X1)P(X2|X1)P(X3|X2,X1)… P(XN|XN-1,…,X1)
= P(XN)P(XN-1|XN)P(XN-2|XN,XN-1)…P(X1|XN, XN-1,…, X2)
Constantin Rothkopf
13
Conditional Probability example
Consider the roll of two dice and the following two events:
Event A: D1 = 2
Event B: D1+D2 < 6
Question: what is P(A | B) ? Answer: P(A,B)/P(B) =
(3/36)/(10/36) = 3/10 = 0.3
Constantin Rothkopf
14
Independence of two Events
Definition: two events A,B are
independent if P(A,B) = P(A)P(B).
equivalently: P(A|B) = P(A)
or:
P(B|A) = P(B)
Quiz: are these two “dice world“
events independent?
A = “sum is even”, B = “D2>3”
P(A) = 0.5
P(B) = 0.5
Answer: Yes, since P(A,B) = 9/36 = 0.25 = P(A)P(B)
Constantin Rothkopf
15
Conditional Independence
Definition: two events A,B are conditionally independent given
a third event C if P(A,B|C) = P(A|C)P(B|C).
Note: P(A,B|C) = P(A,B,C)/P(C)
Illustration with Venn diagram:
C
A
Constantin Rothkopf
A^B^C
B
16
Conditional Independence example
Example:
A: “D1=2”, B: “D2>3”, C: “D1+D2=7”
are A and B conditionally independent given C?
P(A,B|C) = 1/6
P(A|C) = 1/6
P(B|C) = 1/2
Thus: P(A,B|C) ≠ P(A|C) P(B|C)
(no conditional independence)
Constantin Rothkopf
17
Bayes’ rule
Bayes’ Theorem [Bayes, 1763]:
P( B | A) P( A)
P( A | B) =
P( B)
Reverend Thomas Bayes
Proof:
We can decompose P(A,B) in two ways using conditional probabilities:
• P(A,B) = P(A|B)P(B), but also
• P(A,B) = P(B|A)P(A).
Hence we have: P(A|B)P(B) = P(B|A)P(A), from which Bayes’ rule follows
immediately.
Constantin Rothkopf
18
Terminology
Bayes’ Theorem:
likelihood
posterior probability
prior probability
P( B | A) P( A)
P( A | B) =
P( B)
evidence
Note: likelihood and prior probability are often much easier to measure than
the posterior probability, so it makes sense to express the latter as a function of
the former. We can also express the evidence P(B) as a function of likelihood
and prior using the law of total probability assuming a set of Ai forming a
partition:
P( B) = ∑ P( B, Ai ) = ∑ P( B | Ai ) P( Ai )
i
Thus:
i
P( B | A) P( A)
P( A | B) =
∑ P( B | Ai ) P( Ai )
i
Constantin Rothkopf
19
Bayes’ rule example: blood test
Consider a blood test to detect a disease D:
the test can be positive ‘+’ or negative ‘-’.
For people having the disease, the test is positive in 95% of cases.
For people without the disease the test is positive in 2% of the cases.
Let’s assume we also know that 1% of the population has this disease.
Question: What is the probability that a person with a positive test result has
the disease?
P(+|D)=0.95, P(+|¬D)=0.02, P(D)=0.01, P(¬D)=0.99.
P(D|+) = P(+|D)P(D)/P(+) ; P(+)=P(+|D)P(D) + P(+|¬D)P(¬D)
= 0.95*0.01 / (0.95*0.01 + 0.02*0.99)
= 32% (lower than you may have expected)
Constantin Rothkopf
20
Bayes’ rule in Perception
Example: consider the following vision problem. You have to recognize an
animal, the possible events being A (mouse, cat, dog, …), in an image I from
a possible set of images. Application of Bayes’ rule gives:
what images are “generated” by a mouse
and how probable is each of them?
how frequently do I
encounter a mouse?
P( I = i | A = mouse) P( A = mouse)
P( A = mouse | I = i ) =
P( I = i)
is this a mouse
in the image?
Constantin Rothkopf
how frequently do I encounter
any possible image?
21
Bayes’ rule in decoding
Example: consider the following decoding problem. You have to recognize a
direction of motion, the event being S (dots moving to the left or to the right),
from a recording of neuronal firing rate, where the event is that the firing rate
R is 6oHz. Application of Bayes’ rule gives:
If the dots are moving left, how likely is it
to observe a rate of 60Hz?
How likely is a motion direction to the
left across trials?
P( IR=60Hz
= i | A =| S=left
mouse) P( A S=left
= mouse)
P( AS=left
= mouse
| I = i) =
| R=60Hz
P((R=60Hz)
I = i)
What is the
probability that the
dots are moving left,
if we measure a rate
of 60Hz?
Constantin Rothkopf
How likely is it to observe a firing rate of 60Hz?
22
Part 2: Random Variables
•
•
•
•
•
definition of random variables
probability mass functions
probability density function
expected values, mean, and variance
example distributions: binomial, exponential, Poisson,
Gaussian
Constantin Rothkopf
23
Random Variables
So far, we have considered events on sample spaces. But
often we are interested in functions of events, which leads
to the definition:
• A random variable is a real valued function defined on the
sample space.
• Examples:
X: the sum of two dice
Y: the square of the sum of two dice
Z: the number of rolls of two dice until the total sum is
larger then 10
Repeat the experiment over and over again: distribution. The
probability of a result expresses the chance of that result
happening.
Constantin Rothkopf
24
From events to random variables
• The elementary probability rules hold with regard to
random variables
• Probability of an event E={xk} on the sample space
S={x1, x2, …, xi, …, xk, …xN}:
define random variable X as the outcome and then
P(xk) = P(X=xk)
• P(X=x |Y=y) = P( X=x , Y=y ) / P(Y=y)
Constantin Rothkopf
25
Discrete vs. continuous
from Dayan & Abbott (2001)
Constantin Rothkopf
26
Probability Mass function (pmf)
• A discrete random variable D has an associated probability
mass function (pmf) P, defined over the set of outcomes U:
P(D = a) = P(a), a ∈ U , with the following properties:
P( D = a) ≥ 0 ∀ a
∑ P( D = a) = ∑ P(a) = 1
a
a
Example:
die roll S = {1, 2, 3, 4, 5, 6}
What is the pmf of this
random variable?
1
P (a )
uniform distribution
0.5
0
1
Constantin Rothkopf
2
3
4
5
6
a
27
The Binomial Distribution
Consider making a sequence of n binary experiments, e.g., coin tosses: Bernoulli
r.v. X~B(p). Assume: P(X=1) = p, P(X=0) = 1 – p = q .
What is the probability of getting a positive result k times?
Answer: the binomial distribution
 n  k n−k
P(n, k ) =   p q
k 
, where
n
n!
  =
 k  k!(n − k )!
is the number of k-element subsets of an n-element set (“n choose k”).
Example: consider a sequence of three fair coin tosses. What is the
probability of observing “heads” exactly once?
• count outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
• use formula: (3! / (1! x 2!)) x (0.5)1 x (0.5)2 = 6/(2x2x4) = 3/8
Constantin Rothkopf
28
Poisson Distribution
• The Poisson distribution can be used if there’s a small
probability of “success” per unit time that does not depend
on previous successes. If the mean number of successes in
a certain time interval is given by µ, then the probability of
observing k successes in a time interval is given by:
P(k ) = e − µ
µk
k!
• Note: The Poisson distribution is also a good
approximation for the binomial distribution if N is large and
p is small (rare successes). This is useful since binomial
distribution is awkward to compute for large N,k.
Constantin Rothkopf
29
Cortical neurons are (approx.) Poisson
Example: spiking of neurons over time (spike train) is often modeled as
a Poisson process: every millisecond there is a certain probability of the
neuron sending a spike independent of what happened earlier. (This
simple model ignores things like the refractory period.)
time
Assume a neuron that fires on average 1 spike per 100ms. What is the
probability of observing exactly 3 spikes during a time interval of 250
ms?
k
P(k ) = e
−µ
µ
k!
If there’s one spike in 100ms we can expect k =2.5 spikes in 250ms.
P(3) = e-2.5 (2.5)3 / 3! ≈ 0.15
Constantin Rothkopf
30
Probability Density function (pdf)
probability density function (pdf)
from D. Ballard
Constantin Rothkopf
31
Non-negativity and Normalization
The probability density is non-negative; it needs to integrate
to one.
P( Ai ) ≥ 0
p( x) ≥ 0
→
∞
∑ P( A ) = 1
i
i
→
∫ p( x)dx = 1
−∞
Some made-up examples:
p(x)
p(x)
x
Constantin Rothkopf
p(x)
x
x
32
Example: uniform distribution
p(x)
What is the probability
that X = 1.3 exactly:
P(X = 1.3) = ?
0.5
0
1
2
x
Finite probability corresponds
to area under the pdf. P(1 < X < 1.5) = ?
1.5
∫ p( x) dx = 0.25
1
The probability at a particular value x may
exceed 1 (compare to pmf).
Constantin Rothkopf
33
Gaussian Distribution
The Gaussian is the most important continuous distribution:
2

(
1
x − µ) 

p( x) =
exp −
2

2
σ
2π σ


from Duda et.al.
(2001)
Mean and variance of Gaussian
are µ and σ2, respectively
X ~ N(µ, σ 2)
Note: many distributions observed in nature look like this.
Note: Gaussian can also be used to approximate binomial distribution.
Note: The convolution of two Gaussians is again a Gaussian.
Constantin Rothkopf
34
Gaussian and tuning curves
from Dayan & Abbott (2001), originally from Hubel & Wiesel, 1968
Constantin Rothkopf
35
Approximating the binomial pmf
Example binomial
distributions for varying
numbers of fair coin tosses,
i.e. p=1/2
from Pitman (1999)
Constantin Rothkopf
36
Exponential Distribution
• for x≥0 with parameter µ (often λ = 1/ µ):
 x
1
p ( x) = exp − 
µ
 µ
• mean: µ
• variance: µ2
• The exponential random variable X is ‘memoryless’, because if
it represents the lifetime of a process, then the distribution of
the remaining lifetime is independent on the time already
passed.
Constantin Rothkopf
37
Expected values
Consider a numerical random variable X with an associated pmf
P(X=x) = P(x). The expected value or mean is defined as:
E ( X ) = µ X = ∑ xi P( X = xi )
i
We can also consider the expected value of any function f(X) of a
random variable. This is defined as:
E ( f ( X ) ) = ∑ f ( xi ) P( X = xi )
i
Note: expectation is a linear operation! We will use this when
calculating the mean of the binomial random variable.
Constantin Rothkopf
38
Expected values for continuous
random variables
• Straightforward generalization with sums turning into
integrals. In the discrete case:
E ( f ( X ) ) = ∑ f ( xi ) P( X = xi )
i
• Now the continuous case:
∞
E [f (X )]≡
∫ f ( x) p( x)dx
−∞
Constantin Rothkopf
39
Mean, Variance, and higher moments
• Mean and variance are useful for “summarizing” the most
essential features of a pmf P(x). The mean is defined as
(see above):
E ( X ) = µ X = ∑ xi P( X = xi )
i
• The variance (about the mean) is defined as:
2
2
X
2
E (( X − µ X ) ) = σ = ∑ ( xi − µ X ) P( X = xi )
i
• Higher order moments (about the mean) are defined
accordingly:
23
E (( X − µ X ) )
24
2
2
X
i
X
i
i
2
2
related
to the kurtosis
ofXa distribution (peakyness)
X
i
i
i
distribution
=related
σ =to the skewness
( x − µ of)a P
(X = x
E (( X − µ X ) ) = σ
Constantin Rothkopf
∑
= ∑ (x − µ
)
) P( X = x )
40
Calculating mean and variance
• Mean:
µ = E[ X ] = ∑ xi P( xi )
→
i
E[ X ] = ∫ xp( x)dx
• Variance
2
2
2
σ = E[( X − µ ) ] = ∑ (xi − µ ) P( xi )
i
→
• useful relation:
2
2
E[( X − µ ) ] = ∫ (x − µ ) p ( x)dx
2
σ 2 = E[ X 2 ] − (E[ X ])
Constantin Rothkopf
41
Calculation for binomial distribution
• According to the definition of the expected value:
• But expectation is a linear operation. Each toss of the coin is
a Bernoulli experiment. Assuming independent tosses:
E[X] = E[X1+X2+X3+…+Xn]=
Constantin Rothkopf
E[Xi] =
p = np
42
Some examples …
• Bernoulli random variable with parameter 0 ≤ p ≤ 1:
• mean:
µ =p
• variance: σ2=p(1-p)
• Binomial random variable with parameters 0 ≤ n, 0 ≤ p ≤ 1:
• mean:
µ =np
• variance: σ2=np(1-p)
• Poisson random variable with parameter λ :
• mean:
µ =λ
• variance: σ2 = λ
• Exponential random variable with parameter λ :
• mean:
µ = 1/ λ
• variance: σ 2= 1 / λ2
• Gaussian random variable with parameters µ, σ2 :
• mean:
µ
• variance: σ2
Constantin Rothkopf
43
Cumulative distribution function
pdf:
P( x < X ≤ x + Δx)
p ( x) = lim
Δx →0
Δx
x
cdf:
Φ ( x) = P( x < X ) =
∫ p( x' )dx'
cdf gives probability
of X being smaller
than a specific x
−∞
Notes:
• cdf is monotonically increasing (probabilities are non-negative)
0 ≤ Φ ( x) ≤ 1
• cdf takes values between zero and one:
• pdf is derivative of cdf
• probability of X being between A and B is given by:
B
P( A < X ≤ B) = Φ ( B) − Φ ( A) = ∫ p ( x)dx
A
Constantin Rothkopf
44
Graphical representations
Cumulative
distribution function
of pdf & cdf
pdf
cdf
1
uniform
A
pdf
bimodal
Constantin Rothkopf
B
A
cdf
B
1
45
Central Limit Theorem
Another reason why the Gaussian is fundamentally important:
from Pitman (1999)
Constantin Rothkopf
46
Pdf of sum of N uniform distributions
• What is the distribution of the sum of N uniform random variables?
• N=1
• N=2
• N=3
• N=4
• Note: one can show that the pdf of the sum of two random variables X1
and X2 is the convolution of the two pdfs.
Constantin Rothkopf
47
Example: Random Walk
• A simple model that shows up in many contexts:
• Brownian motion of particle in fluid
• neuron’s membrane potential under influence of
excitatory and inhibitory synaptic inputs
• and many more
• consider particle at position Xt {0,±1,±2,…} at time t
• at each time step the particle makes a step to the left or
right with probability 0.5:
• P(Xt+1=x+1|Xt=x) = P(Xt+1=x-1|Xt=x) = 0.5.
Constantin Rothkopf
48
Random Walk
Question: Assume particle starts at X0=0, what is probability of finding it
at a particular location X after T steps?
Solution: (with central limit theorem)
• denote distance of single step by d
• mean distance of single step:
µ = E[d] = 0.5*(-1) + 0.5*1 = 0
• variance of distance of single step:
σ2 = E[(d- µ)2] = 0.5*1 + 0.5*1 = 1
• According to central limit theorem:
• distribution of distance traveled after T steps is approximately Gaussian
with mean µ * T = 0 and variance: σ 2 = T, or equivalently standard
deviation of √T.
• The longer you wait the broader the distribution of X.
Constantin Rothkopf
49
Random walk and neurons
from Shadlen (1999)
Constantin Rothkopf
50
Questions?
Thanks to Jochen Triesch for large part of the material for this lecture
Constantin Rothkopf
51

Download Report

Probability and Statistics I

Paperzz.com

Your Paperzz