Probability, Bayesian statistics, and information theory

A Brief Review of Probability,
Bayesian Statistics,
and Information Theory
Brendan Frey
Electrical and Computer Engineering
University of Toronto
[email protected]
http://www.psi.toronto.edu
A system is described by a set of random variables
with domains A configuration is an assignment to
A sample space is the set of possible configurations
and is given by the product of the domains:
Ex: A die. .
!#"$
(# of dots on die)
&"'$
Ex: 2 dice. .
.
( &"$ ) * ) %+ , "-&" $
Ex: 2 dice. than above.
%
.
( &./"$
. Less untuitive
Ex: 2 dice, angle of hand. 10
32 4
5
6(7 8
2 9
%;: $
. . ( &"$
,
!
The probability of configuration ,
that satisfies
6
7 8
' 8
, is a real number
' ./"
, for
'
' Ex: 2 unbiased dice.
3 " #" .
The probability density for configuration , number that satisfies
6
7 8
where is a differential volume of ! is possible.
NOTE: #" $ % 1
6
7 8
0
, is a real
.
"
!
A random experiment or simulation produces a
4
configuration
.
Discrete case: In experiments, the fraction of times
'
configuration occurs converges to as
.
Continuous case: Suppose
is a region of . In
experiments, the fraction of times a configuration in
occurs converges to
as
.
!#"
%$-
is the probability of given the value of
$
Imagine throwing away all experiments where
equal to the given value
$
is not
& '()*+
and
$
$
are independent if
$-
$
$
6
%$-
Knowing tells us nothing about the value of
vice versa
,
and
and
general,
If
$- and
$ -
$
$- $- $ are independent,
$- $-
and in
$-
) "
From the chain rule and normalization,
$- $ is sometimes called the marginal of
For densities, $ -
$
$ -
) Since
$ -
Using
$ $
$ -
$
$ ,
$- $-
,
%$- $-
% $- $-
$
$ $-
observed and hidden, we call the prior,
%$-
$ the posterior
the likelihood and For
we get Bayes rule
%$-
%$-
For densities,
$ % $- $-
% $- $ $
()*+! The expected value of
is
$
* If
and
$
, or is
$
*+ The variance of
$
can be a vector, eg,
$
are independent,
* !* $
The covariance of and is
$ !$
$ $
$ -
$ 7
are independent,
If and
(not vice versa)
In general,
$
$
$
%
* )* )
The covariance matrix of
!
or, for
4
5
,
is
4
!
37 $
(eg, coin toss)
where
7
4
7
is the probability that is 1.
Sometimes, we parameterize using
- 2 2 4 ,
7 7
4
5
(eg, prior for the probability that a coin will land
heads up)
-
7
%
'*%
8
8
otherwise
Machine learning and statistics study how models are
learned from data.
In Bayesian machine learning and statistics, the model is
considered to be a hidden variable with a prior
distribution. Given the data, the posterior distribution over
models can be used to make predictions, interpret the
data, etc.
Maximum likelihood (ML) estimation and maximum a
posteriori (MAP) estimation can be viewed as
approximations to Bayesian learning, where the most
probable model is selected. (In ML estimation, the prior
over models is assumed to be uniform.)
Suppose we flip a coin a bunch of times and see heads and tails. In a “frequentist” approach, we
estimate the probability of heads as
-
In the Bayesian approach, we first specify a prior, say that
7 the probability of seeing a head, is uniform on .
Using Bayes rule, we obtain
)
which is a Beta distribution with mode
* -
%+
mean .
and
This distribution can be used to make decisions, compute
confidence intervals, or interpret the data.
For example, the minimum squared loss estimate of
3 is
%+
This is closer to the prior than the frequentist estimate.
0
Entropy is a measure of the maximum average amount of
information that a random variable can convey in its value
4
The entropy of a discrete variable For a discrete variable,
The more uniform
7
$ is
bits
since
8
is, the greater the entropy
If is an integer for all , bits of information
can be conveyed using an encoder that uses
bits to pick
( % If
(natural logarithm) is used instead of
information is measured in nats.
,
1
0.5
1
0
2
0.125
3
100
* * !)
String
3
0.125
3
101
4
0.125
3
110
5
0.125
3
111
2 bits
Imagine we have a queue of random bits (eg, a
compressed image) that we’d like to convey.
We can use this information to produce a series of
experiments for . Each experiment is produced thus:
Draw a bit from the queue
If
If
7
set
and terminate the experiment
draw two more bits and use these to pick
2, 3, 4 or 5 and terminate the experiment
This procedure picks according to and conveys an
7
7 %
. % bits per experiment
average of ,
Instead of encoding a bit string into a random variable ,
we can encode into a bit string using a source code.
The decoder uses the bit string to recover .
It turns out that if has a distribution
minimum average bit string length is
then the
is an integer for all , the minimum can
If be achieved by mapping each to a bit string with
length
4
!
37 $
(eg, coin toss, bit from a magnetic disk),
4 7 where
is the probability that
is 1.
1
)
-x*log(x)/log(2)-(1-x)*log(1-x)/log(2)
0.9
Entropy of Bernoulli variable
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
Probability that Bernoulli variable equals 1
1
) ) * ) ) ' )* "
Relative entropy is a measure of the average excess
string length when the wrong source code is used.
Suppose the true distribution for is
minimum average string length is
, so the
Suppose we use bit strings determined from the
wrong distribution, . The average string length
will be
The average excess string length is the relative
entropy:
7
!
Suppose we try to use a source code to compress a real
4 5
variable .
We can create infinitesimal bins, where bin
will have probability The minimum average string length for this
distribution is
at
This average length (“entropy”) is infinite; ie,
conveys infinite information
However, on the next page, we see that the relative
entropy is finite...
) ) )! *+
)
Suppose we use bit strings determined from the wrong
density . Under this density, bin at will have
.
probability The average string length is
! The relative entropy (excess average string length) is
7
Since the relative entropy is finite, we refer to
as entropy, although it may be NEGATIVE!
"
4
5
(eg, distribution of failure times)
7
7
otherwise
, nats
The entropy increases as
bits
increases
!
4
5
(eg, variable that is a sum of a large number of
other real random variables)
%;:
%;:
nats
%;:
The entropy increases as
&
bits
increases
!
4
!
5
4
5
%;: where
,
is an and is the determinant
positive definite matrix
, an covariance matrix
0
Suppose we have an invertible function
density .
$
and a
When a small volume is mapped from -space to
$
-space, the probability in the volume should stay
constant.
However, because the volume may change shape,
the probability density will change and the Jacobian
captures this effect.
Conservation of probability mass gives
$- $ $-
,
and
For
$ is called the Jacobian
4
5
and
$
4
5
,
is a matrix of derivatives
: A. Leon-Garcia, Probability and Random
Processes for Electrical Engineering, Addison Wesley,
New York, NY, 1994.
) * ! *
: R. M.
Neal. Bayesian Learning for Neural Networks, Springer,
New York, NY, 1996.
& : T. M. Cover and J. A. Thomas,
Elements of Information Theory, John Wiley & Sons, New
York, NY, 1991.
) * * ) *
Useful when we study Gaussian models.
http://www.psi.toronto.edu/matrix/matrix.html
,
: