ST213 Mathematics of Random Events: outline notes 1999-2000

ST213 Mathematics of Random Events:
Outline notes Spring Term 1999-2000
Lecturer: Jonathan Warren
Notes by: Wilfrid S. Kendall
Department of Statistics
University of Warwick
This is /home/fisher/wsk/ms/ST213/st213.tex (unix) version 1.6.
Last edited: 16:23:17, 24/01/2000.
1
Contents
1 Introduction
1.1 Aims and Objectives . . . . . . .
1.2 Books . . . . . . . . . . . . . . .
1.3 Resources (including examination
1.4 Motivating Examples . . . . . . .
. . . . . . . .
. . . . . . . .
information)
. . . . . . . .
2 Probabilities, algebras,
and σ-algebras
2.1 Motivation . . . . . . . . . . . . . .
2.2 Revision of sample space and events
2.3 Algebras of sets . . . . . . . . . . . .
2.4 Limit Sets . . . . . . . . . . . . . . .
2.5 σ-algebras . . . . . . . . . . . . . . .
2.6 Countable additivity . . . . . . . . .
2.7 Uniqueness of probability measures .
2.8 Lebesgue measure and coin tossing .
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
12
12
18
22
23
26
28
3 Independence and measurable functions
3.1 Independence . . . . . . . . . . . . . . .
3.2 Borel-Cantelli lemmas . . . . . . . . . .
3.3 Law of large numbers for events . . . . .
3.4 Independence and classes of events . . .
3.5 Measurable functions . . . . . . . . . . .
3.6 Independence of random variables . . .
3.7 Distributions of random variables . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
34
35
36
37
39
40
4 Integration
4.1 Simple functions and Indicators .
4.2 Integrable functions . . . . . . .
4.3 Expectation of random variables
4.4 Examples . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
44
47
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Convergence
49
5.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Laws of large numbers for random variables . . . . . . . . . . . . . . . . . . . 55
ii
5.3
5.4
5.5
Convergence of integrals and expectations . . . . . . . . . . . . . . . . . . . . 56
Dominated convergence theorem . . . . . . . . . . . . . . . . . . . . . . . . . 57
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6 Product measures
59
6.1 Product measure spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Fubini’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Relationship with independence . . . . . . . . . . . . . . . . . . . . . . . . . . 61
iii
ST213 outline notes (Version 1.6 [U]):
1
1.1
16:23:17, 24/01/2000
Introduction
Aims and Objectives
The main purpose of the course ST213 Mathematics of Random Events (which we will
abbreviate to MoRE) is to work over again the basics of the mathematics of uncertainty.
You have already covered this in a rough-and-ready fashion in:
(a) ST111 Probability;
(b) and even in ST114 Games and Decisions.
In this course we will cover these matters with more care. It is important to do this
because a proper appreciation of the fundamentals of the mathematics of random events
(a) gives an essential basis for getting a good grip on the basic ideas of statistics;
(b) will be of increasing importance in the future as it forms the basis of the hugely
important field of mathematical finance.
1
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
It is appropriate at this level that we cover the material emphasizing concepts rather
than proofs: by-and-large we will concentrate on what the results say and so will on some
occasions explain them rather than prove them. The third-year courses MA305 Measure
Theory, and ST318 Probability Theory go into the matter of proofs. For further discussion of
how Warwick probability courses fit together, see our road-map to probability at Warwick
at
www.warwick.ac.uk/statsdept/teaching/probmap.html
1.2
Books
The book with contents best matching this course is Williams [3], though this gives more
details (and especially many more proofs!) than we cover here; a still more extensive treatment is given by Billingsley [1]. The book by Grimmett and Stirzaker [2] also gives helpful
explanations of some (but not all) of the concepts dealt with in this course.
2
ST213 outline notes (Version 1.6 [U]):
1.3
16:23:17, 24/01/2000
Resources (including examination information)
The course is composed of 30 lectures, valued at 12 CATS credit. It has an assessed component (20%) as well as an examination in the summer term. The assessed component will be
conducted as follows: an exercise sheet will be handed out approximately every fortnight,
totalling 4 sheets. In the 10 minutes at the start of the next lecture you produce an answer
to one question under examination conditions, specified at the start of the lecture. Model
answers will be distributed after the test, and an examples class will be held a week after
the test. The tests will be marked, and the assessed component will be based on the best 3
out of 4 of your answers.
This method helps you learn during the lecture course so should:
• improve your exam marks;
• increase your enjoyment of the course;
• cost less time than end-of-term assessment.
Further copies of exercise sheets (after they have been handed out in lectures!) can be
obtained at the homepage for the ST213 course:
www.warwick.ac.uk/statsdept/teaching/ST213.html
3
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
There are various resources available for you as part of the course. First of all, naturally
enough, are the lectures. These will be supplemented about once a fortnight by an examples
class. You are expected to come to this class prepared to work through examples; I and
some helpers will circulate to offer help and advice when requested.
These notes will also be made available at the above URL, chapter by chapter as they
are covered in lectures. Notice that they do not cover all the material of the lectures: their
purpose is to provide a basic skeleton of summary material to supplement the notes you
make during lectures. For example no proofs are included. In particular you will not find it
possible to cover the course by ignoring lectures and depending on these notes alone!
The notes are in Acrobat pdf format: this is a good way to disseminate information including mathematical formulae, and can be read using widely available free software (Adobe
Acrobat Reader: follow links from Adobe Acrobat Reader downloads homepage; it is also
on numerous CD-ROMS accompanying the more reputable computer magazines!). Acrobat
pdf format allows me to include numerous hypertext references and I have made full use of
this. As a rule of thumb, clicking on coloured text is quite likely to:
• move you to some other relevant text, either in the current document such as here:
Aims and objectives, or occasionally in supporting documents or other course notes;
4
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
• launch a Worldwide Web browser as here (assuming your system has configured a
browser appropriately);
• send me an email, as here: [email protected].
In due course I expect also to experiment with animations . . . .
You should notice that Adobe Acrobat Reader includes a facility for going back from the
current page to the previously visited page: bear this in mind should you get lost!
Documents always have typographical errors: please email me to notify me of any you
think you have found. When you do this, please include a reference to the version and the
date of production of these notes (see the page header information on every page!). In any
case I expect to update these notes throughout the lecturing term as I myself discover and
correct errors, and think up improvements.
If you try to print out these pages you will likely discover a snag! I have formatted them
to fit comfortably on computer screens; printing via Adobe Acrobat Reader will use up a
lot of pages unless you know how to play clever tricks with PostScript. The version to be
placed in the library will be re-formatted to fit A4 pages.
Finally notice that if you download a copy of these notes to your own computer then you
may discover that some of the links cease to work. In particular these and other web-based
5
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
teaching materials of the Department of Statistics all reside on a special sub-area which
is only accessible from web-browsers originating on machines based within the Warwick
campus. So don’t be surprised that you can’t access the notes from your own internet
account!
Further related material (eg: related courses, some pretty pictures of random processes, ...)
can be obtained by following links from: W.S. Kendall’s homepage:
www.warwick.ac.uk/statsdept/Staff/WSK/
The Statistics Department will in the summer term sell booklets of the previous year’s
examination papers together with (rough) outline solutions, and we will run two revision
classes for this course at that time.
Finally, there is a unix newsgroup for this course:
uwarwick.stats.course.st213
It is intended for self-help discussion of ST213-related matters. Lecturers will lurk on the
group (this means, they will not post, except for occasional announcements of an administrative nature, but will read it at approximately weekly intervals to get an idea of course
feedback).
6
ST213 outline notes (Version 1.6 [U]):
1.4
16:23:17, 24/01/2000
Motivating Examples
Here are some examples to help us see what are the issues.
Example 1.1 (J. Bernoulli, circa 1692): Suppose that A1 , A2 , ... are mutually independent events, each of which has probability p. Define
Sn
=
#{ events Ak which happen for k ≤ n} .
Then the probability that Sn /n is close to p increases to 1 as n tends to infinity:
P [ |Sn /n − p| ≤ ]
→
1
as n → ∞ for all > 0.
Example 1.2 Suppose the random variable U is uniformly distributed over the continuous
range [0, 1]. Why is it that for all x in [0, 1] we have
P [U = x]
7
=
0
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
and yet
P[a ≤ U ≤ b]
=
b−a
whenever 0 ≤ a ≤ b ≤ 1? Why can’t we argue as follows?


[
P[a ≤ U ≤ b] = P 
{x} 
x∈[a,b]
=
X
P[U = x]
=
0?
x∈[a,b]
Example 1.3 (The Banach-Tarski paradox): Consider a sphere S 2 . In a certain qualified sense it is possible to do the following curious thing: we can “find” a subset F ⊂ S 2
and (for any k ≥ 3) rotations τ1k , τ2k , ..., τkk such that
S2
=
τ1k F ∪ τ2k F ∪ ... ∪ τkk F .
What then should we suppose the surface area of F to be? Since S 2 = τ13 F ∪ τ23 F ∪ τ33 F we
can argue for area(F ) = 1/3. But since S 2 = τ14 F ∪ τ24 F ∪ τ34 F ∪ τ44 F we can equally argue
for area(F ) = 1/4. Or similarly for area(F ) = 1/5. Or 1/6, or ...
8
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Example 1.4 Reverting to Bernoulli’s example (Example 1.1 above) we could ask, what is
the probability that, when we look at the whole sequence S1 /1, S2 /2, S3 /3, ..., we see the
sequence tends to p? Is this different from Bernoulli’s statement?
Example 1.5 Here is a question which is apparently quite different, which turns out to
be strongly related to the above ideas! Can we generalize the idea of a “Riemann integral”
in such a way as to make sense of rather discontinuous integrands, such as the case given
below?
Z
1
f (x) dx
0
where
f (x)
=
1
0
when x is a rational number,
when x is an irrational number.
9
ST213 outline notes (Version 1.6 [U]):
2
2.1
16:23:17, 24/01/2000
Probabilities, algebras,
and σ-algebras
Motivation
Consider two coins A and B which are tossed in the air so as each to land with either heads
or tails upwards. We do not assume the coin-tosses are independent!
It is often the case that one feels justified in assuming the coins individually are equally
likely to come up heads or tails. Using the fact P [ A = T ] = 1 − P [ A = H ], etc, we find
P [ A comes up heads ]
=
P [ B comes up heads ]
=
1
2
1
2
To find probabilities such as P [ HH ] = P [ A = H, B = H ] we need to say something
about the relationship between the two coin-tosses. It is often the case that one feels justified
10
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
in assuming the coin-tosses are independent, so
P [ A = H, B = H ]
=
P[A = H ] × P [B = H ] .
However this assumption may be unwise when the person tossing the coin is not experienced!
We may decide that some variant of the following is a better model: the event determining
[B = H] is C if [A = H], D if [A = T ], where
P[C = H ]
=
P[D = H ]
=
3
4
1
4
and A, C, D are independent.
There are two stages of specification at work here. Given a collection C of events, and
specified probabilities P [ C ] for each C ∈ C, we can find P [ C c ] = 1 − P [ C ] the probability
of the complement C c of C, but not necessarily P [ C ∩ D ] for C, D ∈ C.
11
ST213 outline notes (Version 1.6 [U]):
2.2
16:23:17, 24/01/2000
Revision of sample space and events
Remember from ST111 that we can use notation from set theory to describe events. We
can think of events as subsets of sample space Ω. If A is an event, then the event that A
does not happen is the complement or complementary event Ac = {ω ∈ Ω : ω 6∈ A}. If
B is another event then the event that both A and B happen is the intersection A ∩ B =
{ω ∈ Ω : ω ∈ A and ω ∈ B}. The event that either A or B (or both!) happen is the union
A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}.
2.3
Algebras of sets
This leads us to identify classes of sets for which we want to find probabilities.
Definition 2.1 (Algebra of sets): An algebra (sometimes called a field) of subsets of Ω
is a class C of subsets of a sample space Ω satisfying:
(1) closure under complements: if A ∈ C then Ac ∈ C;
(2) closure under intersections: if A, B ∈ C then A ∩ B ∈ C;
12
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
(3) closure under unions: if A, B ∈ C then A ∪ B ∈ C.
Definition 2.2 (Algebra generated by a collection): If C is a collection of subsets of
Ω then A(C), the algebra generated by C, is the intersection of all algebras of subsets of Ω
which contain C.
Here are some examples of algebras:
(i) the trivial algebra A = {Ω, ∅};
(ii) supposing Ω = {H, T }, another example is
A
=
{Ω = {H, T }, {H}, {T }, ∅} ;
(iii) now consider the following class of subsets of the unit interval [0, 1]:
A = { finite unions of subintervals } ;
13
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
This is an algebra. For example, if
A
(a0 , a1 ) ∪ (a1 , a2 ) ∪ ... ∪ (a2n , a2n+1 )
=
is a non-overlapping union of intervals (and we can always re-arrange matters so that
any union of intervals to be non-overlapping!) then
Ac
=
[0, a0 ] ∪ [a1 , a2 ] ∪ ... ∪ [a2n+1 , 1] .
This checks point (1) of the definition of an algebra of sets. Point (2) is rather easy,
and point (3) is defined by points (1) and (2).
(iv) Consider A = {{1, 2, 3}, {1, 2}, {3}, ∅}. This is an algebra of subsets of Ω = {1, 2, 3}.
Notice it does not include events such as {1}, {2, 3}.
(v) Just to give an example of a collection of sets which is not an algebra, consider
{{1, 2, 3}, {1, 2}, {2, 3}, ∅}.
(vi) Algebras get very large. It is typically more convenient simply to give a collection C
of sets generating the algebra. For example, if C = ∅ then A(C) = {∅, Ω} is the trivial
algebra described above!
14
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
(vii) If Ω = {H, T } and C = {{H}} then
A = {{H, T }, {H}, {T }, ∅}
as in example (ii) above.
(viii) If Ω = [0, 1] and C = { intervals in [0, 1] } then A(C) is the collection of finite unions
of intervals as in example (iii) above.
(ix) Finally, if Ω = {H, T } and C is the collection of points in [0, 1] then A(C) is the
collection of (a) all finite sets in [0, 1] and (b) all complements of finite sets in [0, 1].
In realistic examples algebras are rather large : not surprising, since they correspond to
the collection of all “true-or-false” statements you can make about a certain experiment! (If
your experiment’s results can be summarised as n different “yes”/“no” answers – such as,
result is hot/cold, result is coloured black/white, etc – then the relevant algebra is composed
of 2n different subsets!) Therefore it is of interest that the typical element of an algebra can
be written down in a rather special form:
15
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 2.3 (Representation of typical element of algebra): If C is a collection of
subsets of Ω then the event A belongs to the algebra A(C) generated by C if and only if
A
=
Mi
N \
[
Ci,j
i=1 j=1
c
where for each i, j either Ci,j or its complement Ci,j
belongs to C. Moreover we may write
A in this form with the sets
Mi
\
Di =
Ci,j
j=1
being disjoint.
1
We are now in a position to produce our first stab at a set of axioms for probability. Given
a sample space and an algebra A of subsets, probability P [ · ] assigns a number between 0
1 This result corresponds to a basic remark in logic: logical statements, however complicated, can be
reduced to statements of the form (A1 and A2 and ... and Am ) or (B1 and B2 and ... and Bn ) or ... or
(C1 and C2 and ... and Cp ), where the statements A1 etc are either basic statements or their negations,
and no more than one of the (...) or ... or (...) can be true at once.
16
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
and 1 to each event in the algebra A, obeying the rules given below. There is a close analogy
to the notion of length of subsets of [0, 1] (and also to notions of area, volume, ...): the table
below makes this clear:
Probability
P[∅] = 0
P[Ω] = 1
P[A ∪ B ] = P [A] + P [B]
(if A ∩ B = ∅)
Length of subset of [0, 1]
Length (∅) = 0
Length ([0, 1]) = 1
Length ([a, b] ∪ [c, d]) =
Length ([a, b]) + Length ([c, d])
if a ≤ b < c ≤ d
There are some consequences of these axioms which are not completely trivial. For
example, the “law of negation”
P [ Ac ]
1 − P[A] ;
=
and the “generalized law of addition” holding when A ∩ B is not necessarily empty
P [A ∪ B]
=
P [A] + P [B] − P[A ∩ B ]
17
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
(think of “double-counting”); and finally the “inclusion-exclusion law”
P [ A1 ∪ A2 ∪ ... ∪ An ]
X
P [ Ai ]
=
−
XX
+
. . . + (−1)n+1 P [ A1 ∩ A2 ∩ ... ∩ An ] .
i
2.4
i6=j
P [ Ai ∩ Aj ] +
Limit Sets
Much of the first half of ST111 is concerned with calculations using these various rules
of probabilistic calculation. Essentially the representation theorem above tells us we can
compute the probability of any event in A(C) just so long as we know the probabilities of
the various events in C and also of all their intersections, whether by knowing events are
independent or whether by knowing various conditional probabilities. 2
However these calculations can become long-winded and ultimately either infeasible or
unrevealing. It is better to know how to approximate probabilities and events, which leads
2 We avoid discussing conditional probabilities here for reasons of shortage of time: they have been dealt
with in ST111 and figure very largely in ST202
18
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
us to the following kind of question:
Suppose we have a sequence of events Cn which are decreasing (getting harder and harder
to satisfy) and which converge to a limit C:
Cn
↓
C.
Can we say P [ Cn ] converges to P [ C ]?
Here is a specific example. Suppose we observe an infinite sequence of coin tosses, and
think therefore of the collection C of events Ai that the ith coin comes up heads. Consider
the probabilities
(a) P [ second toss gives heads ] = P [ A2 ]
Tn
(b) P [ first n tosses all give heads ] = P [ i=1 Ai ]
(c) P [ the first toss which gives a head is even-numbered ]
There is a difference! The first two can be dealt with within the algebra. The third
cannot: suppose Cn is the event “the first toss in numbers 1, ..., n which gives a head is
even-numbered or else all n of these tosses give tails”, then Cn lies in A(C), and converges
19
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
down to the event C “the first toss which gives a head is even-numbered”, but C is not in
A(C).
We now find a number of problems raise their heads.
• Problems with “everywhere being impossible”: Suppose we are running an
experiment with an outcome uniformly distributed over [0, 1]. Then we have a problem
as mentioned in the second of our motivating examples: under reasonable conditions
we are working with the algebra of finite unions of sub-intervals of [0, 1], and the
probability measure which gives P [ [a, b] ] = b − a, but this means P [ {a} ] = 0. Now
we need to be careful, since if we rashly allow ourselves to work with uncountable
unions we get


[
X
{x}  =
0 = 0.
P
x∈[0,1]
x∈[0,1]
But this contradicts P [ [0, 1] ] = 1 and so is obviously wrong.
• Problems with specification: if we react to the above example by insisting we can
only give probabilities to events in the original algebra, then we can fail to give probabilities to perfectly sensible events, such as in examples such as in (c) in the infinite
20
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
sequence of coin-tosses above. On the other hand if we rashly prescribe probabilities
then how can we avoid getting into contradictions such as above?
It seems sensible to suppose that at least when we have Cn ↓ C then we should be allowed
to say P [ Cn ] ↓ P [ C ], and this turns out to be the case as long as the set-up is sensible.
Here is an example of a set-up which is not sensible:
Example 2.4 Ω = {1, 2, 3, ...}, C = {{1}, {2}, ...}, P [ n ] = 1/2n+1 . Then A(C) is the
collection of finite and co-finite3 subsets of the positive integers, and
P [ {1, 2, ..., n} ]
=
n
X
=
(1/2) × (1 − 1/2n+1 )
1/2m+1
m=1
We must now investigate how we can deal with limit sets.
3 co-finite:
complement is finite
21
→
1/2 6= 1 .
ST213 outline notes (Version 1.6 [U]):
2.5
16:23:17, 24/01/2000
σ-algebras
The first task is to establish a wide range of sensible limit sets. Boldly, we look at sets which
can be obtained by any imaginable combination of countable set operations: the collection
of all such sets is a σ-algebra.4
Definition 2.5 (σ-algebra): A σ-algebra of subsets of Ω is an algebra which is also closed
under countable unions.
In fact σ-algebras are even larger than ordinary algebras; it is difficult to describe a
typical member of a σ-algebra, and it pays to talk about σ-algebras generated by specified
collections of sets.
Definition 2.6 (σ-algebra generated by a collection): For any collection of subsets C
of Ω, we define σ(C) to be the intersection of all σ-algebras of subsets of Ω which contain C:
\
σ(C) =
{S : S is a σ-algebra and C ⊆ S} .
4σ
stands for “countable”
22
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 2.7 (Monotone limits): Note that σ(C) defined above is indeed a σ-algebra.
Furthermore, it is the smallest σ-algebra containing C which is closed under monotone limits.
Examples of σ-algebras include: all algebras of subsets of finite sets (because then there
will be no non-finite countable set operations); the Borel σ-algebra generated by the family
of all intervals of the real line; the σ-algebra for the coin-tossing example generated by the
infinite family of events
Ai
2.6
=
[ ith coin is heads ] .
Countable additivity
Now we have established a context for limit sets (they are sets belonging to a σ-algebra) we
can think about what sort of limiting operations we should allow for probability measures.
Definition 2.8 (Measures): A set-function µ : A → [0, ∞] is said to be a finitely-additive
measure if it satisfies:
(FA) µ(A ∪ B) = µ(A) + µ(B) whenever A, B are disjoint. It is said to be countably-additive
(or σ-additive) if in addition
23
ST213 outline notes (Version 1.6 [U]):
(CA) µ (
in A.
S∞
i=1
Ai ) =
P∞
i=1
16:23:17, 24/01/2000
µ(Ai ) whenever the Ai are disjoint and their union
S∞
i=1
Ai lies
We abbreviate “finitely-additive” to (FA), and “countably-additive” to (CA).
We often abbreviate “countably-additive measure” to “measure”.
Notice that if A were actually a σ-algebra then we wouldn’t have to check the condition
S∞
“ i=1 Ai lies in A” in the third property.
Definition 2.9 (Probability measures): A set-function P : A → [0, 1] is said to be a
finitely-additive probability measure if it is a (FA) measure such that P [ Ω ] = 1. It is a
(CA) probability measure (we often just say “probability measure” if in addition it is (CA).
Notice various consequences for probability measures: µ(∅)
= 0, condition
S∞
P∞ (ii) follows
from condition (iii) if condition (iii) holds, we always have µ ( i=1 Ai ) ≤ i=1 µ(Ai ) even
when the union is not disjoint, etc.
CA is a kind of continuity condition. A similar continuity condition is that of “monotone
limits”.
24
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Definition 2.10 (Monotone limits): A set-function µ : A → [0, 1] is said to obey the
monotone limits property (ML) if it satisfies:
• µ(Ai ) → µ(A) whenever the Ai increase upwards to a limit set A which lies in A.
(ML) is simpler to check than (CA) but is equivalent for finitely-additive measures.
Theorem 2.11 (Equivalence for countable additivity):
⇐⇒
(FA) + (ML)
(CA)
Lemma 2.12 (Another condition for countable additivity): Suppose P is a finitely
additive probability measure on (Ω, F), where F is an algebra of sets. Then P is countably
additive if and only if
lim P [ An ] = 1
n→∞
whenever the sequence of events An belongs to the algebra F and moreover An ↑ Ω.
25
ST213 outline notes (Version 1.6 [U]):
2.7
16:23:17, 24/01/2000
Uniqueness of probability measures
To illustrate the next step, consider the notion of length/area. (To avoid awkward alternatives, we talk about the measure instead of length/area /volume/...) It is easy to define the
area of very regular sets. But for a stranger, more “fractal-like”, set A we would need to
define something like an “outer-measure”
nX
o
µ(A) = inf
µ(Bi ) : where the Bi cover A
to get at least an upper bound for what it would be sensible to call the measure of A.
Of course we must give equal priority to considering what is the measure of the complement Ac . Suppose for definiteness that A is contained in a simple set Q of finite measure (a
convenient interval for length, a square for area, a cube for volume, ...) so that Ac = Q \ A.
Then consideration of µ(Ac ) leads us directly to consideration of “inner-measure” for A:
µ(A)
=
µ(Q) − µ(Ac ) .
Clearly µ(A) ≥ µ(A): moreover we can only expect a truly sensible definition of measure
on the set
F =
A : µ(A) = µ(A) .
26
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
The fundamental theorem of measure theory states that this works out all right!
Theorem 2.13 (Extension theorem): If µ is a measure on an algebra A which is σadditive on A then it can be extended uniquely to a countable additive measure on F defined
as above: moreover σ(A) ⊆ F.
The proof of this remarkable theorem is too lengthy to go into here. Notice that it can
be paraphrased very simply: if your notion of measure (probability, length, area, volume,
...) can be defined consistently on an algebra in such a way that it is σ-additive whenever
the two sides
!
∞
∞
X
[
µ
Ai
=
µ(Ai )
i=1
i=1
S∞
make sense (whenever the disjoint union i=1 Ai actually belongs to the algebra), then it
can be extended uniquely to the (typically much larger) σ-algebra generated by the original
algebra, so as again to be a (σ-additive) measure.
There is an important special part of this theorem which is worth stating separately.
27
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Definition 2.14 (Π-system): A Π-system of subsets of Ω is a collection of subsets including Ω itself and closed under finite intersections.
Theorem 2.15 (Uniqueness for probability measures): Two finite measures which
agree on a π-system Π also agree on the generated σ-algebra σ(Π).
2.8
Lebesgue measure and coin tossing
The extension theorem can be applied to the “uniform probability space” Ω = [0, 1], A
given by finite unions of intervals, P given by lengths of intervals. It turns out P is indeed
σ-additive on A (showing this is non-trivial!) and so the extension theorem tells us there is
a unique countably additive extension P on the σ-algebra B = σ(A) (the Borel σ-algebra
restricted to [0, 1]). We call this Lebesgue measure.
There is a significant connection between infinite sequences of coin tosses and numbers
in [0, 1]. Briefly, we can expand a number x ∈ [0, 1] in binary (as opposed to decimal!): we
write x as
.ω1 ω2 ω3 ...
28
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
where ωi equals 1 or 0 according as 2i x is greater than 1 or not. The coin-tossing σ-algebra
can be viewed as generated by the sequence
{ω1 , ω2 , ω3 , ...}
with 0 standing for tails, 1 for heads. In effect we get a map from coin-tossing space 2N to
number space [0, 1] – with the slight cautionary note that this map very occasionally maps
two sequences onto one number (think of .0111111... and .100000...). In particular
[ω1 = a1 , ω2 = a2 , ..., ωd = ad ]
=
[x, x + 2−d )
where x is the number corresponding to (a1 , a2 , ..., ad ).
Remarkably, we can now use the uniqueness theorem to show that the map
T : (a1 , a2 , ..., ad ) 7→ x
preserves probabilities, in the sense that Lebesgue measure is exactly the same as we get
by finding the probability of the event T −1 (A) as a coin-tossing event, if the coins are
independent and fair.
29
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
It is reasonable to ask whether there are any non-measurable sets, since σ-algebras are
so big! It is indeed very hard to find any. Here is the basic example, which is due in essence
to Vitali.
Example 2.16 Consider the following equivalence relation on (Ω,B,P): we say x ∼ y if
x − y is a rational number. Now construct a set A by choosing exactly one member from
each equivalence class. So for any x ∈ [0, 1] there is one and only one y ∈ A such that x − y
is a rational number.
If A were Lebesgue measurable then it would have a value P [ A ]. What would this value
be?
Imagine [0, 1] folded round into a circle. It is the case that P [ A ] does not change when
one turns this circle. In particular we can now consider Aq = {a + q : a ∈ A} for rational
q. By construction Aq and Ar are disjoint for different rational q, r. Now we have
[
Aq = [0, 1]
q rational
and since there are only countably many rational q, and P [ Aq ] doesn’t depend on q, we
30
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
determine
P [ [0, 1] ]
X
=
q
P [ Aq ]
rational
X
=
q
P[A] .
rational
But this cannot make sense if P [ [0, 1] ] = 1! We are forced to conclude that A cannot be
Lebesgue measurable.
This example has a lot to do with the Banach-Tarski paradox described in the motivating
Example 1.3 above.
3
3.1
Independence and measurable functions
Independence
In ST111 we formalized the idea of independence of events. Essentially we require a “multiplication law” to hold:
31
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Definition 3.1 (Independence of an infinite sequence of events): We say the events
Ai (for i = 1, 2, ...) are independent if, for any finite subsequence i1 < i2 < ... < ik we
have
P [ Ai1 ∩ ... ∩ Aik ] = P [ Ai1 ] × ... × P [ Aik ]
Notice we require all possible multiplication laws to hold: it is possible to build interesting examples where events are independent pair-by-pair, but altogether give non-trivial
information about each other.
We need to talk about infinite sequences of events (often independent). We often have
in the back of our minds a sense that the sequence is revealed to us progressively over time
(though this need not be so!), suggesting two natural questions. First, will we see events
occur in the sequence right into the indefinite future? Second, will we after some point see
all events occur?
Definition 3.2 (“Infinitely often” and “Eventually”): Given a sequence of events
B1 , B2 , ... we say
• Bi holds infinitely often ([Bi i.o.]) if there are infinitely many different i for which the
32
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
statement Bi is true: in set-theoretic terms
[Bi i.o.]
=
∞ [
∞
\
Bj .
i=1 j=i
• Bi holds eventually ([Bi ev.]) if for all large enough i the statement Bi is true: in
set-theoretic terms
∞ \
∞
[
[Bi ev.] =
Bj .
i=1 j=i
Notice these two concepts ev. and i.o. make sense even if the infinite sequence is just a
sequence, with no notion of events occurring consecutively in time!
Notice (you should check this yourself!)
[Bi i.o.]
=
33
[Bic ev.]c .
ST213 outline notes (Version 1.6 [U]):
3.2
16:23:17, 24/01/2000
Borel-Cantelli lemmas
The multiplication laws appearing above in Section 3.1 force a kind of “infinite multiplication
law”.
Lemma 3.3 (Probability of infinite intersection): If the events Ai (for i = 1, 2, ...)
are independent then
" ∞
#
∞
\
Y
P
Ai
=
P [ Ai ]
i=1
i=1
We have to be careful what we mean by the infinite product
course the limiting value
lim
n→∞
n
Y
Q∞
i=1
P [ Ai ]: we mean of
P [ Ai ] .
i=1
We can now prove a remarkable pair of facts about P [ Ai i.o. ] (and hence its twin
P [ Ai ev. ]!). It turns out it is often easy to tell whether these events have probability 0 or
1.
34
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 3.4 (Borel-Cantelli lemmas): Suppose the events Ai (for i = 1, 2, ...) form
an infinite sequence. Then
P∞
(i) if i=1 P [ Ai ] < ∞ then
P [ Ai holds infinitely often ]
(ii) if
P∞
i=1
=
P [ Ai i.o. ]
=
0;
=
1.
P [ Ai ] = ∞ and the Ai are independent then
P [ Ai holds infinitely often ]
=
P [ Ai i.o. ]
Note the two parts of the above result are not quite symmetrical: the second part also
requires independence. It is a good exercise to work out a counterexample to part (ii) if
independence fails.
3.3
Law of large numbers for events
As a consequence of these ideas it can be shown that “limiting frequencies” exist for sequences of independent trials with the same success probability.
35
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 3.5 (Law of large numbers for events): Suppose that we have a sequence of
independent events Ai each with the same probability p. Let Sn count the number of events
A1 , ...,, An which occur. Then
Sn
P − p ≤ ev.
= 1
n
for all positive .
3.4
Independence and classes of events
The idea of independence stretches beyond mere sequences of events. For example, consider
(a) a set of events concerning a football match between Coventry City and Aston Villa at
home for Coventry, and (b) a set of events concerning a cricket test between England and
Australia at Melbourne, both happening on the same day. At least as a first approximation,
one might assume that any combination of events concerning (a) is independent of any
combination concerning (b).
Definition 3.6 (Independence and classes of events): Suppose C1 , C2 are two classes
36
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
of events. We say they are independent if A and B are independent whenever A ∈ C1 ,
B ∈ C2 .
Here our notion of Π-systems becomes important.
Lemma 3.7 (Independence and Π-systems): If two Π-systems are independent, then
so are the σ-algebras they generate.
Returning to sequences, the above is the reason why we can jump immediately from
assumptions of independence of events to deducing that their complements are independent.
Corollary 3.8 (Independence and complements): If a sequence of events Ai is independent, then so is the sequence of complementary events Aci .
3.5
Measurable functions
Mathematical work often becomes easier if one moves from sets to functions. Probability
theory is no different. Instead of events (subsets of sample space) we can often find it easier
to work with random variables (real-valued functions defined on sample space). You should
37
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
think of a random variable as involving lots of different events, namely those events defined
in terms of the random variable taking on different sets of values. Accordingly we need to
take care that the random variable doesn’t produce events which fall outwith our chosen
σ-algebra. To do this we need to develop the idea of a measurable function.
Definition 3.9 (Measurable space): (Ω, F) is a measurable space if F is a σ-algebra of
subsets of Ω.
Definition 3.10 (Borel σ-algebra): The Borel σ-algebra B is the σ-algebra of subsets of
R generated by the collection of intervals of R.
In fact we don’t need all the intervals of R. It is enough to take the closed half-infinite
intervals (−∞, x].
Definition 3.11 (Measurable function): Suppose given two measurable spaces (Ω, F),
(Ω0 , F0 ). We say the function
f : Ω → Ω0
is measurable if f −1 (A) = {ω : f (ω) ∈ A} belongs to F whenever A belongs to F0 .
38
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Definition 3.12 (Random variable): Suppose that X : Ω → R is measurable as a mapping from (Ω, F) to (R, B). Then we say X is a random variable.
As we have said, to each random variable there is a class of related events. This actually
forms a σ-algebra.
Definition 3.13 (σ-algebra generated by a random variable): If X : Ω → R is a
random variable then the σ-algebra generated by X is the family of events σ(X) = {X −1 (A) :
A ∈ B}.
3.6
Independence of random variables
Random variables can be independent too! Essentially here independence means that a
event generated by one of the random variables cannot be used to give useful predictions
about an event generated by the other random variable.
Definition 3.14 (Independence of random variables): We say random variables X
and Y are independent if their σ-algebras σ(X), σ(Y ) are independent.
39
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 3.15 (Criterion for independence of random variables): Let X and Y be
random variables, and let P be the Π-system of R formed by all half-infinite closed intervals
(−∞, x]. Then X and Y are independent if and only if the collections of events X −1 P,
Y −1 P are independent.5
3.7
Distributions of random variables
We often need to talk about random variables on their own, without reference to other
random variables or events. In such cases all we are interested in is the probabilities they
have of taking values in various regions:
Definition 3.16 (Distribution of a random variable): Suppose that X is a random
variable. Its distribution is the probability measure PX on R given by
PX [B]
P [X ∈ B ]
=
whenever B ∈ B.
5 Here
we define X −1 P = {X −1 (A) : A ∈ P} = {X −1 ((−∞, x]) : x ∈ (−∞, ∞)}
40
ST213 outline notes (Version 1.6 [U]):
4
16:23:17, 24/01/2000
Integration
One of the main things to do with functions is to integrate them (find the area under the
curve). One of the main things to do with random variables is to take their expectations
(find their average values). It turns out that these are really the same idea! We start with
integration.
4.1
Simple functions and Indicators
Begin by thinking of the simplest possible function to integrate. That is an indicator function, which only takes two possible values, 0 or 1:
Definition 4.1 (Indicator function): If A is a measurable set then its indicator function
is defined by
0 if x 6∈ A;
I[A] (x) =
1 if x ∈ A.
The next stage up is to consider a simple function taking only a finite number of values,
since it can be regarded as a linear combination of indicator functions.
41
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Definition 4.2 (Simple functions): A simple function h is a measurable function h :
Ω → R which only takes finitely many values. Thus we can represent it as
h(x)
=
c1 I[A1 ] (x) + ...cn I[An ] (x)
for some finite collection A1 , ..., An of measurable sets and constants c1 , ..., cn .
It is easy to integrate simple functions ...
Definition 4.3 (Integration of simple functions): The integral of a simple function h
with respect to a measure µ is given by
Z
h dµ
Z
=
h(x)µ( dx)
=
n
X
ci µ(Ai )
i=1
where
h(x)
=
c1 I[A1 ] (x) + ...cn I[An ] (x)
as above.
42
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
R
Note that one really should prove that the definition of h dµ does not depend on exactly
how one represents h as the sum of indicator functions.
Integration for such functions has a number of basic properties which one uses all the
time, almost unconsciously, when trying to find integrals.
Theorem 4.4 (Properties of integration for simple functions):
R
R
(1) if µ(f 6= g) = 0 then f dµ = g dµ;
R
R
R
(2) Linearity: (af + bg) dµ = a f dµ + b g dµ;
R
R
(3) Monotonicity: f ≤ g means f dµ ≤ g dµ;
(4) min{f, g} and max{f, g} are simple.
Simple functions are rather boring. For more general functions we use limiting arguments. We have to be a little careful here, since some functions will have integrals built up
from +∞ where they are integrated over one part of the region, and −∞ over another part.
43
ST213 outline notes (Version 1.6 [U]):
Think for example of
Z ∞
1
dx
x
−∞
=
Z
0
∞
16:23:17, 24/01/2000
1
dx +
x
Z
0
−∞
1
dx “equals” ∞ − ∞?
x
So we first consider just non-negative functions.
Definition 4.5 (Integration for non-negative measurable functions): If f ≥ 0 is
measurable then we define
Z
Z
f dµ = sup
g dµ : for simple g such that 0 ≤ g ≤ f .
4.2
Integrable functions
For general functions we require that we don’t get into this situation of “∞ − ∞”.
Definition 4.6 (Integration for general measurable functions): If f is measurable
and we can write f = g − h for two non-negative measurable functions g and h, both with
44
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
finite integrals, then
Z
f dµ
=
Z
g dµ −
Z
h dµ .
We then say f is integrable.
R
One really needs to prove that the integral f dµ does not depend on the choice f = g−h.
In fact if there is any choice which works then the easy choice
g
h
=
=
max{f, 0}
max{−f, 0}
will work.
One can show that the integral on integrable functions agrees with its definition on
simple functions and is linear. What starts to make the theory very easy is that the integral
thus defined behaves very well when studying limits.
Theorem 4.7 (Monotone convergence theorem (MON)): If fn ↑ f (all being non-
45
ST213 outline notes (Version 1.6 [U]):
negative measurable functions) then
Z
16:23:17, 24/01/2000
fn dµ
↑
Z
f dµ .
Corollary 4.8 (Integrability and simple functions): if f is non-negative and measurable then for any sequence of non-negative simple functions fn such that fn ↑ f we have
Z
Z
fn dµ ↑
f dµ .
Definition 4.9 (Integration over a measurable set): if A is measurable and f is
integrable then
Z
Z
f dµ =
I[A] f dµ .
A
46
ST213 outline notes (Version 1.6 [U]):
4.3
16:23:17, 24/01/2000
Expectation of random variables
The above notions apply directly to random variables, which may be thought of simply as
measurable functions defined on the sample space!
Definition 4.10 (Expectation): if P is a probability measure then we define expectation
(with respect to this probability measure) for all integrable random variables X by
Z
Z
E[X ] =
X dP =
X(ω)P( dω) .
The notion of expectation is really only to do with the random variable considered on
its own, without reference to any other random variables. Accordingly it can be expressed
in terms of the distribution of the random variable.
Theorem 4.11 (Change of variables): Let X be a random variable and let g : R → R
be a measurable function. Assuming that the random variable g(X) is integrable,
Z
E [ g(X) ] =
g(x)PX ( dx) .
R
47
ST213 outline notes (Version 1.6 [U]):
4.4
16:23:17, 24/01/2000
Examples
You need to work through exercises such as the following to get a good idea of how the
above really works out in practice. See the material covered in lectures for more on this.
R1
Exercise 4.12 Evaluate 0 xLeb( dx) = x.
4.13 Consider Ω = {1, 2, 3, ...}, P [ {i} ] = pi where
RExerciseP
∞
f dP = i=1 f (i)pi .
Exercise 4.14 Evaluate
Ry
Exercise 4.15 Evaluate
Rn
0
ex Leb( dx).
f (x)Leb( dx) where

if 0 ≤ x < 1,

 1

2
if 1 ≤ x < 2,
f (x) =
...



n if n − 1 ≤ x < n.
0
48
P∞
i=1
pi = 1. Evaluate
ST213 outline notes (Version 1.6 [U]):
Exercise 4.16 Evaluate
5
R
16:23:17, 24/01/2000
I[0,θ] (x) sin(x)Leb( dx).
Convergence
Approximation is a fundamental key to making mathematics work in practice. Instead of
being stuck, unable to do a hard problem, we find an easier problem which has almost
the same answer, and do that instead! The notion of convergence (see first-year analysis)
is the formal structure giving us the tools to do this. For random variables there are a
number of different notions of convergence, depending on whether we need to approximate
a whole sequence of actual random values, or just a particular random value, or even just
probabilities.
5.1
Convergence of random variables
Definition 5.1 (Convergence in probability): The random variables Xn converge in
probability to Y ,
Xn → Y in prob. ,
49
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
if for all positive we have
P [ |Xn − Y | > ]
→
0.
Definition 5.2 (Convergence almost surely / almost everywhere): The random
variables Xn converge almost surely to Y ,
Xn
→
Y a.s. ,
if we have
P [ Xn → Y ]
=
0.
The (measurable) functions fn converge almost everywhere to f if the set
{x : fn (x) → f (x) fails }
is of Lebesgue measure zero.
The difference is that convergence in probability deals with just a single random value
Xn for large n. Convergence almost surely deals with the behaviour of the whole sequence.
Here are some examples to think about.
50
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Example 5.3 Consider random variables defined on ([0, 1], B, Leb) by Xn (ω) = I[[0,1/n]] (ω),
Then Xn → 0 a.s..
Example 5.4 Consider the probability space above and the events A1 = [0, 1], A2 = [0, 1/2],
A3 = [1/2, 1], A4 = [0, 1/4], ..., A7 = [3/4, 1], ... Then Xn = I[An ] converges to zero in
probability but not almost surely.
Example 5.5 Suppose in the above that
Xn
=
n
X
(k/n)I[[(k−1)/n,k/n]] .
k=1
Then Xn → X a.s., where X(ω = ω ∈ [0, 1].
Example 5.6 Suppose in the above that Xn ≤ a for all n. Let Yn = maxm≤n Xm . Then
Yn ↑ Y a.s. for some Y .
51
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Example 5.7 Suppose in the above that the Xn are not bounded, but are independent, and
furthermore
∞
Y
lim
P [ Xn ≤ a ] = 1 .
a→∞
i=1
Then Yn ↑ Y a.s. where
P [Y ≤ a] =
∞
Y
P [ Xn ≤ a ] .
i=1
As one might expect, the notion of almost sure convergence implies that of convergence
in probability.
Theorem 5.8 (Almost sure convergence implies convergence in probability): Xn →
X a.s. implies Xn → X in prob.
52
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Almost sure convergence allows for various theorems telling us when it is OK to exchange
integrals and limits. Generally this doesn’t work: consider the example
Z ∞
1 =
λ exp(−λt) dt 6→
Z0 ∞
Z
lim λ exp(−λt) dt = 0 dt = 0 .
0
λ→∞
However we have already seen one case where it does work: when the limit in monotonic.
In fact we only need this to hold almost everywhere (i.e. when the convergence is almost
sure).
Theorem 5.9 (MON): if the functions fn , f are non-negative and if fn ↑ f µ − a.e. then
Z
Z
fn dµ ↑ f dµ .
It is often the case that the following simple inequalities are crucial to figuring out
whether convergence holds.
53
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Lemma 5.10 (Markov’s inequality): if f : R → R is increasing and non-negative and
X is a random variable then
P [ X ≥ a ] ≤ E [ f (X) ] /f (a)
for all a such that f (a) > 0.
Corollary 5.11 (Chebyshev’s inequality): if E X 2 < ∞ then
P [ |X − E [ X ] | ≥ a ] ≤ Var(X)/a2
for all a > 0.
In particular we can get a lot of mileage by combining with the fact, that while in general
the variance of a random variable is not additive, it is additive in the case of independence.
Lemma 5.12 (Variance and independence): if a sequence of random variables Xi is
independent then
!
n
n
X
X
Var
Xi
=
Var (Xi ) .
i=1
i=1
54
ST213 outline notes (Version 1.6 [U]):
5.2
16:23:17, 24/01/2000
Laws of large numbers for random variables
An important application of these ideas is to show that the law of large numbers extends
from events to random variables.
Theorem 5.13 (Weak law of large numbers): if a sequence of random variables Xi
is independent, and if the random variables all have the same finite mean and variance
E [ Xi ] = µ and Var(Xi ) = σ 2 < ∞, then
Sn /n → µ in prob.
where Sn = (X1 + ... + Xn )/n is the partial sum of the sequence.
As you will see, the proof is really rather easy when we use Chebyshev’s inequality
above. Indeed it is also quite easy to generalize to the case when the random variables are
correlated, as long as the covariances are small ...
However the corresponding result for almost sure convergence, rather than convergence,
is rather harder to prove.
55
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 5.14 (Strong law of large numbers): if a sequence of random variables Xi
is independent and identically distributed, and if E [ Xi ] = µ then
Sn /n → µ a.s.
where Sn = (X1 + ... + Xn )/n is the partial sum of the sequence.
5.3
Convergence of integrals and expectations
We already know a way to relate integrals to limits (MON). What about a general sequence
of non-negative measurable functions?
Theorem 5.15 (Fatou’s lemma (FATOU)): If the functions fn : R → R are actually
non-negative then
Z
Z
lim inf fn dµ ≤ lim inf fn dµ .
We can also go “the other way”:
56
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 5.16 (“Reverse Fatou”): If the functions fn : R → R are bounded above by g
µ a.e. and g is integrable then
Z
Z
lim sup fn dµ ≤
lim sup fn dµ .
5.4
Dominated convergence theorem
Although in general one can’t interchange limits and integrals, this can be done if all the
functions (equivalently, random variables) involved are bounded in absolute value by a single
non-negative function (random variable) which has finite integral.
Corollary 5.17 (Dominated convergence theorem (DOM)): If the functions fn :
R → R are bounded above in absolute value by g µ a.e. (so |fn | < g a.e.) and g is integrable and also fn → f then
Z
Z
lim fn dµ =
f dµ .
This is a very powerful result ...
57
ST213 outline notes (Version 1.6 [U]):
5.5
16:23:17, 24/01/2000
Examples
Example 5.18 If the Xn form a bounded sequence random variable and they converge
almost surely to X then
E [ Xn ] → E [ X ] .
Example 5.19 Suppose that U is a random variable uniformly distributed over [0, 1] and
Xn =
n
2X
−1
k2−n I[k2−n ≤U <(k+1)2−n ] .
k=0
Then E [ log(1 − Xn ) ] → −1.
Example 5.20 Suppose that the Xn are independent and X1 = 1 while for n ≥ 2
P [ Xn = n + 1 ]
P [ Xn = 1 ]
= P [ Xn = 1/(n + 1) ]
= 1 − 2/n3
58
=
1/n3
ST213 outline notes (Version 1.6 [U]):
and Zn =
and
Qn
i=1
Xi . Then the Zn form an almost surely convergent sequence with limit Z∞ ,
E [ Zn ]
6
6.1
16:23:17, 24/01/2000
=
E [ Z∞ ] .
Product measures
Product measure spaces
The idea here is, given two measure spaces (Ω, F, µ) and (Ω0 , F0 , ν), we build a meaasure
space Ω × Ω0 by using “rectangle sets” A × B with measures µ(A) × ν(B). As you might
guess from the “product form” µ(A) × ν(B), in the context of probability this is related to
independence.
Definition 6.1 (Product measure space): define the “product measure” µ ⊗ ν on the
Π-system R of rectangle sets A × B as above. Let A(R) be the algebra generated by R.
59
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Lemma 6.2 (Representation of A(R)): every member of A(R) can be expressed as a
finite disjoint union of rectangle sets.
It is now possible to apply the Extension Theorem 2.13 (we need to check σ-additivity –
this is non-trivial but works) to define the “product measure” µ ⊗ ν on the whole σ-algebra
σ(R).
6.2
Fubini’s theorem
There are three big results on integration. We have already met two: MON and DOM,
which tell us cases when we can exchange integrals and limits. The other result arises in
the situation where we have a product measure space. In such a case we can integrate any
function in one of three possible ways: either using the product measure, or by first doing
a “partial integration” holding one coordinate fixed, and then integrating with respect to
that one. We call this alternative iterated integration, and obviously there are two ways to
do it depending on which variable we fix first. The final big result is due to Fubini, and
tells us that as long as the function is modestly well-behaved it doesn’t matter which of the
three ways we do the integration, we still get the same answer:
60
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
Theorem 6.3 (Fubini’s theorem): Suppose f is a real-valued function defined on the
product measure space above which is either (a) non-negative or (b) µ ⊗ ν-integrable. Then
Z
Z Z
f d(µ ⊗ ν) =
f (ω, ω 0 )µ( dω) ν( dω 0 )
Ω0
Ω
Notice the two alternative conditions. Non-negativity (sometimes described as Tonelli’s
condition, is easy to check but can be limited. Think carefully about Fubini’s theorem and
especially Tonelli’s condition, and you will see that the only thing which can go wrong is
when in the product form you have an ∞ − ∞ problem!
6.3
Relationship with independence
Suppose X and Y are independent random variables. Then the distribution of the pair
(X, Y ), a measure on R × R given by
µ∗ (A)
=
P [ (X, Y ) ∈ A ] ,
is exactly the product measure µ⊗ν where µ is the distribution of X, and ν is the distribution
of Y .
61
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
End of outline notes
62
ST213 outline notes (Version 1.6 [U]):
16:23:17, 24/01/2000
References
[1] P. Billingsley. Probability and Measure. John Wiley & Sons, 1985. [OPAC]. 2
[2] G.R. Grimmett and D.R. Stirzaker. Probability and Random Processes. Oxford University Press, 1982. [OPAC]. 2
[3] D. Williams. Probability with Martingales. CUP, 1991. [OPAC]. 2
63