Applied Probability c Leopold Sögner Department of Economics and Finance Institute for Advanced Studies Stumpergasse 56 1060 Wien Tel: +43-1-59991 182 [email protected] http://www.ihs.ac.at/∼soegner January, 2014 Course Outline (1) Applied Probability Learning Objectives: • Parts of the statistics course were dedicated to probability theory. Some measure theory, the law of large numbers and central limit theorems have already been covered in that course. • The applied probability course covers further concepts of probability theory and its applications. • The main concepts are discussed in detail during the lectures. In addition students have to work through the textbooks and have to solve problems to improve their understanding and to acquire skills to apply these tools to related problems. 1 Course Outline (2) Applied Probability Literature: • Durrett, Rick (2010), Probability: Theory and Examples, Cambridge Series in Statistical and Probabilistic Mathematics, 4th edition, Cambridge 2010. Supplementary Literature: • Billingsley, Patrick (2012), Probability and Measure: Anniversary Edition (Wiley Series in Probability and Statistics) • Klenke, Achim (2008), Probability Theory: A Comprehensive Course, Springer, Berlin Heidelberg. 2 Course Outline (3) Applied Probability • Measure Theory & the Integral, Durrett Chapter 1, Klenke, Chapter 1,4 – – – – – – Repetition of some concepts taught in the statistics course Probability Spaces Random variables Integration and Fubini’s Theorem Expected Value Modes of Convergence Expected time: 4 units 3 Course Outline (4) Applied Probability • Martingales, Durrett Chapter 5 – The concepts of conditional probability and conditional expectation – Radon Nikodym theorem – Doob’s Martingale convergence theorem Expected time: 4 units 4 Course Outline (5) Applied Probability • Markov Chains, Durrett Chapter 6 – – – – Markov property and Markov chains Recurrence and transience Stationary measures Asymptotic behavior Expected time: 6 units 5 Course Outline (6) Applied Probability • Ergodic Theorems, Durrett Chapter 7 – Definitions – Birkhoff’s ergodic theorem – Stationary measures Expected time: 2 units 6 Course Outline (7) Applied Probability • Brownian Motion, Durrett Chapter 8 – – – – – Definitions and construction Markov property and Brownian motion Stopping and hitting times Martingales and Brownian motion Donsker’s Theorem Expected time: 4 units 7 Course Outline (8) Applied Probability • Winter Term 2014 – Time schedule - Google calendar – Practice session will be organized by Alexander Satdarov. 8 Course Outline (9) Applied Probability Some more comments on homework and grading: • Mid term (40%), • Final test (40%), • Homework and class-room participation (20%). • Mid term, final test & retake: tba 9 Outline - Measure Theory Applied Probability • Classes of sets, σ-algebra, Borel σ-algebra. • Set functions, measure, probability measure. • Measure extension theorem, the Lebesgue measure, product measure. • Klenke Chapter 1 10 Classes of Sets (1) Applied Probability • Ω 6= ∅ is a nonempty set. • A ⊂ 2Ω, where 2Ω stands for the set of all subset of Ω. • Ω is called the set of elementary events. Its elements are ω. • A stands for system of observable events. Elements of A are the sets A1, A2, . . . . Ai are supersets of ω. • A may satisfy certain properties. The properties we consider are: 11 Classes of Sets (2) Applied Probability • Definition: A class of sets A is called (see Klenke, Definition 1.1) – Closed under intersections (or π-system, ∩-closed) if A ∩ B ∈ A whenever A, B ∈ A. – Closed under countable intersections (or σ − ∩-closed) if T∞ ˙ n=1 An ∈ A for any choice of countable may sets A1 , A2 , ∈A. – Closed under unions (or ∪-closed) if A ∪ B ∈ A whenever A, B ∈ A. – Closed under countable unions (or σ − ∪-closed) if S∞ ˙ n=1 An ∈ A for any choice of countable may sets A1 , A2 , ∈A. – Closed under differences (or \-closed) if A \ B ∈ A whenever A, B ∈ A. – Closed under complements if Ac := Ω \ A ∈ A for any set A ∈ A. 12 Classes of Sets (3) Applied Probability • Definition: σ-algebra/σ-field (see Klenke, Definition 1.2) A class of sets A ∈ 2Ω is called σ-algebra if it fulfills the following three conditions: – Ω∈A – A is closed under complements. – A is closed under countable unions. • Remark & outlook: Our goal is to define probabilities on σ-algebras. The events considered in probability are elements of A. I.e. for any A ∈ A have P(A). 13 Classes of Sets (4) Applied Probability Some properties/interdependences: • Theorem, (see Klenke, Theorem 1.3): If A is closed under complements, then (i) A is ∩-closed is equivalent (⇔) to A is ∪-closed and (ii) A is σ − ∩-closed ⇔ to A is σ − ∪-closed. • Theorem, (see Klenke, Theorem 1.4): Suppose that A is \-closed, then (i) A is ∩-closed, (ii) If in addition A is σ − ∩-closed ⇔ to A is σ − ∪-closed. (iii) Any countable union of sets in A can be expressed as a countable (resp. finite) disjoint union of sets in A. • ] stands for disjoint union in the textbook of Klenke. 14 Classes of Sets (5) Applied Probability • Definition: Algebra (see Klenke, Definition 1.6) A class of sets A ∈ 2Ω is called algebra if it fulfills the following three conditions: – Ω∈A – A is \-closed. – A is ∪-closed. • Remark: Note that σ − ∪-closed was a property of a σ-algebra. 15 Classes of Sets (6) Applied Probability • Definition: Ring (see Klenke, Definition 1.7) A class of sets A ∈ 2Ω is called a ring if it fulfills the following three conditions: – – – – ∅∈A A is \-closed. A is ∪-closed. A ring is called σ-ring if A is σ − ∪-closed. • Remark: Note that σ − ∪-closed was a property of a σ-algebra. A ring A containing Ω is an algebra. 16 Classes of Sets (7) Applied Probability • Definition: Semiring (see Klenke, Definition 1.8) A class of sets A ∈ 2Ω is called a semiring if it fulfills the following three conditions: – ∅∈A – For any two sets A, B ∈ A the difference set B \ A is a finite union of mutually disjoint sets in A. – A is ∩-closed. 17 Classes of Sets (8) Applied Probability • Definition: λ-system (see Klenke, Definition 1.10) A class of sets A ∈ 2Ω is called a Dynkin’s λ-system if it fulfills the following three conditions: – Ω∈A – For any two sets A, B ∈ A the difference set B \ A is in A. – ]∞ n=1 An ∈ A for any choice of countably many pairwise disjoint sets A1, A2, · · · ∈ A. • Remark: π-system: A class of sets A ∈ 2Ω is called a π-system if it closed under the formation of finite intersections, A, B ∈ A implies A ∩ B ∈ A. See definition 1.1. 18 Classes of Sets (9) Applied Probability Some properties/interdependences: • Theorem, (see Klenke, Theorem 1.7): A class of sets A ∈ 2Ω is an algebra if and only if the following three properties hold: – Ω∈A – A is closed under complements. – A is closed under intersections. • To see the differences in the definitions, see e.g. Klenke, Example 1.11. 19 Classes of Sets (10) Applied Probability Some properties/interdependences: • Theorem: For a class of sets A ∈ 2Ω containing ∅ the following statements are equivialent – (i) If A, B ∈ A and A ∩ B = ∅, then A ∪ B ∈ A, (ii) A, B ∈ A and A ⊂ B then B \ A ∈ A and (iii) A, B ∈ A then A ∩ B ∈ A. – (i) A is \-closed, (ii) A is ∪-closed (i.e. A is a ring) – (i) If A, B ∈ A then the symmetric difference A∆B ∈ A and (ii) A∩B ∈A • Remark: (A, ∆, ∩) is a ring in terms of algebra. 20 Classes of Sets (12) Applied Probability Some properties/interdependences: • Theorem, (see Klenke, Theorem 1.7): A class of sets A ∈ 2Ω is an algebra if and only if the following three properties hold: – Ω∈A – A is closed under complements. – A is closed under intersections. • To see the differences in the definitions, see e.g. Klenke, Example 1.11. 21 Classes of Sets (13) Applied Probability • Theorem, Relations between classes of sets (see Klenke, Theorem 1.12): – – – – Every Every Every Every σ-algebra is a λ-system, an algebra and σ-ring. σ-ring is a ring. ring is a semiring. algebra is a ring. An algebra on a finite Ω is a σ-algebra. • See e.g. Klenke, Figure 1.1. 22 Classes of Sets (14) Applied Probability • Theorem, Intersection of classes of sets (see Klenke, Theorem 1.15): Let I be an arbitrary index set, and assume that Ai is a σ-algebra for every i ∈ I. Then the intersection AI := {A ∈ Ω : A ∈ Ai for every i ∈ I} = \ i∈I Ai is a σ-algebra. (The analogous statement holds for rings, σ-rings, algebras, λ-systems, but not for semirings.) – By this theorem the intersection of σ-fields is a σ-field. – It can also be used to construct the smallest σ-field. 23 Classes of Sets (15) Applied Probability • Theorem, Generated σ-algebra (see Klenke, Theorem 1.16): Let E ⊂ 2Ω. Then there exists a smallest σ-algebra σ(E) with E ⊂ σ(E): σ(E) := \ A. A⊂2Ω is a σ-algebra, E⊂A σ(E) is called the σ-algebra generated by E. E is called generator of σ(E). δ(E) is the λ-system generated by E. 24 Classes of Sets (16) Applied Probability • We observe that: E ⊂ σ(E). If E1 ⊂ E2 then σ(E1) ⊂ σ(E2). A is a σ-algebra if and only if σ(A) = A. δ(E) ⊂ σ(E). Given that D is a λ-system: D is a π-systems is equivalent to D is a σ-algebra (Theorem 1.18 in Klenke). – If E is a π-system, then δ(E) = σ(E) (Theorem 1.19 in Klenke). – – – – – 25 Classes of Sets (17) Applied Probability • For the rest of the course we shall mainly consider real valued random variable X ∈ Rn. • We restrict to σ-algebras generated by topologies. For the real numbers we shall consider half-open subsets with rational borders of the intervals. 26 Classes of Sets (18) Applied Probability • Definition: Topology (see Klenke, Definition 1.20) Let Ω 6= ∅ be an arbitrary set. A class of sets τ ⊂ 2Ω is called a topology on Ω if it has the three properties: – ∅, Ω ∈ τ – A ∩ B ∈ τ for any two sets A, B ∈ τ . – SA∈F A ∈ τ for any F ⊂ τ . The pair (Ω, τ ) is called a topological space. The sets A ∈ τ are called open sets, and the sets A ⊂ Ω with Ac ∈ τ are called closed. 27 Classes of Sets (19) Applied Probability • Topologies are closed under finite intersections, while a σ-algebra is closed under countable intersections. • Topologies are closed under arbitrary unions, while a σ-algebra is closed under countable unions. 28 Classes of Sets (20) Applied Probability • Definition: Metric d (distance function): For any elements x, y, z ∈ Ω, (i) d(x, y) ≥ 0 where d(x, y) = 0 if and only if x = y, (ii) d(x, y) = d(y, x) (symmetry) and (iii) d(x, z) ≤ d(x, y) + d(y, z) (subadditivity / triangle inequality). • Assume that d exists on Ω. Then we define the open ball with radius r centered at x ∈ Ω by Br (x) = {y ∈ Ω : d(x, y) < r}. • The class of open sets is the topology: τ ={ [ (x,r)∈F Br (x) : F ⊂ Ω × (0, ∞)}. 29 Classes of Sets (21) Applied Probability • Definition: Borel σ-algebra (see Klenke, Definition 1.21) Let (Ω, τ ) be a topological space. The σ-algebra B(Ω) := B(Ω, τ ) := σ(τ ) that is generated by the open sets A ∈ τ is called Borel σ-algebra on Ω. The elements A ∈ (Ω, τ ) are called Borel sets or Borel measurebale sets. 30 Classes of Sets (22) Applied Probability • Some remarks: – We are interested in B(Rn). Rnris equipped with the Euclidian distance d(x, y) = kx − yk2 = Pni=1(xi − yi)2. – There are subsets of Rn that are not Borel sets. E.g. Vitali sets see literature (e.g. Durrett, Appendix or Billingsley, end of Chapter 2). – If C ⊂ Rn is a closed set, then C c ∈ τ is also in B(Rn) such that C is also a Borel set. Therefore also x ∈ Rn is contained in B(Rn), i.e. {x} ∈ B(Rn). – B(Rn) is not a topology. To see this consider sets V ⊂ Rn where V 6∈ B(Rn) (by the above argument we know that such subsets exist). V = Sx∈V {x}. If B(Rn) were a topology then it would by closed under arbitrary unions. Since {x} ∈ B(Rn) we get V = Sx∈V {x} ∈ B(Rn). However here we get a contradiction. 31 Classes of Sets (23) Applied Probability • Since the class of open sets that generates the Borel σ-field is quite big, we raise the question whether this could also be done with a smaller class of sets. • Define: E1 = {A ∈ Rn : A is open}, E2 = {A ∈ Rn : A is closed}, E3 = {A ∈ Rn : A is compact}, E4 = {Br (x) : x ∈ Qn, r ∈ Q+}, E5 = {(a, b) : (a, b) ∈ Qn, a < b}, E6 = {[a, b) : [a, b) ∈ Qn, a < b}, E7 = {(a, b] : (a, b] ∈ Qn, a < b}, E8 = {[a, b] : [a, b] ∈ Qn, a < b}, E9 = {(−∞, b) : b ∈ Qn}, E10 = {(−∞, b] : b ∈ Qn}, E11 = {(a, ∞) : a ∈ Qn}, E12 = {[a, ∞) : a ∈ Qn}. 32 Classes of Sets (24) Applied Probability • Theorem, (see Klenke, Theorem 1.23): The Borel σ-field B(Rn) is generated by any of the classes E1 − E12. That is to say σ(Ei) = B(Rn) for i = 1, . . . , 12. • Remark, (see Klenke, Remark 1.24): The classes of sets E1 − E3,E5 − E12 are a π system. Hence the Borel σ-algebra equal the generated λ-system, i.e. B(Rn) = δ (Ei) holds for i = 1, 2, 3, 5, . . . , 12 (see also Dynkin’s π − λ theorem, (Theorem 1.19)). E4 . . . E12 are countable. 33 Set Functions (1) Applied Probability • Definition: (see Klenke, Definition 1.27) Let A ⊂ 2Ω and let µ : A → [0, ∞] be a set function. We say that µ is: – monotone if µ(A) ≤ µ(B) for any two sets A, B ∈ A with A ⊂ B. – additive if µ(]ni=1Ai) = Pni=1 µ(Ai) for any choice of finitely many mutually disjoint sets A1, . . . , An ∈ A with Sni=1 Ai ∈ A. P∞ – σ-additive if µ(]∞ A ) = i=1 i i=1 µ(Ai ) for any choice of countably many mutually disjoint sets A1, . . . , An ∈ A with S∞ i=1 Ai ∈ A. – subadditive if for any choice of finitely many sets A, A1, . . . , An ∈ A with A ∈ Sni=1 Ai we have µ(A) ≤ Pni=1 µ(Ai). – σ-subadditive if for any choice of countable many sets P∞ A, A1, A2, · · · ∈ A with A ∈ S∞ A we have µ(A) ≤ i=1 i i=1 µ(Ai ). 34 Set Functions (2) Applied Probability • Definition: (see Klenke, Definition 1.28) Let A be a semiring and let µ : A → [0, ∞] be a set function with µ(∅) = 0. µ is called a: – – – – content if µ is additive, premeasure if µ is σ-additive, measure if µ is a premeasure and A is a σ-alegebra probability measure if µ is a measure and µ(Ω) = 1. 35 Set Functions (3) Applied Probability • Definition: (see Klenke, Definition 1.29) Let A be a semiring. A content µ on A is called: – finite if µ(A) < ∞ for every A ∈ A and – σ-finite if there exists a sequence of sets Ω1, Ω2, · · · ∈ A such that Ω = S∞ n=1 Ωn and such that µ(Ωn ) < ∞ for all n ∈ N. • Discuss examples in Klenke, page 12 and 13. 36 Set Functions (4) Applied Probability • Theorem: Properties of the content (see Klenke, Lemma 1.31) Let A be a semiring and let µ be a content µ on A. The following statements hold: – If A is a ring, then µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B) for any two sets A, B ∈ A. – If A is a ring, then µ(B) = µ(A) + µ(B \ A) for any two sets A, B ∈ A with A ⊂ B. µ is monotone. – If µ is σ-additive, then µ is also σ-subadditive. S∞ – If A is a ring, then P∞ µ(A ) ≤ µ( n n=1 n=1 An ) for any choice of countably many mutually disjoint sets A1, A2, · · · ∈ A with S∞ n=1 An ∈ A. 37 Set Functions (5) Applied Probability • Theorem: Inclusion-exclusion formula (see Klenke, Theorem 1.33) Let A be a ring and let µ be a content µ on A. Let n ∈ N and A1, . . . , An ∈ A. Then the following inclusion and exclusion formulas hold: – µ(A1 ∪· · ·∪An) = Pnk=1(−1)k−1 P{i1,...,ik }⊂{1,...,n} µ(Ai1 ∩· · ·∩Aik ). – µ(A1 ∩· · ·∩An) = Pnk=1(−1)k−1 P{i1,...,ik }⊂{1,...,n} µ(Ai1 ∪· · ·∪Aik ). • The summation is over all subsets of {1, . . . , n} with k elements. 38 Set Functions (6) Applied Probability • Definition: (see Klenke, Definition 1.34) Let A1, A2, . . . be sets. We write – An ↑ A and say that (An)n∈N increases to A if A1 ⊂ A2 ⊂ . . . and S∞ n=1 An = A, and – An ↓ A and say that (An)n∈N decreases to A if A1 ⊃ A2 ⊃ . . . and T∞ n=1 An = A. 39 Set Functions (7) Applied Probability • Definition: Limits (see Klenke, Definition 1.13) Let A1, A2, . . . be subsets of Ω. The sets T∞ – lim inf n→∞ An := S∞ n=1 m=n Am , and S∞ – lim supn→∞ An := T∞ n=1 m=n Am are called limes inferior and limes superior of the sequence (An). • lim inf n→∞ An can be written as lim inf n→∞ An = {ω ∈ Ω : ]{n ∈ N : ω 6∈ An} < ∞}. • lim supn→∞ An can be written as lim supn→∞ An = {ω ∈ Ω : ]{n ∈ N : ω ∈ An} = ∞}. 40 Set Functions (8) Applied Probability • Definition: Continuity of contents (see Klenke, Definition 1.35) Let µ be a content on the ring A: – µ is called lower semicontinuous if (for n → ∞) µ(An) → µ(A) for any A ∈ A and any sequence (An)n∈N in A with An ↑ A. – µ is called upper semicontinuous if (for n → ∞) µ(An) → µ(A) for any A ∈ A and any sequence (An)n∈N in with µ(An) < ∞ for some (and eventually all ) n ∈ N and An ↓ A. – µ is called ∅-continuous if upper semicontinuity holds for A = ∅. 41 Set Functions (9) Applied Probability • Theorem: Continuity and premeasure (see Klenke, Theorem 1.36) Let µ be a content on the ring A. Consider the five properties: – – – – – (i) µ is σ-additive and hence a premeasure. (ii) µ is σ-subadditive. (iii) µ is lower semicontinuous. (iv) µ is ∅-continuous. (v) µ is upper semicontinuous. • Then the following implications hold: (i) ⇔ (ii) ⇔ (iii) ⇒ (iv) ⇔ (v). If µ is finite then we also have (iii) ⇐ (iv). • Discuss: what does this result imply for probability measures in a σ-field? 42 Set Functions (10) Applied Probability • Definition: Measurable sets, measure space (see Klenke, Definition 1.38) – A pair (Ω, A) consisting of a nonempty set Ω and σ-algebra A ⊂ 2Ω is called measurable space. The sets A ∈ A are called measurable sets. If Ω is at most countably infinite and if A = 2Ω, then the measurable space (Ω, 2Ω) is called discrete. – A triple (Ω, A, µ) is called a measure space if (Ω, A) is a measurable space and µ is a measure on A. – If in addition µ(Ω) = 1 then (Ω, A, µ) is called a probability space. The sets A ∈ A are called events. – The set of all finite measure on (Ω, A) is denoted by Mf (Ω) = Mf (Ω, A). The subset of probability measures is denoted by M1(Ω) = M1(Ω, A). Mσ (Ω) = Mσ (Ω, A) stands for the set of σ-finite measures. 43 Measure Extension Theorem (1) Applied Probability • We already defined and considered: – – – – – Classes of sets, ring, semiring, σ-algebra. Borel σ-algebra, Definition of measure, content, premeasure Measure space, probability space, measurable set. The goal is now to construct measures on σ-algebras. To do this we construct measures on a semiring. By the extension theorem we obtain a measure on th whole σ-algebra. 44 Measure Extension Theorem (2) Applied Probability • Example: Lebesgue measure. Let n ∈ N and let A = {(a, b] : a, b ∈ Rn, a < b} be the semiring of half open rectanges (a, b] ∈ Rn. The n-dimensional volume of such a rectangle is µ((a, b]) = n Y (bi − ai). i=1 • Can we extend µ((a, b]) to a uniquely determined measure on the Boral σ-algebra B(Rn) = σ(A)? • The resulting measure is called Lebesgue measure λ (or λn) on (Rn, B(Rn)). 45 Measure Extension Theorem (3) Applied Probability • Theorem: Caratheodory measure extension theorem (see Klenke, Theorem 1.41) Let A ∈ 2Ω be a ring and let µ a σ-finite measure on A. There exists a unique measure µ̃ on σ(A) such that µ̃(A) = µ(A) for all A ∈ A. Furthermore µ̃ is σ-finite. • Theorem: Extension theorem (see Klenke, Theorem 1.53) Let A ∈ 2Ω be a semiring and let µ : A → [0, ∞] by an additive, σ-subadditive and σ-finite set function with µ(∅) = 0. Then there exists a unique σ-finite measure µ̃ : σ(A) → [0, ∞] such that µ̃(A) = µ(A) for all A ∈ A. 46 Measure Extension Theorem (4) Applied Probability • Theorem: Lebesgue measure (see Klenke, Theorem 1.55) There exists a uniquely determined measure λn on (Rn, B(Rn)) with the property that n Y µ((a, b]) = (bi − ai) i=1 for all a, b ∈ Rn, a < b. λn is called Lebesgue measure (or Lebesgue-Borel measure) on (Rn, B(Rn)). • Defintion: Lebesgue-Stieltjes measure. Consider a monotone increasing and right-continuous function F . The measure µF ((a, b]) = F (b) − F (a) on (Rn, B(Rn)) is called Lebesgue-Stieltjes measure with distribution function F . (see Examples 1.56 and 1.57). 47 Measure Extension Theorem (5) Applied Probability • If F (x) = x, then µF is equal to the Lebesgue measure. • Let f : R → [0, ∞) be continuous and let F (x) = 0x f (t)dt for all x. Then µF is the extension of the premeasure with density f . R • Let x1, x2, · · · ∈ R and αn > 0 for all n ∈ N such that P∞ P∞ α < ∞. Then F = n=1 n n=1 αn 1[xn ,∞) is the distribution function of the finite measure µ = P∞ n=1 αn δxn . • If limx→∞(F (x) − F (−x)) = 1, then µF is a probability measure. 48 Measure Extension Theorem (6) Applied Probability • Definition: Distribution function (see Klenke, Definition 1.59) – A right continuous monotone increase function F : R → [0, 1] with F (−∞) := limx→−∞ F (x) = 0 and F (∞) := limx→∞ F (x) = 1 is called probability distribution function. 49 Measure Extension Theorem (7) Applied Probability • Theorem: Finite products of measures (see Klenke, Definition 1.61) – Let n ∈ N and let µ1, . . . , µn be finite measures or Lebesgue-Stieltjes measures on (R, B(R)). Then there exists a unique σ-finite measure µ on (Rn, B(Rn)) such that µ((a, b]) = n Y i=1 µ((ai, bi]) for all a, b ∈ Rn, a < b. µ = Nni=1 µi is called the product measure. 50 Measure Extension Theorem (8) Applied Probability • Definition: Null set (see Klenke, Definition 1.68) – A set A is called a µ-null set if µ(A) = 0. By Nµ we denote the class of all subsets of a µ-null set. – Let E(ω) be a property that a point ω ∈ Ω has or not. We say that E holds µ-almost everywhere (a.e.) or for almost all ω if there exists a null set N such that E(ω) holds for every ω ∈ Ω \ N . If A ∈ A and if there exists a null set N such that E(ω) holds for every ω ∈ A \ N , the E holds almost everywhere on A. – If µ = P is a probability measure then we say that E holds almost surely (on Ω, on A). – Let A, B ∈ A and assume that there is a null set N such that A4B ⊂ N . Then A = B mod µ. 51 Measure Extension Theorem (9) Applied Probability • Definition: Complete measure space (see Klenke, Definition 1.69) – A measure space (Ω, A, µ) is complete if Nµ ⊂ A. – If a measure space does not contain the null-sets it can be made complete by adding these null-sets. For more details see Klenke, pages 33-34 and further literature. 52 Measurable Maps (1) Applied Probability • Structure preserving maps (homomorphims) – Continuous maps for topololgical spaces. – Measurable maps for measureable spaces. • Definition: Measurable maps (see Klenke, Definition 1.76) – A map X: Ω → Ω0 is called space A − A0 measurable if X −1(A0) := {X −1(A0) : A0 ∈ A0} ⊂ A. I.e X −1(A0) ∈ A for any A0 ∈ A0. If X is measurable we write X : (Ω, A) → (Ω0, A0). – If Ω0 = R and A = B(R) is the Borel σ-algebra on R, the X: (Ω, A) → (R, B(R)), then X is called A measurable real map. 53 Measurable Maps (2) Applied Probability • Examples: – The identity map id : Ω → Ω is A − A0 measurable. – If A = 2Ω or A0 = {, Ω, the any map X : Ω → Ω is A − A0 measurable. – The indicator function 1A : Ω → {0, 1} is A − 2{0,1} measurable if and only if A ∈ A. 54 Measurable Maps (3) Applied Probability • Theorem: Generated σ-algebra (see Klenke, Theorem 1.78) – Let (Ω0, A0) be a measurable space and let Ω be a nonempty set. Let X: Ω → Ω0 be a map. The preimage X −1(A0) := {X −1(A0) : A0 ∈ A0} ⊂ A is the smallest σ-algebra with respect to which X is measurable. σ(X) = X −1(A0) is called the σ-algebra of Ω that is generated by X. • Theorem: Measurability of continuous maps (see Klenke, Theorem 1.88) – Let (Ω, A) and (Ω0, A0) be a topological spaces and let f by a continuous map. Then f is B(Ω) − B(Ω0) measurable. 55 Measurable Maps (4) Applied Probability • Definition: Simple function (see Klenke, Definition 1.93) Let (Ω, A) be a measurable space. A map f : Ω → R is called simple function if there is an n ∈ N and mutually disjoint measurable sets A1, . . . , An ∈ A as well as number α1, . . . , αn ∈ R such that f= n X i=1 αi1Ai . 56 Measurable Maps (5) Applied Probability • Definition: Simple functions II (see Klenke, Definition 1.95) Assume that f, f1, f2, . . . are maps f : Ω → R̄ such that f1(ω) ≤ f2(ω) ≤ . . . and limn→∞ fn(ω) = f (ω) for any ω ∈ Ω. Then (fn)n∈N increases pointwise to f . Notation fn ↑ f . fn ↓ f stands for decreases pointwise. 57 Measurable Maps (6) Applied Probability • Theorem: Simple function (see Klenke, Theorem 1.96) Let (Ω, A) be a measurable space and f : Ω → [0, ∞] be measurable. Then the following statements hold: – There exists a sequence (fn)n∈N of nonnegative simple functions such that fn ↑ f . – There are measurable sets A1, A2, · · · ∈ A and numbers α1, α2, · · · ≥ 0 such that f= ∞ X n=1 αn 1A n . • More details on measurable maps see textbook. 58 Measurable Maps (7) Applied Probability • Measurable maps transport measures from one space to another • Definition: Image measure/push-forward measure (see Klenke, Definition 1.98) – Let (Ω, A) and (Ω0, A0) be a measurable spaces and µ be a a measure on (Ω, A). X : (Ω, A) → (Ω0, A0) is measurable. The image measure of µ under X is the measure µ0 := µ ◦ X −1 on (Ω0, A0) that is defined by: 0 µ ◦ X −1 : A → [0, ∞], A0 7→ µ(X −1(A0)). 59 Measurable Maps (8) Applied Probability • Theorem: Density transformation formula (see Klenke, Theorem 1.101) – Let µ by a measure on Rn that has continuous (or piecewise continuous) density f : Rn → [0, ∞) such that µ((−∞, x]) = Z x x1 n f (t1, . . . , tn)dt1 . . . dtn . . . −∞ −∞ Z for all x ∈ Rn. Let A ∈ Rn be any open or closed set in Rn with µ(Rn \ A) = 0. Further B ∈ Rn be any open or closed set. Assume that φ : A → B is a continuously differentiable bijection with derivative φ0. Then the image measure µ ◦ φ−1 has the density f (φ−1(x)) fφ = if x ∈ B, fφ = 0 if x 6∈ B. |det(φ0(φ−1(x)))| 60 Random Variables (1) Applied Probability • We consider a probability space (Ω, A, P). A in A are called events. • Definition: Random Variable (see Klenke, Definition 1.102) Let (Ω0, A0) be a measurable space and let X : Ω → Ω0 be measurable. – X is called a random variable with values in (Ω0, A0). If (Ω0, A0) = (R, B(R)) then X is called real random variable. – For A0 ∈ A0 we write {X ∈ A0} := X −1(A0) and P(X ∈ A0) := P(X −1(A0)). In addition {X ≥ 0} := X −1([0, ∞)) and {X ≤ b} := X −1((−∞, b]), etc. 61 Random Variables (2) Applied Probability • Definition: Distributions (see Klenke, Definition 1.103) Consider a random variable X. – The probability measure PX := P ◦ X −1 is called the distribution of X. – For a real valued random variable X, the map FX : x 7→ P(X ≤ x) is called the distribution function of X. We write X ∼ µ (or X ∼ F ) if µ = PX and say X has distribution µ. – A family (Xi)i∈I of random variables is called identically D distributed if PXi = PXj for all i, j ∈ I. We write X =Y if PX = PY . 62 Random Variables (3) Applied Probability • Theorem: Distributions vs. random variables (see Klenke, Theorem 1.104) – For any distribution function F , there exists a random variable X with FX = F . • By this theorem it is also sufficient to consider a model for the distribution function F , the corresponding random variable need not be modeled explicitly. We know by this this theorem that an X with this distribution function has to exist. 63 Outline - Independence Applied Probability • Independent events • Borel Cantelli lemma • Independent random variables • Klenke Chapter 2 64 Independence of Events (1) Applied Probability • Definition: Independence of events (see Klenke, Definition 2.3) – Let I be an arbitrary index set and let (Ai)i∈I be an arbitrary family of events. The family (Ai)i∈I is called independent if for any finite subset J ⊂ I the product formula holds: \ P j∈J Aj = Y j∈J P (Aj ) . • Discuss Examples 2.1 and 2.2 in the textbook. 65 Independence of Events (2) Applied Probability • Now we roll a die infinitely often. What is the probability that the face shows 6 infinitely often. It should be one. • We we play roulette. What is the probability of {0 infinitely often}. It should also be one. • Otherwise there must be a last 6 or a last zero. • Remember: A∗ := lim inf n→∞ An can be written as lim inf →∞ An = {ω ∈ Ω : ]{n ∈ N : ω 6∈ An} < ∞}. • A∗ := lim supn→∞ An can be written as lim sup→∞ An = {ω ∈ Ω : ]{n ∈ N : ω ∈ An} = ∞}. 66 Independence of Events (3) Applied Probability • Theorem, Borel-Cantelli lemma (see Klenke, Theorem 2.7): Let A1, A2, . . . be events and define A∗ := lim supn→∞ An. Then: ∗ – If P∞ n=1 P(An ) < ∞, then P(A ) = 0. (Here P can be an arbitrary measure on (Ω, A).) ∗ – If (An)n∈N is independent and P∞ n=1 P(An ) = ∞, then P(A ) = 1. • The Borel-Cantelli lemma belongs to the so called 0 − 1-laws. • Discuss the examples 2.8 to 2.10. Example 2.9 demonstrates why independence is important in the second Borel-Cantelli lemma. 67 Independence of Events (4) Applied Probability • Definition: Independence of classes of events (see Klenke, Definition 2.11) – Let I be an arbitrary index set and let Ei ∈ A for all i ∈ I. The family (Ei)i∈I is called independent if for any finite subset J ⊂ I and any choice of Ej ∈ Ej , j ∈ J we have \ P j∈J Ej = Y j∈J P (Ej ) . 68 Independence of Events (5) Applied Probability • Theorem, Independence of classes (see Klenke, Theorem 2.13): – Let I be finite and for any i ∈ I let Ei ∈ A with Ω ∈ Ei. Then (Ei)i∈I is independent if and only if P (Tj∈J Ej ) = Qj∈J P (Ej ) holds for any finite subset J ⊂ I and any choice of Ej ∈ Ej , j ∈ J. – (Ei)i∈I is independent if and only if (Ej )j∈J is independent for all finite J ∈ I. – (Ei ∪ ∅)i∈I is ∩-stable, then (Ei)i∈I is independent if and only if (σ (Ei))i∈I is independent. – Let K be an arbitrary set and let (Ik )k∈K be mutually disjoint subsets of I. If (Ei)i∈I is independent then Si∈Ik Ei k∈K also is independent. 69 Independent Random Variables (1) Applied Probability • We now consider an arbitrary index set I. For each i ∈ I we consider the measurable space (Ωi, Ai) and the random variable Xi : (Ω, A) → (Ωi, Ai) with generated σ-field σ(Xi) = Xi−1(Ai). • Definition, Independent random variables (see Klenke, Definition 2.14): The family (Xi)i∈I of random variables is called independent if the family (σ(Xi))i∈I of sigma-algebras is independent. 70 Independent Random Variables (2) Applied Probability • Definition, Joint distribution (see Klenke, Definition 2.20): – For any i ∈ I let Xi be a real random variable. For any finite subset J ∈ I let FJ (x) := F((Xj )j∈J )(x) : RJ → [0, 1] x 7→ P(Xj ≤ xj for all j ∈ J) = P \ X −1((−∞, xj ]) j∈J FJ is called the joint distribution function of (Xj )j∈J . The probability measure P((Xj )j∈J ) on RJ is called joint distribution of (Xj )j∈J . – Remark: We consider the probability space (Ω, A, P). X is a random variable. (Ωi, Ai) = (RJ , B (RJ )). Let Ai = (−∞, xj ] then X −1((−∞, xj ]) = A ∈ A. Then P is applied to A. 71 Independent Random Variables (3) Applied Probability • Theorem, Joint distribution (see Klenke, Theorem 2.21): – A family (Xi)i∈I if real random variables is independent if and only if, for every finite J ∈ I and every x = (xj )j∈J ∈ RJ FJ (x) = Y j∈J F((Xj ))(xj ). 72 Independent Random Variables (4) Applied Probability • Theorem, Joint density (see Klenke, Corollary 2.22): – In addition (to 2.21), assume that any FJ has a continuous density fJ (x) = f((Xj )j∈J )(x), i.e. there exists a continuous map fJ : RJ → [0, ∞) such that FJ (x) = Z xj1 xjn . . . −∞ −∞ fJ (t1 , . . . , tn )dt1 . . . dtn Z for all x ∈ RJ where J = {j1, . . . , jn}. In this case the family (Xi)i∈I if real random variables is independent if and only if for any finite J ∈ I Y f(Xj )(xj ). fJ (x) = j∈J 73 Outline - The Integral Applied Probability • Construction of integrals with respect to a measure µ. • Properties of the integral. • Monotone convergence, the Lemma of Fatou and the St. Petersburg game. • Riemann vs. Lebesgue integral. • Klenke Chapter 4. 74 Construction of the Integral (1) Applied Probability • We consider a measure space (Ω, A, µ). • The goal of this section is to construct an integral with respect to a measure µ. • We already observed that measurable functions can be approximated by (a sequence of increasing) simple functions (see Definition 1.93 and Theorem 1.96). Hence simple functions play an important role in the construction of the integral. 75 Construction of the Integral (2) Applied Probability • Let E be the vector space of simple function (see Definition 1.93) on (Ω, A) and E+ = {f ∈ E : f ≥ 0} the cone of nonnegative simple functions. • If f= m X i=1 αi1Ai for some m ∈ N where α1, . . . , αm ∈ [0, ∞) and A1, . . . , Am ∈ A are mutually disjoint set, then the above representation of f is called normal representation of f . 76 Construction of the Integral (3) Applied Probability • Theorem, Normal representation (see Klenke, Lemma 4.1): Pn – If f = Pm α 1 and f = i=1 i Ai j=1 αj 1Bj are two normal representations of f ∈ E+ then m X i=1 αiµ(Ai) = n X j=1 αj µ(Bj ). • Remark: In the next step we construct the integral. By this theorem the value of the integral does not depend on the normal representation we use. 77 Construction of the Integral (4) Applied Probability • Definition, (see Klenke, Definition 4.2): – Define the map I : E+ → [0, ∞] by I(f ) = m X i=1 αiµ(Ai) The function f has the normal representation Pm i=1 αi 1Ai . • Conventions for infinity: 0 · ∞ = ∞ · 0 = 0, x · ∞ = ∞ · x = ∞ for 0 < x < ∞, ∞ · ∞ = ∞. 78 Construction of the Integral (5) Applied Probability • Theorem, Properties of I(f ) (see Klenke, Lemma 4.3): The map I is positive, linear and monotone increasing: Let f, g ∈ E+ and α ≥ 0. Then the following statements hold: – I(αf ) = αI(f ). – I(f + g) = I(f ) + I(g). – If f ≤ g then I(f ) ≤ I(g). 79 Construction of the Integral (6) Applied Probability • Definition, Integral (see Klenke, Definition 4.4): – If f : Ω → [0, ∞] is measurable, then we define the integral of f with respect to µ by Z f dµ := sup{I(g) : g ∈ E+, g ≤ f }. • If µ is the Lebesgue measure λ we get the Lebesgue integral. If µ is the counting measure then the integral becomes a sum. • The integral is an extension of the map I on E+ to the set of (nonnegative) measurable functions. • Note that f ≤ g holds pointwise, i.e. f (ω) ≤ g(ω) for all ω ∈ Ω or almost surely (everywhere), i.e. f (ω) ≤ g(ω) for all ω ∈ Ω \ N where N stands for a µ-null set. 80 Construction of the Integral (7) Applied Probability • Theorem, Properties of the integral (see Klenke, Theorem 4.6): Let f, g, f1, f2, . . . be measurable maps Ω → [0, ∞]. Then: – Monotonicity: If f ≤ g, then f dµ ≤ gdµ. – Monotone convergence: If fn ↑ f , then the integrals also converge R R fndµ ↑ f dµ. – Linearity: If α, β ∈ [0, ∞], then R R R (αf + βg)dµ = α f dµ + β gdµ. R R 81 Construction of the Integral (8) Applied Probability • Until now we have considered measurable function on E+. Now we extend this concept to measurable f . • First, f = f + − f −, where f + = max{0, f } and f − = − min{0, f }. • f +, f − ≤ |f |. Hence if |f |dµ < ∞, then f +dµ < ∞ and R f −dµ < ∞. R R 82 Construction of the Integral (9) Applied Probability • Definition, Integral of measurable functions (see Klenke, Definition 4.7): A measurable function f : Ω → R̄ is called µ-integrable if R |f |dµ < ∞. We write – L1(µ) := L1(Ω, A, µ) = {f : Ω → R̄ : R f measurable and |f |dµ < ∞}. – For f ∈ L1(µ) we define the integral of f with respect to µ by Z f (ω)dµ(ω) := f dµ := f dµ − f −dµ. Z Z + Z – If we only have f +dµ < ∞ or f −dµ, the values −∞ and ∞ are possible. R R – A f dµ := 1Af dµ for A ∈ A. R R 83 Construction of the Integral (10) Applied Probability • Theorem, Properties of the integral (see Klenke, Theorem 4.8): Let f : Ω → [0, ∞] be a measurable map. – We have f = 0 almost everywhere if and only if f dµ = 0. R – If f dµ < ∞ then f < ∞ almost everywhere. R 84 Construction of the Integral (11) Applied Probability • Theorem, Properties of the integral (see Klenke, Theorem 4.9): Let f, g ∈ L1(µ). – Monotonicity: If f ≤ g almost everywhere then f dµ ≤ gdµ. If R R f = g almost everywhere then f dµ = gdµ. R R – Triangle inequality: | f dµ| ≤ |f |dµ. – Linearity: If α, β ∈ R, then αf + βg ∈ L1(µ) and R R R (αf + βg)dµ = α f dµ + β gdµ. This equation also holds if at most one of the integrals is infinite. R R 85 Construction of the Integral (12) Applied Probability • Theorem, Image measure, change of variable (see Klenke, Theorem 4.10): – Let (Ω, A) and (Ω0, A0) be a measurable spaces and µ be a a measure on (Ω, A). X : (Ω, A) → (Ω0, A0) is measurable. µ0 = µ ◦ X −1 is the image measure of µ under X (see image measure/push-forward measure (see Klenke, Definition 1.98)). Assume that f : Ω0 → R̄ is µ0-integrable. Then f ◦ X ∈ L1(µ), Z Ω (f ◦ X)dµ(ω) = Z Ω0 0 −1 0 f (ω )d(µ ◦ X (ω )) = f (ω 0)dµ0(ω 0) Z and Z (f X −1 (A0 ) ◦ X)(ω)dµ(ω) = Z A0 0 −1 0 f (ω )d(µ ◦ X (ω )) = Z A0 f (ω 0)dµ0(ω 0). 86 Construction of the Integral (13) Applied Probability Ad change of variable formula: • If X is a random variable on (Ω, A, P), then PX = P ◦ X −1 and R R R 0 −1 0 f ◦ X(ω)dP(ω)= f (X(ω))dP(ω)= Ω Ω Ω0 f (ω )d(P ◦ X (ω ))= R 0 0 Ω0 f (ω )dPX (ω ). • Suppose that (Ω0, A0) = (R, B(R)), X is a real valued function φ and f (x) = x is R R the identity then Ω φ(ω)µ(dω) = R x(µ ◦ φ−1dx). • Next, let f = sin, (Ω, A) = (Ω0, A0) = (R, B(R)), X : λ 7→ 2λ and µ be the Lebesgue measure λ. Then x = 2λ describes X , X −1 is therefore λ = x/2. Let A = [0, π] then A0 = [0, 2π]. Moreover, µ0 = λ ◦ X −1 = x/2 and Rπ R 2π 1 0 sin(2λ)dλ = 0 sinx 2 dx. • Also the density transformation formula of theorem 1.101 follows from the above theorem. Theorem 4.15 in Klenke is a further version of this result. 87 Construction of the Integral (14) Applied Probability • Let (Ω, A) be a discrete measurable space and let µ = Pω∈Ω αω δω for certain numbers of αω ≥ 0. A map f is integrable if P ω∈Ω |f (ω)|αω < ∞. In this case Z f dµ = X ω∈Ω f (ω)αω . • Example: ω0, ω1, . . . , ω36 are the events when we consider the roulette wheel. If the wheel is fair then pi = 1/37, i.e. αω = 1/37. f (ω) is the gain/loss given some strategy. Assume that we bet 1 EURO on zero. Then f = 35 if ω0 realizes, while f = −1 with the other ωi. Then P ω∈Ω f (ω)αω = 1/37 ∗ 35 − 36/37 ∗ 1 = 1/37 ∗ (35 − 36)) = −1/37. 88 Construction of the Integral (15) Applied Probability • Definition, Lebesgue integral (see Klenke, Definition 4.12): – Let λ be the Lebesgue measure on Rn and let f : Rn → R be measurable with respect to B ∗(Rn)-B(R) (see Klenke, page 33) and λ integrable. Then we call Z f dλ the Lebesgue integral of f . If A in B ∗(Rn) and f is measurable then we write Z Z A f dλ = 1A f dλ. 89 Construction of the Integral (16) Applied Probability • Definition, Lebesgue density (see Klenke, Definition 4.13): – Let µ be a measure on (Ω, A) and let f : Ω → [0, ∞) be a R measurable map. Define ν(A) = 1Af dµ for A ∈ A. Then ν has a density f with respect to µ. – If µ = λ, then ν has a density with respect to the Lebesgue measure. • Prominent examples for densities with respect to the Lebesgue measure are the normal, the student-t, the gamma, the exponential density. 90 Construction of the Integral (17) Applied Probability • Definition, (see Klenke, Definition 4.16): – For a measurable f : Ω → R̄ define kf kp := ( |f |pdµ) Z 1/p if p ∈ [1, ∞) and kf k∞ := inf{K ≥ 0 : µ({|f | > K}) = 0}. – For p ∈ [1, ∞] we define the vector space Lp(µ) := {f : Ω ∈ R̄ : f measurable , kf kp < ∞}. 91 Construction of the Integral (18) Applied Probability • Theorem, (see Klenke, Theorem 4.17): • The map kf k1 is a seminorm on the vector space L1(µ). That is to say for all f, g ∈ L1(µ) and α ∈ R we observe – kαf k1 = |α|kf k1 – kf + gk1 ≤ kf k1 + kgk1 – kf k1 ≥ 0 for all f and kf k1 = 0 if f = 0 almost everywhere. • Since kf k1 = 0 does not imply the f = 0 for all ω ∈ Ω, we only observe a seminorm. p p0 • In addition it can be shown that L (µ) ⊂ L (µ) for 1 ≤ p0 ≤ p ≤ ∞ (Klenke, Theorem 4.19). 92 Integral and Limits (1) Applied Probability • We go to investigate the question whether the limit and the integral can be interchanged. • Two criteria are the monotone convergence theorem and the Lemma of Fatou. • The St. Petersburg game is a very prominent example where the limit of the integral is not the integral of the limit. We shall meet the St. Petersburg game also when we talk about Martingales (fair games) and Martingale converge theorems. • Remark: The St. Petersburg game has been investigated by Jakob Bernoulli, Ars conjectandi, (1713, post mortem). 93 Integral and Limits (2) Applied Probability • Theorem, Monotone convergence [Beppo Levi](see Klenke, Theorem 4.20): Let f1, f2, · · · ∈ L1(µ) and let f : Ω → R̄ be measurable. Assume that fn ↑ f almost everywhere for n → ∞. Then lim n→∞ Z Z fndµ = f dµ where both sides can equal +∞. 94 Integral and Limits (3) Applied Probability • Theorem, Fatou’s Lemma (see Klenke, Theorem 4.21): Let let f ∈ L1(µ) and let f, f1, f2, . . . be measurable with fn ≥ f almost everywhere for all n ∈ N. Then Z Z lim n→∞ inf fndµ ≤ lim n→∞ inf fndµ. • By this lemma it also follows that lim sup Z n→∞ Z fndµ ≤ lim sup fndµ n→∞ (given that there is an integrable majorant g, i.e. fn ≤ g) and Z lim n→∞ inf fn dµ ≤ lim n→∞ inf Z fn dµ ≤ lim sup n→∞ Z fn dµ ≤ Z lim sup fn dµ. n→∞ 95 Integral and Limits (4) Applied Probability • St. Petersburg game – Consider a gamble: E.g. roulette. To make it simple we only consider to bet on black or red. When we bet on red the probability to win is p = 18/37 < 1/2. – Suppose that this game is played again an again. Then we have a probability space (Ω, A, P) with Ω = {−1, 1}N, A = (2{−1,1})⊗N and P = ((1 − p)δblack + pδred)⊗N = ((1 − p)δ−1 + pδ1)⊗N. Dn : Ω → {−1, 1}, ω 7→ ωn is the n-th result of the game. 96 Integral and Limits (5) Applied Probability • St. Petersburg game – The player plays a so called doubling strategy: In more details, H1 = 1 is the amount invested in the first round. In step i the player bets Hi = 2i−1. If realizes then he wins Hi. The player stops to play after the first time he wins. In more formal terms: Hn = 0 for all n ≥ 2 if D1 = 1, Hn = 0 if there is some Di = 1, i = 1, . . . , n − 1. Hn = 2n−1 else, i.e. if Di = −1 for all i = 1, . . . , n − 1. Note that Hn depends on D1, . . . , Dn−1 only, therefore it σ(D1, . . . , Dn−1) measurable. – The cumulated gain is Sn = n X i=1 HiDi. 97 Integral and Limits (7) Applied Probability • St. Petersburg game – The probability of no win until the n-th game is P(D1 = −1 ∩ · · · ∩ Dn = −1) = (1 − p)n. Therefore, P(Sn = 1 − 2n) = (1 − p)n and P(Sn = 1) = 1 − (1 − p)n. Hence Z SndP = (1 − p)n(1 − 2n) + (1 − (1 − p)n)1 = 1 − (2(1 − p))n ≤ 0 for p ≤ 1/2. Taking limits yields −∞ for p < 1/2 and 0 for p = 1/2. R – Hence, lim SndP ≤ 0. 98 Integral and Limits (8) Applied Probability • St. Petersburg game – The limit S can be −∞ or 1. The probability that S = −∞ = limn→∞(1 − p)n = 0, while S = 1 with probability 1. (Let An is the set where Sn = 1, then by Borel-Cantelli lim sup = 1 by considering the complement Acn, the Borel Cantelli Lemma yields lim sup ACn has a probability of zero. Hence R R lim inf = 1.) Then SdP = lim SndP = (−∞) ∗ 0 + 1 ∗ 1 = 1. R R – Hence limn→∞ SndP < SdP. 99 Integral and Limits (9) Applied Probability • St. Petersburg game – In the lemma of Fatou an integrable minorant has been assumed (see f ∈ L1(µ) in theorem 4.21). In the St. Petersburg game there is no integrable minorant S̃ for (Sn) (i.e. Sn ≥ S̃ for all n ∈ N). – Define S̃ := inf{Sn : n ∈ N}, then P(S̃ = 1 − 2n−1) = p(1 − p)n−1 and R n−1 S̃dµ = P∞ (1 − 2n−1) = −∞ for p ≤ 1/2. n=1 p(1 − p) 100 Lebesgue vs. Riemann Integral (1) Applied Probability • What is the difference between the Lebesgue and the Riemann integral. • Note the we defined the Lebesgue integral of f with respect to µ = λ by Z f dλ := sup{I(g) : g ∈ E+, g ≤ f }. • Let J = [a, b] be an interval in R and λ the Lebesgue measure on J. • Consider the sequence (tn)n∈N of the partitions tn = (tni)i=0...,n where tn0 = a < . . . tnn = b becomes finer with increasing n where max tni − tni−1 → 0 for n → ∞. 101 Lebesgue vs. Riemann Integral (2) Applied Probability • For any f : J → R and any n ∈ R define the lower and the upper Riemann sum: Ltn(f ) := and Unt (f ) := n X i=1 n X i=1 inf f ([tni − tni−1))(tni − tni−1) sup f ([tni − tni−1))(tni − tni−1). 102 Lebesgue vs. Riemann Integral (3) Applied Probability • Definition: Riemann integrability. – f is Riemann integrable if there exists a t such that the limits of the lower and the upper Riemann sums are equal and finite. (In this case the limits do not depend on the choice of t). We write b a f (x)dx Z = n→∞ lim Ltn(f ) = n→∞ lim Unt (f ). 103 Lebesgue vs. Riemann Integral (4) Applied Probability • Theorem, Riemann and Lebesgue integral (see Klenke, Theorem 4.23): – Let f : J → R be Riemann integrable on J = [a, b]. Then f is Lebesgue integrable on J with integral b a f (x)dx Z = Z J f dλ. • It can be shown that a function is Riemann integrable if it is continuous almost everywhere, e.g. the set of its points of discontinuity has Lebesgue measure zero (see e.g. Billingsley, 1986; Heuser, 1993, Chapter 17, Chapters 83-84 in the second book). 104 Lebesgue vs. Riemann Integral (5) Applied Probability • Example, A function with is Lebesgue integrable but not Riemann integrable (see Klenke, Example 4.24): – Let f : [0, 1] → R where x 7→ 1x∈Q. Here Ln(f ) = 0 and Un(f ) = 1, hence this function is not Riemann integrable. The R Lebesgue integral is [0,1] 1x∈Qdλ = 0 since Q ∩ [0, 1] has measure zero. 105 Lebesgue vs. Riemann Integral (6) Applied Probability • Example, An improper Riemann integrable function which is not Lebesgue integrable (see Klenke, Example 4.25): – An improper integral is defined by means of R∞ Rn f (x)dx = lim n→∞ 0 0 f (x)dx. It can be shown that R exists while [0,∞) f dλ = ∞. R∞ 0 sin(x) 1+x dx 106 Lebesgue vs. Riemann Integral (7) Applied Probability • Theorem, Properties of the integral (see Klenke, Theorem 4.26): – Let f : Ω → R be measurable and f ≥ 0 almost everywhere. Then ∞ X n=1 Z µ({f ≥ n}) ≤ f (x)dµ ≤ and Z f (x)dµ = ∞ 0 µ({f Z ∞ X n=1 µ({f > n}) ≥ t})dt . 107 Outline - Expected Value, LLN, Inequalities Applied Probability • Expected value by using the concept of the integral. • The Cauchy-Schwarz and the Markov inequality. • The weak and the strong law of large numbers. • Klenke, Chapter 5 108 Moments (1) Applied Probability • Definition, (see Klenke, Definition 5.1): Consider a probability space (Ω, A, P). Let X be a real valued random variable. – If X ∈ L1(P), then X is called integrable and we call Z E(X) = XdP the expectation or mean of X. If E(X) = 0 then X is called centered. – If n ∈ N and X ∈ Ln(P), then the quantities mk =: E(X k ) , Mk =: E(|X|k ) for any k = 1, . . . , n are called the kth moments and kth absolute moments of X. 109 Moments (2) Applied Probability • Definition, (see Klenke, Definition 5.1): Consider a probability space (Ω, A, P). Let X be a real valued random variable. – If X ∈ L2(P), then X is called square integrable and V(X) = E(X 2) − E(X)2 r is the variance of X. The number σ := V(X) is called the standard deviation X. (In the textbook sometimes V(X) = ∞ if E(X 2) = ∞ is used.) – If X, Y ∈ L2(P), then we define the covariance of X and Y by Cov(X, Y ) = E ((X − E(X))(Y − E(Y ))) . X and Y are called uncorrelated if Cov(X, Y ) = 0 and correlated otherwise. 110 Moments (3) Applied Probability • Theorem, Rules for expectations (see Klenke, Theorem 5.3): Let X, Y, Xn, Zn, n ∈ N be real integrable random variables on (Ω, A, P). – If PX = PY , then E(X) = E(Y ). – Linearity: Let c ∈ R. Then cX ∈ L1(P) and X + Y ∈ L1(P) as well as E(cX) = cE(X) and E(X + Y ) = E(X) + E(Y ). – If X ≥ 0 almost surely then E(X) = 0 if and only if X = 0 almost surely. – Monotonicity: If X ≤ Y almost surely, then E(X) ≤ E(Y ) with equality if and only if X = Y almost surely. 111 Moments (4) Applied Probability • Theorem, Rules for expectations (see Klenke, Theorem 5.3): Let X, Y, Xn, Zn, n ∈ N be real integrable random variable on (Ω, A, P). – Triangle inequality: |E(X)| ≤ E(|X|). – If Xn ≥ 0 almost surely for all n ∈ N, then P∞ E(P∞ X ) = n n=1 n=1 E(Xn ). – If Zn ↑ Z for some Z, then E(Z) = limn→∞ E(Zn) ∈ (−∞, ∞]. 112 Moments (5) Applied Probability • Theorem, Independent vs. Uncorrelated (see Klenke, Theorem 5.4): – Let X, Y ∈ L1(P) be independent. Then (XY ) ∈ L1(P), E(XY ) = E(X)E(Y ) and Cov(XY ) = 0, i.e. X and Y are uncorrelated. 113 Moments (6) Applied Probability • Theorem, Wald’s identity (see Klenke, Theorem 5.5): – Let T, X1, X2, . . . be independent real random variables in L1(P). Let P(T ∈ N0) = 1 and assume that X1, X2, . . . are identically distributed. Define T X ST := Xi . i=1 Then ST ∈ L1(P) and E(ST ) = E(T )E(X1). 114 Moments (7) Applied Probability • Theorem, Properties of the Variance (see Klenke, Theorem 5.6): Let X ∈ L2(P). Then: – V(X) = E ((X − E(X))2) ≥ 0. – V(X) = 0 if and only if X = E (X) almost surely. – The map f : R → R, x 7→ E ((X − x)2) is miminal at x0 = E(X) with f (E(X)) = V(X). 115 Moments (8) Applied Probability • Theorem, Covariance (see Klenke, Theorem 5.7): The map Cov : L2(P) × L2(P) → R is a positive semidefinite symmetric bilinear form and Cov = 0 if Y is almost surely constant. The detailed version of this concise statement is: Let X1, . . . , Xm, Y1, . . . , Yn ∈ L2(P) and α1, . . . , αm, β1, . . . , βn ∈ R as well as d, e ∈ R. Then d + Cov m X i=1 αiXi, d + n X j=1 = αj Xj X i,j αiβj Cov(Xi, Yj ) . In particular V(αX) = α2V(X), α, X ∈ R, and the Bienaymé formula holds, V m X Xi = m X i=1 i=1 V (Xi) + For uncorrelated X1, . . . , Xm we have V ( m X i,j=1; i6=j Pm Cov(Xi, Yj ) . i=1 Xi ) = Pm i=1 V (Xi ). 116 Moments (9) Applied Probability • Theorem, Cauchy-Schwarz inequality (see Klenke, Theorem 5.9): If X, Y ∈ L2(P) then (Cov (X, Y )))2 ≤ V(X)V(Y ) . Equality holds if and only if there are a, b, c ∈ R with |a| + |b| + |c| > 0 and such that aX + bY + c = 0 almost surely. 117 Moments (10) Applied Probability • Theorem, Blackwell-Girshick (see Klenke, Theorem 5.10): – Let T, X1, X2, . . . be independent real random variables in L2(P). Let P(T ∈ N0) = 1 and assume that X1, X2, . . . are identically distributed. Define T X ST := Xi . i=1 Then ST ∈ L2(P) and V(ST ) = V(T )E(X1)2 + E(T )V(X1). 118 The Weak Law of Large Numbers (1) Applied Probability • Theorem, Markov inequality, Chebyshev inequality (see Klenke, Theorem 5.11): – Let X be a real random variables and let f : [0, ∞) → [0, ∞) be monotone increasing. Then for any ε with f (ε) > 0, the Markov inequality holds: P (|X| ≥ ε) ≤ E(f (|X|)) . f (ε) E(X 2 ) ε2 . 2 In the special case f (x) = x we get P (|X| ≥ ε) ≤ In particular if X ∈ L2(P) the Chebyshev inequality holds: P (|X − E(X)| ≥ ε) ≤ V(X) . 2 ε 119 The Weak Law of Large Numbers (2) Applied Probability • Definition, Law of Large Numbers (see Klenke, Theorem 5.12): Let (Xn)n∈N be a sequence of real random variables in X ∈ L1(P) and let S̃n := Pni=1(X − E(Xi)). – We say that (Xn)n∈N fulfills the weak law of large numbers if 1 | S̃n | > ε = 0 for any ε > 0. lim P n→∞ n – We say that (Xn)n∈N fulfills the strong law of large numbers if 1 P lim sup | S̃n| = 0 = 1 . n→∞ n 120 The Weak Law of Large Numbers (3) Applied Probability • Theorem, Weak Law of Large Numbers (see Klenke, Theorem 5.14): – Let X1, X2, . . . be uncorrelated random variables in L2(P) with V := supn∈N V(Xn) < ∞. Then (Xn)n∈N fulfills the weak law of large numbers. More precisely, for any ε > 0 we have 1 V P | S̃n| ≥ ε ≤ 2 . n nε 121 The Strong Law of Large Numbers (1) Applied Probability • Theorem, Strong of Large Numbers (see Klenke, Theorem 5.16): – Let X1, X2, · · · ∈ L2(P) be pairwise independent (that is Xi and Xj are independent for all i, j ∈ N with i 6= j) and identically distributed. Then (Xn)n∈N fulfills the strong law of large numbers. 122 The Strong Law of Large Numbers (2) Applied Probability • Theorem, Etemadi’s Strong of Large Numbers (see Klenke, Theorem 5.17): – Let X1, X2, · · · ∈ L1(P) be pairwise independent (that is Xi and Xj are independent for all i, j ∈ N with i 6= j) and identically distributed. Then (Xn)n∈N fulfills the strong law of large numbers. 123 The Strong Law of Large Numbers (3) Applied Probability • Example, Monte Carlo Integration (see Klenke, Example 5.21): – Let f : [0, 1] → R be a function. We want to determine the value R of the integral I = 01 f (x)dx numerically. – Generate pseudo random numbers X1, . . . , Xn on [0, 1]. – Iˆn := n1 Pni=1 f (xi) is an estimate of I. – Given that f ∈ L1([0, 1]), the strong law of large numbers yields Iˆn → I almost surely. 124 The Strong Law of Large Numbers (4) Applied Probability • Example, Monte Carlo Integration (see Klenke, Example 5.21): – We don’t know anything how fast Iˆn converges to I. If R f ∈ L2([0, 1]) then V1 = f 2(x)dx − I 2 can be obtained. The Chebychev inequality yields −1/2 P |Iˆn − I| > εn ! ≤ V1 . 2 ε I.e. the error is of the order n−1/2. – In the literature different methods to reduce the variance of Iˆn are available. One important example is importance sampling. See e.g. Robert and Casella (1999). 125 The Strong Law of Large Numbers (5) Applied Probability • Definition, Empirical Distribution Function (see Klenke, Definition 5.22): – Let X1, X2, . . . be real random variables. The map Fn : R → [0, 1], x 7→ n1 Pni=1 1(−∞,x] is called empirical distribution function of X1, . . . , Xn. 126 The Strong Law of Large Numbers (6) Applied Probability • Theorem, Glivenko-Cantelli (see Klenke, Theorem 5.23): – Let X1, X2, . . . be iid real random variables with distribution function F and let Fn, n ∈ N be the empirical distribution functions. Then lim sup supx∈R|Fn(x) − F (x)| = 0 almost surely. n→∞ 127 Outline - Convergence Theorems Applied Probability • Almost sure convergence. • Convergence in probability (convergence in measure). • Mean convergence (L1 convergence). • Uniform integrability. • Klenke, Chapter 6. 128 Almost Sure and Measure Convergence (1) Applied Probability • The triple (Ω, A, µ) is a σ-finite measure space. • (E, d) is a separable metric space with Borel σ-algebra B(E). Separable means that there exists a countable dense set see e.g. Munkres (2000) • f1, f2, · · · : Ω → E are measurable with respect to A − B(B). 129 Almost Sure and Measure Convergence (2) Applied Probability • Definition, Almost sure convergence, convergence in probability (see Klenke, Definition 6.2): We say that (fn)n∈N converges to f – in µ-measure, symbolically fn meas → f if µ({d(f, fn) > ε} ∩ A) n→∞ −→ 0 for all ε > 0 and all A ∈ A with µ(A) < ∞, and – in µ-almost everywhere (a.e.), symbolically fn a.e. → f if there exists a µ-null set N ∈ A such that d(f (ω), fn(ω)) n→∞ −→ 0 for any ω ∈ Ω \ N . – If µ is a probability measure, then convergence in µ-measure is called convergence in probability. If (fn) converges almost everywhere then we say the it converges almost surely (a.s.). 130 Almost Sure and Measure Convergence (3) Applied Probability • Almost sure convergence implies convergence in probability. • Convergence in probability does not imply almost sure convergence. – Let (Xn)n∈N be an independent family of real values random variables. Xn is Bernoulli distributed with pn = n1 . Then P 0. P({d(X, Xn) > ε}) = P(|X − Xn| > ε) = 1/n. Hence Xn → Let An be the event where Xn = 1, then lim supn→∞(An) = A∗ corresponds to {1 infinitely often}. Since P∞ P∞ 1 P(|X − X | > ε) = n n=1 n=1 n = ∞ the second Borel-Cantelli lemma implies that lim supn→∞ Xn = 1 almost surely. 131 Almost Sure and Measure Convergence (4) Applied Probability • Definition, Mean Convergence (see Klenke, Definition 6.8): – Let E = R and f, f1, f2, · · · ∈ L1(µ). We say that (fn)n∈N converges in mean (L1 convergence) to f , symbolically L1 fn → f if ||fn − f ||1 → 0. • Note that ||fn − f ||1 = |fn − f |dµ n→∞ −→ 0. Lp convergence means R that ||fn − f ||p = ( |fn − f |pdµ)1/p n→∞ −→ 0 (this comes later in more detail). R • L1 convergence implies convergence in measure but not vice versa. 132 Almost Sure and Measure Convergence (5) Applied Probability • Theorem, Fast Convergence (see Klenke, Definition 6.12): Let (E, d) be a separable metric space. In order for the sequence (fn)n∈N of measurable maps Ω → E to converge almost everywhere, it is sufficient that one of the following conditions hold: – E = R and there is a p ∈ [1, ∞) with fn ∈ Lp(µ) for all n ∈ N and there is an f ∈ Lp(µ) with P∞ n=1 ||fn − f ||p < ∞. P∞ – There is a measuralbe f with n=1 µ(A ∩ {d(f, fn) > ε}) < ∞ for all ε > 0 and for all A ∈ A with µ(A) < ∞. – E is complete and there is a summable sequence (εn)n∈N such that P∞ n=1 µ(A ∩ {d(fn , fn+1 ) > εn }) < ∞ for all ε > 0 and for all A ∈ A with µ(A) < ∞. 133 Almost Sure and Measure Convergence (6) Applied Probability • Corollary, Subsequence and convergence (see Klenke, Definition 6.13): Let (E, d) be a separable metric space. Let f, f1, f2, . . . be measurable maps Ω → E. Then the following statements are equivalent: – fn n→∞ −→ f in measure. – For any subsequence of (fn)n∈N there exists a sub-subsequence that converges to f almost everywhere. 134 Convergence in Distribution (1) Applied Probability • Convergence in Distribution, see Klenke, Definition 13.17; • Definition, Convergence in Distribution (see Karr, 1993, Def. 5.5): The sequence (Xn) converges to X in distribution if lim FXn (t) = FX (t) n→∞ d for all t at which FX is continuous. This is denoted by Xn −→ X or Xn ⇒ X 135 Uniform Integrability (1) Applied Probability • If f is integrable then f 1|f |>α goes to zero almost everywhere as R α → ∞. Therefore limα→∞ |f |≥α |f |dµ = 0. • Uniform means that if we consider a sequence (fn) ∈ L1(µ) the R integrability holds for all n, i.e. limα→∞ supn |fn|≥α |fn|dµ = 0 (see Billingsley, 1986, page 220). • Example: fndµ = na . Klenke uses the following definition: R • Definition, uniformly integrable (see Klenke, Definition 6.16): – A family F ⊂ L1(µ) is called uniformly integrable if inf1 sup (|f | − g)+ dµ = 0 . Z 0≤g∈L (µ) f ∈F 136 Uniform Integrability (2) Applied Probability • Theorem (see Klenke, Theorem 6.17): The family F ⊂ L1(µ) is uniformly integrable if and only if Z inf sup 0≤g̃∈L1 (µ) f ∈F |f |>g̃ |f |dµ = 0 . If µ(Ω) < ∞ then uniform integrability is equivalent to the following two conditions: – inf a∈[0,∞) supf ∈F (|f | − a)+ dµ = 0 and R – inf a∈[0,∞) supf ∈F |f |>a |f |dµ = 0. R 137 Uniform Integrability (3) Applied Probability • Theorem (see Klenke, Theorem 6.25): Let {fn : n ∈ N} ⊂ L1(µ). The following statements are equivialent: – There is an f ∈ L1(µ) with fn → f in L1. – (fn)n∈N is an L1(µ)-Cauchy sequence, that is ||fn − fm||1 → 0 for m, n → ∞. – (fn)n∈N is uniformly integrable and there is a measurable map f such that fn n→∞ −→ f in measure. The limits in the first and the third point coincide. 138 Uniform Integrability (4) Applied Probability • In Chapter 6 we additionally find: – Lebesgues dominated convergence theorem. – Interchanging the Integral and Differentiation 139 Outline - Convergence Theorems Applied Probability • Lp spaces and Lp convergence. • Jensen’s inequality, Hölder’s inequality, Minkowski’s inequality. • The Fischer-Riesz theorem (Lp(µ) is a Banach space). • Hilbert spaces. • Lebesgue’s decomposition theorem, absolute continuity. • The Randon Nikodym theorem. • Klenke, Chapter 7. 140 Lp Spaces (1) Applied Probability • We consider a σ-finite measure space (Ω, A, µ). For f : Ω → R̄ we define Z 1/p ||f ||p := ( |f |pdµ) for p ∈ [1, ∞) and ||f ||∞ := inf{K ≥ 0 : µ(|f | > k) = 0} • Spaces of functions where these terms are finite are the Lp spaces. Lp(Ω, A, µ) = Lp(µ) = {f : Ω → R̄ measurable and ||f ||p < ∞}. 141 Lp Spaces (2) Applied Probability • Note that ||f ||p is only a seminorm; we observed this for L1 in Klenke (2008)[Theorem 4.17]. The goal is to adapt the space such that we obtain a norm. • To obtain a norm ||f − g||p = 0 if and only if f = g µ-almost everywhere. For a seminorm we only have f = g implies ||f − g||p = 0. • Hence we define N = {h is measurable and h = 0 µ − a.e.}. For any p ∈ [1, ∞], N is a subvector of Lp. • To obtain a norm from the seminorm we build the factor space. 142 Lp Spaces (3) Applied Probability • Definition, Factor space (see Klenke, Definition 7.1): – For any p ∈ [1, ∞] define Lp(Ω, A, µ) = Lp(µ) := Lp/N = {f¯ := f + N : f ∈ Lp}. For f¯ ∈ Lp(µ) define ||f¯||p = ||f ||p for any f ∈ f¯. Also let R R f¯dµ = f dµ if this expression is define for f . • I.e. with f¯ we define equivalence classes. f ∼ g if g, f ∈ f¯. 143 Lp Spaces (4) Applied Probability • Definition, Lp convergence (see Klenke, Definition 7.2): – Let p ∈ [1, ∞] and f1, f2, · · · ∈ Lp(µ). If ||fn − f ||p n→∞ −→ 0 then Lp p we say that (fn)n∈N converges to f in L (µ) and write fn −→ f . 144 Lp Spaces (5) Applied Probability • Theorem, (see Klenke, Theorem 7.3): Let p ∈ [1, ∞] and f1, f2, · · · ∈ Lp(µ). Then the following statements are equivalent: p Lp – There is an f ∈ L (µ) with fn −→ f . – (fn)n∈N is a Cauchy sequence in Lp(µ). (I.e. there is a positive integer N such that for all m, n ∈ N the ||fm − fn||p ≤ ε.) If p < ∞ then these two statements are equivalent to – (|fn|p)n∈N is uniformly integrable and there exists a measurable f with fn converging to f in measure. The limits in the first and this point coincide. 145 Inequalities and the Fischer-Riesz T. (1) Applied Probability • Theorem, Jensen’s inequality (see Klenke, Theorem 7.9): – Let I ⊂ R be and interval and let X be an I-valued random variable with E(|X|) < ∞. If φ is convex, then E(φ(X)−) < ∞ and E(φ(X)) ≥ φ (E(X)) . • φ has to be convex on an interval containing the range of X. • Extension to Rn see Klenke, Theorem 7.11. • Example: Consider a random variable X where E(X 2) < ∞. Then Jensen’s inequality yields E(X 2) ≥ (E(X))2. Hence V(X) = E(X 2) − (E(X))2 ≥ 0. 146 Inequalities and the Fischer-Riesz T. (2) Applied Probability • Theorem, Hölder’s inequality (see Klenke, Theorem 7.16): – Let p, q ∈ [1, ∞] with p1 + 1q = 1 and f ∈ Lp(µ), g ∈ Lq (µ). Then f g ∈ L1(µ) and ||f g||1 ≤ ||f ||p||g||q . 147 Inequalities and the Fischer-Riesz T. (3) Applied Probability • Theorem, Minkowski’s inequality (see Klenke, Theorem 7.16): – For p ∈ [1, ∞] and f, g ∈ Lp(µ) ||f + g||p ≤ ||f ||p + ||g||p . 148 Inequalities and the Fischer-Riesz T. (4) Applied Probability • Theorem, Fischer-Riesz (see Klenke, Theorem 7.18): – (Lp(µ), ||.||p) is a Banach space for every p ∈ [1, ∞]. • A Banach space B is a vector space equipped with a norm ||.||. With respect to that norm this space is complete (for every Cauchy sequence (fn)∞ n=1 in B, there exists an element f ∈ B such that limn→∞ ||fn − f || = 0). • By the Minkowski inequality we observe that the triangle inequality is satisfied in (Lp(µ), ||.||p). Therefore ||.||p is a norm. By Klenke (2008)[Theorem 7.3] the space (Lp(µ), ||.||p) is complete. 149 Hilbert Spaces (1) Applied Probability • Definition, Inner product (see Klenke, Definition 7.19): Let V be a real vector space. A map h., .i : V × V → R is called an inner product if – (linearity) hx, αy + zi = αhx, yi + hx, zi for all x, y, z ∈ V and α ∈ R. – (symmetry) hx, yi = hy, xi for all x, y ∈ V . – (positive definiteness) hx, xi > 0 for all x ∈ V \ {0} . • If only the first two properties hold and hx, xi ≥ 0 for all x, then h., .i is called a positive semindefinite symmetric bilinear form, or a semi-inner product. • If h., .i is an inner product, then (V, h., .i) is called a (real) Hilbert space if the norm defined by ||x|| := h., .i1/2 is complete, this is if (V, ||.||) is a Banach space. 150 Hilbert Spaces (2) Applied Probability • Definition, (see Klenke, Definition 7.20): – For f, g ∈ L2(µ) define Z hf, gi := f gdµ . – For f¯, ḡ ∈ L2(µ) define hf¯, ḡi := hf, gi where f ∈ f¯ and g ∈ ḡ. • Theorem, (see Klenke, Theorem 7.21): – h., .i is an inner product on L2(µ) and a semi-inner product on L2(µ). In addition ||f ||2 = hf, f i1/2. 151 Hilbert Spaces (3) Applied Probability • Theorem, (see Klenke, Theorem 7.22): – The space (L2(µ), h., .i) is a Hilbert space. 152 Hilbert Spaces (4) Applied Probability • Definition, Orthogonal Complement (see Klenke, Definition 7.24): – Let V be a real vector space with inner product h., .i. If W ⊂ V then the orthogonal complement of W is the following linear subspace of V . W ⊥ := {v ∈ V : hv, wi = 0 for all w ∈ W } . 153 Hilbert Spaces (5) Applied Probability • Theorem, Orthogonal Decomposition (see Klenke, Theorem 7.22): – Let (V, h., .i) be a Hilbert space and let W ⊂ V be a closed linear subspace. For any x ∈ V , there is a unique representation x = w + w⊥ where w ∈ W and w⊥ ∈ W ⊥. 154 Hilbert Spaces (6) Applied Probability • Let ||x − ŵ|| = inf w∈W ||x − w||. • It can be shown that the above expression is minimized when ŵ ∈ W and x − ŵ ∈ W ⊥ (see e.g. Ruud, 2000, Section 2.6.2); or: • Theorem, Projection Theorem (see e.g. Brockwell and Davis, 2006, Theorem 2.3.1): If W is a closed subspace of the Hilbert space V and x ∈ V , then – there is a unique element ŵ ∈ W such that ||x − ŵ|| = inf w∈W ||x − w||. – ŵ ∈ W and ||x − ŵ||) = inf w∈W ||x − w|| if and only if ŵ ∈ W and (x − ŵ) ∈ W ⊥. • We shall observe that ŵ is given by the conditional expectation. 155 The Radon-Nikodym Theorem (1) Applied Probability • Definition, (see Klenke, Definition 7.30): Let µ and ν be two measures on (Ω, A). – ν is called absolutely continuous with respect to µ (symbolically ν µ) if ν(A) = 0 for all A ∈ A with µ(A) = 0 . The measures ν and µ are called equivalent if ν µ and µ ν. – µ is called singular to ν (µ ⊥ ν) if there exists and A ∈ A such that µ(A) = 0 and ν(Ω \ A) = 0. 156 The Radon-Nikodym Theorem (2) Applied Probability • Theorem, Lebesgue’s decomposition theorem (see Klenke, Theorem 7.33): – Let µ and ν be σ-finite measures in (Ω, A). Then ν can be uniquely decomposed into an absolutely continuous part νa and a singular part νs (with respect to µ): ν = νa + νs where νa µ and νs ⊥ µ . νa has a density with respect to µ and finite µ almost everywhere. dνa dµ is A-measurable and 157 The Radon-Nikodym Theorem (3) Applied Probability • Theorem, Radon-Nikodym theorem (see Klenke, Corollary 7.34): – Let µ and ν be σ-finite measures in (Ω, A). Then ν has a density w.r.t µ ⇔ ν µ . dν is A-measurable and finite µ almost everywhere. In this case dµ dν The term dµ is called Radon-Nikodym derivative of ν with respect to µ. 158 Outline - Martingales Applied Probability • Conditional Expectation. • Martingales. • Discrete Stochastic Integrals and No-Arbitrage. • Optional Sampling Theorem. • The Martingale Convergence Theorem. • Klenke, Chapters 8 to 11. 159 Conditional Expectation (1) Applied Probability • Definition, Conditional Probability (see Klenke, Definition 8.2): – Let (Ω, A, P) be a probability space and A ∈ A. We define the conditional probability given A for any B ∈ A by P(B∩A) P(B|A) = P(A) if P(A) > 0 , 0 else. • The specification for the case P(A) is arbitrary but of no importance. 160 Conditional Expectation (2) Applied Probability • Theorem, (see Klenke, Theorem 8.4): – If P(A) > 0, then P(B|A) is a probability measure on (Ω, A). • Theorem, (see Klenke, Theorem 8.5): Let A, B ∈ A with P(A), P(B) > 0. Then – A, B are independent ⇔ P(B|A) = P(B) ⇔ P(A|B) = P(A). 161 Conditional Expectation (3) Applied Probability • Theorem, Summation formula/law of total probability (see Klenke, Theorem 8.6): – Let I be a countable set and let (Bi)i∈I be pairwise disjoint sets with P (Ui∈I Bi) = 1. Then for any A ∈ A, P(A) = X i∈I P(A|Bi)P(Bi) . 162 Conditional Expectation (4) Applied Probability • Theorem, Bayes’ formula (see Klenke, Theorem 8.7): – Let I be a countable set and let (Bi)i∈I be pairwise disjoint sets with P (Ui∈I Bi) = 1. Then for any A ∈ A with P(A) > 0 and any k ∈ I, P(Bk |A) = P(A|Bk )P(Bk ) P(A ∩ Bk ) = . P P(A) i∈I P(A|Bi )P(Bi ) 163 Conditional Expectation (5) Applied Probability • Definition, (see Klenke, Defintion 8.9): – Let X ∈ L1(P) and A ∈ A. Then we define Z E(1A X) P(A) E(X|A) := X(ω)P[dω|A] = 0 else. if P(A) > 0 , 164 Conditional Expectation (6) Applied Probability • P(B|A) = E(1B |A) for all B ∈ A • Consider a countable set I and pairwise disjoint sets (Bi)i∈I with U i∈I Bi = Ω. • Define F = σ(Bi, i ∈ I). • For X ∈ L1(P) we define the map E(X|F) : Ω → R by E(X|F) = E(X|Bi) ⇐⇒ ω ∈ Bi . 165 Conditional Expectation (7) Applied Probability • Theorem, (see Klenke, Theorem 8.10): The map E(X|F) has the following properties: – E(X|F) is F-measurable. – E(X|F) ∈ L1(P) and for any A ∈ F we have R R A E(X|F)dP = A XdP. 166 Conditional Expectation (8) Applied Probability • F ⊂ A is a sub-σ-algebra and X ∈ L1(Ω, A, P). • Definition, Conditional Expectation (see Klenke, Definition 8.11): A random variable Y is called a conditional expectation of X given F, symbolically E(X|F) := Y , if – Y is F-measurable. – For any A ∈ F we have E(X1A) = E(Y 1A). – For B ∈ A, P(B|F) := E(1B |F) is called a conditional probability of B given the σ-algebra F. 167 Conditional Expectation (9) Applied Probability • Theorem, Conditional Expectation (see Klenke, Theorem 8.12): – E(X|F) exists and is unique (up to equality almost surely). • Existence follows from the Radon-Nikodym theorem. 168 Conditional Expectation (10) Applied Probability • We write/define E(X|Y ) = E(X|σ(Y )) • Theorem, Properties of the Conditional Expectation (see Klenke, Theorem 8.14): Let G ⊂ F ⊂ A be σ-algebras and let Y ∈ L1(Ω, A, P). Then: – (Linearity) E(λX + Y |F) = λE(X|F) + E(Y |F). – (Monotonicity) If X ≥ Y a.s. the E(X|F) ≥ E(Y |F). – If E(|XY |) < ∞ and Y is measurable with respect to F, then E(Y X|F) = Y E(X|F) and E(Y |F) = E(Y |Y ) = Y . – (Tower Property) E(E(X|F)|G) = E(E(X|G)|F) = E(X|G). – (Triangle inequality) E(|X||F) ≥ |E(X|F)|. 169 Conditional Expectation (11) Applied Probability • Theorem, Properties of the Conditional Expectation (see Klenke, Theorem 8.14): Let G ⊂ F ⊂ A be σ-algebras and let Y ∈ L1(Ω, A, P). Then: – (Independence) If σ(X) and F are independent, then E(X|F) = E(X). – If P(A) = {0, 1} for any A ∈ F then E(X|F) = E(X). – (Dominated convergence) Assume Y ∈ L1(P), Y ≥ 0 and (Xn)n∈N is a sequence of random variables with |Xn| ≤ Y for n ∈ N and such that Xn → X a.s. Then 1 lim E(X |F) = E(X|F) a.s. and in L (P) . n n→∞ 170 Conditional Expectation (12) Applied Probability • Theorem, Conditional Expectation and Projection (see Corollary, 8.16): – Let F ⊂ A be σ-algebra and let X be a random variable with E(X 2) < ∞. Then E(X|F) is the orthogonal projection of X on L2(Ω, F, P). That is, for any F-measurable Y with E(Y 2) < ∞ 2 E (X − Y ) ≥ E (X − E(X|F)) 2 with equality if and only if E(X|F) = Y . 171 Processes, Filtrations (1) Applied Probability • In the following (E, τ ) is Polish space (separable completely metrizable topological space - (see e.g. Klenke, 2008, p. 184)) with Borel σ-algebra E. (Ω, F, P) stands for a probability space, I for an index set. • Definition, Stochastic Process (see Klenke, Definition 9.1): – Let I ⊂ R. A family of random variables X = (Xt, t ∈ I) on (Ω, F, P) with values in (E, E) is called a stochastic process with index set I and range E. • In most cases the ’time notation’ instead of the more general index set notation is used. 172 Processes, Filtrations (2) Applied Probability • Examples: – Let I = N0 and (Yn, n ∈ N0) be a family of iid Rademacher random variables (with p = 1/2) on a probability space (Ω, F, P). I.e. P(Yn = 1) = P(Yn = −1) = 12 . E = Z (with the discrete topology) and let Xt = t X n=1 Yn for all t ∈ N0. (Xt, t ∈ N0) is called symmetric random walk in Z. For random walks see e.g. Durrett (2010)[Chapter 4]. – Brownian motion, here I = R+, see e.g. Klenke (2008)[Chapter 21] or Durrett (2010)[Chapter 8]. – Poisson Process, see e.g. Klenke (2008)[Chapter 3]. – Random Graphs, see e.g. Durrett (2007). 173 Processes, Filtrations (3) Applied Probability • Definition, (see Klenke, Definition 9.6): – If X is a random variable or a stochastic process, we write L [X] = PX for the distribution of X. If G ⊂ F is a σ-algebra, the we write L [X|G] for the regular conditional distribution of X given G. 174 Processes, Filtrations (4) Applied Probability • Definition, (see Klenke, Definition 9.7): An E-valued stochastic process X = (Xt)t∈I is called – real valued if E = R. – a process with independent increments if X is real valued and for all n ∈ N and all t0, t1, . . . , tn ∈ I with t0 < t1 < · · · < tn we have that (Xti − Xti−1 )i=1,...,n is independent. – Gaussian process if X is real valued and for all n ∈ N and all t0, t1, . . . , tn ∈ I, Xt1 , . . . , Xtn ) is n-dimensional normally distributed, and – integrable (square integrable) if X is real valued and E(|Xt|) < ∞ (E(Xt2) < ∞) for al t ∈ I. 175 Processes, Filtrations (5) Applied Probability • Definition, (see Klenke, Defintion 9.7): Assume that I ∈ R is closed under addition. An E-valued stochastic process X = (Xt)t∈I is called – stationary if L [(Xs+t)t∈I ] = L [(Xt)t∈I ] for all s ∈ I, and – a process with stationary increments if X is real valued and L [(Xs+t+r − Xt+r )] = L [(Xs+r − Xr )] and for all r, s, t ∈ I. If 0 ∈ I then it is enough to consider r = 0. 176 Processes, Filtrations (6) Applied Probability • Remark: In econometrics often a weaker form of stationarity is used. • Definition, Weak Stationarity (see e.g. Brockwell and Davis, 2006, Defintion 1.3.2): The time series (Xt)t∈Z is said to be weak stationary (covarinace stationary, stationary in the wide sense, second order stationary) if – E(Xt) < ∞ for all t ∈ Z. – E(Xt) = m for all t ∈ Z. – E ((Xr − m)(Xs − m)) = E ((Xr+t − m)(Xs+t − m)) for all r, s, t ∈ Z. 177 Processes, Filtrations (7) Applied Probability • In the following definition the index set I should be partially ordered. • Definition, Filtration (see Klenke, Definition 9.9): Let F = (Ft, t ∈ I) be a family of σ-algebras with Ft ∈ F for all t ∈ I. F is called filtration if Fs ⊂ Ft for all s, t ∈ I with s ≤ t. • Definition, Adapted (see Klenke, Definition 9.10): A stochastic process X is called adapted to the filtration F if Xt is Ft-measurable for all t ∈ I. If Ft = σ(Xs, s ≤ t) for all t ∈ I, then we denote by F = σ(X) the filtration generated by X. 178 Processes, Filtrations (8) Applied Probability • Definition, Predictable (see Klenke, Definition 9.12): A stochastic process X = (Xn, n ∈ N0) is called predictable with respect to the filtration F = (Fn, n ∈ N0) if X0 is constant and if for every n ∈ N, Xn is Fn−1-measurable. • For measure theoretic details (adapted, augmented filtration) see e.g. Karatzas and Shreve (1991). 179 Processes, Filtrations (9) Applied Probability • Definition, Stopping Time (see Klenke, Definition 9.15): – A random variable τ with values in I ∪ {∞} is called a stopping time with respect to F if for any t ∈ I {τ ≤ t} ∈ Ft. • Theorem, Stopping Time (see Klenke, Theorem 9.16): – Let I by countable. τ is a stopping time if and only if {τ = t} ∈ Ft for all t ∈ I. 180 Processes, Filtrations (10) Applied Probability • Theorem, Stopping Time (see Klenke, Theorem 9.18): Let σ and τ be stopping times. Then: – σ ∨ τ and σ ∧ τ are stopping times. – If σ, τ ≥ 0, then σ + τ is also a stopping time. – If s ≥ 0, then τ + s is a stopping time. However, in general, τ − s is not. 181 Processes, Filtrations (11) Applied Probability • Definition, σ-algebra of the τ -past (see Klenke, Definition 9.19): – Let τ be a stopping time. Then Fτ := {A ∈ F : A ∩ {τ ≤ t} ∈ Ft, for any t ∈ I} is called the σ-algebra of the τ -past. 182 Processes, Filtrations (12) Applied Probability • Theorem, (see Klenke, Lemma 9.21): – If σ and τ are stopping times with σ ≤ τ , then Fσ ⊂ Fτ . • Definition, (see Klenke, Definition 9.22): – If τ < ∞ is a stopping time, then we define Xτ (ω) = Xτ (ω)(ω). 183 Martingales (1) Applied Probability • Definition, Martingales (see Klenke, Definition 9.24): Let (Ω, F, P) be a probability space, I ⊂ R, and let F be a filtration. Let X = (Xt)t∈I be a real-valued, adapted stochastic process with E(|Xt|) < ∞ for all t ∈ I. X is called (with respect to F) a – martingale if E(Xt|Fs) = Xs for all s, t ∈ I with t > s, – submartingale if E(Xt|Fs) ≥ Xs for all s, t ∈ I with t > s, – supermartingale if E(Xt|Fs) ≤ Xs for all s, t ∈ I with t > s. • Consider the map t 7→ E(Xt). For a martingale this map is constant, for a submartingale monotone increasing, while for a supermartingale it is monotone decreasing. • If not otherwise stated Ft = σ(Xs, s ≤ t). 184 Martingales (2) Applied Probability • Theorem, Martingales - properties (see Klenke, Theorem 9.32): Let (Ω, F, P) be a probability space, I ⊂ R, and let F be a filtration. Let X = (Xt)t∈I be a real-valued, adapted stochastic process with E(|Xt|) < ∞ for all t ∈ I. – X is a supermartingale if and only if (−X) is a submartingale. – Let X and Y be martingales and let a, b ∈ R. Then (aX + bY ) is a martingale. – Let X and Y be supermartingales and let a, b ≥ 0. Then (aX + bY ) is a supermartingale. – Let X and Y be supermartingales. Then Z := X ∧ Y = (min(Xt, Yt))t∈I is a supermartingale. – If (Xt)t∈I is a supermartigale and E(XT ) ≥ E(X0) for some T ∈ N0, then (Xt)t∈I is a martingale. If there exists a sequence TN → ∞ with E(XTN ) ≥ E(X0), then (Xt)t∈I is a martingale. 185 Martingales (3) Applied Probability • Theorem, (see Klenke, Theorem 9.33): Let X = (Xt)t∈I be a martingale and let φ : R → R be a convex function. – If E(φ(Xt)+) < ∞ for all t ∈ I, then (φ(Xt))t∈I is a submartingale. – If t∗ := sup(I) ∈ I, then E(φ(Xt∗ )+) < ∞ implies E(φ(Xt)+) < ∞. – In particular, if p ≥ 1 and E(|Xt|p) < ∞ for all t ∈ I, then (|Xt|p)t∈I is a submartingale. 186 Discrete Stochastic Integral (1) Applied Probability • Definition, Discrete Stochastic Integral (see Klenke, Definition 9.37): – Let (Xn)n∈N0 be an F-adapted real process and let (Hn)n∈N be a real-valued and F-predictable process. The discrete stochastic integral of H with respect to X is the stochastic process H · X defined by (H · X) := n X m=1 Hm(Xm − Xm−1) for n ∈ N0 . If X is a martingal, then H · X is also called martingale transform of X. • Note that (H · X) is F-adapted by construction. 187 Discrete Stochastic Integral (2) Applied Probability • Theorem, Stability Theorem (see Klenke, Theorem 9.39): Let (Xn)n∈N0 be an F-adapted real process with E(|X0|) < ∞. – X is a martingale if and only if for any locally bounded predictable process H (i.e. each Hn is bounded), the stochastic integral (H · X) is a martingale. – X is a submartingale (supermartingale) if and only if (H · X) is a submartingale (supermartingale) for any locally bounded predictable process H ≥ 0. 188 Discrete Stochastic Integral (3) Applied Probability • Example: St. Petersburg game (Klenke, Example 9.40) – I = N0, D1, D2, . . . are iid Rademacher random variables (with p = 1/2), i.e. P(Di = 1) = P(Di = −1) = 21 for all i ∈ N0. – D = (Di)i∈N0 and F = σ(D). – Di is the result of a bet that gives a gain or loss of one Euro for every Euro we put at stake. – Hn is the Euro amount we bet in the gambling round n. – The gambling strategy has to be predictable, hence Hn = Fn(D1, . . . , Dn−1). We already had Hn = 2n−11(D1=···=Dn−1=−1). H1 = 1 which is measurable with respect to the trivial σ-field F0 = {∅, Ω}. 189 Discrete Stochastic Integral (4) Applied Probability • Example: St. Petersburg game – Define Xn = Pni=1 Di. (Xn) is a martingale. – Let H1 = 1 and Hn = 2n−11(D1=···=Dn−1=−1). Then Sn = Pni=1 Hi(Xi − Xi−1) = Pni=1 HiDi = (H · X)n is the gain process. – (Sn), or S in textbook notation, is a martingale. – We obtain E(Sn) = 0 for all n ∈ N. – Note that we already know that Sn → 1 almost surely. This issue will be discussed when we investigate martingale convergence. In this example n ∈ N0. 190 Martingales and Option Pricing (1) Applied Probability • In the following we consider the Cox et al. (1979) model: – I = {0, 1, . . . , T } – A risky asset with binary price process (St)t∈I . – A risk free asset, ≈ bond, paying a fixed interest rate r ≥ 0. The price of the risk free asset at period t is Stb = (1 + r)t. – We want to price a European call option, which is a derivative security with a payoff structure (ST − K)+ = max(0, ST − K). T is called expiry date or maturity, while K is called strike price. – With European options the option cannot be executed before the expiry date T . 191 Martingales and Option Pricing (2) Applied Probability • By the martingale transform Y = (H · X) we have transformed the martingale X into a further martingale. • Let Y0 = 0. When the process X is fixed, which martingales Y can be obtained by means of H = H(Y ). • Not all martingales Y can be obtained (see e.g. Klenke, 2008, Example 9.41). • However, a martingale (Y ) can be represented as stochastic integrals if the increments of Xn − Xn−1 can only take two values. 192 Martingales and Option Pricing (3) Applied Probability • Definition, Binary Model (see Klenke, Definition 9.42): – A stochastic process X0, . . . , XT is called binary splitting or a binary model if there exist random variables D1, . . . , DT with values in {−1, 1} and functions fn : Rn−1 × {−1, 1} → R for n = 1, . . . , T as well as x0 ∈ R such that X0 = x0 and Xn = fn(X1, . . . , Xn−1, Dn) for any n = 1, . . . , T . By F = σ(X) we denote the filtration generated by X. • Xn depends only on the past Xi, but not on the full information arising from D1, . . . , Dn. 193 Martingales and Option Pricing (4) Applied Probability • Theorem, Representation Theorem (see Klenke, Theorem 9.43): – Let X be a binary model and let VT be an FT -measurable random variable. Then there exists a bounded predictable process H and a v0 ∈ R with VT = v0 + (H · X). 194 Martingales and Option Pricing (5) Applied Probability • The predictable process H = (Htb, Hts)> , Ht ∈ R2, is called trading strategy. Htb and Hts are the number of bonds and shares held by the investor in period 0 ≤ t ≤ T . • The value of the portfolio at time t is Vt = HtbStb + HtsSt the discounted value is ! 1 s b b Ṽt = b Ht Sn + Ht St . Sn 195 Martingales and Option Pricing (6) Applied Probability • Definition, Self Financing Trading Strategy (see Lamberton and Lapeyre, 2008): – A trading strategy is called self financing if b s HtbStb + HtsSt = Ht+1 Stb + Ht+1 St hold for any t = {0, 1, . . . , T − 1}. (Ht is predictable, i.e. measurable with respect to Ft = σ(X1, . . . , Xn−1).) • For t = 0, V0(= H0bS0b + H0sS0) = H1bS1b + H1sS1 has to hold. V0 = H0bS0b + H0sS0 as done in Lamberton and Lapeyre (2008), some random variable, or some constant v0. This depends on F0. In Klenke H0 has to be constant based on his definition of predictability, while the textbook of Lamberton and Lapeyre (2008) only required measurable with respect to F0. 196 Martingales and Option Pricing (7) Applied Probability • Theorem, (see Lamberton and Lapeyre, 2008, Proposition 1.1.2): The following statements are equivalent: – The strategy H is self financing. – For any t ∈ {1, . . . , T } Vt(H) = V0(H) + t X (Hib∆Stb + His∆Si) i=1 where ∆St = St − St−1. – For any t ∈ {1, . . . , T } Ṽt(H) = V0(H) + where ∆S̃t = S̃t − S̃t−1 = 1 S Stb t t X (Hib∆S̃ib + Hts∆S̃i) i=1 − S b1 St−1. t−1 197 Martingales and Option Pricing (8) Applied Probability • Theorem, (see Lamberton and Lapeyre, 2008, Proposition 1.1.3): – For any predictable process H s and for any F0 measurable variable V0, there exists a unique predictable process H b such that the strategy H = (H b, H s) is self financing and the initial value if V0. 198 Martingales and Option Pricing (9) Applied Probability • Definition, Admissible Strategy (see Lamberton and Lapeyre, 2008, Definition 1.1.4): – A strategy H is admissible if it is self financing and if Vt(H) ≥ 0 for any t ∈ {0, 1, . . . , T }. • Definition, Arbitrage Strategy (see Lamberton and Lapeyre, 2008, Definition 1.1.5): – An arbitrage strategy is an admissible strategy with zero initial value (i.e. V0(H) = 0) and non-zero final value (i.e. VT (H) > 0). • For different definitions/forms of arbitrage (see Mas-Colell et al., 1995; Werner and Ross, 2000) 199 Martingales and Option Pricing (10) Applied Probability • Definition, Viable Market (see Lamberton and Lapeyre, 2008, Definition 1.1.5): – A market is viable (arbitrage free) if there is no arbitrage opportunity. • Theorem, Fundamental Theorem of Asset Pricing (see Lamberton and Lapeyre, 2008, Theorem 1.1.6): – The market is viable if and only if there exists a probability measure P∗ equivalent to P such that the discounted prices of assets are martingales. 200 Martingales and Option Pricing (11) Applied Probability • Remark: – For different definitions/forms of arbitrage (see Mas-Colell et al., 1995; Werner and Ross, 2000). – The discrete time version of the fundamental theorem of asset prices goes back to Harrison and Kreps (1979) and Harrison and Pliska (1981). – The continuous time analog of this theorem has been derived by Delbaen and Schachermayer (1994). – An easier way to look on this theorem is provided in e.g. Filipović (2009)[Chapter 4]. 201 Martingales and Option Pricing (12) Applied Probability • Definition, Attainable Claim (see Lamberton and Lapeyre, 2008, Definition 1.3.1): – A contingent claim h is attainable if there exists an admissible strategy H worth h at time T . • Example: h = (ST − K)+. H is a linear combination of bonds and shares such that the payoff is (ST − K)+. • Definition, Complete Market (see Lamberton and Lapeyre, 2008, Definition 1.3.3): – The market is complete if every contingent claim is attainable. 202 Martingales and Option Pricing (13) Applied Probability • Theorem, (see Lamberton and Lapeyre, 2008, Theorem 1.3.4): – A viable market is complete if and only if there exists a unique probability measure P∗ equivalent to P, under which discounted prices are martingales. 203 Martingales and Option Pricing (14) Applied Probability • Definition, Cox et al. (1979) Model (see Klenke, Definition 9.44): – Consider an economy with a risky asset S and a risk free asset S b. – Let T ∈ N, a ∈ (−1, 0) and b ∈ (0, 1) as well as p ∈ (0, 1). Further, let D1, . . . , DT be iid Rademacher random variables where P(Di = 1) = 1 − P(Di = −1) = p. We let the initial price of the risky asset S0 = s0 > 0 and, for n = 1, . . . , T , define (1 + b)St−1 St = (1 + a)St−1 if Dt = +1, if Dt = −1. – F0 = {∅, Ω}, F = 2Ω. 204 Martingales and Option Pricing (15) Applied Probability • Definition, Cox et al. (1979) Model (see Klenke, Definition 9.44): – The bond prices Stb are described by the deterministic function Stb = (1 + r)t, r ≥ 0, fulfilling a < r < b, is a fixed interest rate. – An European call option with payoff profile h(XT ) = max(0, ST − K) = (ST − K)+ is written on the risky asset described by S. K is called strike price, T is the expiration date. π(VT ) is the arbitrage free value of this financial derivative. 205 Martingales and Option Pricing (16) Applied Probability • Let us start with T = 1. In this case the asset price is either S11 = (1 + b)s0 or S11 = (1 + b)s0. • The value process is VT (D1 = 1) = V1(D1 = 1) = ((1 + b)s0 − K)+ with a probability of p and V1(D1 = −1) = ((1 + a)s0 − K)+ with 1 − p. • From the above results we know that S̃ (or S 0 in Klenke’s notation) has to follows a martingale. I.e. we have to transform p such that S̃ follows a martingale, i.e. 1 (p∗(1 + b)s0 + (1 − p∗)(1 + a)s0) . s0 = 1+r 206 Martingales and Option Pricing (17) Applied Probability a−r ∗ • Some algebra yields p∗ = r−a = > 0. Hence p ∼ p. The b−a a−b equivalent process X 0 is (1 + b)s0 with probability p∗ and (1 + a)x0 with probability 1 − p∗. This process is a martingale. • To obtain the value of the option we calculate the expected value of VT under the equivalent measure p∗. This yields π(VT ) = Ep∗ (VT ) 1 ∗ + ∗ + = p ((1 + b)s0 − K) + (1 − p )((1 + a)s0 − K) . 1+r 207 Martingales and Option Pricing (18) Applied Probability • For T ∈ N, Cox et al. (1979) derived: • The price of an European call option is given by π(VT ) = Ep∗ (VT ) = T s0 T ∗ i X T −i i T −i ∗ (p ) (1 − p ) · (1 + b) (1 + a) T (1 + r) i=A i − T T ∗ i K X ∗ T −i (p ) (1 − p ) · , T (1 + r) i=A i where A := min{i ∈ N0 : (1 + b)i(1 + a)T −is0 > K}. 208 Martingales and Efficient Markets Applied Probability • Remark: – In the above example martingales have been used to price derivatives. – On the martingale property of asset prices see e.g. Lucas (1978) and Duffie (2001). – Regarding efficient capital market literature see e.g. LeRoy (1989) or Campbell et al. (1997) and the literature cited there. 209 Optional Sampling Theorems Applied Probability • Motivation: – If X is a martingale, then the martingale transform (H · X) provides us with a further martingale. – Does this also hold for a stopped process? – In less mathematical terms, by the stability theorem we observe that a fair game (= martingale) cannot be transformed in an unfair game by some gambling strategy H. – The optional sampling theorems investigate these issue for processes stopped at random times. 210 Dobb Decomposition (1) Applied Probability • Let X = (Xn)n∈N0 be an adapted process with E(|Xn|) < ∞ for all n ∈ N0 . • We try to decompose X into a martingale and a predictable process. I.e.. for n ∈ N0, Mn := X0 + and An := n X k=1 n X k=1 (Xk − E(Xk |Fk−1)) (E(Xk |Fk−1) − Xk−1) • Xn = Mn + An. M is a martingale, A is predictable with A0 = 0. 211 Dobb Decomposition (2) Applied Probability • Theorem, Doob decomposition (see Klenke, Theorem 10.1): – Let X = (Xn)n∈N0 be an adapted process with E(|Xn|) < ∞ for all n ∈ N0. Then there exists a unique decomposition X = M + A, where A is predictable with A0 = 0 and M is a martingale. This representation of X is called the Doob decomposition. X is a submartingale if and only if A is monotone increasing. 212 Dobb Decomposition (3) Applied Probability • Definition, Square variation process (see Klenke, Definition 10.3): – Let X = (Xn)n∈I be a square integrable F martingale. The unique predictable process A for which (Xn2 − An)n∈I becomes a martingale is called the square variation process of X and is denoted by (hXin)n∈I = A. 213 Dobb Decomposition (4) Applied Probability • Theorem, Square variation process (see Klenke, Theorem 10.4): – Let X = (Xn)n∈I be a square integrable F martingale. Then for n ∈ N0 , n X E((Xi − Xi−1)2|Fi−1) hXin = i=1 and E (hXin) = n X i=1 V(Xn − X0) . 214 Dobb Decomposition (5) Applied Probability • Discuss the Examples 10.2., 10.6 and 10.7 • The square variation (quadratic variation) is an important concept/property of processes when continuous time martingales are investigated (see e.g. Klenke, 2008, Chapter 21). 215 Optional Sampling and Stopping (1) Applied Probability • Theorem, (see Klenke, Lemma 10.10): – Let I ⊂ R be countable, X = (Xt)t∈I be a martingale, let T ∈ I and let τ be a stopping time with τ ≤ T . Then Xτ = E(XT |Fτ ) and E (Xτ ) = E (XT ) . • Note that this theorem required that τ is bounded by some T ∈ I. 216 Optional Sampling and Stopping (2) Applied Probability • Theorem, Optional Sampling Theorem (see Klenke, Theorem 10.11): Let X = (Xn)n∈N0 be a supermartingale and let σ ≤ τ be stopping times. – Assume there exists a T ∈ N with τ ≤ T . Then Xσ ≥ E(Xτ |Fσ ) , and, in particular E (Xσ ) ≥ E (Xτ ). If X is a martingale, then equality holds in each case. 217 Optional Sampling and Stopping (3) Applied Probability • Theorem, Optional Sampling Theorem (see Klenke, Theorem 10.11): Let X = (Xn)n∈N0 be a supermartingale and let σ ≤ τ be stopping times. – If X is nonnegative and if τ < ∞ almost surely, then we have E(Xτ ) ≤ E(X0) < ∞, E(Xσ ) ≤ E(X0) < ∞, and Xσ ≥ E(Xτ |Fσ ). – Assume that more generally X is only adapted and integrable. Then X is a martingale if and only if E(Xτ ) = E(X0) for any bounded stopping time τ . 218 Optional Sampling and Stopping (4) Applied Probability • Definition, Stopped Process (see Klenke, Definition 10.13): – Let I ⊂ R be countable, let (Xt)t∈I be adapted and let τ be a stopping time. We define the stopped process X τ by Xtτ = Xτ ∧t for any t ∈ I. Further Fτ is the filtration (Ftτ ) = (Ft∧τ ). • Ft∧τ could be σ(Xt∧τ ). • Xtτ is adapted to Fτ and F. 219 Optional Sampling and Stopping (5) Applied Probability • Theorem, Optional Stopping (see Klenke, Theorem 10.15): – Let X = (Xn)n∈N0 be a (sub-, super-) martingale with respect to F and let τ be a stopping time. Then X τ is a (sub-, super-) martingale with respect to F and Fτ . • Discuss the examples 10.16, 10.17 and 10.19. 220 Optional Sampling and Stopping (6) Applied Probability • Until now we have considered bounded stopping times. To obtain an optional sampling result with unbounded stopping times stronger assumptions the stochastic process X become necessary ⇒ uniform integrability. • Theorem, (see Klenke, Lemma 10.20): – Let X = (Xn)n∈N0 be a uniformly integrable martingale. Then the family (Xτ : τ is a finite stopping time) is uniformly integrable. • Note that bounded mean τ ≤ T means P(τ ≤ T ) = 1, while finite means τ ∈ I, where I ⊂ R, almost surely, or P(τ < ∞) = 1. 221 Optional Sampling and Stopping (7) Applied Probability • Theorem, Optional Sampling and Uniform Integrability (see Klenke, Theorem 10.21): – Let X = (Xn)n∈N0 be a uniformly integrable martingale (respectively supermartingale) and let σ ≤ τ be finite stopping times. Then E(|Xτ |) < ∞ and Xσ = E (Xτ |Fσ ) respectively Xσ ≥ E (Xτ |Fσ ). 222 Optional Sampling and Stopping (8) Applied Probability • Theorem, (see Klenke, Corollary 10.22): – Let X = (Xn)n∈N0 be a uniformly integrable martingale (respectively supermartingale) and let τ1 ≤ τ2 ≤ . . . be finite stopping times. Then (Xτn )n∈N is a martingale (respectively supermartingale). 223 Martingale Convergence Applied Probability • Motivation: – In the former section we observed that we obtain a martingale from a martingale when the martingale transform (H · X) is applied or by optimal stopping. – I.e. we cannot transform a fair game into an unfair game. – Now we investigate this question when t → ∞. 224 Doob’s Inequality (1) Applied Probability • Let I ⊂ N0 and let X = (Xn)n∈I be a stochastic process. For n ∈ N we define Xn∗ = sup{Xk : k ≤ n} and |X|∗n = sup{|Xk | : k ≤ n}. • Theorem, (see Klenke, Lemma 11.1): – Let X be a submartingale, then for all λ > 0, λP (Xn∗ ≥ λ) ≤ E Xn1{Xn∗≥λ} ≤ E |Xn|1{Xn∗≥λ} . 225 Doob’s Inequality (2) Applied Probability • Theorem, Doob’s Lp-inequality (see Klenke, Theorem 11.2): Let X be a martingale or a positive submartingale. – For any p ≥ 1 and λ > 0, λpP (|X|∗n ≤ λ) ≤ E (|Xn|p) . – For any p > 1 p p p ≥ λ) ≤ E (|Xn | ) . p−1 p E (|Xn| ) ≤ E ((|X|∗n)p 226 Martingale Convergence Theorems (1) Applied Probability • Motivation and notation - upcrossing inequality: – F = (Fn)n∈N0 , F∞ = σ (Sn∈N0 Fn). (Xn)n∈N0 is real valued and adpated to F. – a, b ∈ R with a < b. – An upcrossing occurs when X passes the interval [a, b]. 227 Martingale Convergence Theorems (2) Applied Probability • Motivation and notation - upcrossing inequality: – In more detail: τk := inf{n ≥ σk−1 : Xn ≤ a} and σk := inf{n ≥ τk : Xn ≥ b} for k ∈ N. τk = ∞ if σk−1 = ∞. σk = ∞ if τk = ∞. – X has its kth upcrossing over [a, b] between τk and σk if σk < ∞. – For n ∈ N, we define the number of upcrossings over [a, b] at time n by Una,b := sup{k ∈ N0 : σk ≤ n} . 228 Martingale Convergence Theorems (3) Applied Probability • Theorem, Upcrossing inequality (see Klenke, Lemma 11.3): – Let (Xn)n∈N0 be a submartingale. Then E Una,b ! E ((Xn − a)+) − E ((X0 − a)+) . ≤ b−a 229 Martingale Convergence Theorems (4) Applied Probability • Theorem, Martingale convergence theroem (see Klenke, Theorem 11.4): – Let (Xn)n∈N0 be a submartingale with sup{E (Xn+) : n ≥ 0} < ∞. Then there exists an F∞-measurable −→ X∞ almost random variable X∞ with E (X∞) < ∞ and Xn n→∞ surely. 230 Martingale Convergence Theorems (5) Applied Probability • Theorem, (see Klenke, Corollary 11.5): – Let (Xn)n∈N0 be a nonnegative supermartingale, then there is an F∞-measurable random variable X∞ ≥ 0 with E (X∞) ≤ E (X0) −→ X∞ almost surely. and Xn n→∞ 231 Martingale Convergence Theorems (6) Applied Probability • Example: St. Petersburg game – Let Sn be the account balance in the St. Petersburg game (Example 9.40). Then S is a martingale. – Sn ≤ 1 almost surely for any n. – Therefore the requirements of the martingale convergence theorem are fulfilled. By this (Xn) converges to a finite random variable almost surely. In the St. Petersburg game (Xn) converges to 1 almost surely. – Since E(Sn) = 0 for all n ∈ N0 we do not obtain L1 convergence. We also know that Sn is integrable but not uniformly integrable. 232 Martingale Convergence Theorems (7) Applied Probability • Theorem, Convergence theorem for uniformly integrable martingales (see Klenke, Corollary 11.7): Let (Xn)n∈N0 be a uniformly integrable F− (sub-, super-) martingale. Then there exists an F∞-measurable −→ X∞ almost surely and integrable random variable X∞ with Xn n→∞ in L1. Furthermore – Xn = E (X∞|Fn) for all n ∈ N is a martingale. – Xn ≤ E (X∞|Fn) for all n ∈ N is a submartingale. – Xn ≥ E (X∞|Fn) for all n ∈ N is a supermartingale. 233 Martingale Convergence Theorems (7) Applied Probability • Theorem, Lp convergence theorem for martingales (see Klenke, Corollary 11.7): Let p > 1 and let (Xn)n∈N0 be an Lp-bounded martingale. Then there exists an F∞-measurable integrable random variable X∞ with E (|X∞|p) < ∞ and Xn n→∞ −→ X∞ almost surely and in Lp. In particular, (|Xn|p)n∈N0 is uniformly integrable. 234 Outline - Markov Chains Applied Probability • Markov Chains - definitions. • Discrete Markov Chains. • Recurrence and transience. • Random Walks • Invariant distributions, periodicity and convergence • Markov chain Monte Carlo methods • Klenke, Chapters 17 and 18. 235 Definitions and Construction (1) Applied Probability • E is a Polish space with Borel σ-algebra B(E), I ⊂ R is some index set, (Xt)t∈I is an E-valued stochastic process. F = (Ft)t∈I = σ(X) if not otherwise stated. • Definition, Markov Property (see Klenke, Definition 17.1): – We say that X has the Markov Property if, for every A ∈ B(E) and all s, t ∈ I, with s ≤ t: P(Xt ∈ A|Fs) = P(Xt ∈ A|Xs) . • If E is countable space, then X has the Markov property if and only if for all n ∈ N, all s1 < · · · < sn < t and all i1, . . . , in, i ∈ E with P(Xs1 = i1, . . . , Xsn = in) > 0 we have P(Xt = i|Xs1 = i1, . . . , Xsn = in) = P(Xt = i|Xsn = in) . 236 Definitions and Construction (2) Applied Probability • Definition, (see Klenke, Definition 17.3): Let I ⊂ [0, ∞) be closed under addition and assume 0 ∈ I. A stochastic process X is called time-homogeneous Markov Process with distributions (Px)x∈E on the space (Ω, A) if: – For every x ∈ E, X is a stochastic process on the probability space (Ω, A, Px) with Px(X0 = x) = 1. – The map κ : E × B(E)⊗I → [0, 1], (x, B) 7→ Px(X ∈ B) is a stochastic kernel. 237 Definitions and Construction (3) Applied Probability • Definition, (see Klenke, Definition 17.3): Let I ⊂ [0, ∞) be closed under addition and assume 0 ∈ I. A stochastic process X is called time-homogeneous Markov Process with distributions (Px)x∈E on the space (Ω, A) if: – X has the time homogeneous Markov property: For every A ∈ B(E) every x ∈ E and all s, t ∈ I we have Px(Xt+s ∈ A|Fs) = κt(Xs, A), Px almost surely. Here for every t ∈ I the transition kernel κt : E × B(E) → [0, 1] is the stochastic kernel defined for x ∈ E and A ∈ B(E) by I κt(x, A) := κ x, {y ∈ E : y(t) ∈ A} = Px(Xt ∈ A) . The family (κt(x, A)), t ∈ I, x ∈ E A ∈ B(E) is also called the family of transition probabilities of X. 238 Definitions and Construction (4) Applied Probability • Definition, (see Klenke, Definition 17.3): Let I ⊂ [0, ∞) be closed under addition and assume 0 ∈ I. A stochastic process X is called time-homogeneous Markov Process with distributions (Px)x∈E on the space (Ω, A). – We write Ex for the expectation with respect to Px, L(X) = Px and Lx(X|F) = Px(X ∈ .|F) for a (regular) conditional distribution of X given F. – If E is countable, then X is called discrete Markov process. – In the special case I = N0, X is called Markov chain. In this case κn is called the family of n-step transition probabilities. • We shall observe that the existence of the transition kernel κt will result in the existence of the kernel κ • Discuss the examples 17.5 and 17.7. 239 Definitions and Construction (5) Applied Probability • Definition, Transition kernel (see Klenke, Definition 8.24): Let (Ω1, A1) and (Ω2, A2) be measurable spaces. A map κ : Ω1 × A2 → [0, ∞] is called a (σ-) finite transition kernel from Ω1 to Ω2 if – ω1 7→ κ(ω1, A2) is A1-measurable for any A2 ∈ A2. – A2 → 7 κ(ω1, A2) is a (σ-) finite measure on (Ω2, A2) for any ω1 ∈ Ω1. If the measure is a probability measure for all ω1 ∈ Ω1 then κ us called a stochastic kernel or Markov kernel. 240 Definitions and Construction (6) Applied Probability • The next theorem constructs a Markov process for a more general Markov semigroup of stochastic kernels. (I.e. we consider a set of kernels K, with elements κt, and a binary operation - in our case κs ∗ κt - for which the semigroup axioms hold (in particular ∗ is associative)). • Theorem, (see Klenke, Theorem 17.8): – Let I ⊂ [0, ∞) be closed under addition and let (κt)t∈I be a Markov semigroup of stochastic kernel from E to E . Then there is a measurable space (Ω, A) and a Markov process ((Xt)t∈I ), Px) on the space (Ω, A) with transition probabilities Px(Xt ∈ A) = κt(x, A) for all x ∈ E , A ∈ B(E) and t ∈ I . Conversely, for every Markov process X the above equation defines a semigroup of stochastic kernels. By this equation the finite dimensional distributions of X are uniquely determined. 241 Definitions and Construction (7) Applied Probability • Theorem, (see Klenke, Theorem 17.9): – A stochastic process is a Markov process if and only if there exists a stochastic kernel κ : E × B(E)⊗I → [0, 1] such that for every bounded B(E)⊗I − B(E)-measurable function f : E I → R and for every s ≥ 0 and x ∈ E we have Ex(f ((Xt+s)t∈E |Fs) = EXs (f (X))) := Z EI κ(Xs, dy)f (y) . 242 Definitions and Construction (8) Applied Probability • Theorem, (see Klenke, Corollary 17.10): – A stochastic process (Xn)n∈N0 is a Markov process if and only if Lx((Xn+k )n∈N0 |Fk ) = LXk ((Xn)n∈N0 ) for every k ∈ N0. 243 Definitions and Construction (9) Applied Probability • Theorem, (see Klenke, Theorem 17.11): – Let I = N0. If (Xn)n∈N0 is a stochastic process with distributions Px, x ∈ E), then the Markov property in Definition 17.3(iii) is implied by the existence of a stochastic kernel κ1 : E × B(E) → [0, 1] with the property that for every A ∈ B(E), every x ∈ E and every s ∈ I , we have P(x)(Xs+1 ∈ A|Fs) = κ1(Xs, A) . In this case the n-step transition kernel κn can be computed inductively by κn = κn−1 ∗ κ1 = Z E κn−1(., dx)κ1(x, .) In particular the family (κn)nN is a Markov semigroup and the distribution of X is uniquely determined by κ1. 244 Definitions and Construction (10) Applied Probability • Definition, Strong Markov Property (see Klenke, Definition 17.12): – Let I ⊂ [0, ∞). be closed under addition. A Markov process (Xt)t∈I with distributions (P(x), x ∈ E) has the strong Markov property if, for every a.s. finite stopping time τ , every bounded B(E)⊗I − B(E)-measurable function f : E I → R and for every x ∈ E we have Ex(f ((Xτ +s)t∈E |Fτ ) = EXτ (f (X))) := Z EI κ(Xτ , dy)f (y) . 245 Definitions and Construction (11) Applied Probability • Theorem, (see Klenke, Theorem 17.14): – If I ⊂ [0, ∞) is countable and closed under addition, then every Markov process (Xn)n∈I with distributions (P(x), x ∈ E) has the strong Markov property. 246 Definitions and Construction (12) Applied Probability • Theorem, Reflection Principle (see Klenke, Theorem 17.15): – Let Y1, Y2, . . . be iid real random variables with symmetric distribution L(Y1) = L(−Y1). Define X0 and Xn := Y1 + · · · + Yn for n ∈ N. Then for every n ∈ N0 and a > 0 P sup Xm ≥ a ≤ 2P (Xm ≥ a) − P (Xn = a) . m≤n If we have P (Y1 ∈ {−1, 0, 1}) = 1, then for a ∈ N equability holds. 247 Discrete Markov Chains (1) Applied Probability • In the following E is countable and I = N0. • X = (Xn)n∈N0 on E is a discrete Markov chain or Markov chain with discrete state space. • If X is discrete then (Px)x∈E is described by the transition matrix P = (p(x, y))x,y∈E = (Px[X1 = y])x,y∈E . • The n-step transition probabilities are p(n)(x, y) = Px[Xn = y]. 248 Discrete Markov Chains (2) Applied Probability • The n-step transition probabilities can be obtained by the n-fold matrix product p(n)(x, y) = pn(x, y) . where p(n)(x, y) = Pz∈E pn−1(x, z)p(z, y); p0 = I is the identity matrix. • Induction results in the Chapman-Kolmogorov equation: For all m, n ∈ N0 and x, y ∈ E we have p(m+n)(x, y) = Pz∈E p(m)(x, z)p(n)(z, y); p0 = I is the identity matrix. 249 Discrete Markov Chains (3) Applied Probability • Definition, Stochastic matrix (see Klenke, Definition 17.16): – A matrix (p(x, y))x,y∈E with nonnegative entries and with X p(x, y) = 1 y∈E for all x ∈ E is called a stochastic matrix. 250 Discrete Markov Chains (4) Applied Probability • Remark: Stochastic matrix – A stochastic matrix is a stochastic kernel from E to E. – By theorem 17.8 there exists a unique discrete Markov chain. – Example: (Rn(x), x ∈ E, n ∈ N0) is an independent family of random variables with values in E and distributions P(Rn(x) = y) = p(x, y) for all x, y ∈ E and n ∈ N0. We did not require that (Rn(x), x ∈ E) are independent. We also did not require that the Rn have the same distribution. Only the one-dimensional marginal distributions are determined. x With x ∈ E we define X0 = x and Xnx = Rn(Xn−1 ) for n ∈ N. – Notation: Px := L(X x) is the distribution of X x. This a probability measure on the space of sequences (E N0 , B(E)⊗N0 ). 251 Discrete Markov Chains (5) Applied Probability • Theorem, (see Klenke, Theorem 17.17): – With respect to the distribution (Px)x∈E the canonical process X on (E N0 , B(E)⊗N0 ) is a Markov chain with transition matrix P. – In particular, for any stochastic matrix p, there corresponds a unique discrete Markov chain X with transition probabilities p. 252 Discrete Markov Chains (6) Applied Probability • Example, Random Walk (see Klenke, Example 17.18): – Let E = Z and assume that p(x, y) = p(0, y − x) for all x, y ∈ Z. – Here p is translation invariant. – A discrete Markov chain X with this transition probability matrix p is a random walk on Z. – Xn has the same distribution as X0 + Z1 + · · · + Zn where (Zn) are iid with P(Zn = x) = p(0, x). 253 Discrete Markov Chains (7) Applied Probability • Example, Simulation (see Klenke, Example 17.19): – – – – Consider a finite state space E = {1, . . . , k}. We want to simulate X with transition matrix P. (Un)n∈N is a sequence of uniform random numbers. Define r(i, 0) = 0, r(i, j) = p(i, 1) + · · · + pi,j for i, j ∈ E. Define Rn by Rn(i) = j ⇔ Un ∈ [r(i, j − 1), r(i, j)) – Then P(Rn(i) = j) = r(i, j) − r(i, j − 1) = p(i, j). 254 Discrete Processes in Continuous Time (1) Applied Probability • E is countable. • (Xt)t∈[0,∞) is a process on E with transition matrix p(x, y) := Px(Xt = y) with x, y ∈ E. 255 Discrete Processes in Continuous Time (2) Applied Probability • With x 6= y the process X jumps with rate q(x, y) if the following limit exists 1 q(x, y) = lim Px(Xt = y) . t↓0 t • Assume that this limit exists, i.e. y6=x q(x, y) P < ∞. • We define q(x, x) = − Py6=x q(x, y). • With this conversion 1 lim Px(Xt = y) − 1(x=y) = q(x, y) t↓0 t for all x, y ∈ E. 256 Discrete Processes in Continuous Time (3) Applied Probability • Definition, Generator (see Klenke, Definition 17.23): P 1 P (X = y), q(x, x) = − – If q(x, y) = lim t y6=x q(x, y) and t↓0 t x limt↓0 1t Px(Xt = y) − 1(x=y) = q(x, y) hold, then Q = (q(x, y))x,y∈E is called the Q-matrix of X or the generator of the semigroup (pt)t≥0. 257 Discrete Processes in Continuous Time (4) Applied Probability • Theorem, Generator (see Klenke, Theorem 17.25): – Let Q be an E × E matrix such that q(x, y) ≥ 0 for all x, y ∈ E with x 6= y. Assume that q(x, y) = limt↓0 1t Px(Xt = y) for x 6= y and q(x, x) = − Py6=x q(x, y) hold and that λ := supx∈E |q(x, x)| < ∞. Then Q is the Q-matrix of the unique Markov process X. 258 Discrete Processes in Continuous Time (5) Applied Probability • Example, Poisson Process (see Klenke, Example 17.24): – A Poisson process with intensity α has the Q-matrix q(x, y) = α(1(y=x+1) − 1(y=x)). • We meet continuous time Markov chains with discrete state space e.g. in credit risk modeling (see e.g. Schönbucher, 2003, Chapter 8.2). 259 Poisson Process (1) Applied Probability • Example, Geiger counter, number of clicks should be a random variable in the time interval I = (a, b]. The number of clicks should be – Independent for disjoint intervals. – Homogeneous: I.e. when the interval is shifted by some constant c ∈ R then the distribution remains the same. – The number of clicks should have finite expectation. – At one point of time there should be only one click. 260 Poisson Process (2) Applied Probability • In more formal terms: I = {(a, b] : a, b ∈ [0, ∞], a ≤ b}, `((a, b]) = b − a. I ∈ I and NI ∈ N0, Nt = N(0,t] ... the number of jumps until t. (NI , I ∈ I) is a family P1 NI∪J = NI + NJ if I ∩ J = ∅ and I ∪ J ∈ I. P2 The distribution of NI only depends on the length of the interval I. I.e. PNI = PNJ if `(I) = `(J). P3 If J ⊂ I with I ∩ J = ∅ for all I, J ∈ J and I 6= J, then (NJ , J ∈ J ) is an independent family. 261 Poisson Process (3) Applied Probability • In more formal terms: I = {(a, b] : a, b ∈ [0, ∞], a ≤ b}, `((a, b]) = b − a. I ∈ I and NI ∈ N0, Nt = N(0,t] ... the number of jumps until t. (NI , I ∈ I) is a family P4 For any I ∈ I we have E(NI ) < ∞. P5 lim supε↓0 ε−1P(Nε ≥ 2) = 0. • Let λ := lim supε↓0 ε−1P(Nε ≥ 2). Then: 262 Poisson Process (4) Applied Probability P (doubleclick in (0, 1]) = n→∞ lim P = = 2n[ −1 k=0 {N(k2−n,(k+1)2−n] ≥ 2} 2n\ −1 1 − n→∞ lim P {N(k2−n,(k+1)2−n] ≤ 1} 1 − n→∞ lim k=0 2nY −1 k=0 P {N(k2−n,(k+1)2−n] ≤ 1} by P3 2n = 1 − n→∞ lim 1 − P {N(0,2−n] ≥ 2} = n 2 2 1 − n→∞ lim 1 − n P {N(0,2−n] ≥ 2} 2 ≤ 1 − e−λ . by P2 n 263 Poisson Process (5) Applied Probability • Definition, Poisson Process (see Klenke, Definition 5.33): A family (Nt, t ≥ 0) of N0-valued random variables is called a Poisson process with intensity α ≥ 0 if N0 = 0 and if: – For any n ∈ N and any choice of n + 1 numbers 0 = t0 < t1 · · · < tn the family (Nti − Nti−1 , i = 1, . . . , n) is independent. – For t > s ≥ 0 the difference Nt − Ns is Poisson distributed with parameter α(t − s), that is P(Nt − Ns = k) = e α(t−s) (α(t − s))k k! for all k ∈ N0. 264 Poisson Process (6) Applied Probability • Theorem, (see Klenke, Theorem 5.34): – If (NI , I ∈ I) has the properties P1 to P5, then (N(0,t], t ≥ 0) is a Poisson process with intensity α := E(N(0,1]). If (Nt) is a Poisson process, then (Nt − Ns, (s, t] ∈ I) has the properties P1 to P5. 265 Poisson Process (7) Applied Probability • Example - Geiger counter – Waiting time between clicks, is the time that the process does not jump. Hence, P(N(s,s+t] = 0) = P(Nt − Ns = 0) = e−αt. – The waiting times are independent, since Nt − Ns and Ns − Nr are independent for some t > s > r. – The Poisson process is a process with independent and stationary increments. 266 Poisson Process (8) Applied Probability • Ad existence: – Let W1, W2, . . . be an independent family of exponentially distributed random variables with parameter α > 0, i.e. P(Wn > x) = e−αx. – Define Tn = Pnk=1 Wk and interpret Tk is the waiting time between jump n − 1 and n. Tn are times where the process jumps from n − 1 to n. – Let Nt = Pn≥1 1(Tn≤t). – Hence, {Nt = k} = {Tk ≤ t < Tk+1}. – Since {Nt ≥ k} ∈ Ft, the jump times Tk are stopping times. – The process (Nt) is right-continuous and non-decreasing. 267 Poisson Process (9) Applied Probability • Theorem, (see Klenke, Theorem 5.35): – Given the above construction of (Nt), the family (Nt, t ≥ 0) is a Poisson process with intensity α. • We meet the Poisson process e.g. in continuous time option pricing models (see e.g. Lamberton and Lapeyre, 2008, Chapter 7). 268 Discrete Processes in Continuous Time (5) Applied Probability • Example, Poisson Process (see Klenke, Example 17.24): – A Poisson process with intensity α has the Q-matrix q(x, y) = α(1(y=x+1) − 1(y=x)). 269 Rating Model (1) Applied Probability • We meet continuous time Markov chains with discrete state space e.g. in credit risk modeling (see e.g. Schönbucher, 2003, Chapter 8.2). • Standard & Poor’s rating transition matrix. (p(x, y))x,y∈E = P(0,1] = AAA AA A BBB BB B CCC D AAA 89.10 0.86 0.09 0.06 0.04 0.00 0.00 0.00 AA 9.63 90.10 2.91 0.43 0.22 0.19 0.00 0.00 A 0.78 7.47 88.94 6.56 0.79 0.31 1.16 0.00 BBB 0.19 0.99 6.49 84.27 7.19 0.66 1.16 0.00 BB 0.30 0.29 1.01 6.44 77.64 5.17 2.03 0.00 B 0.00 0.29 0.45 1.60 10.43 82.46 7.54 0.00 CCC 0.00 0.00 0.00 0.18 1.27 4.35 64.93 0.00 D 0.00 0.00 0.09 0.45 2.41 6.85 23.19 100.00 . 270 Rating Model (2) Applied Probability • After we take the matrix logarithm (see e.g. Horn and Johnson, 1990), we obtain the generator matrix Q: AAA AA A BBB BB B CCC D AAA −0.1159 0.0096 0.0008 0.0006 0.0004 0.0000 0.0000 0.0000 AA 0.1075 −0.1061 0.0324 0.0036 0.0022 0.0021 −0.0004 0.0000 A 0.0042 0.0832 −0.1214 0.0756 0.0058 0.0027 0.0144 0.0000 BBB 0.0013 0.0081 0.0746 −0.1775 0.0885 0.0047 0.0136 0.0000 BB 0.0034 0.0026 0.0090 0.0790 −0.2612 0.0640 0.0245 0.0000 B −0.0004 0.0029 0.0040 0.0140 0.1295 −0.1998 0.1013 0.0000 CCC 0.0000 −0.0001 −0.0003 0.0014 0.0138 0.0590 −0.4353 0.0000 D 0.0000 −0.0002 0.0006 0.0033 0.0208 0.0673 0.2820 0.0000 271 . Rating Model (3) Applied Probability • Note that P(0,1] = exp(1 · Q), where exp stands for the matrix exponential. • To obtain P(0,s], s ∈ R+, calculate P(0,s] = exp(s · Q). • The off-diagonal elements qij , i, j = 1, . . . , N , i 6= j are non-negative. • The diagonal elements satisfy qii = − Pl6=i qil . 272 Rating Model (4) Applied Probability • Q(s) and therefore P(t,T ] can also depend on time t. E.g. consider P(t,t+∆] ≈ IN + ∆Q(s). • If the generator matrices commute, i.e. Q(s)Q(t) = Q(t)Q(s), then R P(t,T ] = exp( tT Q(s)ds). • If Q(t) = Q for all t, then P(t,T ] = exp((T − t)Q). 273 Rating Model (5) Applied Probability • The continuous time Markov chain with generator Q can be viewed as a collection of K compound Poisson processes. • I.e. for every rating class k ∈ E, the Poisson process Nk (t) triggers a transition away from this class k with intensity −qkk (the intensity to stay in k was Qkk ) • Whenever a jump occurs the random variable V indicates the new state of the chain. Hence the conditional intensity to move from class k to class j is qkj /qkk . • The conditional probability to move from class k to class j when is jump takes place is (approximately) −qkj /qkk . 274 Rating Model (6) Applied Probability • The rating process R(t) can be written as dR(t) = (V − R(t))dNR(t)(t) . • For the estimation of transition intensities see Schönbucher (2003)[Chapter 8.3]. 275 Recurrence and Transience (1) Applied Probability • We consider X = (Xn)n∈N0 , a chain on a countable space E with transition matrix P. • In the following we meet some properties of discrete chains. This properties will be useful when we investigate the limit behavior of chains. 276 Recurrence and Transience (2) Applied Probability • Definition, (see Klenke, Theorem 17.28): – For any x ∈ E, let τx := τx1 := inf{n > 0 : Xn = x} and τxk := inf{n > τxk−1 : Xn = x} for k ∈ N, k ≥ 2. τxk is the kth entrance time of X for x. For x, y ∈ E let F (x, y) := Px(τy1 < ∞) := Px(there is an n > 1 with Xn = y) be the probability of ever going from x to y. In particular, F (x, x) is the return probability after the first jump from x to x. 277 Recurrence and Transience (3) Applied Probability • Theorem, (see Klenke, Theorem 17.29): – For all x, y ∈ E and k ∈ N we have Px(τyk < ∞) = F (x, y)F (y, y)k−1 . 278 Recurrence and Transience (4) Applied Probability • Definition, (see Klenke, Definition 17.30): A state x ∈ E is called – – – – – recurrent if F (x, x) = 1. positive recurrent if Ex(τx1) < ∞. null recurrent if x is recurrent but not positive recurrent. transient if F (x, x) < 1. absorbing if p(x, x) = 1. • The Markov chain X is called (positive/null) recurrent if every state x ∈ E is (positive/null) recurrent and is called transient if every recurrent state is absorbing. • Note that ’absorbing’ ⇒ ’positive recurrent’ ⇒ ’recurrent’. 279 Recurrence and Transience (5) Applied Probability • Definition, (see Klenke, Definition 17.33): – Denote by N (y) = P∞ n=0 1Xn =y the total number of visits of X to y and by ∞ X G(x, y) = Ex(N (y)) = pn(x, y) n=0 the Green function of X. 280 Recurrence and Transience (6) Applied Probability • Theorem, (see Klenke, Theorem 17.34): – For x ∈ E we have (with the convention 1/0 = ∞) G(x, y) = if x 6= y and F (x, y) 1 − F (y, y) 1 G(x, y) = 1 − F (y, y) if x = y. In addition, G(x, y) = F (x, y)G(y, y) + 1x=y . – A state x ∈ E is recurrent if and only if G(x, x) = ∞. 281 Recurrence and Transience (7) Applied Probability • Theorem, (see Klenke, Theorem 17.35): – If x is recurrent and F (x, y) > 0, then y is also recurrent and F (x, y) = F (y, x) = 1. 282 Recurrence and Transience (8) Applied Probability • Definition, Irreducible (see Klenke, Definition 17.36): A discrete Markov chain is called – irreducible if F (x, y) > 0 for all x, y ∈ E or equivalently G(x, y) > 0. – weak irreducible if F (x, y) + F (y, x) > 0 for all x, y ∈ E. 283 Recurrence and Transience (9) Applied Probability • Theorem, (see Klenke, Theorem 17.35): – An irreducible discrete Markov chain is either recurrent or transient. If |E| > 2 then there is no absorbing state. 284 Invariant Distributions (1) Applied Probability • We consider a discrete space E and (Xn)n∈N0 is a Markov chain. • Is there a distribution L(Xn) which stays the same for all n? • If such an invariant distribution exists the next chapter provides conditions when convergence to this invariant distribution takes place. 285 Invariant Distributions (2) Applied Probability • Definition, (see Klenke, Definition 17.42): – If µ is a measure on E and f : E → R is a map, then we write µp({x}) = Py∈E µp({y})p(y, x) and Pf (x) = Py∈E p(x, y)f (y) if the sums converge. 286 Invariant Distributions (3) Applied Probability • Definition, (see Klenke, Definition 17.43): – A σ-finite measure µ and E is called invariant measure if µP = µ. A probability measure that is an invariant measure is called an invariant distribution. Denote by I the set of invariant distributions. – A function f : E → R is called subharmonic if Pf exists and if f ≤ Pf . If is called superharmonic if f ≥ pf and harmonic if f = Pf • Remark: In terms of linear algebra an invariant measure is a left eigenvector of P corresponding to the eigenvalue 1. A harmonic function is a right eigenvector corresponding to the eigenvalue 1. 287 Invariant Distributions (4) Applied Probability • Theorem, (see Klenke, Theorem 17.46): – If X is transient, then an invariant distribution does not exist. 288 Invariant Distributions (5) Applied Probability • Theorem, (see Klenke, Theorem 17.47) – Let x be a recurrent state and let τx1 = inf{n ≥ 1 : Xn = x}. Then one invariant measure µx is defined by µx({x}) = E 1 −1 τxX n=0 1 {Xn =y} = ∞ X n=0 Px X n = y; τx1 >n . 289 Invariant Distributions (6) Applied Probability • Theorem, (see Klenke, Corollary 17.48): – If X is positive recurrent then π := µx Ex (τx1) is an invariant distribution for any x ∈ E. • Klenke provides a citation for: If X is irreducible and recurrent, then an invariant measure if X is unique up to a multiplicative factor. • If X is transient there can be more than one invariant measure; see e.g. Remark 17.50(ii). 290 Invariant Distributions (7) Applied Probability • Theorem, (see Klenke, Theorem 17.51) – Let X be irreducible. X is positive recurrent if and only if the set of invariant distributions I = 6 ∅. In this case, I = {π} with 1 π({x}) := >0 Ex (τx1) for all x ∈ E. • Discuss Example 17.52, and Exampler 1.7.3, 1.7.8 and 1.7.10 in Norris (1998). 291 Convergence of Chains (1) Applied Probability • We consider a Markov chain X with invariant distribution π. • When does the distribution of Xn (PXn = Pn or L(Xn)) converge to π if n → ∞. • We shall observe that it is necessary and sufficient that the state space cannot be decomposed into subspaces that the chain does not leave or that are visited by the chain periodically (see e.g. weather model vs. learning model). • The first property will be called irreducible and second aperiodic (vs. reducible and periodic). 292 Convergence of Chains (2) Applied Probability • We consider a positive recurrent Markov chain X on the countable space E with transition matrix P started at some arbitrary µ ∈ M(E). • When does the distribution of Xn converge to π, i.e. µPn → π for n → ∞. • First π has to be unique (up to a factor). π is the unique left eigenvector of P with eigenvalue 1. For uniqueness irreducibility is sufficient by Theorem 17.49. • To obtain µPn → π contraction properties of P are necessary. First, 1 is the largest eigenvalue of P. The stochastic matrix is sufficiently contractive if the multiplicity of the eigenvalue is one. There are no further eigenvalues with modulus (possibly complex valued) one. 293 Periodicity of Markov Chains (1) Applied Probability • For the last property, irreducible is not sufficient. E.g. with E = {0, . . . , N − 1} the Markov chain with transition matrix p(x, y) = 1(y=x+1 (mod N )) is P= 0 1 0 0 0 1 1 0 0 . for N = 3. Every point is visited periodically after N steps. • The eigenvalues are 1 and −0.5000 ± 0.8660ı. For N = 2 the eigenvalues are 1, 1. For N ≥ 1 the eigenvalues are the N roots provided by e2πık/N , k = 0, 1, . . . , N − 1. The uniform distribution is the invariant distribution but limn→∞ δxPn does not exist for any x ∈ E. 294 Periodicity of Markov Chains (2) Applied Probability • Notation: For m, n ∈ N, we write m|n if m is a divisor of n, i.e. n m ∈ N. • If M ⊂ N, then ged(M ) is the greatest common divisor of all n ∈ M. 295 Periodicity of Markov Chains (3) Applied Probability • Definition, periodic, aperiodic (see Klenke, Definition 18.1): – For x, y ∈ E define N (x, y) = {n ∈ N0 : pn(x, y) > 0} . For any x ∈ E, dx := ged(N (x, x)) is called the period of the state x. – If dx = dy for all x, y ∈ E, then d = dx is called the period of X. – If dx = 1 for all x ∈ E, the X is called aperiodic. 296 Periodicity of Markov Chains (4) Applied Probability • Theorem, (see Klenke, Lemma 18.2): – For any x ∈ E there exists an nx ∈ N with Pn·dx (x, x) > 0 for all n ≥ nx. 297 Periodicity of Markov Chains (5) Applied Probability • Theorem, (see Klenke, Lemma 18.3): Let X be irreducible. Then the following statements hold: – d = dx = dy for all x, y ∈ E. – For all x, y ∈ E there exist nx,y ∈ N and Lx,y ∈ {0, . . . , d − 1} such that nd + Lx,y ∈ N (x, y) for all n ≥ nx,y . Lx,y is uniquely determined and we have Lx,y + Ly,z = Lx,z (mod d) for all x, y, z ∈ E. 298 Periodicity of Markov Chains (6) Applied Probability • Theorem, (see Klenke, Theorem 18.4): Let X be irreducible with period d. Then there exists a disjoint decomposition of the state space E= d ] i=1 Ei with the property p(x, y) > 0 and x ∈ Ei ⇒ y ∈ Ei+1 (mod d) . The decomposition is unique up to cyclic permutations. 299 Convergence Theorem (1) Applied Probability • Definition, Total variation norm (see Klenke, page 173): – Consider two probability measures P and Q, then kP − QkT V = sup{ f d(P − Q) : f ∈ L∞(C) with kf k∞ = 1}. Z is called total variation norm. • It can be shown that if kPn − QkT V → 0 then Pn converges to Q in distribution when n → ∞. 300 Convergence Theorem (2) Applied Probability • Theorem, Convergence Theorem of Markov chains (see Klenke, Theorem 18.18): Let X be an irreducible, positive recurrent Markov chain on E with invariant distribution π. Then the following are equivialent: – X is aperiodic – For every x ∈ E we have kLx(Xn) − πkT V n→∞ −→ 0 . – kLx(Xn) − πkT V n→∞ −→ 0 holds for some x ∈ E. – For every µ ∈ M1(E) we have kµpn − πkT V n→∞ −→ 0. 301 Convergence Theorem (3) Applied Probability • In Chapter 18 countable E were considered. The results obtained here can be extended to more general E, i.e. E = Rn. • Convergence theorems can also be obtained for these more general E. See e.g. Robert and Casella (1999), Meyn and Tweedie (2009) or Durrett (2010)[Chapter 6.8]. 302 Speed of Convergence (1) Applied Probability • How fast does PXn converge to π. • Without going into details suppose that |E| = N . Consider the eigenvalues of P, sorted according to their modulus, λ[1] = 1 ≥ |λ[2]| ≥ · · · ≥ |λ[N ]|. If p is irreducible then |λ[2]| < 1. Then kµpn − πkT V ≤ C|λ[2]|n . • In Examples 18.14 and 18.15 Klenke obtains and investigates the speed of convergence. 303 Speed of Convergence (2) Applied Probability • The speed of convergence is an important issue when Markov chain Monte Carlo methods are applied. The speed is related to the question of how many draws are necessary to obtain samples from the (approximate) posterior. In most applications the speed of convergence cannot be derived analytically. Therefore, so called convergence diagnostics are applied in Bayesian statistics and Bayesian econometrics (Gelman and Rubin, 1992; Brooks and Gelman, 1998; Geweke, 1992; Cowles and Carlin, 1996; Chib and Ergashev, 2009). 304 Markov Chains and Linear Algebra (1) Applied Probability • We now assume that E is finite. |E| = n then P is an n × n matrix (P is an n × n matrix, the probability vector p is a n × 1 column vector). • The states are (S1, . . . , Sn) = (E1, . . . , En) 305 Markov Chains and Linear Algebra (2) Applied Probability • Example: Weather model, see (see Luenberger, 1979, page 225): – The states (S1, S2, S3) are sunny, cloudy, rainy. – The stochastic matrix is: P= S S 0.5 C 0.5 R 0 C 0.5 0.25 0.5 R 0 0.25 0.5 . 306 Markov Chains and Linear Algebra (3) Applied Probability • Example: Estes Learning model, see (see Luenberger, 1979, page 226): – Two states (S1, S2) := (L, N ), something has been learned or not. Here it is assumed that nothing is forgotten after it has been learned. The probability to learn is α. – The stochastic matrix is: P= L N L N 1 0 α 1−α . 307 Markov Chains and Linear Algebra (4) Applied Probability • Example, Gambler’s Ruin problem (see Luenberger, 1979, Chapters 2 and 8): – We consider two players: A for guest, B for the house. – p is the probability that A wins a coin from player B. q = 1 − p is the probability that B wins one coin from A. – The initial holdings are a, b ∈ N. – A player wins overall if he obtains all coins. – What is the probability that A wins? 308 Markov Chains and Linear Algebra (5) Applied Probability • Example: Suppose that a = b = 2, then P= 0 1 2 3 4 0 1 q 0 0 0 1 0 0 q 0 0 2 0 p 0 q 0 3 0 0 p 0 0 4 0 0 0 p 1 . 309 Markov Chains and Linear Algebra (6) Applied Probability • Proposition, (see Luenberger, 1979, page 230): – Corresponding to a stochastic matrix P the value λ0 = 1 = λ[1] is an eigenvalue. No other eigenvalue if P has absolute value greater than 1. • Definition, Regular Chain (see Luenberger, 1979, page 230): – A Markov chain is called regular if Pm > 0 for some m ∈ N. 310 Markov Chains and Linear Algebra (7) Applied Probability • Proposition, (see Luenberger, 1979, page 230): Let P be the transition matrix of a regular Markov chain. Then: – There is a unique probability vector p > 0 such that p>P = p>. – For any initial state i (corresponding to an initial probability vector equal to the ith coordinate vector ei) the limit vector m π > = m→∞ lim e> i P exists and is independent of i. Furthermore, π = p. – limm→∞ Pm = P̄, where P̄ is the n × n matrix, each of whose rows is equal to p>. 311 Markov Chains and Linear Algebra (8) Applied Probability • Example: Weather model – A chain is regular if Pm > 0 for some m ∈ N. – For the weather model P2 > 0. – By Matlab (or some other software package) we can derive the left eigenvectors of P. Based on this we obtain the invariant probability vector p. – In addition we observe that Pm for m sufficiently large yields a matrix close to P̄, which is a matrix containing in the rows i = 1, . . . , n the vectors p>. 312 Markov Chains and Linear Algebra (9) Applied Probability • By ordering the states we can obtain the transition matrix P in blocked form. 0(r×n−r) P P = r . R Q • The r × r matrix Pr collects the closed/absorbing states. R is an (n − r) × r matrix representing the transition probabilities from the transient states to states within the closed class. The (n − r) × (n − r) substochastic matrix Q contains the transition probabilities within the transient states. • M = (I(n−r) − Q)−1 is called the fundamental matrix of the Markov chain. 313 Markov Chains and Linear Algebra (10) Applied Probability • Proposition, (see Luenberger, 1979, page 240): – The matrix M = (I(n−r) − Q)−1 exists and is positive. 314 Markov Chains and Linear Algebra (11) Applied Probability • Proposition, (see Luenberger, 1979, page 240): – The element mij of the matrix M of a Markov chain with transient states is equal to the mean number of times the process is in the transient state Sj if it is initiated in a transient state Si. 315 Markov Chains and Linear Algebra (12) Applied Probability • Proposition, (see Luenberger, 1979, page 240): – Let 1(n−r) be a column vector of ones. In a Markov chain with transient states, the ith component of the vector M1(n−r) is equal to the mean number of steps before entering a closed class when the process is initiated in transient state Si. 316 Markov Chains and Linear Algebra (13) Applied Probability • Proposition, (see Luenberger, 1979, page 241): – Let bij be the probability that if a Markov chain is started in transient state Si, it will first enter a closed class by visiting state Sj . Let B be the (n − r) × r matrix with entries bij . Then B = MR. 317 Markov Chains and Linear Algebra (14) Applied Probability • Example: Estes Learning model – With L N P = L 1 0 . N α 1−α we already observe the canonical form of P. Here n = 2, r = 1, Pr = 1, R = α and Q = 1 − α. – Hence, M = (1 − (1 − α))−1 = α1 . – α1 is the mean number of steps necessary enter into the closed class (”learned”). – Here, B = MR = αα = 1. 318 Markov Chains and Linear Algebra (15) Applied Probability • Example: Gambler’s Ruin with n = 4 coins: given by 0 4 1 2 0 1 0 0 0 4 0 1 0 0 P = q 0 0 p 1 2 0 0 q 0 3 0 p 0 q P in canonical form is 3 0 0 0 p 0 . • The n coin problem can be easily implemented in Matlab. Luenberger (1979)[page 244] presents some analytical results. 319 Markov Chain Monte Carlo (1) Applied Probability • In the next step we want to sample a random variable Y with distribution π. • E is a finite set in chapter 18.3. The following results also hold for more general E. • U1, U2, . . . are iid uniform random variables. • The idea is to construct a Markov chain X that is distributed (approximately) like π. • The method of producing π distributed samples is called Markov chain Monte Carlo method (MCMC). 320 Markov Chain Monte Carlo (2) Applied Probability • Metropolis algorithm – Q is a transition matrix of an arbitrary irreducible Markov chain on E. The Metropolis matrix is q(x, y) min 1, π(y)q(y,x) π(x)q(x,y) p(x, y) = 0 1 − Pz6=x p(x, z) ! if x 6= y, q(x, y) > 0, if x 6= y, q(x, y) = 0, if x = y . – Note that P is reversible, i.e. for all x, y ∈ E we have π(x)p(x, y) = π(y)p(y, x) and π is invariant. 321 Markov Chain Monte Carlo (3) Applied Probability • Proposition, (see Klenke, Theorem 18.20): – If Q is irreducible, then the Metropolis matrix P of Q is irreducible with unique invariant measure π. If, in addition, Q is aperiodic or if π is not the uniform distribution on E, then P is aperiodic. 322 Markov Chain Monte Carlo (4) Applied Probability • Metropolis algorithm Now we simulate a chain X with distribution converging to π as follows: – We can draw from the reference chain/proposal distribution q. – Suppose that a transition from the present state x to state y is proposed. The we accept this proposal with probability min 1, π(y)q(y, x) . π(x)q(x, y) 323 Markov Chain Monte Carlo (5) Applied Probability • Example: Ising model – A model on ferromagnetism in crystals. – Atoms are placed at sites of a lattice Λ = 0, . . . , N − 12. – Each atom i ∈ Λ has a magnetic spin x(i) ∈ {−1, 1}, that either points upwards or downwards. – Neighboring atoms interact. – Due to thermic fluctuations the state of the system is random and distributed according to the Bolzmann distribution π on the state space E = {−1, 1}Λ. The inverse temperature β = 1/T ≥ 0 (in Kelvin) a parameter of this distribution. 324 Markov Chain Monte Carlo (6) Applied Probability • Example: Ising model – The local energy level of a single atom i ∈ Λ is described by H i(x) = 1 X 1(x(i)6=x(j) 2 j∈Λ:j∼i where i ∼ j indicates that i and j are neighbors in Λ (coordinate wise mod N , periodic boundary conditions). – Total energy (Hamiltonian function) H(x) = X i∈Λ H i(x) = X j∼i 1(x(i)6=x(j) 325 Markov Chain Monte Carlo (7) Applied Probability • Example: Ising model – The Bolzmann distribution π is given by exp(−βH(x)) π(x) = P . x∈E exp(−βH(x)) Due to the normalizing term we get a probability measure. 326 Markov Chain Monte Carlo (8) Applied Probability • Example: Ising model – Consider x ∈ E. Denote xi,σ the state in which at site i the spin is changed to σ ∈ {−1, 1}. I.e. xi,σ (j) = σ if i = j and x(j) if i 6= j. – xi is the state where the spin is reversed, i.e. xi = xi, −x(i). – We want to simulate the atoms x. 327 Markov Chain Monte Carlo (9) Applied Probability • Example: Ising model 1 if y = xi for some – Use a reference chain: q(x, y). E.g. q(x, y) = ]Λ i ∈ Λ and zero else. I.e. we choose a site i ∈ Λ (uniformly on Λ) and invert the spin at that site. Q is irreducible. – The Metropolis algorithm accepts the proposal of the reference chain with probability 1 if π(xi) ≥ π(x). Otherwise the proposal is accepted with probability π(xi)/π(x). 328 Markov Chain Monte Carlo (10) Applied Probability • Example: Ising model – Note that P P H(xi) − H(x) = 1 − j: j∼i j: j∼i 1(x(j)6=x(i)) = (x(j)6 = −x(i)) −2 · Pj: j∼i 1(x(j)6=x(i)) − 12 . P 1 i – This yields log(π(x )/π(x)) = 2 · j∼i 1(x(j)6=x(i)) − 2 . This expression only depends on the 2d neighbors, in our case d = 2. I.e. the normalizing constant need not be calculated. 329 Markov Chain Monte Carlo (11) Applied Probability • Example: Ising model – By the we obtain the Metropolis transition matrix 1 ]Λ min 1, exp 2 · p(x, y) = 1 − Pi∈Λ p(x, xi ) 0 1 j∼i 1(x(j)6=x(i)) − 2 P if y = xi , if x = y else. 330 Markov Chain Monte Carlo (12) Applied Probability • Example: Ising model – This can be implemented as follows: Draw I1, I2, . . . , In ∼ UΛ and Un iid uniform on [0, 1]. Then ( Fn(x) = x In if log Un ≤ 2 · P j∼i 1(x(j)6=x(i)) − 1 2 x else. – The chain (Xn)n∈N is obtain from Fn(Xn−1) for every n ∈ N. 331 Markov Chain Monte Carlo (13) Applied Probability • An alternative to the Metropolis algorithm is the Gibbs sampler. • If x us a state and i ∈ Λ, then define x−i := {y ∈ E : y(j) = x(j) for j 6= i} . 332 Markov Chain Monte Carlo (14) Applied Probability • Definition, Gibbs sampler (see Robert and Casella (1999)[Chapter 7]): Suppose that for some k > 1 the random variable Y can be written as Y = (Y1, . . . , Yk ), were Yi is either unior multidimensional. Suppose the we can simulate from the corresponding densities f1, . . . , fk , that is Yi|y1, y2, . . . , yi−1, yi+1, . . . , yk ∼ fi(yi|y1, y2, . . . , yi−1, yi+1, . . . , yk ) for all i = 1, . . . , k. The associated Gibbs sampling algorithm for a transition from Ym to Ym+1 is Step 1: Y1 ∼ f1(y1|y2,m, . . . , yk,m) Step 2: .. Y2 ∼ f2(y2|y1,m+1, y3,m, . . . , yk,m) .. Step k: Yk ∼ fk (yk |y1,m+1, y2,m+1, . . . , yk−1,m+1) 333 Markov Chain Monte Carlo (15) Applied Probability • Our goal is to sample X. • Definition, Completion, (see Robert and Casella, 1999, Definition 7.1.4): Given a probability density f , a density g that satisfies R Z g(x, z)dz = f (x) is called completion of f . 334 Markov Chain Monte Carlo (16) Applied Probability • Observe that for each i fixed, the Gibbs sampler corresponds to a Metropolis Hastings sampler with proposal density 0 0 q(y, y 0) = δ(y1,y2,...,yi−1,yi+1,...,yk )(y10 , y20 , . . . , yi−1 , yi+1 , . . . , yk0 ) × fi(yi0 |y1, y2, . . . , yi−1, yi+1, . . . , yk ). Hence, p(x, y) = g(y 0) fi(yi|y1, y2, . . . , yi−1, yi+1, . . . , yk ) g(y) fi(yi0 |y1, y2, . . . , yi−1, yi+1, . . . , yk ) = fi(yi0 |y1, y2, . . . , yi−1, yi+1, . . . , yk ) fi(yi|y1, y2, . . . , yi−1, yi+1, . . . , yk ) × fi(yi|y1, y2, . . . , yi−1, yi+1, . . . , yk ) =1. fi(yi0 |y1, y2, . . . , yi−1, yi+1, . . . , yk ) • The Gibbs sampler is equivalent to a Metropolis-Hastings algorithm with acceptance probability one. For more details see Robert and Casella (1999). 335 Markov Chain Monte Carlo (17) Applied Probability • Example: Ising model – Here we have x−i = {xi,−1, xi,+1}. For i ∈ Λ and σ ∈ {−1, 1} i,σ π(x |x−i) = = = π(xi,σ ) π({xi,−1, xi,+1}) exp(−βH(xi,σ )) exp(−βH(xi,−1)) + exp(−βH(xi,+1)) i,σ 1 + exp β(H(x ) − H(x i,−σ )) −1 −1 1 X = 1 + exp 2β (1(x(j)6=σ) − ) . j: j∼i 2 336 Markov Chain Monte Carlo (18) Applied Probability • Example: Ising model – For the Ising model we get a Markov chain (Xn)n∈N0 with values in E = {−1, 1}Λ and transition matrix Here we have x−i = {xi,−1, xi,+1}. For i ∈ Λ and σ ∈ {−1, 1} p(x, y) = −1 1 1 + exp 2β P 1 j: j∼i (1(x(j)6=σ) − 2 ) ]Λ 0 if y = xi for some i ∈ Λ, otherwise. 337 Markov Chain Monte Carlo (19) Applied Probability • Until now a convergent Markov chain has been constructed to simulate from a distribution π. • The distribution we want to simulate from can also be the distribution of some parameters θ ∈ Θ. Where Θ stands for some parameter space. • This fact is extensively used in Bayesian econometrics and statistics, where MCMC methods are used to simulate the posterior distribution of the unknown parameter θ. • Since the normalizing constant need not be derived to implement the Metropolis-Hastings algorithm or the Gibbs sampler, these methods can be implemented in a straightforward way. 338 Markov Chain Monte Carlo (20) Applied Probability • Example: Linear Regression model – Consider the linear stochastic model yn = β >xn + εn. yn ∈ R, xn ∈ Rk , β ∈ Rk , xn,1 = 1 for all n. n ∈ N. εn is iid normal with mean zero and variance σ 2. – Then Θ = Rk × Rk+. – Suppose that N observations are available. The distribution of Y N |X N is described by π(Y1, . . . , YN |X N , θ) = = N Y n=1 N Y n=1 π(Yn|Xn, θ) √ N 1 1 X exp( 2 (Yn − β >Xn)) . 2σ n=1 2πσ π(.|X N , θ) evaluated at the data y N , xN is called likelihood. 339 Markov Chain Monte Carlo (21) Applied Probability • Example: Linear Regression model – In a Bayesian analysis priors on θ have to be assumed. For example, we choose the so called natural conjugate priors. For our model this is normal distribution for β and a gamma distribution for 1/σ 2 or an inverse gamma distribution for σ 2. In particular π(θ) = π(β, 1/σ 2) = πN (β|b0, B̃0σ 2)πΓ(1/σ 2|v0, S0) or π(θ) = π(β, 1/σ 2) = πN (β|b0, B̃0σ 2)πΓ−1 (σ 2|v0, S0) . 340 Markov Chain Monte Carlo (22) Applied Probability • Example: Linear Regression model πΓ−1(σ 2|v0, S0) v S0 0 1 v0 +1 s0 = Γ(v0) σ2 exp − σ2 . – – If y is Gamma distributed (with density ba f (y; a, b) = Γ(a) (y)a−1 exp (−by), E(y) = a/b, V(y) = a/b2 ), then 1/y follows an inverse gamma distribution (with density ba f (y; a, b) = Γ(a) (1/y)a+1 exp (−b/y), E(y) = b/(a − 1), V(y) = b2/((a − 1)(a − 2)) ) (see e.g. Frühwirth-Schnatter, 2006, p. 434). – By the Bayes theorem: π(θ|y N , xN ) ∝ π(y1, . . . , yN |xN , θ)π(β, 1/σ 2) = πN (β|b0, B̃0σ 2)πΓ−1 (σ 2|v0, S0) . 341 Markov Chain Monte Carlo (23) Applied Probability • Example: Linear Regression model – By some algebra it can be shown that (see e.g. Cameron and Trivedi, 2005; Frühwirth-Schnatter, 2006) π(θ|y N , xN ) ∝ v +1 s 1/σ 2 N exp − N2 σ B̃N B̃0−1b0 ! −k/2 1 σ2 exp − 2 (β − βN )> BN (β − βN ) 2σ > B̃N B̃0−1b0 > ! ! +X Y = + X Xβ̂OLS , where βN = 2 BN = σN B̃N β̂OLS = (X>X)−1X>Y where X is a N × k matrix and Y is of dimension N × 1. B̃N = (B̃0−1 + X>X)−1, N 1 > > −1 > −1 vN = v0 + 2 and sN = s0 + 2 Y Y + b0 B̃0 b0 − β B̃N β . – The posterior π(β, σ 2|y N , xN ) has normal-gamma form. It factorizes into a normal density with parameters bN and BN and an (inverse) gamma density with parameter shape parameter vN and scale parameter sN . 342 Markov Chain Monte Carlo (24) Applied Probability • Example: Linear Regression model – Suppose that σ 2 is fixed, and the normal prior πN (β|b0, B0) is used for β. Then π(β|y N , xN , σ 2) with mean βN = BN B0−1b0 BN = (B0−1 + σ12 X>X)−1. 1 > X Y 2 σ is a normal density + and variance parameter – When β is fixed and an inverse gamma prior is applied to σ 2, then π(σ 2|y N , xN , β) follows an inverse gamma distribution with shape parameter 2 vN = v0 + N2 and scale parameter sN = s0 + 21 PN n=1 (yn − βxn ) . 343 Markov Chain Monte Carlo (25) Applied Probability • Example: Linear Regression model – We observe that with σ 2 the conditional distribution of β is a normal distribution with mean parameter bN and variance BN . – With β fixed, we observe that σ 2 follows an inverse gamma distribution with parameters vN and sN . 344 Markov Chain Monte Carlo (26) Applied Probability • Example: Linear Regression model – In addition, consider an update of β, then z stands for the former β and σ 2. In more detail we construct a chain (Zm)m∈N, where 2 Zm = (βm, σm ). This is called the mth draw of the sampler. – This fact can be used in a computationally efficient way. By 2 drawing from π(β|y n, xn, σ 2), we get (βm, σm−1 ). We draw σ 2 from an inverse gamma distribution with parameter vN and sN . 345 Markov Chain Monte Carlo (27) Applied Probability • Example: Linear Regression model – Hence we get the chain (Zm) as follows. Given Zm−1 draw 2 Step 1: βm from πN (β|y N , xN , σm−1 ) 2 Step 2: σm from πΓ−1 (σ 2|y N , xN , βm) – The convergence period of this chain is called burn-in phase, while after the burn-in phase we obtain (approximate) draws from the posterior. 346 Markov Chain Monte Carlo (28) Applied Probability • Alternatively Metropolis-Hastings algorithm can be used. – Here q(x, y) is a proposal density. Here, min 1, 0 p(x, y) = π(y)q(y,x) π(x)q(x,y) ! else. – To implement the Metropolis-Hastings algorithm we simply have ! π(y)q(y,x) to calculate min 1, π(x)q(x,y) . The new y is accepted if some iid uniform variable Um is smaller than this term. 347 Markov Chain Monte Carlo (29) Applied Probability • Example: Linear Regression model 2 – Metropolis-Hastings update of β. Given Xm−1 = (βm−1, σm−1 ) draw β new from some proposal density q(β, βm−1). Then 2 2 ) q(βm−1 , β new ) π(y N |xN , β new , σm−1 ) π(β new |σm−1 new . ρ(βm−1 , β ) = min 1, 2 2 π(y N |xN , βm−1 , σm−1 ) π(βm−1 |σm−1 ) q(β new , βm−1 ) – Accept β new if Um ≤ ρ(βm−1, β new ). – Note that also here the normalized constant need not be calculated. 348 Markov Chain Monte Carlo (30) Applied Probability • Example: Linear Regression model 2 – Metropolis-Hastings update of σ 2. Given (βm, σm−1 ) draw σ 2,new 2 from some proposal density q(σ 2, σm−1 ). Then 2 ρ(σm−1 , σ 2,new ) = 2 , σ 2,new ) π(y N |xN , βm , σ 2,new ) π(βm |σ 2,new )π(σ 2,new ) q(σm−1 min 1, . 2 2 2 2 π(y N |xN , βm , σm−1 ) π(βm |σm−1 )π(σm−1 ) q(σ 2,new , σm−1 ) 2 – Accept σ 2,new if Um ≤ ρ(σm−1 , σ 2,new ). – Note that conjugate priors are not necessary with the Metropolis-Hastings algorithm. 349 Markov Chain Monte Carlo (31) Applied Probability • Remark: – In Klenke, chapter 17 and 18 we investigated a countable state space. The Gibbs sampler and the Metropolis-Hastings algorithm also work in a more general state space. See Robert and Casella (1999) and Meyn and Tweedie (2009). – For the Metropolis Hastings algorithm see Tierney (1998). – Bayesian methods can also be applied in a lot of models where the likelihood is not available in closed form. E.g. latent variable models, hierarchical models, mixture models (see e.g. Frühwirth-Schnatter, 2006). 350 Outline - Brownian Motion Applied Probability • Continuous versions and Hölder continuity. • Definitions and properties. • Convergence of probability measures • Donsker’s Theorem • Klenke, Chapter 21. 351 Continuous Versions (1) Applied Probability • Independent normally distributed increments (see Klenke, Example 14.45): – I = [0, ∞] and Ωi = R, i ∈ [0, ∞), B = B(R). Ω = R[0,∞), A = B ⊗[0,∞) and let Xt be the coordinate map for t ∈ [0, ∞). Then X = (Xt)t≥0 is the canonical process on (Ω, A). – Construct a probability measure P on (Ω, A) such that X has independent, stationary and normally distributed increments: (Xti − Xti−1 )i=1,...,n are independent for all 0 = t0 < t1 < · · · < tn. PXt−Xs = N(0,t−s) for all t > s. 352 Continuous Versions (2) Applied Probability • Independent normally distributed increments (see Klenke, Example 14.45): – Define stochastic kernels κt(x, dy) := δx ∗ N(0,t)(dy) for t ∈ [0, ∞) where N0,0 = δ0. Here the Chapman-Kolmogorov equation holds κs ∗ κt(x, dy) = δx(N(0,s) ∗ N(0,t))(dy) = δx ∗ N(0,s+t)(dy) = κs+t(x, dy) – On more details on probability measures on product space (see e.g. Klenke, 2008, Chapter 14) 353 Continuous Versions (3) Applied Probability • Independent normally distributed increments (see Klenke, Example 14.45): – P is the unique probability measure on Ω according to Corollary 14.44. – With X we have almost constructed a Brownian motion, what is missing is to investigate whether the paths (i.e. the maps t 7→ Xt) are almost surely continuous. A-priori the paths of a canonical process need not be continuous since every map [0, ∞) → R is possible. The next step to find paths which are P-almost surely negligible. 354 Continuous Versions (4) Applied Probability • Definition, (see Klenke, Definition 21.1): Let X and Y be stochastic processes on (Ω, A, P) with time set I and state space E. X and Y are called – modifications or versions of each other if, for any t ∈ I, we have Xt = Yt almost surely. – indistinguishable if there exists an N ∈ A with P(N ) = 0 such that {Xt 6= Yt} ⊂ N for all t ∈ I. • Indistinguishable processes are modifications. 355 Continuous Versions (5) Applied Probability • Definition, Hölder continuous (see Klenke, Definition 21.2): – Let (E, d) and (E 0, d0) be metric spaces and γ ∈ (0, 1]. A map φ : E → E 0 is called Hölder continuous of order γ at the point r ∈ E, if there exists ε > 0 and C < ∞ such that for any s ∈ E with d(s, r) < ε we have d0(φ(r), φ(s)) ≤ Cd(r, s)γ . – φ is called locally Hölder continuous of order γ if, for every t ∈ E, there exist ε > 0 and C(t, ε) > 0 such that for all s, r ∈ E with d(s, r) < ε and d(r, t) < ε, the above inequality holds. or versions of each other if, for any t ∈ I, we have Xt = Yt almost surely. – Finally, φ is called Hölder continuous of order γ if there exists a C such that the above inequality holds for all s, r ∈ E. 356 Continuous Versions (6) Applied Probability • Remarks: Hölder continuous – If γ = 1 Hölder continuity is Lipschitz continuity. – If E = R and γ > 1 every locally Hölder continuous function is constant. – If φ is Hölder γ-continuous at a given point t, there need not exist an open neighborhood in which φ is continuous, i.e. φ need not be locally Hölder γ-continuous. 357 Continuous Versions (7) Applied Probability • Theorem, Hölder continuous - properties (see Klenke, Lemma 21.3): Let I ⊂ R and let f : I → R be locally locally Hölder continuous of order γ ∈ (0, 1]. Then the following statements hold: – f is locally Hölder continuous of order γ 0 for every γ 0 ∈ (0, γ). – If I is compact, then f is Hölder continuous. – Let I be a bounded interval of length T > 0. Assume that there exists an ε > 0 and an C(ε) < ∞ such that for all s, t ∈ I with |t − s| ≥ ε we have |f (t) − f (s)| ≤ C(ε)|t − s|γ . Then f is Hölder continuous of order γ with constant C = C(ε) [T /ε]1−γ . 358 Continuous Versions (8) Applied Probability • Definition, Path properties (see Klenke, Definition 21.4): – Let I ⊂ R and let X = (Xt, t ∈ I) be a stochastic process on some probability space (Ω, A, P) with values in a metric space (E, d). For every ω ∈ Ω we say that the map I → E, t 7→ Xt(ω) is a path of X. – We say that X has almost surely continuous paths, or briefly that X is a.s. continuous, if for almost all ω ∈ Ω, the path t 7→ Xt(ω) is continuous. – Similarly, we define, locally Hölder-γ-continuous paths, etc. 359 Continuous Versions (9) Applied Probability • Theorem, (see Klenke, Lemma 21.5): Let X and Y be modifications of each other. Assume that one of the following properties hold: – I is countable. – I ⊂ R is a (possibly unbounded) interval and X and Y are almost surely right continuous. • Then X and Y are indistinguishable. 360 Continuous Versions (10) Applied Probability • Theorem, Kolmogorov-Chentsov (see Klenke, Theorem 21.6): Let X = (Xt : t ∈ [0, ∞)) be a real valued process. Assume for every T > 0 there are numbers α, β, C > 0 such that E (|Xt − Xs|α ) ≤ C|t − s|1+β for all s, t ∈ [0, T ]. Then the following statements hold: – There is a modification X̃ = (X̃t, t ∈ [0, ∞)) of X whose paths are locally Hölder-γ -continuous of every order γ ∈ 0, αβ . u β – Let γ ∈ 0, α . For every ε > 0 and T < ∞ there exists a number K < ∞ that depends only on ε, T, α, β, C, γ such that γ P |X̃t − X̃s| ≤ K|t − s| , s, t ∈ [0, T ] ≥ 1 − ε . 361 Continuous Versions (11) Applied Probability • Remark: Kolmogorov-Chentsov theorem: – The result of this theorem holds in a Polish space (E, %). The proof does not rely on the assumption that the range is in R. – If we change the time set then the assumptions have to be strengthed. E.g. If (Xt)t∈Rd with values in E, we have E (ρ(Xt, Xs)α) ≤ Ckt − skd+β 2 d for all s, t ∈ [−T, T ] . Then for all γ ∈ 0, Hölder-γ-continuous version of X. β α ! there is a locally 362 Construction and Path Properties (1) Applied Probability • Definition, Brownian Motion (see Klenke, Definition 21.8): A real valued stochastic process B = (Bt : t ∈ [0, ∞)) is called a (standard) Brownian motion if – – – – B0 = 0, B has independent, stationary increments, Bt ∼ N0,t (normal with mean zero and variance t) for all t > 0 and t 7→ Bt is P-almost surely continuous. 363 Construction and Path Properties (2) Applied Probability • Theorem, Existence of Brownian Motion (see Klenke, Theorem 21.9): There exists a probability space (Ω, A, P) and a Brownian motion B on (Ω, A, P). The paths are almost surely Hölder-γ-continuous for any γ ∈ (0, 21 ). 364 Construction and Path Properties (3) Applied Probability • Remark, Gaussian process – (Xt)t∈I is called Gaussian process if for every n ∈ N and t1, . . . , tn ∈ I we have (Xt1 , . . . , Xtn ) is n-dimensional normally distributed. – X is called centered if E(Xt) = 0 for every t ∈ I. – The map Γ(s, t) = Cov(Xs, Xt) for s, t ∈ I is called covariance function. 365 Construction and Path Properties (4) Applied Probability • Theorem, (see Klenke, Theorem 21.11): Let X = (Xt)t∈[0,∞) be a stochastic process. Then the following are equivalent: – X is a Brownian motion. – X is a continuous centered Gaussian process with Cov(Xs, Xt) = s ∧ t for all s, t ≥ 0. • Theorem, Scaling property of Brownian motion (see Klenke, Corollary 21.12): – If B is a Brownian motion and if K 6= 0, then K −1BK 2t is also a Brownian motion. 366 Construction and Path Properties (5) Applied Probability • Definition, Brownian Bridge (see Klenke, Example 21.13): – A process X = (Xt : t ∈ [0, 1]), with Xt := Bt − tB1 , is called Brownian bridge. • The covariance function of the Brownian bridge is Γ(s, t) = s ∧ t − st. To see this calculate E(Xs, Xt) = E(Bs − sB1, Bt − tB1) = · · · = s ∧ t − st. 367 Construction and Path Properties (6) Applied Probability • Theorem, (see Klenke, Theorem 21.11): – Let (Bt)t≥0 be a Brownian motion. Then E(Bs, Bt) = t − s tB1/t 0 Xt = if t ≥ 0, if t = 0 . Then X is a Brownian motion. • A Brownian motion (Wt)t≥0 started at zero and with E(Wt2) = t is often called standard Brownian motion. 368 Construction and Path Properties (7) Applied Probability • Theorem, Blumenthal’s 0-1 law (see Klenke, Theorem 21.15): – Let (Bt)t≥0 be a Brownian motion and let F = (Ft)t≥0 = σ(B) be the filtration generated by B. – Further, let F0+ = Tt>0 Ft. – Then F0+ is the P-trivial σ-algebra. • On the P-trivial σ-algebra see e.g. Klenke (2008)[Chapter 2.3]. 369 Construction and Path Properties (8) Applied Probability • Theorem, Paley-Wiener-Zygmund (see Klenke, Theorem 21.17): – For every γ > 12 , almost surely, the paths of Brownian motion (Bt)t≥0 are not Hölder-continuous of order γ at any point. – In particular, the paths are almost surely nowhere differentiable. 370 Strong Markov Property (1) Applied Probability • Px is the probability measure such that B = (Bt)t≥0 is a Brownian motion started at x ∈ R. I.e. under Px the process (Bt − x)t≥0 is a standard Brownian motion. The simple Markov property directly follows from the construction of the process. • Theorem, Strong Markov propery (see Klenke, Theorem 21.18): – Brownian motion B with distributions (Px)x∈R has the strong Markov property. 371 Strong Markov Property (2) Applied Probability • Theorem, Reflexion principle for Brownian motion (see Klenke, Theorem 21.19): – For every a > 0 and T > 0, √ 2 T 1 −a2/(2T ) P (sup{Bt : t ∈ [0, T ] > a}) = P (BT > a) ≤ √ e . 2π a 372 Strong Markov Property (3) Applied Probability • Theorem, Lévy’s arcsine law (see Klenke, Theorem 21.20): – Let T > 0 and ζT := sup{t ≤ T : Bt = 0}. Then for t ∈ [0, T ] r 2 P (ζT ≤ t) = arcsin t/T . π 373 Feller Processes (1) Applied Probability • In some applications a continuous version of a process is too demanding. I.e. when you work with the Poisson process or a Brownian motion with jumps. Often there is a version which is right continuous paths and left side limits. • Definition, Càdlàg (see Klenke, Definition 21.21): Let E be a Polish space. A map f : [0, ∞] → E is called right continuous with left limits (RCLL) or càdlàg (continue à droint, limites à gauche) if f (t) = f (t+) := lims↓t f (s) for every t ≥ 0 and if for every t > 0 the left sided limit f (t−) = lims↑t exists and is finite. 374 Feller Processes (2) Applied Probability • Definition, (see Klenke, Definition 21.22): – A filtration F = (Ft)t≥0 is called right continuous if F = F+, where Ft+ = Ts>t Fs. We say that filtration F satisfies the usual conditions if F is right continuous and of F0 is P-complete. • See also Karatzas and Shreve (1991): adapted, augmented filtration filtration. 375 Feller Processes (3) Applied Probability • Theorem, Dobb’s regularisation (see Klenke, Theorem 21.24): – Let F = (Ft)t≥0 be a filtration that satisfies the usual conditions and let X = (Xt)t≥0 be an F-supermartingale such that t 7→ E(Xt) is right continuous. The there exists a modification X̃ of X with RCLL paths. 376 Feller Processes (4) Applied Probability • Definition, Feller semigroup (see Klenke, Definition 21.26): – A Markov semigroup (κt) on E is called Feller semigroup if f (x) = lim κtf (x) t→0 for all x ∈ E, f ∈ C0(E) (set of bounded continuous functions that vanish at infinity) and κtf ∈ C0(E) for every f ∈ C0(E). 377 Feller Processes (5) Applied Probability • Theorem, (see Klenke, Theorem 21.27): – Let (κt)t≥0 be a Feller semigroup on the locally compact Polish space E. Then there exists a strong Markov process (Xt)t≥0 with RCLL paths and transition kernels (κt)t≥0. – Such a process is called Feller process. 378 The Space C([0, ∞)) (1) Applied Probability • C([0, ∞)) is the space of continuous functions. Instead of Ω = R we shall work with Ω = C([0, ∞)) in the following. We need some of these results to investigate the functional central limit theorem. • Let us consider functionals which depend on the whole part of a Brownian motion. E.g. is sup{Xt, t ∈ [0, 1]} measurable. • For general processes this is not the case, while by the continuity of Brownian motion measurability still holds. • We can consider Brownian motion as the canonical process on the space Ω = C([0, ∞)) of continuous paths. 379 The Space C([0, ∞)) (2) Applied Probability • Let Ω = C([0, ∞)) ⊂ R[0,∞). The evaluation map is Xt : Ω → R, ω 7→ t(ω). • For f, g ∈ C([0, ∞)) and n ∈ N let dn(f, g) := k(f − g)|[0,n]k ∧ 1 −n and d(f, g) = P∞ n=1 2 dn (f, g). 380 The Space C([0, ∞)) (3) Applied Probability • Theorem, (see Klenke, Theorem 21.30): – d is a complete metric on Ω = C([0, ∞)) ⊂ R(0,∞) that induces the topology of uniform convergence on compact sets. The space (Ω, d) is separable and hence Polish. 381 The Space C([0, ∞)) (4) Applied Probability • Theorem, (see Klenke, Theorem 21.31): – With respect to the Borel σ-algebra B(Ω, d) the canonical projections Xt, t ∈ [0, ∞) are measurable. – On the other hand the Xt generate B(Ω, d). Hence B(R) N [0,∞) |Ω = σ (Xt, t ∈ [0, ∞)) = B(Ω, d) . 382 The Space C([0, ∞)) (5) Applied Probability • Definition, (see Klenke, Definition 21.33): – Let P be the probability measure on Ω = C([0, ∞)) with respect to which the canonical process X is a Brownian motion. – Then P is called the Wiener measure. – The tripe (Ω, A, P) is called the Wiener space and X is called the canonical Brownian motion or Wiener process. 383 Convergence of Prob. M. on C([0, ∞)) (1) Applied Probability • Let X and (X n)n∈N be random variables with values in C([0, ∞)) with distributions PX and PX n . • Definition, (see Klenke, Definition 21.35): – We say that the finite-dimensional distributions of (X n) converge to those of X if, for every k ∈ N and t1, . . . , tk ∈ [0, ∞), we have (Xtn1 , . . . , Xtnk ) n→∞ ⇒ (Xt1 , . . . , Xtk ) . In the case, we write X n n→∞,f dd ⇒ X or PX n n→∞,f dd ⇒ PX . 384 Convergence of Prob. M. on C([0, ∞)) (2) Applied Probability • Theorem, (see Klenke, Lemma 21.36): – Pn n→∞, f dd ⇒ P and Pn n→∞, f dd ⇒ Q imply P = Q 385 Convergence of Prob. M. on C([0, ∞)) (3) Applied Probability • Theorem, (see Klenke, Theorem 21.37): – Weak convergence in M(Ω, d) implies finite dimensional distribution convergence: Pn n→∞ ⇒ P ⇒ Pn n→∞,f dd ⇒ Q. 386 Convergence of Prob. M. on C([0, ∞)) (4) Applied Probability • Theorem, (see Klenke, Theorem 21.38): Let (Pn)n∈N and P be probability measures on C([0, ∞)). Then the following are equivalent: n→∞,f dd – Pn ⇒ P and (Pn)n∈N is tight. – Pn n→∞ ⇒ P weakly. • On weak convergence see e.g. Klenke (2008)[Chapter 13]. In particular, tightness is defined in Definition 13.26. 387 Convergence of Prob. M. on C([0, ∞)) (5) Applied Probability • To derive a useful criterion for tightness, the Arzelà-Ascoli theorem will be used. For N, δ > 0 and ω ∈ C([0, ∞)), let V N (ω, δ) := sup{|ω(t) − ω(s)| : |t − s| ≤ δ, s, t ≤ N } . • Theorem, Arzelà-Ascoli (see Klenke, Theorem 21.39): A set A ⊂ C([0, ∞)) is relatively compact if and only if the following two conditions hold: – {ω(0) : ω ∈ A} ⊂ R is bounded. – For every N we have limδ↓0 supω∈A V N (ω, ∆) = 0. 388 Convergence of Prob. M. on C([0, ∞)) (6) Applied Probability • Theorem, (see Klenke, Theorem 21.40): A family (Pi : i ∈ I) of probability measures on C([0, ∞)) is weakly relatively compact if and only if the following two conditions hold: – (Pi ◦ X0−1, i ∈ I) is tight; this for every ε > 0, there is a K > 0 such that Pi ({ω : |ω(0)| > K}) ≤ ε for all i ∈ I. – For all η, ε > 0 and N ∈ N there is a δ > 0 such that N Pi {ω : V (ω, δ) > η} ≤ ε for all i ∈ I. 389 Convergence of Prob. M. on C([0, ∞)) (7) Applied Probability • Theorem, (see Klenke, Corollary 21.41): – Let (Xi : i ∈ I) and (Yi : i ∈ I) be families of random variables in C([0, ∞)). – Assume that (PXi : i ∈ I) and (PYi : i ∈ I) are tight. – Then (PXi+Yi : i ∈ I) is tight. 390 Convergence of Prob. M. on C([0, ∞)) (8) Applied Probability • Theorem, Kolmogorov’s criterion for weak relative compactness (see Klenke, Theorem 21.42): Let (X i : i ∈ I) be a sequence of continuous stochastic processes. Assume that the following conditions are satisfied. – The family (P(X0i ∈ ·) : i ∈ I) is tight. – There are numbers C, α, β > 0 such that for all s, t ∈ [0, ∞) and every i ∈ I we have E |Xsi − Xti|α ≤ C|s − t|β+1 . Then the family (PXi : i ∈ I) = (L(X i), i ∈ I) of distributions of X i is weakly relatively compact in M(C([0, ∞))). 391 Donsker’s Theorem (1) Applied Probability • We consider iid random variables Y1, Y2, . . . with E(Yi) = 0 and V(Yi) = σ 2 > 0. • For t > 0 define Stn := [nt] X i=1 Yi and S̃tn 1 [nt] X := √ 2 Yi . i=1 σ n [nt] stands for the integer part of nt. • By the central limit theorem L(S̃tn) n→∞ → N0,t . 392 Donsker’s Theorem (2) Applied Probability • Given the properties of Browian motion (Bt ∼ N0,t) we observe that L(S̃tn) n→∞ → L(Bt) for any t > 0. • By the multivariate central limit theorem we observe that → L(Bt1 , . . . , BtN ) . L(S̃tn1 , . . . , S̃tnN ) n→∞ 393 Donsker’s Theorem (3) Applied Probability • Define S̄tn 1 [nt] tn − [tn] X √ := Yi + √ 2 Y[nt]+1 . 2 σ n i=1 σ n • Then for ε > 0 P(|S̃tn−S̄tn| −2 > ε) ≤ ε E (S̃tn − S̄tn)2 1 1 1 1 n→∞ ≤ 2 2 E Y1 ≤ 2 → 0 . ε nσ εn 394 Donsker’s Theorem (4) Applied Probability • By Slutzky’s theorem (see e.g. Klenke, 2008, Theorem 13.18) we obtain convergence of the finitedimensional distributions to the Wiener measure PW . I.e. PS̄ n n→∞,f dd ⇒ PW . 395 Donsker’s Theorem (5) Applied Probability • Donsker’s theorem strengthens this convergence statement to weak convergence on C([0, ∞)). • This theorem is also called functional central limit theorem. • Theorems of this kind are also called invariance principles since the limiting distribution is the same for all distributions of Yi with expectation of zero and the same variance. 396 Donsker’s Theorem (6) Applied Probability • Theorem, Donsker’s Theorem (see Klenke, Theorem 21.43): – In the sense of weak convergence on C([0, ∞)) the distributions of S̄ n converge to the Wiener measure, L(S̄ n) n→∞ → PW . – S̄tn ⇒ Bt, where ⇒ stands for convergence in distribution. • This theorem builds the basis when limit distributions of partial sums are considered in econometrics (see e.g. Davidson, 1994; White, 2001). 397 Donsker’s Theorem (7) Applied Probability • From the continuous mapping theorem (see e.g. Klenke, 2008, Theorem 13.25) it follows: • Theorem, (see e.g. Durrett, 2010, Theorem 8.6.6) – If φ : C([0, 1]) → R has the property that it is continuous almost every where then φ(S̄(nt)) ⇒ φ(Bt) . 398 Donsker’s Theorem (8) Applied Probability • Example (see Durrett, 2010, Example 8.6.1): – Let φ(x) = x(1). Then φ : C([0, 1]) → R is continuous and the above theorem gives the central limit theorem. • Example (see Durrett, 2010, Example 8.6.5): – Let φ(ω) = [0,1] ω(t) R −1−(k/2) n n X m=1 k dt and k ∈ N. φ(.) is continuous. Then k −1−(k/2) (Sm) = n n X ( m X m=1 i=1 k Yi) ⇒ 1 0 Bt dt Z . 399 *Literatur Billingsley, P. (1986). Probability and Measure. Wiley, Wiley series in probability and mathematical statistics, New York, 2nd edition. Brockwell, P. J. and Davis, R. A. (2006). Time Series: Theory and Methods. Springer Series and Statistics. Springer, New York, 2nd edition. Brooks, S. P. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455. Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge University Press, New York. Campbell, J. Y., Lo, A. W., and MacKinlay, A. C. (1997). The Econometrics of Financial Markets. Princeton University Press, Princeton. Chib, S. and Ergashev, B. (2009). Analysis of multifactor affine yield curve models. Journal of the American Statistical Association, 104(488):1324–1337. Cowles, M. K. and Carlin, B. P. (1996). Markov chain Monte Carlo 400 convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91(434):883–904. Cox, J., Ross, S., and Rubinstein, M. (1979). Option pricing: A simplified approach. Journal of Financial Economics, 7:229–263. Davidson, J. (1994). Stochastic Limit Theory - An Introduction for Econometricians. Oxford University Press, New York. Delbaen, F. and Schachermayer, W. (1994). A general version of the fundamental theorem of asset. Mathematische Annalen, 300:463–520. Duffie, D. (2001). Dynamic Asset Pricing. Princeton University Press, Princeton and Oxford. Durrett, R. (2007). Random Graph Dynamics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. Durrett, R. (2010). Probability: Theory and Examples. 4th Edition. Cambridge University Press, Cambridge. Filipović, D. (2009). Term-Structure Models: A Graduate Course. Springer, Berlin. Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Series in Statistics. Springer. 401 Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer Series in Statistics, Springer, New York. Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472. Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian Statistics 4, pages 169–193. Oxford University Press, Oxford. Harrison, J. and Pliska, S. R. (1981). Martingales and stochastic integrals in the theory of continuous trading. Stochastic Processes and their Applications, 11(3):215 – 260. Harrison, M. and Kreps, D. (1979). Martingales and arbitrage in multiperiod security markets. Journal of Economic Theory, 20:381–408. Heuser, H. (1993). Lehrbuch der Analysis, Teil 1. Teubner, Wiesbaden, 10th edition. Horn, R. A. and Johnson, C. R. (1990). Matrix analysis. Cambridge University Press, Cambridge. Corrected reprint of the 1985 original. Karatzas, I. and Shreve, S. E. (1991). Brownian Motion and Stochastic 402 Calculus. Springer-Verlag, New York, 2nd edition. Karr, A. F. (1993). Probability Theory. Springer. Klenke, A. (2008). Probability Theory - A Comprehensive Course. Springer. Lamberton, D. and Lapeyre, B. (2008). Introduction to Stochastic Calculus – Applied to Finance. Chapman & Hall, London, 2nd edition. LeRoy, S. F. (1989). Efficient capital markets and martingales. Journal of Economic Literature, 27(4):1583–1621. Lucas, Robert E, J. (1978). Asset prices in an exchange economy. Econometrica, 46(6):1429–45. Luenberger, D. G. (1979). Introduction to dynamic systems: theory, models, and applications. John Whiley and Sons, New York. Mas-Colell, A., Whinston, M. D., and Green, J. R. (1995). Microeconomic Theory. Oxford University Press, New York. Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Cambridge University Press, New York, 2nd edition. Munkres, J. (2000). Topology. Prentice Hall, Upper Saddle River, NJ, 2nd edition. Norris, J. R. (1998). Markov Chains. Cambridge University Press. 403 Robert, C. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York. Ruud, P. A. (2000). An Introduction to Classical Econometric Theory. Oxford University Press, New York. Schönbucher, P. J. (2003). Credit Derivates Pricing Models: Models, Pricing and Implementation. Wiley Finance Series. John Wiley & Sons. Tierney, L. (1998). A note on metropolis-hastings kernels for general state spaces. Annals of Applied Probability, 8:1–9. Werner, J. and Ross, S. A. (2000). Principles of Financial Economics. Cambridge University Press. White, H. (2001). Asymptotic Theory For Econometricians. Emerald Group Publishing, Bingley, UK, revised edition. 404
© Copyright 2026 Paperzz