Today’s Topics • Bayes’ Rule – so you can start on HW3’s code, we will put “full joint prob tables” on hold until after this lecture Thomas Bayes • Naïve Bayes (NB) 1701-1761 • Nannon and NB 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 1 Bayes’ Rule • Recall P(A B) P(A | B) x P(B) P(B | A) x P(A) • Equating the two RHS (right-hand-sides) we get P(A | B) = P(B | A) x P(A) / P(B) This is Bayes’ Rule! 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 2 Common Usage - Diagnosing CAUSE Given EFFECTS P(disease | symptoms) = P(symptoms | disease) x P(disease) P(symptoms) Usually a big AND of several random variables, so a JOINT probability HW3: prob(this move leads to a WIN | NANNON board configuration) 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 3 Simple Example (only ONE symptom variable) • Assume we have estimated from data P(headache | disease=haveFlu) = 0.90 P(headache | disease=haveStress) = 0.40 P(headache | disease=healthy) = 0.01 P(haveFlu) = 0.02 // Dropping ‘disease=’ for clarity P(haveStress) = 0.20 // Because it’s midterms time! P(healthy) = 0.78 // We assume the 3 ‘disease values’ disjoint • Patient comes in with headache, what is most likely diagnosis? 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 4 Solution P(flu P(disease | symptoms) = P(symptoms | disease) x P(disease) P(symptoms) | headache) = 0.90 0.02 / P(headache) P(stress | headache) = 0.40 0.20 / P(headache) P(healthy | headache) = 0.01 0.78 / P(headache) Note: we never need to compute the denominator to find most likely diagnosis! STRESS most likely (by nearly a factor of 5) 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 5 This same issue arises when have many more neg than pos ex’s – false pos overwhelm true pos Base-Rate Fallacy https://en.wikipedia.org/wiki/Base_rate_fallacy Assume Disease A is rare (one in 1 million, say – so picture not to scale) 99.99% Assume population is 10B = 1010 So 104 people have it A 0.01% Assume testForA is 99.99% accurate You test positive. What is the prob you have Disease A? Someone (not in this room) might naively think prob = 0.9999 10/27/16 People for whom testForA = true 9999 people that actually have Disease A 106 people that do NOT have Disease A Prob(A | testForA) = 0.01 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 6 Dealing with Many Boolean-Valued Symptoms (D = Disease, Si = Symptom i) P(D | S1 S2 S3 … Sn ) =? =? =? =? =? // Bayes’ Rule = P(S1 S2 S3 … Sn | D) x P(D) P(S1 S2 S3 … Sn) If n small, could use a full joint table If not, could design/learn a Bayes Net (next lecture) We’ll consider `conditional independence’ of S’s 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 7 Assuming Conditional Independence Repeatedly using P(A B | C) P(A | C) P(B | C) We get P(S1 S2 S3 … Sn | D) = P(Si | D) Assuming D has three possible, disjoint values P(D1 | S1 S2 S3 … Sn) = [ P(Si | D1) ] x P(D1) / P(S1 S2 S3 … Sn) P(D2 | S1 S2 S3 … Sn) = [ P(Si | D2) ] x P(D2) / P(S1 S2 S3 … Sn) P(D3 | S1 S2 S3 … Sn) = [ P(Si | D3) ] x P(D3) / P(S1 S2 S3 … Sn) We know P(Di | S1 S2 S3 … Sn) = 1, so if we want, we could solve for P(S1 S2 S3 … Sn) and, hence, need not compute/approx it! 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 8 Full Joint vs. Naïve Bayes • Completely assuming conditional independence is called Naïve Bayes (NB) – We need to estimate (eg, from data) P(Si | Dj) P(Dj) // For each disease j, prob symptom i appears // Prob of each disease j • If we have N binary-valued symptoms and a tertiary-valued disease, size of full joint is (3 2N ) – 1 • NB needs only (3 x N) + 3 - 1 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 9 Naïve Bayes Example (for simplicity, ignore m-estimates [later] here) Dataset S1 S2 S3 D T F T T F T T F F T T T T T F T T F T F F T T T T F F F P(D=true) P(D=false) P(S1=true | D = true) P(S1=true | D = false) P(S2=true | D = true) P(S2=true | D = false) P(S3=true | D = true) P(S3=true | D = false) = = = = = = = = ‘Law of Excluded Middle’ P(S3=true | D=false) + P(S3=false | D=false) = 1 so no need for the P(Si=false | D=?) estimates 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 10 Naïve Bayes Example (for simplicity, ignore m-estimates) Dataset S1 S2 S3 D T F T T F T T F F T T T T T F T T F T F F T T T T F F F 10/27/16 P(D=true) P(D=false) P(S1=true | D = true) P(S1=true | D = false) P(S2=true | D = true) P(S2=true | D = false) P(S3=true | D = true) P(S3=true | D = false) cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 = = = = = = = = 4/7 3/7 2/4 2/3 3/4 1/3 3/4 2/3 11 Processing a ‘Test’ Example (a) Prob(D = true | S1 = true S2 = true S3 = true) ? = P(S1 | D) P(S2 | D) P(S3 | D) P( D) / probOfSymptoms = (2 / 4) (3 / 4) (3 / 4) (4 / 7) / probS = 0.161 / probS (b) Prob(D = false | S1 = true S2 = true S3 = true) ? = P(S1 | D) P(S2 | D) P(S3 | D) P( D) / probOfSymptoms = (2 / 3) (1 / 3) (2 / 3) (3 / 7) / probS = 0.063 / probS Because (a) + (b) = 1, 0.161 + 0.063 = probSymptoms = 0.224 Here, vars = true unless NOT sign present 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 12 Processing a ‘Test’ Example (2) Replacing probSymptoms (a) Prob(D = true | S1 = true S2 = true S3 = true) ? = 0.161 / 0.224 = 0.72 (b) Prob(D = false | S1 = true S2 = true S3 = true) ? = 0.063 / 0.224 = 0.28 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 13 Is NB Naïve? Surprisingly, the assumption of independence, while most likely violated, is not too harmful! • Naïve Bayes works quite well – Very successful in text categorization (‘bag-o- words’ rep) – Used in printer diagnosis in Windows, spam filtering, etc • Prob’s not accurate (‘uncalibrated’) due to double counting, but good at seeing if prob > 0.5 or prob < 0.5 • Resurgence of research activity in Naïve Bayes – Many ‘dead’ ML algo’s resuscitated by availability of large datasets (KISS Principle) 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 14 The Big Picture of Playing NANNON - provided s/w gives you set of (annotated) legal moves - if zero or one, the s/w passes or makes the only possible move Current NANNON Board Possible Next Board ONE Possible Next Board THREE Possible Next Board TWO Four Effects of MOVES HIT: _XO_ BREAK: _XX_ EXTEND: _X_XX CREATE: _X_X_ 10/27/16 __X_ _X_X __XXX __XX_ cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 Choose move that gives best prob of winning 15 Reinforcement Learning (RL) vs. Supervised Learning • Nannon is Really an RL Task • We’ll Treat as a SUPERVISED ML Task – All moves in winning games considered GOOD – All moves in losing games considered BAD • Noisy Data, but Good Play Still Results • ‘Random Move’ & Hand-Coded Players Provided • Provided Code can make 106 Moves/Sec 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 16 What to Compute? Multiple Possibilities (Pick only One) P(move in winning game | current board chosen move) OR P(move in winning game | next board) OR P(move in winning game | next board current board) OR P(move in winning game | next board effect of move) OR Etc. 10/27/16 Hit, break, extend, create, or some combo cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 17 `Raw’ Random Variables Representing the Board # of Safe Pieces for O # of Home Pieces for X What is on Board Cell i (X, O, or empty) # of Home Pieces for O Board size varies (L cells) Number of pieces each player has also varies (K pieces) Full Joint Size for Above = K (K+1) 3L (K+1) K - for L=12 and K=5, | full joint | = 900 x 312 = 478,296,900 # of Safe Pieces for X You can also create ‘derived’ features, eg, ‘inDanger’ Some Possible Ways of Encoding the Move - die value - which of 12 (?) possible effect combo’s occurred - moved from cell i to cell j (L2 possibilities; some not possible with 6-sided die) - how many possible moves there were (L – 2 possibilities) 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 18 Some Possible Java Code Snippets for NB private static int boardSize = 6; // Default width of board. private static int pieces = 3; // Default #pieces per player. … int homeX_win[] = new int[pieces + 1]; // Holds p(homeX=? | win). int homeX_lose[] = new int[pieces + 1]; // Holds p(homeX=? | !win). int safeX_win[] = new int[pieces + 1]; // NEED TO ALSO DO FOR ‘O’! int safeX_lose[] = new int[pieces + 1]; // Be sure to initialize using m! int board_win[][] = new int[boardSize][3]; // 3 since X, O, or blank. int board_lose[][] = new int[boardSize][3]; int wins = 1; // Remember m-estimates. int losses = 1; 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 19 NB Technical Detail: Underflow • If we have, say, 100 features, we are multiplying 100 numbers in [0,1] • If many probabilities are small, we could “underflow” the minimum positive double in our computer • Trick: Sum log’s of prob’s Often we need only compare the exponents of two calc’s 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 20 Exploration vs. Exploitation Tradeoff • We are not getting iid data since the data we get depends on the moves we choose • Always doing what we currently think is best (exploitation) might be a local minimum • So we should try out seemingly non-optimal moves now and then (exploration), but likely to lose game • Think about learning how to get from home to work - many possible routes, try various ones now and then, but most days take what has been best in past • Simple sol’n for HW3: observe 100,000 games where two random-move choosers play each other (‘burn-in’ phase) 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 21 Stationarity • What About the Fact Opponent also Learns? • That Changes the Probability Distributions We are Trying to Estimate! • However, We’ll Assume that the Prob Distribution Remains Unchanged (ie, is Stationary) While We Learn 10/27/16 cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8 22
© Copyright 2026 Paperzz