CS 540 – Introduction to AI Fall 2015

Today’s Topics
• Bayes’ Rule – so you can start on HW3’s code,
we will put “full joint prob tables” on hold until after
this lecture
Thomas Bayes
• Naïve Bayes (NB)
1701-1761
• Nannon and NB
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
1
Bayes’ Rule
• Recall P(A  B)  P(A | B) x P(B)
 P(B | A) x P(A)
• Equating the two RHS (right-hand-sides) we get
P(A | B) = P(B | A) x P(A) / P(B)
This is Bayes’ Rule!
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
2
Common Usage
- Diagnosing CAUSE Given EFFECTS
P(disease | symptoms)
= P(symptoms | disease) x P(disease)
P(symptoms)
Usually a big AND of several random
variables, so a JOINT probability
HW3: prob(this move leads to a WIN | NANNON board configuration)
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
3
Simple Example
(only ONE symptom variable)
• Assume we have estimated from data
P(headache | disease=haveFlu)
= 0.90
P(headache | disease=haveStress)
= 0.40
P(headache | disease=healthy)
= 0.01
P(haveFlu)
= 0.02 // Dropping ‘disease=’ for clarity
P(haveStress) = 0.20 // Because it’s midterms time!
P(healthy)
= 0.78 // We assume the 3 ‘disease values’ disjoint
• Patient comes in with headache,
what is most likely diagnosis?
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
4
Solution
P(flu
P(disease | symptoms)
= P(symptoms | disease) x P(disease)
P(symptoms)
| headache) = 0.90  0.02 / P(headache)
P(stress | headache) = 0.40  0.20 / P(headache)
P(healthy | headache) = 0.01  0.78 / P(headache)
Note: we never need to compute the
denominator to find most likely diagnosis!
STRESS most likely (by nearly a factor of 5)
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
5
This same issue arises when have
many more neg than pos ex’s
– false pos overwhelm true pos
Base-Rate Fallacy
https://en.wikipedia.org/wiki/Base_rate_fallacy
Assume Disease A is rare
(one in 1 million, say
– so picture not to scale)
99.99%
Assume population is 10B = 1010
So 104 people have it
A
0.01%
Assume testForA is 99.99% accurate
You test positive. What is the prob
you have Disease A?
Someone (not in this room)
might naively think prob = 0.9999
10/27/16
People for whom
testForA = true
9999 people that actually have Disease A
106 people that do NOT have Disease A
Prob(A | testForA) = 0.01
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
6
Dealing with Many Boolean-Valued Symptoms
(D = Disease, Si = Symptom i)
P(D | S1  S2  S3  …  Sn )
=?
=?
=?
=?
=?
// Bayes’ Rule
= P(S1  S2  S3  …  Sn | D) x P(D)
P(S1  S2  S3  …  Sn)
If n small, could use a full joint table
If not, could design/learn a Bayes Net (next lecture)
We’ll consider `conditional independence’ of S’s
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
7
Assuming
Conditional Independence
Repeatedly using P(A  B | C)  P(A | C)  P(B | C)
We get
P(S1  S2  S3  …  Sn | D) =  P(Si | D)
Assuming D has three possible, disjoint values
P(D1 | S1  S2  S3  …  Sn) = [  P(Si | D1) ] x P(D1) / P(S1  S2  S3  …  Sn)
P(D2 | S1  S2  S3  …  Sn) = [  P(Si | D2) ] x P(D2) / P(S1  S2  S3  …  Sn)
P(D3 | S1  S2  S3  …  Sn) = [  P(Si | D3) ] x P(D3) / P(S1  S2  S3  …  Sn)
We know  P(Di | S1  S2  S3  …  Sn) = 1, so if we want, we could solve
for P(S1  S2  S3  …  Sn) and, hence, need not compute/approx it!
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
8
Full Joint vs. Naïve Bayes
• Completely assuming conditional
independence is called Naïve Bayes (NB)
– We need to estimate (eg, from data)
P(Si | Dj)
P(Dj)
// For each disease j, prob symptom i appears
// Prob of each disease j
• If we have N binary-valued symptoms
and a tertiary-valued disease,
size of full joint is (3  2N ) – 1
• NB needs only (3 x N) + 3 - 1
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
9
Naïve Bayes Example
(for simplicity, ignore m-estimates [later] here)
Dataset
S1
S2
S3
D
T
F
T
T
F
T
T
F
F
T
T
T
T
T
F
T
T
F
T
F
F
T
T
T
T
F
F
F
P(D=true)
P(D=false)
P(S1=true | D = true)
P(S1=true | D = false)
P(S2=true | D = true)
P(S2=true | D = false)
P(S3=true | D = true)
P(S3=true | D = false)
=
=
=
=
=
=
=
=
‘Law of Excluded Middle’
P(S3=true | D=false) + P(S3=false | D=false) = 1
so no need for the P(Si=false | D=?) estimates
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
10
Naïve Bayes Example
(for simplicity, ignore m-estimates)
Dataset
S1
S2
S3
D
T
F
T
T
F
T
T
F
F
T
T
T
T
T
F
T
T
F
T
F
F
T
T
T
T
F
F
F
10/27/16
P(D=true)
P(D=false)
P(S1=true | D = true)
P(S1=true | D = false)
P(S2=true | D = true)
P(S2=true | D = false)
P(S3=true | D = true)
P(S3=true | D = false)
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
=
=
=
=
=
=
=
=
4/7
3/7
2/4
2/3
3/4
1/3
3/4
2/3
11
Processing a ‘Test’ Example
(a) Prob(D = true | S1 = true  S2 = true  S3 = true) ?
= P(S1 | D)  P(S2 | D)  P(S3 | D)  P( D) / probOfSymptoms
= (2 / 4)  (3 / 4)  (3 / 4)  (4 / 7) / probS = 0.161 / probS
(b) Prob(D = false | S1 = true  S2 = true  S3 = true) ?
= P(S1 |  D)  P(S2 |  D)  P(S3 |  D)  P( D) / probOfSymptoms
= (2 / 3)  (1 / 3)  (2 / 3)  (3 / 7) / probS = 0.063 / probS
Because (a) + (b) = 1, 0.161 + 0.063 = probSymptoms = 0.224
Here, vars = true unless
NOT sign present
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
12
Processing a ‘Test’ Example (2)
Replacing probSymptoms
(a) Prob(D = true | S1 = true  S2 = true  S3 = true) ?
= 0.161 / 0.224 = 0.72
(b) Prob(D = false | S1 = true  S2 = true  S3 = true) ?
= 0.063 / 0.224 = 0.28
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
13
Is NB Naïve?
Surprisingly, the assumption of independence,
while most likely violated, is not too harmful!
• Naïve Bayes works quite well
– Very successful in text categorization (‘bag-o- words’ rep)
– Used in printer diagnosis in Windows, spam filtering, etc
• Prob’s not accurate (‘uncalibrated’) due to double counting,
but good at seeing if prob > 0.5 or prob < 0.5
• Resurgence of research activity in Naïve Bayes
– Many ‘dead’ ML algo’s resuscitated by availability
of large datasets (KISS Principle)
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
14
The Big Picture of Playing NANNON
- provided s/w gives you set of (annotated) legal moves
- if zero or one, the s/w passes or makes the only possible move
Current
NANNON Board
Possible Next
Board ONE
Possible Next
Board THREE
Possible Next
Board TWO
Four Effects of MOVES
HIT:
_XO_
BREAK:
_XX_
EXTEND:
_X_XX
CREATE:
_X_X_
10/27/16




__X_
_X_X
__XXX
__XX_
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
Choose move that
gives best prob of
winning
15
Reinforcement Learning (RL)
vs. Supervised Learning
• Nannon is Really an RL Task
• We’ll Treat as a SUPERVISED ML Task
– All moves in winning games considered GOOD
– All moves in losing games considered BAD
• Noisy Data, but Good Play Still Results
• ‘Random Move’ & Hand-Coded
Players Provided
• Provided Code can make 106 Moves/Sec
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
16
What to Compute?
Multiple Possibilities (Pick only One)
P(move in winning game | current board  chosen move) OR
P(move in winning game | next board) OR
P(move in winning game | next board  current board) OR
P(move in winning game | next board  effect of move) OR
Etc.
10/27/16
Hit, break,
extend, create,
or some combo
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
17
`Raw’ Random Variables
Representing the Board
# of Safe
Pieces for
O
# of Home
Pieces for
X
What is on
Board Cell i
(X, O, or empty)
# of Home
Pieces for
O
Board size varies (L cells)
Number of pieces each player has also varies (K pieces)
Full Joint Size for Above = K  (K+1)  3L  (K+1)  K
- for L=12 and K=5, | full joint | = 900 x 312 = 478,296,900
# of Safe
Pieces for
X
You can also
create
‘derived’
features, eg,
‘inDanger’
Some Possible Ways of Encoding the Move
- die value
- which of 12 (?) possible effect combo’s occurred
- moved from cell i to cell j (L2 possibilities; some not possible with 6-sided die)
- how many possible moves there were (L – 2 possibilities)
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
18
Some Possible Java Code
Snippets for NB
private static int boardSize = 6; // Default width of board.
private static int pieces
= 3; // Default #pieces per player.
…
int homeX_win[] = new int[pieces + 1]; // Holds p(homeX=? | win).
int homeX_lose[] = new int[pieces + 1]; // Holds p(homeX=? | !win).
int safeX_win[] = new int[pieces + 1]; // NEED TO ALSO DO FOR ‘O’!
int safeX_lose[] = new int[pieces + 1]; // Be sure to initialize using m!
int board_win[][] = new int[boardSize][3]; // 3 since X, O, or blank.
int board_lose[][] = new int[boardSize][3];
int wins
= 1;
// Remember m-estimates.
int losses
= 1;
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
19
NB Technical Detail:
Underflow
• If we have, say, 100 features, we are
multiplying 100 numbers in [0,1]
• If many probabilities are small, we could “underflow”
the minimum positive double in our computer
• Trick:
Sum log’s of prob’s
Often we need only compare the exponents of two calc’s
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
20
Exploration vs. Exploitation
Tradeoff
• We are not getting iid data since the data
we get depends on the moves we choose
• Always doing what we currently think is best
(exploitation) might be a local minimum
• So we should try out seemingly non-optimal moves
now and then (exploration), but likely to lose game
• Think about learning how to get from home to work
- many possible routes, try various ones now and then,
but most days take what has been best in past
• Simple sol’n for HW3: observe 100,000 games where two
random-move choosers play each other (‘burn-in’ phase)
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
21
Stationarity
• What About the Fact Opponent also Learns?
• That Changes the Probability Distributions
We are Trying to Estimate!
• However, We’ll Assume that the Prob
Distribution Remains Unchanged
(ie, is Stationary) While We Learn
10/27/16
cs540- Fall 2016 (Shavlik©), Lecture 14, Week 8
22