人工知能特論 6.機械学習概論とバージョン空間法

Artificial Intelligence
8. Supervised and unsupervised
learning
Japan Advanced Institute of Science and Technology (JAIST)
Yoshimasa Tsuruoka
Outline
• Supervised learning
• Naive Bayes classifier
• Unsupervised learning
• Clustering
• Lecture slides
• http://www.jaist.ac.jp/~tsuruoka/lectures/
Supervised and unsupervised
learning
• Supervised learning
– Each instance is assigned with a label
– Classification, regression
– Training data need to be created manually
• Unsupervised learning
– Each instance is just a vector of attribute-values
– Clustering
– Pattern mining
Naive Bayes classifier
Chapter 6.9 of Mitchell, T., Machine Learning (1997)
• Naive Bayes classifier
– Output probabilities
– Easy to implement
– Assumes conditional independence between
features
– Efficient learning and classification
Bayes’ theorem
• Thomas Bayes (1702 – 1761)
P A B  
P APB A
P B 
• The reverse conditional probability can be
calculated using the original conditional
probability and prior probabilities.
Bayes’ theorem
• Can we know the probability of having cancer
from the result of a medical test?
Pcancer positive 
Pcancer positive 
Pcancer   0.008,
Pcancer P positive cancer 
P positive
Pcancer P positive cancer 
P positive cancer   0.98,
P positive
Pcancer   0.992
Pnegative cancer   0.02
P positive cancer   0.03, Pnegative cancer   0.97
Bayes’ theorem
Pcancer P positive cancer   0.008  0.98  0.0078
Pcancer P positive cancer   0.992  0.03  0.298
0.0078
Pcancer positive 
 0.21
0.0078  0298
• The probability of actually having cancer is not
very high.
Naive Bayes classifier
• Assume that features are conditionally
independent
v NB  argmax P v j a1 , a2 ,..., an 
vj
 argmax
vj

P v j P a1 , a2 ,..., an v j
Pa1 , a2 ,..., an 

 argmax P v j P a1 , a2 ,..., an v j
vj
 
 argmax P v j  P ai v j
vj
i


Bayes’ theorem
The denominator is
constant.
Conditional independence
Training data
Day
Outlook
Temperature
Humidity
Wind
PlayTennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
Naive Bayes classifier
• Instance
<Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong>

v NB  argmax P v j   P ai v j
v j  yes , no

i


 P Temperature  cool v 
 P Humidity  high v 
 P Wind  strong v 
 argmax P v j  P Outlook  sunny v j
v j  yes , no
j
j
j
Class prior probability
• Maximum likelihood estimation
– Just counting the number of occurrences in the
training data
PPlayTennis  yes  
9
 0.64
14
5
PPlayTennis  no  
 0.36
14
Conditional probabilities of
features
• Maximum likelihood
PWind  strong PlayTennis  yes  
3
 0.33
9
3
PWind  strong PlayTennis  no    0.60
5
Class posterior probabilities
P yes  P Outlook  sunny yes 
 P Temperature  cool yes 
 P Humidity  high yes 
 P Wind  strong yes 
 0.0053
P no  P Outlook  sunny no 
0.0053
 0.205
0.0053  0.0206
 P Temperature  cool no 
Normalize
 P Wind  strong no 
0.0206
 0.795
0.0053  0.0206
 P Humidity  high no 
 0.0206
Smoothing
• Maximum likelihood estimation
nc
n
– Estimated probabilities are not reliable when nc is
small
• m-estimate of probability
nc  mp
nm
p : prior probability
m : equivalent sample size
Text classification with a Naive
Bayes classifier
• Text classification
– Automatic classification of news articles
– Spam filtering
– Sentiment analysis of product reviews
– etc.
There were doors all round the hall, but they were all locked;
and when Alice had been all the way down one side and up
the other, trying every door, she walked sadly down the
middle, wondering how she was ever to get out again.
 
v NB  argmax Pv j   P ai v j
45
v j like, dislike
i

 P a

" were" v 
 argmax Pv j  P a1 " there" v j
v j like, dislike
2
j
...

 P a45 " again" v j

Conditional probabilities of words
• Cannot be estimated reliably

P a2 " were" v j

The probability of the second word of the document being the word “were”
• Ignore the position and apply m-estimate
smoothing


P wk v j 
nk  1
n  Vocabulary
Unsupervised learning
• No “correct” output for each instance
• Clustering
– Merging “similar” instances into a group
– Hierarchical clustering, k-means, etc..
• Pattern mining
– Discovering frequent patterns from a large
amount of data
– Association rules, graph mining, etc
Clustering
• Organize instances into groups whose
members are similar in some way
Agglomerative clustering
• Define a distance between every pair of instances
– E.g. cosine similarity
• Algorithm
1. Start with every instance representing a singleton
cluster
2. The closest two clusters are merged into a single
cluster
3. Repeat this process until all clusters are merged
Agglomerative clustering
• Example
Dendrogram
1
3
2
5
4
1
2
3
4
5
Defining a distance between
clusters
single link
centroid
complete link
group-average
k-means algorithm
• Centroids
• Minimize
ci
  d x, c  
k
i 1 xCi
2
i
• Algorithm
1. Choose k centroids c1,…ck randomely
2. Assign each instance to the cluster with the
closest centroid
3. Update the centroids and go back to Step 2