Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka Outline • Supervised learning • Naive Bayes classifier • Unsupervised learning • Clustering • Lecture slides • http://www.jaist.ac.jp/~tsuruoka/lectures/ Supervised and unsupervised learning • Supervised learning – Each instance is assigned with a label – Classification, regression – Training data need to be created manually • Unsupervised learning – Each instance is just a vector of attribute-values – Clustering – Pattern mining Naive Bayes classifier Chapter 6.9 of Mitchell, T., Machine Learning (1997) • Naive Bayes classifier – Output probabilities – Easy to implement – Assumes conditional independence between features – Efficient learning and classification Bayes’ theorem • Thomas Bayes (1702 – 1761) P A B P APB A P B • The reverse conditional probability can be calculated using the original conditional probability and prior probabilities. Bayes’ theorem • Can we know the probability of having cancer from the result of a medical test? Pcancer positive Pcancer positive Pcancer 0.008, Pcancer P positive cancer P positive Pcancer P positive cancer P positive cancer 0.98, P positive Pcancer 0.992 Pnegative cancer 0.02 P positive cancer 0.03, Pnegative cancer 0.97 Bayes’ theorem Pcancer P positive cancer 0.008 0.98 0.0078 Pcancer P positive cancer 0.992 0.03 0.298 0.0078 Pcancer positive 0.21 0.0078 0298 • The probability of actually having cancer is not very high. Naive Bayes classifier • Assume that features are conditionally independent v NB argmax P v j a1 , a2 ,..., an vj argmax vj P v j P a1 , a2 ,..., an v j Pa1 , a2 ,..., an argmax P v j P a1 , a2 ,..., an v j vj argmax P v j P ai v j vj i Bayes’ theorem The denominator is constant. Conditional independence Training data Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Naive Bayes classifier • Instance <Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong> v NB argmax P v j P ai v j v j yes , no i P Temperature cool v P Humidity high v P Wind strong v argmax P v j P Outlook sunny v j v j yes , no j j j Class prior probability • Maximum likelihood estimation – Just counting the number of occurrences in the training data PPlayTennis yes 9 0.64 14 5 PPlayTennis no 0.36 14 Conditional probabilities of features • Maximum likelihood PWind strong PlayTennis yes 3 0.33 9 3 PWind strong PlayTennis no 0.60 5 Class posterior probabilities P yes P Outlook sunny yes P Temperature cool yes P Humidity high yes P Wind strong yes 0.0053 P no P Outlook sunny no 0.0053 0.205 0.0053 0.0206 P Temperature cool no Normalize P Wind strong no 0.0206 0.795 0.0053 0.0206 P Humidity high no 0.0206 Smoothing • Maximum likelihood estimation nc n – Estimated probabilities are not reliable when nc is small • m-estimate of probability nc mp nm p : prior probability m : equivalent sample size Text classification with a Naive Bayes classifier • Text classification – Automatic classification of news articles – Spam filtering – Sentiment analysis of product reviews – etc. There were doors all round the hall, but they were all locked; and when Alice had been all the way down one side and up the other, trying every door, she walked sadly down the middle, wondering how she was ever to get out again. v NB argmax Pv j P ai v j 45 v j like, dislike i P a " were" v argmax Pv j P a1 " there" v j v j like, dislike 2 j ... P a45 " again" v j Conditional probabilities of words • Cannot be estimated reliably P a2 " were" v j The probability of the second word of the document being the word “were” • Ignore the position and apply m-estimate smoothing P wk v j nk 1 n Vocabulary Unsupervised learning • No “correct” output for each instance • Clustering – Merging “similar” instances into a group – Hierarchical clustering, k-means, etc.. • Pattern mining – Discovering frequent patterns from a large amount of data – Association rules, graph mining, etc Clustering • Organize instances into groups whose members are similar in some way Agglomerative clustering • Define a distance between every pair of instances – E.g. cosine similarity • Algorithm 1. Start with every instance representing a singleton cluster 2. The closest two clusters are merged into a single cluster 3. Repeat this process until all clusters are merged Agglomerative clustering • Example Dendrogram 1 3 2 5 4 1 2 3 4 5 Defining a distance between clusters single link centroid complete link group-average k-means algorithm • Centroids • Minimize ci d x, c k i 1 xCi 2 i • Algorithm 1. Choose k centroids c1,…ck randomely 2. Assign each instance to the cluster with the closest centroid 3. Update the centroids and go back to Step 2
© Copyright 2026 Paperzz