Lecture notes

 Classification and Clustering CS102 ­ May 10 Supervised versus Unsupervised Machine Learning ●
●
●
Supervised:​
Create a model from well­understood training data, use it for inference or prediction about other data. Examples: regression, classification Unsupervised:​
Try to understand the data, look for patterns or structure. Examples: data mining, clustering Also in­between approaches, such as ​
semi­supervised​
and ​
active learning Classification Goal: Given a set of ​
feature values ​
(often just called ​
features​
) for an item not seen before, decide which one of a set of predefined categories the item belongs to. Supervised, training data is set of items with both features and categories (​
labeled ​
data) Examples: ●
●
●
Customer purchases ­ features: ​
age,income,gender,zipcode,profession​
; categories: likelihood of buying (​
high,medium,low​
) Medical diagnosis ­ features: ​
age,gender,history,symptom1­severity, symptom2­severity,test1­result,test2­result​
; categories: ​
diagnosis Fraud detection in online purchases ­ features: ​
item,volume,price,shipping, address​
; categories: ​
fraud​
or ​
okay Classification using k­nearest neighbors (kNN) For any pair of items i​
,i​
and their features, can compute ​
distance(​
i​
,i​
) 1​
2​
1​
2​
●
Example: combine numeric measures using equality for ​
gender, profession;​
absolute difference for ​
age, income​
; proximity for ​
zipcode To classify a new item ​
i​
: Find ​
k​
closest items to ​
i​
, assign most frequent category Weather data: ● Categories based on temperature (​
frigid,cold,cool,comfy,warm​
) ● Predict category based on latitude and longitude ● Distance is Euclidean distance in two­dimensional space ● Show ​
TempsCat​
spreadsheet, ​
LatLongCats​
scatterplot ● Try two cities: Dallas, Texas (long 96.8, lat 32.8); Davenport, Iowa (long 90.6, lat 41.5) Truth: Dallas comfy, Davenport cold Can also use for regression via average values: ● Customer purchases ­ predict dollars spent ● Weather ­ predict temperature from latitude and longitude ● Show ​
LatLongTemps​
scatterplot ­ Truth: Dallas temp 34, Davenport temp 13 Note: No hidden complicated math! Once distance function is defined, rest is easy (though not necessarily efficient). Classification using decision tree Yes/no or partition questions over feature values Navigate to bottom of the tree, find category ●
●
Customer purchases ­ gender split, then age partitions, then income, categories on leaves. Show classification of new customer. Weather ­ Show ​
TempsCat​
, want to predict category from other “features”. Speculate latitude as most discriminating (sort on lat), but what next? Primary challenge is in building “good” trees from training data Classification using Naive Bayes Conditional independence assumption:​
Given two features X​
and X​
and a category Y, the 1​
2​
probability that X​
=v​
for an item in category Y is independent of the probability that X​
=v​
for 1​
1​
2​
2 ​
that item. (Customer purchases example: might be true for ​
gender​
and ​
age​
, unlikely to be true for ​
income​
and ​
zipcode​
.) This assumption is what makes the approach “naive”. Algorithm: ● Calculate from training data: 1. Fraction (probability) of items in each category 2. For each category, fraction (probability) of items in that category with X=v for each feature X and value v ● To classify a new item​
i​
: for each category Y compute: probability of item ​
i​
being in category Y (1) times probability of item​
i​
being in category Y given ​
i​
’s feature values (product of 2’s). Pick the category with the highest result. Example: Predict temperature category from region and coastal. Show ​
Bayes.csv Coastal city in Northeast​
, probabilities: warm: 0.1 * 1 * 0 = 0 comfy: 0.27 * 0.8 * 0 = 0 cool: 0.28 * 0.44 * 0.13 = 0.016 cold: 0.25 * 0.29 * 0.15 = 0.011 frigid: 0.09 * 0 * 0.2 = 0 Non­coastal city in Southatlantic​
, probabilities: warm: 0.1 * 0 * 0.5 = 0 comfy: 0.27 * 0.2 * 0.41 = 0.022 cool: 0.28 * 0.56 * 0.13 = 0.020 cold: 0.25 * 0.71 * 0 = 0 frigid: 0.09 * 1 * 0 = 0 Some hidden math/logic but mostly just arithmetic. Disclaimer: Most presentations of the Naive Bayes algorithm (including the readings linked to the website) extend algorithm step two: “... for each category Y compute: probability of being in category Y (1) times probability of being in category Y given feature values (product of 2’s), divided by probability of those feature values​
. Pick the category with the highest result. Since the last value doesn’t change across categories, the division won’t change the outcome. Classification using logistic regression Recall: Logistic regression uses training data to compute function f(x​
,...,x​
) that gives 1​
n­1​
probability of result x​
being “yes”. Generalize to probabilities of x​
falling in predefined set of n​
n​
categories. As with Naive Bayes, classify x​
as belonging to category with highest probability. n​
Lots of hidden math. Clustering Again multidimensional feature space, distance metric between items Goal: Partition set of items into ​
k​
groups (clusters) such that items within groups are “close to each other” Unsupervised, no training data Clustering using k­means ●
●
●
Each group has a mean value For each item compute square of distance from mean value for its group Goal is to find ​
k​
groups that minimize sum of those squares Weather example: Cluster cities into six groups based on latitude/longitude Distance is Euclidean distance in two­dimensional space Show: ​
Points.jpg​
, ​
Clusters.jpg​
, ​
ClusterMeans.jpg Note clusters need not be of similar sizes. As with k­nearest neighbors, no hidden complicated math, but efficient algorithm is non­trivial (actually, impossible).