Probability Review and Naïve Bayes

ProbabilityReviewandNaïve
Bayes
Instructor:AlanRitter
SomeslidesadaptedfromDanJurfasky andBrendanO’connor
WhatisProbability?
• “Theprobabilitythecoinwilllandheadsis0.5”
– Q:whatdoesthismean?
• 2Interpretations:
– Frequentist (Repeatedtrials)
• Ifweflipthecoinmanytimes…
– Bayesian
• Webelievethereisequalchanceofheads/tails
• Advantage:eventsthatdonothavelongtermfrequencies
Q:Whatistheprobability thepolaricecapswillmeltby2050?
ProbabilityReview
X
P (X = x) = 1
x
Conditional
Probability
P (A, B)
= P (A|B)
P (B)
ChainRule
P (A|B)P (B) = P (A, B)
ProbabilityReview
X
P (X = x, Y ) = P (Y )
x
Disjunction /Union:
P (A _ B) = P (A) + P (B)
Negation:
P (¬A) = 1
P (A)
P (A ^ B)
BayesRule
Hypothesis
(Unknown)
H
P (D|H)
GenerativeModel ofHowHypothesis CausesData
BayesianInferece
P (H|D)
BayesRuletellsushowtofliptheconditional
Reasonabouteffectstocauses
Usefulifyouassumeagenerativemodel foryourdata
P (D|H)P (H)
P (H|D) =
P (D)
Data
(Observed
Evidence)
D
BayesRule
BayesRuletellsushowtofliptheconditional
Reasonabouteffectstocauses
Usefulifyouassumeagenerativemodel foryourdata
Likelihood
Prior
P (D|H)P (H)
P (H|D) =
P (D)
Posterior
Normalizer
BayesRule
BayesRuletellsushowtofliptheconditional
Reasonabouteffectstocauses
Usefulifyouassumeagenerativemodel foryourdata
Likelihood
Prior
P (D|H)P (H)
P (H|D) = P
h P (D|H)P (H)
Posterior
Normalizer
BayesRule
BayesRuletellsushowtofliptheconditional
Reasonabouteffectstocauses
Usefulifyouassumeagenerativemodel foryourdata
Likelihood
Prior
P (H|D) / P (D|H)P (H)
Posterior
ProportionalTo
(Doesn’tsumto1)
BayesRuleExample
• Thereisadiseasethataffectsatinyfractionof
thepopulation(0.01%)
• Symptomsincludeaheadacheandstiffneck
– 99%ofpatientswiththediseasehavethese
symptoms
• 1%ofthegeneralpopulationhasthese
symptoms
• Q:assumeyouhavethesymptom,whatis
yourprobabilityofhavingthedisease?
TextClassification
IsthisSpam?
WhowrotewhichFederalistpapers?
• 1787-8:anonymousessaystrytoconvince
NewYorktoratifyU.SConstitution:Jay,
Madison,Hamilton.
• Authorshipof12ofthelettersindispute
• 1963:solvedbyMosteller andWallaceusing
Bayesianmethods
JamesMadison
AlexanderHamilton
Whatisthesubjectofthisarticle?
MEDLINE Article
MeSH SubjectCategoryHierarchy
• Antogonists and
Inhibitors
• BloodSupply
• Chemistry
• DrugTherapy
• Embryology
• Epidemiology
• …
Positiveornegativemoviereview?
• unbelievablydisappointing
• Fullofzanycharactersandrichlyappliedsatire,
andsomegreatplottwists
• thisisthegreatestscrewballcomedyever
filmed
• Itwaspathetic.Theworstpartaboutitwasthe
boxingscenes.
TextClassification:definition
• Input:
– adocumentd
– afixedsetofclassesC= {c1,c2,…,cJ}
• Output:apredictedclassc ∈ C
ClassificationMethods:
Hand-codedrules
• Rulesbasedoncombinationsofwordsorother
features
– spam:black-list-addressOR(“dollars”AND“havebeen
selected”)
• Accuracycanbehigh
– Ifrulescarefullyrefinedbyexpert
• Butbuildingandmaintainingtheserulesis
expensive
ClassificationMethods:
SupervisedMachineLearning
• Input:
– adocumentd
– afixedsetofclassesC= {c1,c2,…,cJ}
– Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)
• Output:
– alearnedclassifierγ:d à c
ClassificationMethods:
SupervisedMachineLearning
• Anykindofclassifier
–
–
–
–
Naïve Bayes
Logisticregression
Support-vectormachines
k-NearestNeighbors
– …
NaïveBayesIntuition
• Simple(“naïve”)classificationmethodbasedon
Bayesrule
• Reliesonverysimplerepresentationofdocument
– Bagofwords
Bagofwordsfordocumentclassification
?
Test
document
parser
language
label
translation
…
Machine
Learning
learning
training
algorithm
shrinkage
network...
NLP
parser
tag
training
translation
language...
Garbage
Collection
Planning
GUI
planning ...
garbage
temporal
collection
reasoning
memory
optimization plan
language...
region...
Bayes’RuleAppliedtoDocuments
andClasses
•Foradocumentd andaclass c
P(d | c)P(c)
P(c | d) =
P(d)
Naïve BayesClassifier(I)
cMAP = argmax P(c | d)
c∈C
MAPis“maximuma
posteriori”=mostlikely
class
P(d | c)P(c)
= argmax
c∈C
P(d)
= argmax P(d | c)P(c)
c∈C
BayesRule
Droppingthe
denominator
Naïve BayesClassifier(II)
cMAP = argmax P(d | c)P(c)
c∈C
= argmax P(x1, x2 ,…, xn | c)P(c)
c∈C
Naïve BayesClassifier(IV)
cMAP = argmax P(x1, x2 ,…, xn | c)P(c)
c∈C
O(|X|n•|C|) parameters
Couldonly beestimatedifavery,very
largenumber of trainingexampleswas
available.
Howoftendoesthisclass
occur?
Wecanjustcountthe
relativefrequenciesina
corpus
MultinomialNaïve Bayes
IndependenceAssumptions
P(x1, x2 ,…, xn | c)
• BagofWordsassumption:Assumepositiondoesn’t
matter
• ConditionalIndependence:Assumethefeature
probabilitiesP(xi|cj)areindependentgiventheclassc.
MultinomialNaïve BayesClassifier
cMAP = argmax P(x1, x2 ,…, xn | c)P(c)
c∈C
cNB = argmax P(c j )∏ P(x | c)
c∈C
x∈X
ApplyingMultinomialNaiveBayes
ClassifierstoTextClassification
positions ← allwordpositionsintestdocument
cNB = argmax P(c j )
c j ∈C
∏
i∈ positions
P(xi | c j )
LearningtheMultinomialNaïve BayesModel
• Firstattempt:maximumlikelihoodestimates
– simplyusethefrequenciesinthedata
P̂(c j ) =
P̂(wi | c j ) =
doccount(C = c j )
N doc
count(wi , c j )
∑ count(w, c j )
w∈V
Parameterestimation
P̂(wi | c j ) =
count(wi , c j )
∑ count(w, c j )
fractionoftimeswordwi appears
amongallwordsindocumentsoftopiccj
w∈V
• Createmega-documentfortopicj by
concatenatingalldocsinthistopic
– Usefrequencyofw inmega-document
ProblemwithMaximumLikelihood
• Whatifwehaveseennotrainingdocuments
withthewordfantastic andclassifiedinthe
topicpositive (thumbs-up)?
• Zeroprobabilitiescannotbeconditionedaway,
nomattertheotherevidence!
cMAP = argmax c P̂(c)∏ P̂(xi | c)
i
Laplace(add-1)smoothingfor
Naïve Bayes
count(wi , c)
P̂(wi | c) =
∑ (count(w, c))
w∈V
count(wi , c) +1
=
#
&
%% ∑ count(w, c)(( + V
$ w∈V
'
MultinomialNaïveBayes:Learning
• CalculateP(cj) terms
– Foreachcj inC do
docsj ← alldocswithclass=cj
| docs j |
P(c j ) ←
| total # documents|
MultinomialNaïveBayes:Learning
Fromtrainingcorpus,extractVocabulary
CalculateP(wk | cj) terms
• Textj ← singledoccontainingalldocsj
• For eachwordwk inVocabulary
nk ← #ofoccurrencesofwk inTextj
nk + α
P(wk | c j ) ←
n + α | Vocabulary |
Exercise
NaïveBayesClassification:
PracticalIssues
cM AP = argmaxc P (c|x1 , . . . , xn )
= argmaxc P (x1 , . . . , xn |c)P (c)
n
Y
= argmaxc P (c)
P (xi |c)
i=1
• Multiplyingtogetherlotsofprobabilities
• Probabilitiesarenumbersbetween0and1
• Q:Whatcouldgowronghere?
Workingwithprobabilitiesinlogspace
LogIdentities(review)
log(a ⇥ b) = log(a) + log(b)
a
log( ) = log(a)
b
n
log(a ) = n log(a)
log(b)
NaïveBayeswithLogProbabilities
cM AP = argmaxc P (c|x1 , . . . , xn )
n
Y
= argmaxc P (c)
P (xi |c)
i=1
n
Y
= argmaxc log P (c)
= argmaxc log P (c) +
i=1
n
X
i=1
P (xi |c)
!
log P (xi |c)
NaïveBayeswithLogProbabilities
cM AP = argmaxc log P (c) +
n
X
i=1
log P (xi |c)
• Q:Whydon’twehavetoworryaboutfloating
pointunderflowanymore?
Whatifwewanttocalculateposterior
log-probabilities?
Qn
P (c) i=1 P (xi |c)
Q
P (c|x1 , . . . , xn ) = P
n
0
0)
P
(c
)
P
(x
|c
0
i
c
i=1
Qn
P (c) i=1 P (xi |c)
Q
log P (c|x1 , . . . , xn ) = log P
n
0
0)
P
(c
)
P
(x
|c
0
i
c
i=1
= log P (c) +
n
X
i=1
P (xi |c)
log
"
X
c0
P (c0 )
n
Y
i=1
P (xi |c0 )
#
LogExp SumTrick:
motivation
• Wehave:abunchoflogprobabilities.
– log(p1),log(p2),log(p3),…log(pn)
• Wewant:log(p1+p2+p3+…pn)
• Wecouldconvertbackfromlogspace,sum
thentakethelog.
– Iftheprobabilitiesareverysmall,thiswillresultin
floatingpointunderflow
LogExp SumTrick:
log[
X
i
exp(xi )] = xmax + log[
X
i
exp(xi
xmax )]
Anotherissue:Smoothing
count(w, c) + 1
P̂ (wi |c) = P
w0 2V count(w’,c) + |V |
Anotherissue:Smoothing
Alphadoesn’t necessarily
needtobe1
(hyperparmeter)
count(w, c) + ↵
P̂ (wi |c) = P
0 , c) + ↵|V |
count(w
w0 2V
Anotherissue:Smoothing
Canthinkofalphaasa“pseudocount”.
Imaginarynumber oftimesthiswordhasbeenseen.
count(w, c) + ↵
P̂ (wi |c) = P
0 , c) + ↵|V |
count(w
w0 2V
Anotherissue:Smoothing
count(w, c) + ↵
P̂ (wi |c) = P
0 , c) + ↵|V |
count(w
w0 2V
• Q:Whatifalpha=0?
• Q:whatifalpha=0.000001?
• Q:whathappensasalphagetsverylarge?
Overfitting
• Modelcarestoomuchaboutthetrainingdata
• Howtocheckforoverfitting?
– Trainingvs.testaccuracy
• Pseudocount parametercombatsoverfitting
Q:howtopickAlpha?
• Splittrainvs.Test
• Tryabunchof
accuracy
differentvalues
• Pickthevalueofalpha
thatperformsbest
• Whatvaluestotry?
Gridsearch
– (10^-2,10^-1,...,10^2)
Usethis
one
↵
DataSplitting
• Trainvs.Test
• Better:
– Train(usedforfittingmodelparameters)
– Dev(usedfortuninghyperparameters)
– Test(reserveforfinalevaluation)
• Cross-validation
FeatureEngineering
• Whatisyourword/featurerepresentation
– Tokenizationrules:splittingonwhitespace?
– Uppercaseisthesameaslowercase?
– Numbers?
– Punctuation?
– Stemming?