Feature-based Discriminative Classifiers Classifiers Example binary

Chris topherManning
Feature-based
Discriminative
Classifiers
Makingfeaturesfromtextfor
discriminativeNLPmodels
Classifiers
• Aclassifierisafunctiong whichassignsaninputdatumd toone
of|C|classes,c∈C:g:D →C
• Theclassesmightbe:
• {PERSON, ORGANIZATION, LOCATION, O} for named entity recognition
• {politics, sports, finance, technology, arts, leisure, …} for news
• {spam, notspam} for an email message
• {coreferent, not-coreferent} for acoreference candidate mention pair
ChristopherManning
2
C h ri sto p h er M an n i n g
C h ri sto p h er M an n i n g
Exampleproblem
• Classifyacapitalizedpropernounasaclass:
• LOCATION, DRUG, PERSON
• Foradataexampled
• taking Zantac
• Weworkbyconsideringeachclasscfortheword:
Featuresforaclassifier
• Features f areelementarypiecesofevidencethatlinkaspectsof
whatweobserved withacategoryc thatwewanttopredict
• Afeatureisafunctionwithaboundedrealvalue:f: C × D → ℝ
• Common special case in NLP:
• binary features f: C × D → {0, 1}
• (LOCATION, taking Zantac, )
• (DRUG, taking Zantac, )
• (PERSON, taking Zantac, )
•
andusingfeaturestoscoreeachcandidateclassification
Chris topherManning
Chris topherManning
Examplebinary features
• f 1(c, d) ≡ [c = LOCATION ∧ w-1 = “in” ∧ isCapitalized(w)]
• f 2(c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)]
• f 3(c, d) ≡ [c = DRUG ∧ ends(w, “c”)]
1.8
–0.6
0.3 DRUG
PERSON
LOCATION
LOCATION
taking Zantac
saw Sue
in Arcadia
in Québec
• Modelswillassigntoeachfeatureaweight:
• A positive weight votes that this configuration is likely correct
• A negative weight votes that this configuration is likely incorrect
BinaryFeatures
• Verycommonly,afeaturespecifies
1.
an indicator function – a yes/no boolean matching function – of
properties ofthe input Φ and
2.
a particular class
f i (c, d) ≡ [Φ(d) ∧ c = cj ]
[Value is 0 or 1]
• Each feature picks out a data subset and suggests a label for it
• Thedecisionaboutadatapointisbasedonlyonthevaluesof
thefeatures activeatthatpoint.
1
Chris topherManning
Chris topherManning
MoreGeneralFeatures
Building aSimple Discriminative Model
• Featurescanbemoregeneralthanjustbinarymatching:
• Can compute areal value from input, e.g., log(word length)
• Can match a set of values – e.g., perhaps a partial structure – across
“classes”
• This leads to structured classification, which is common in NLP, for
example to matchparse tree candidates, etc.
• A discriminativecanhavefeaturesthatmatchatreewithaunaryStoVP
• A coreference modelcannotlikeaclusterwithdifferentgenderitems
• Wedefinefeatures(indicatorfunctions)overdatapoints
• Features represent sets of data points which are distinctive enough to
deserve model parameters.
• Words, but also “word contains number”, “word ends with ing”, POS,
syntactic structure, relation between two phrases, etc.
• WemightsimplyencodeeachΦ featureasauniqueString
• A datum will give rise to aset ofStrings: the active Φ features
• Each feature f i(c, d) ≡ [Φ(d) ∧ c = c j] gets a real number weight
• WeconcentrateonΦ features,butoneweightforeachi offi
C h ri sto p h er M an n i n g
C h ri sto p h er M an n i n g
Building aSimple Discriminative Model
• Features are normallyaddedinbigbatchesviafeaturetemplates
• E.g.,one featuretemplate adds ∀i,j observed: lastWord=wi ∧ c = c j
• Another is:nextWord=wi ∧ c = c j . Eachmayaddtensofthousands offeatures
• Amodelmaybespecifiedbythesetoffeaturetemplatesused
• Features are oftenaddedduringmodeldevelopmenttotargeterrors
• Often, the easiest thing to think ofare featuresthat markbad combinations
Linearclassifiers atclassification time
• Linear function from feature sets {f i} to classes {c}.
• Assign a weight λi to each feature f i.
• We consider each class for an observed datum d
• For a pair (c,d), features vote with their weights:
• vote(c) = Σλifi(c,d)
PERSON
in Québec
LOCATION
in Québec
DRUG
in Québec
• Choose the class c which maximizes Σ λif i(c,d)
Chris topherManning
Linearclassifiers atclassification time
• Linear function from feature sets {f i} to classes {c}.
• Assign a weight λi to each feature f i.
Feature-based
Discriminative
Classifiers
• We consider each class for an observed datum d
• For a pair (c,d), features vote with their weights:
• vote(c) = Σλifi(c,d)
PERSON
in Québec
1.8
LOCATION –0.6
in Québec
0.3
DRUG
in Québec
Makingfeaturesfromtextfor
discriminativeNLPmodels
• Choose the class c which maximizes Σ λif i(c,d) = LOCATION
2
Chris topherManning
Feature-based
softmax/maxent
linear classifiers
Feature-Based LinearClassifiers
•
•
Linear classifiers are a linear function from feature sets {f i} to classes {c}
At test time, we consider eachclass c for adatum d
• We generate a feature set {f i} foran observed datum-class pair (c,d)
• Each feature fi has a weight λi
Howtoputfeaturesintoa
classifier
•
• We then score each possible class assignment: vote(c) = Σ λif i(c,d) = λ·f
• We choose the class c which maximizes Σ λif i(c,d)
At training time we have observed (c,d) pairs from labeled examples
• We generate sets of features {f i(c,d)} for them
• We use information about what features occurand don’t occur to set a
weight λi for each feature
C h ri sto p h er M an n i n g
C h ri sto p h er M an n i n g
Examplefeatures
exponential,conditionallog-linear,Gibbs)
• f 1(c, d) ≡ [c = LOCATION ∧ w-1 = “in” ∧ isCapitalized(w)]
• f 2(c, d) ≡ [c = LOCATION ∧ hasAccentedLatinChar(w)]
• f 3(c, d) ≡ [c = DRUG ∧ ends(w, “c”)]
1.8
–0.6
LOCATION
LOCATION
in Arcadia
in Québec
Maxent models (softmax,multiclasslogistic,
0.3
DRUG
taking Zantac
• Make a probabilistic model from the linear combination Σ λif i(c,d)
P (c | d , λ ) =
PERSON
saw Sue
exp ∑ λ f ( c, d )
∑ exp ∑ λ f (c' , d )
i
i
Makes votes positive
i
i
Normalizes votes
i
c'
i
• P(LOCATION|in Québec) = e1 .8 e–0.6 /(e1.8 e–0.6 + e0 .3 + e0) = 0.586
• P(DRUG|in Québec) = e0 .3 /(e1 .8 e–0 .6 + e0.3 + e0) = 0.238
• P(PERSON|in Québec) = e0 /(e1 .8 e–0 .6 + e0.3 + e0) = 0.176
• Theweights are theparameters oftheprobabilitymodel,
combinedviaa“softmax”function
Chris topherManning
Chris topherManning
Feature-Based LinearClassifiers
• Maxent models:
• Given this model form, we choose parameters {λi} that
maximize the conditional likelihood of the data according
to this model (as discussed later): max Λ P(D|C, Λ)
• We construct not only classifications, but probability
distributions over classifications.
Feature-Based LinearClassifiers
Thereareother(good!)waystochoseweightsforfeatures
• Perceptron: find acurrently misclassified example, and nudge
weights in the direction that corrects classification
• Margin-based methods (Support Vector Machines)
• Boosting algorithms
Butthesemethodsarenotastrivialtointerpretas
probabilitydistributionsoverclasses
3
Feature-based
softmax/maxent
linear classifiers
Howtoputfeaturesintoa
classifier
4