Slides - Sameer Singh

TextClassification1
Prof.SameerSingh
CS295:STATISTICALNLP
WINTER2017
January12,2017
BasedonslidesfromNathanSchneider,NoahSmith,DanKleinandeveryoneelsetheycopiedfrom.
TextClassification1
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
CS295:STATISTICALNLP(WINTER2017)
2
TextClassification
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
CS295:STATISTICALNLP(WINTER2017)
3
SentimentAnalysis
Filledwithhorrificdialogue,
laughablecharacters,alaughable
plot,adreallynointerestingstakes
duringthisfilm,"StarWarsEpisode
I:ThePhantomMenace"isnotat
allwhatIwantedfromafilmthatis
supposedtobethehugeopening
tothesegueintothefantastic
OriginalTrilogy.Thepositives
includethescore,thesound…
CS295:STATISTICALNLP(WINTER2017)
4
OtherExamples
• Reviewsoffilms,restaurants,products:positivevs.negative
• Amazonreviewsdata,IMDBreviewsdata
• Library-likesubjects(e.g.,theDeweydecimalsystem)
• Newsstories:politicsvs.sportsvs.businessvs.technology...
• 20newsgroupdata
• Authorattributes:identity,politicalstance,gender,age,...
• Email:spamvs.not
• Gmail:important,promotion,updates,socialmedia,…
• Whatisthereadinglevelofapieceoftext?
• Automaticgraders?
• Howinfluentialwillascientificpaperbe?
• Advertisementrecommendations…
• Willapieceofproposedlegislationpass?
• Identifythepresidentialcandidatefromspeeches
• Postrecommendations/Fakenewsdetection
• Canmajorlyinfluencetheworld!
CS295:STATISTICALNLP(WINTER2017)
5
FormalSetup
Classification
SupervisedLearning
Training
Algorithm
CS295:STATISTICALNLP(WINTER2017)
6
Evaluation:ContingencyTable
CS295:STATISTICALNLP(WINTER2017)
7
Accuracy
Problem
• Classimbalancehurts..
• Gettingoneclassrightmattersmorethantheother(retrieval)
CS295:STATISTICALNLP(WINTER2017)
8
PrecisionandRecall
CS295:STATISTICALNLP(WINTER2017)
9
>2Classes?
Macro-averagedMeasures
Micro-averagedMeasures
CS295:STATISTICALNLP(WINTER2017)
10
McNemar’s Test,Psychometrika, (1947)
MoretestsinSmithbook,appendixB
StatisticalSignificance
CS295:STATISTICALNLP(WINTER2017)
11
TextClassification
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
CS295:STATISTICALNLP(WINTER2017)
12
ClassificationusingJointProb
CS295:STATISTICALNLP(WINTER2017)
13
NaïveBayesClassifier
Twoassumptions
• Wordorderingdoesnotmatter(BagofWords)
CS295:STATISTICALNLP(WINTER2017)
14
NaïveBayesClassifier
Twoassumptions
• Wordorderingdoesnotmatter (BagofWords)
• Wordsareindependentgivencategory
CS295:STATISTICALNLP(WINTER2017)
15
EstimationofParameters
CS295:STATISTICALNLP(WINTER2017)
16
ProblemwithNaïveBayes
CS295:STATISTICALNLP(WINTER2017)
17
LinearModels
CS295:STATISTICALNLP(WINTER2017)
18
NaïveBayesasaLinearModel
CS295:STATISTICALNLP(WINTER2017)
19
TextClassification
IntroductiontoTextClassification
NaiveBayesClassification
CourseProjects
CS295:STATISTICALNLP(WINTER2017)
20
GroupProjects
GroupsfortheProject
• Idealteamsizeis3
• Absolutemaximumof4
• <3ifIapprove(ongoingwork)
SubmitFourReports
• Firsttworeportsareveryshort(1page)
• Finalreportmattersthemost
HowdoIknow
it’sNLP?
• Outputisanyphraseorsentence,definitely!
• Inputisanyphraseorsentence
• Outputisasequenceorstructure(yes!)
• Classification:onlyifoverwordsorphrases
• Outputislinguisticclasses/structures(yes!)
CS295:STATISTICALNLP(WINTER2017)
21
ScopeofWork
Novelty
Butnot
toomuch!
Reuse
• NewTask/Data
• NewMethod/Models
• NewApplicationofExistingMethodtoExistingTask
•
•
•
•
Youdonothavemuchtime!
Aimtohavethewholepipelinedonesoon
Keepthe“scale”ofthedatasmall,sub-sampleifneeded
Bettertohaveacompletefinishedreport
• thangrandideasthatdidnotwork
• Youdonothavetocodeeverything
• Exploitexistingcode,datasets,libraries,webservices
• Donotreinventallthewheels!
CS295:STATISTICALNLP(WINTER2017)
22
Example1:What’stheword..
What’sthewordforsomeoneusingpretentiouswords?
definitionofaword
fromthedictionary
MachineLearning
(LSTM)
lexiphanic
theworditself
ThiscanbeacoolTwitterbot!
Evaluation
• Accuracyofguessingtheword,using
definitionsfromdifferentdictionary?
• Baselines:Google,reversedictionary.org,…
CS295:STATISTICALNLP(WINTER2017)
23
Example2:SQuAD
Tesla was the fourth of five children. He
had an older brother named Dane and
three sisters, Milka, Angelina and Marica.
Dane was killed in a horse-riding accident
when Nikola was five. In 1861, Tesla
attended the "Lower" or "Primary" School
in Smiljan where he studied German,
arithmetic, and religion. In 1862, the Tesla
family moved to Gospić, Austrian Empire,
where Tesla's father worked as a pastor.
Nikola completed "Lower" or "Primary"
School, followed by the "Lower Real
Gymnasium" or "Normal School."
How many siblings did Tesla have?
four
What was Tesla’s brother’s name?
Dane
What happened to Dane?
killed in a horse-riding accident
https://rajpurkar.github.io/SQuAD-explorer/
CS295:STATISTICALNLP(WINTER2017)
24
DatasetsandPapers
Data
•
•
•
•
SearchKaggle,Quora,etc forlargetextdatasets
SeerecentpapersinNLPforreleaseddatasets
Lookfor“sharedtasks”,“challenges”,workshops
Linkstosomeexistingdatasetscomingtowebsitesoon
Papers
•
•
•
•
•
NLPConferences:ACL,EMNLP,NAACL
MLConferences:NIPS,ICML,ICLR,AAAI
Datafocusedvenues:TREC/TAC,SemEval,CONLL
Workshopsattheseconferences:interestingdirections
Morepaperscomingsoontothewebsite
CS295:STATISTICALNLP(WINTER2017)
25
WritingthePitch
Team
Project
Appointment
• Teamnameandmembers
• Singlesentencedescriptionforeachmember
• (approximately)whattheywilldo
• Singlesentenceonwhatmakesyourteamdiverse
• MotivationandProblemDescription
• Plannedapproach:tentative
• Evaluation:usually,mostimportant
• If1or2,meetmebefore/on January17(o.w.noneed)
• Everygrouphastomeetafterwardstodiscusstheproject
CS295:STATISTICALNLP(WINTER2017)
26
Upcoming…
Homework
Project
•
•
•
•
Homework1isup!
Nextlectureswillcontinuewithmoredetails
SignupfortheKaggle account(@uci.edu email)
Due:January26,2017
• ProjectpitchisdueJanuary23,2017!
• Startassemblingteamsnow!(usePiazza)
• Startlookingatpapers,data,etc.forideas
CS295:STATISTICALNLP(WINTER2017)
27