slides

Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Crowdsourcing slang identification and
transcription in twitter language
Benjamin Milde
February 5, 2013
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Introduction
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Twitter - an abundant corpus
• millions of new tweets daily
• mostly informal texts
• 140 character limit → abbreviations, (intended) misspellings,
slang
• study (internet) slang!
@seungriyaaa LOL he does have that lmfao XD rofl hehe :)
The fact my mum writes ’lmfao’ ’pmsl’ ’loooool’ ’rofl’ to her
friends makes me cringe
I loveeeeeeee my beautiful girlfriend <3
...
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Candidate words
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Use OOV-words for candidate list
• Out of vocabulary
words
• = Candidate list
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Some common OOVs
Word
lmao
ima
idk
hahah
tweeting
hahahaha
retweet
awh
followback
smh
follback
directioner
sweetie
dnt
beliebers
Freq.
70785
44999
43175
37092
31835
29657
27620
25867
25596
22973
22111
20573
19311
19045
18331
Word
booze
askin
bikin
merch
twitpic
ganun
blushes
mahone
macam
dieing
uber
bitchy
ahhaha
heya
quid
Freq.
923
922
922
919
919
919
916
915
914
911
910
908
906
906
900
Table : Most common out of vocabulary (OOV) words: Twitter corpus
25 million tweets (summer 2012), words not in simple wikipedia corpus.
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Simple wikipedia: articles with very common slang terms
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Crowdsourcing
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Crowdsourcing slang identification and transcription: HIT
Design
The word is ' jedhead' as found in these examples:
omg amazing ! wet hair ! d i will vote for you guys !! d i am a proud *jedhead* ! d love you !!
so proud of my boys ! im proud to call myself a *jedhead* !
i will always love and support you guys ! kay ? remember that ! forever *jedhead* !
please dont be sad or disappointed you really did your best and every *jedhead* knows that were all so proud of you
Please select a word type for 'jedhead' (required)
Abbreviation / Alternate spelling or misspelled (ex. lol, yr -> your, gonna -> going to, loev you -> love you )
Slang word (slackling -> not working, beliebers -> fans of justin bieber)
Different language (this word is clearly from a different language: Spanish, German, French ...)
Proper name / Brand / Music / Movie title / Website / Family names etc. (smith, nirvana, netflix, youtube, ...)
interjection / fillers (argh, mh, eh, ah, oh, aha, haha...)
Something else / not sure
If you select 'Something else / not sure', please provide a guess in the textbox below and make sure you tried to google the word
to infer its meaning
Transcription in normal language, one or more words (mandatory!!!) (required)
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Results I
• $150, 1 week, 1900 finished items, 3 judgements each
• → 7,8 cents per finished item
• workers earned 1.68$ per hour on avg (7 cents for 5
transcriptions)
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Results II
Category
Abbreviation / Alt. spelling
Slang word
Something else / not sure
Different language
Proper name
Interjection
Percent
58%
12%
12%
10%
6%
2%
Table : Distribution of categories for the candidate words.
Agreement on category: 79.8 %
Agreement on transcription: 58.4 %
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Results III - some example transcriptions
ineed
txtd
folback
belieber
tourny
cakin
outercourse
snogging
tatted
wizkid
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Results III - some example transcriptions
ineed -> {i need, the desperate desire to have an apple product}
txtd -> texted
folback -> follow back
belieber -> fans of justin bieber
tourny -> tournament
cakin -> {flirting, baking}
outercourse -> {dry humping, sex with clothes on}
snogging -> {kissing, british slang for kissing}
tatted -> tattooed
wizkid -> {technological genious, rap star}
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Machine learning
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
ML: Automate slang identification and classification
• One class: Abbreviation / Alt. spelling / Slang (70% of all
collected data)
• Second class: Normal words
• But simple Wikipedia contains slang
• → intersect with Gutenberg corpus
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Learning curve for slang word detection
0.93
0.92
0.91
F1-score
0.9
0.89
0.88
0.87
Linear SVM, char n-gram without context
Logistic Regression, char n-gram without context
Naive Bayes, char n-gram without context
Logistic Regression, char n-gram with context
0.86
0.85
500
1000
1500
2000
2500
3000
3500
4000
# of training vectors
Figure : Number of feature vectors used for learning and achieved
cross-validated F1-scores for the slang word detection (baseline is roughly
78%).
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Some example classifications: Logistic regression (MaxEnt)
with Context
term(s)
lol
gonna
sup, u
google, beatles
about2, gn8, 2getha
1998
university, research, study, slang
Probability for class ’slang’
99.9999992%
1.53%
99.99958%, 99.33%
4.27e-08%, 5.78e-08%
98.97%, 84.02%, 99.999910%
99.99999998%
1.50e-09%, 8.24e-10%, 1.35%, 5.55e-05%
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Idea - New candidate list
• All words with similar frequency (102 -106 times in 25 million
tweets corpus)
• Get probabilities from classifier
• Rank according to probability and frequency
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
New candidate list
Word
lol
x
best
yeah
xx
amazing
omg
were
pretty
babe
?!
followers
followed
awww
niall
lmao
fucking
dm
Freq.
637444
329343
281389
272439
250735
214343
197570
152404
125081
109469
96779
96596
94211
81060
80221
71448
61809
50457
Proba.
0.999999991675
0.999999963706
0.999932982857
0.999999569676
0.99999994069
0.999999672903
0.999999998845
0.9999874413
0.999999993511
0.999999630948
0.999999619389
0.999999998096
0.99999999844
0.99997855482
0.999999999977
0.999978506868
0.999999999922
0.999998836712
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Conclusion
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde
Introduction
Candidate words
Crowdsourcing
Machine learning
Conclusion
Conclusion and future work
• Slang identification and transcription successfully mapped to a
crowdsource task
• Can be automated
• Use cases: twitter normalization, study slang usage
• Maybe transcription can be automated?
Crowdsourcing slang identification and transcription in twitter language
Benjamin Milde