Suraj Maharjan1, Elizabeth Blair2, Steven Bethard2, and Thamar

Developing Language-tagged Corpora for Code-switching Tweets
Suraj
1
Maharjan ,
1Department
Elizabeth
2
Blair ,
of Computer Science
University of Houston
Houston, TX, 77004
[email protected] [email protected] Introduction
Workflow
@user tyo schoolko
nam k ho?
nep-en
Search for users and their
friends who code-switch
In-lab annotations
(find frequent Spanish words)
Rerun tweet search
•  Containing most frequent Spanish words
•  Identified as English by Twitter API
•  Location: areas near California and Texas
Harvest tweets
Law Enforcement : Mexico-USA drug trafficking
Child Language : Analysis in multi-lingual environments
Data mining : Ignoring multilingual may result in significant lost in business
opportunities
Data Distribution
Language Pair Training Test es-en
Search for tweets
•  Containing most frequent English words
(Bangor Miami Corpus)
•  Identified as Spanish by Twitter API
•  Location: areas near California and Texas
Applications
u 
User InformaKon Table 1: Tweets distribu:on for training and test set. Nepali-­‐English Spanish-­‐English Tag Training (%) Test (%) Training(%) Test (%) Lang1 31.14 1 9.76 54.78 4 3.28 Lang2 41.56 4 9.1 23.52 3 0.34 Mixed 0.08 0 .60 0.04 0 .03 NE 2.73 4 .19 2.07 2 .22 Ambiguous 0.09 -­‐ 0.24 0 .12 Other 24.41 2 6.35 19.34 2 4.02 In-lab annotations
(Gold Data)
0.814
0.744
0.744
102426321 15 21
102426321 23 24
102426321 26 31
0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Accuracy
Precision
Recall
Spanish-­‐English F-measure
Dictionary
0.865
0.874
0.865
0.827
0.705
Accuracy
CRF GE
0.861
0.705
Precision
Recall
0.705
F-measure
Nepali-­‐English Fig 1: Benchmark system performance at the word level. u 
u 
Dictionary
u  Build separate dictionaries for each of the label from training dataset
u  Use dictionaries to assign label for each token
u  Use majority language to break the ties
CRF GE
u  Train LangID (King and Abney) with monolingual texts for each language
u  English-Spanish training set : Tweets pulled from Texas and California
u  Nepali monolingual training set : Nepali song lyrics, news website and
Romanized Nepali tweets
Analysis
Tokens words char-­‐2 char-­‐3 char-­‐4 char-­‐5 Nepali-­‐English(%) Spanish-­‐English(%) 1.39 3.54 52.01 52.21 33.36 40.36 12.66 21.31 3.43 9.00 Dictionary based approach had problems with
u  Similar spelling : man (like, heart), gate (date), din (day), me, red
u  Unseen tokens : b-lated, yuss, comrade
u  No standard Romanized spelling for Nepali words
CRF GE failed to detect
u  Small code-switched content embedded inside large monolingual
segments
Both bilinguals frequently used English function words (the, to, yo, he, she,
and) and abbreviations used in social media (lol, lmao, idk)
Conclusion
es-en
lecture lang1
ma
lang2
dhyaan lang2
0.8 0.729
0.6 Final Annotations
441222743661375488
441222743661375488
441222743661375488
0.9 0.6 u 
nep-en
0.861
0.7 u 
Review and update
Partially supported by
0.864
Table 3: N-­‐gram overlap across language pairs. Review and update
1,000 tweets using PDF instruction scheme and 500 tweets using inline
instruction scheme were reviewed
Inter-annotators agreement
u  At most 0.01 annotators agreement difference between PDF and inline
instruction schemes
u  Fleiss multi-Π, Cohen multi-κ
0.872
0.864
0.7 u 
PDF vs Inline Instructions for Annotators
1 0.9 Annotate Tweets
Quality monitor
Table 2: Distribu:on of tags across training and test datasets. u 
1 Identify Users
Nepali-­‐English 9 ,993 2,874 15 males and 6 females from Kathmandu 9 males and 11 females from Eastern, Central, Pacific, (US & Canada) :me zones Spanish-­‐English 1 1,400 3,014 Mountain u 
Results
0.8 lang1
Ronaldo le ta ramro game
lang2
khelyo hai.
@user ammo bichara ma :( sab dos chai
ambiguous
malai, very sad :'( :P
@user @user no Graciela Eso me Lo Dicen
NE
everyday so there's no need
mixed
Por favor Gabriela
other
@user haha. info ta collect gareeko
stop this right now!!!!
raichas ta :P
u 
1
Solorio
of Computer and Information Sciences
University of Alabama at Birmingham
Birmingham, Alabama 35294-1170
{eablair, bethard}@uab.edu Code-switching :
u  Switching between languages within single context
u  Inter and intra sentential code-switching
No! No! No!
u 
2Department
and Thamar
Harvest tweets
u 
Steven
2
Bethard ,
437772008151986176
437772008151986176
437772008151986176
76523773 0 0
I
lang1
76523773 2 5
hope lang1
76523773 37 40 pero lang2
We thank Mona Diab and Julia Hirschberg for their contributions to the
general annotation guidelines as well as the in lab annotators and
CrowdFlower contributors for their labeling efforts.
u 
u 
u 
Code Switching is prevalent, complex and growing
Corpus has not only been tagged for languages but also for named
entities, mixed, ambiguous and irrelevant characters
Our methods for searching and locating code-switching users can be
helpful
LAW IX, June 05, 2015, Denver, Colorado