Developing Language-tagged Corpora for Code-switching Tweets Suraj 1 Maharjan , 1Department Elizabeth 2 Blair , of Computer Science University of Houston Houston, TX, 77004 [email protected] [email protected] Introduction Workflow @user tyo schoolko nam k ho? nep-en Search for users and their friends who code-switch In-lab annotations (find frequent Spanish words) Rerun tweet search • Containing most frequent Spanish words • Identified as English by Twitter API • Location: areas near California and Texas Harvest tweets Law Enforcement : Mexico-USA drug trafficking Child Language : Analysis in multi-lingual environments Data mining : Ignoring multilingual may result in significant lost in business opportunities Data Distribution Language Pair Training Test es-en Search for tweets • Containing most frequent English words (Bangor Miami Corpus) • Identified as Spanish by Twitter API • Location: areas near California and Texas Applications u User InformaKon Table 1: Tweets distribu:on for training and test set. Nepali-‐English Spanish-‐English Tag Training (%) Test (%) Training(%) Test (%) Lang1 31.14 1 9.76 54.78 4 3.28 Lang2 41.56 4 9.1 23.52 3 0.34 Mixed 0.08 0 .60 0.04 0 .03 NE 2.73 4 .19 2.07 2 .22 Ambiguous 0.09 -‐ 0.24 0 .12 Other 24.41 2 6.35 19.34 2 4.02 In-lab annotations (Gold Data) 0.814 0.744 0.744 102426321 15 21 102426321 23 24 102426321 26 31 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Accuracy Precision Recall Spanish-‐English F-measure Dictionary 0.865 0.874 0.865 0.827 0.705 Accuracy CRF GE 0.861 0.705 Precision Recall 0.705 F-measure Nepali-‐English Fig 1: Benchmark system performance at the word level. u u Dictionary u Build separate dictionaries for each of the label from training dataset u Use dictionaries to assign label for each token u Use majority language to break the ties CRF GE u Train LangID (King and Abney) with monolingual texts for each language u English-Spanish training set : Tweets pulled from Texas and California u Nepali monolingual training set : Nepali song lyrics, news website and Romanized Nepali tweets Analysis Tokens words char-‐2 char-‐3 char-‐4 char-‐5 Nepali-‐English(%) Spanish-‐English(%) 1.39 3.54 52.01 52.21 33.36 40.36 12.66 21.31 3.43 9.00 Dictionary based approach had problems with u Similar spelling : man (like, heart), gate (date), din (day), me, red u Unseen tokens : b-lated, yuss, comrade u No standard Romanized spelling for Nepali words CRF GE failed to detect u Small code-switched content embedded inside large monolingual segments Both bilinguals frequently used English function words (the, to, yo, he, she, and) and abbreviations used in social media (lol, lmao, idk) Conclusion es-en lecture lang1 ma lang2 dhyaan lang2 0.8 0.729 0.6 Final Annotations 441222743661375488 441222743661375488 441222743661375488 0.9 0.6 u nep-en 0.861 0.7 u Review and update Partially supported by 0.864 Table 3: N-‐gram overlap across language pairs. Review and update 1,000 tweets using PDF instruction scheme and 500 tweets using inline instruction scheme were reviewed Inter-annotators agreement u At most 0.01 annotators agreement difference between PDF and inline instruction schemes u Fleiss multi-Π, Cohen multi-κ 0.872 0.864 0.7 u PDF vs Inline Instructions for Annotators 1 0.9 Annotate Tweets Quality monitor Table 2: Distribu:on of tags across training and test datasets. u 1 Identify Users Nepali-‐English 9 ,993 2,874 15 males and 6 females from Kathmandu 9 males and 11 females from Eastern, Central, Pacific, (US & Canada) :me zones Spanish-‐English 1 1,400 3,014 Mountain u Results 0.8 lang1 Ronaldo le ta ramro game lang2 khelyo hai. @user ammo bichara ma :( sab dos chai ambiguous malai, very sad :'( :P @user @user no Graciela Eso me Lo Dicen NE everyday so there's no need mixed Por favor Gabriela other @user haha. info ta collect gareeko stop this right now!!!! raichas ta :P u 1 Solorio of Computer and Information Sciences University of Alabama at Birmingham Birmingham, Alabama 35294-1170 {eablair, bethard}@uab.edu Code-switching : u Switching between languages within single context u Inter and intra sentential code-switching No! No! No! u 2Department and Thamar Harvest tweets u Steven 2 Bethard , 437772008151986176 437772008151986176 437772008151986176 76523773 0 0 I lang1 76523773 2 5 hope lang1 76523773 37 40 pero lang2 We thank Mona Diab and Julia Hirschberg for their contributions to the general annotation guidelines as well as the in lab annotators and CrowdFlower contributors for their labeling efforts. u u u Code Switching is prevalent, complex and growing Corpus has not only been tagged for languages but also for named entities, mixed, ambiguous and irrelevant characters Our methods for searching and locating code-switching users can be helpful LAW IX, June 05, 2015, Denver, Colorado
© Copyright 2026 Paperzz