Transcultural urban spaces: where geography meets language 16/17 October 2015 Berne Bon cop, bad cop: A tale of two cities Kelsey Ball, Barbara E. Bullock, Gualberto Guzmán, Rozen Neupane, Kristopher S. Novak, & Jacqueline Larsen Serigos The University of Texas Outline 1. Language, geography, contact, and multilingualism ○ Mobility and mixing 2. The Bilingual Annotation Tasks (BATs) Force ○ Investigating multilingual texts with automated tools 3. Significance ○ Insight into how bilingual speakers use their linguistic resources as they move between different urban spaces 2 Language, geography, contact, and multilingualism 3 Superdiversity ● In zones of high contact, such as urban centers, innovative language use is ubiquitous, reflective of superdiversity (Vertovec 2007). ● Contemporary urban speech characterized by mixing is referred to by a variety of terms: ○ ○ ○ ○ urban youth languages ethnolects translanguaging/code-switching hybrid language practices 4 Language mixing and multilingual corpora ● As speakers move between and within spaces, interacting with different interlocutors, they move between and within languages. ● What is the nature of contemporary mixed language forms? ○ ○ Linguistic studies : How do they operate structurally? Cultural and urban studies: What kinds of meanings and identities do they convey? 5 Polylingual performances ● Not limited to borrowing words from Language X to insert into Language Y; can include: ○ ○ ○ Non-normative usages Local forms Polylingual language play: memes ○ Convergence in structures and meaning between languages ● In order to make sense of this, we need data—lots of it ● But these features make it difficult to annotate these types of text 6 Attempts to study multilingual data ● LIDES/LIPPS Group ● BilingBank ○ ● accessible via http://www.talkbank.org/ Contacts de Langues: Analyses Plurifactorielles (CLAPOTY) ○ Isabelle Léglise (CNRS SEDYL/CELIA) - http://clapoty.vjf.cnrs.fr/index.html 7 Studying multilingual data: Traditionally ● ● ● typically manually annotated set base/matrix language discrete categorization of contact features ○ ○ ○ Established borrowings (vs. single word code-switch) ■ “mon chum” Syntactic calques (vs. a dialectal extensions of existing patterns in a language) ■ “take a coffee” < have coffee The language of a proper name (Named Entity) ■ Robert 8 Studying multilingual data: More recently ● Probabilistic approaches to contact features ○ Allows for gradient/blended classification ● Flexible word sequences (ngrams) ● Automatic annotation 9 Why automate? ● ● ● ● ● ● Eliminate obstacles of cost and time Eliminate errors/biases introduced by manual coding Increase accountability through replicability Process other people’s data to create open resources Accelerate production of findings Encourage collaboration 10 Bilingual Annotation Tasks (BATs) Force /|\ (°v°)/|\ 11 Goals of the BATs ● Present a method and tools for analyzing larger data sets than linguists usually work with ● Demonstrate that it is possible to automatically identify the languages in a data set without the need for human intervention ● Apply a quick metric to calculate the degree to which a text is mixed so that it can be compared with other texts 12 A first step: written corpus ● Killer Crónicas: Bilingual Memories ○ ~49,000 words ○ colloquial language ○ dialectal forms ○ highly bilingual ○ extensive mixing 13 A second step: oral corpus ● Bon Cop Bad Cop ○ 13,502 tokens ○ colloquial language ○ dialectal forms ○ extensive mixing ○ highly bilingual 14 Movie Clip https://www.youtube.com/watch? v=lLISTwHX88k&feature=youtu.be 15 Bon cop, Bad Cop, 2006 ● Plot ○ ○ ● Officers from Ontario and Quebec join forces to solve a murder on the border Film revolves mixing of languages and cultures ■ Bouchard: Francophone ■ Ward: Anglophone Bilingualism ○ ○ Touted as Canada’s first bilingual film, very successful at the box office Filmed with a French and an English script; later edited for final mixed version 16 Methods 17 Corpus Creation 1. Monolingual subtitles: open subtitle.com (English), subsmax. com (French) 18 Corpus Creation 2. Manually combined the two subtitles to create a transcript 19 Annotation Tags for Manual Gold Standard Language Tag Example English years, country Non-standard English Hopportunity French frère, ville Non-standard French mononcle, tabarnak Named Entities Montréal, Toronto, David Punctuation .!,?– Number 2001, VII 20 Example of the Gold Standard 21 Building a language identification model ● ● ● Numbers and Punctuation ○ Rule-based Named Entity ○ Stanford Named Entity Recognizer English and French ○ Character Ngram ○ HMM 22 Character N-gram Model ● ● ● ● What’s the probability that a given word is English? French? Model a language using training data (La Presse corpus, CALLHOME corpus) Identify language of an unknown word based on how often it occurs in training data Example: 23 Hidden Markov Model ● ● What’s the probability that this word is English, considering that the previous word was English? Code-switching is less frequent (therefore less probable) than not switching. -Me occurs in both English and French training data → ambiguous Because previous word was English, me is more likely to be English - 24 Annotation of Bon Cop, Bad Cop by model 25 Results 26 Evaluation of our model Tag Accuracy Precision Recall English 98.3% 98.3% 96.6% French 97.5% 97.4% 98.6% Named Entity 69.4% 78.7% 85.5% Over-all 97.0% 27 Over-all Language Distribution across Bon Cop Bad Cop Named Ent. Language Tag Punct. Eng French Token count 28 Characters’ Usage of French Across Cities Montréal Toronto 29 Logistic Regression Can we predict language based on character and location? ● Relative to Martin Ward... ○ Others: not a significant predictor ○ David Bouchard: 15 times more likely t use French ○ Tattoo Killer: 1.6 times more likely to use English ● Relative to Toronto... ○ In Montréal: 8 times more likely to use French 30 Bilingual metric applied to Bon Cop Bad Cop ● The M-metric is developed by the LIPPS Group (2000): where, k=total number of languages/categories pj=total number of words in the given language over total words in the corpus ● The metric allows us to determine how bilingual a text is: the closer the value is to 1, the more ‘bilingual’ the text 31 M-metric for Bon Cop Bad Cop 32 Switch points Types of Switches Example Eng to French He has had an unfortunate contretemps French to English On peut aller demander à ses partners Punct. only Hey, where you going? Chez nous. Named Ent. and Punct. Bye, Patrick. Merci d'avoir appelé. 33 Switch Pattern Indirect Switches Direct Switches Indirect Switches Over ⅔ of Direct switches involved single word borrowings (eg. ses affaires cheap américaines) 34 Discussion and Conclusions 35 What we’ve achieved ● ● Presented a profile of language switching in a large bilingual text ○ locale ○ speaker Demonstrated that it is possible to automatically identify language switches without manually coding all the data ○ ● with a high degree of accuracy: 97% Calculated the degree to which segments of the corpus are mixed and how they are mixed ○ M score of .71 for overall mixing where 1 = perfectly mixed 36 What we’ve learned about who switches, where ● ● Context matters— a lot ■ In Toronto, only the Québécois cop (Bouchard) uses French; all others perform in English ■ In Montreal, the Torontonian cop (Ward) and Tattoo Killer perform bilingually Speaker matters— a lot ■ ■ The most bilingual character appears to be Tattoo Killer The least is Bouchard 37 What else we’ve learned about who switches ● The anglophone speakers accommodate more to the francophone than vice versa. ○ Given the politics of language in Quebec with regard to French language maintenance, and the fact that the director and producer of the film are Québécois, this might not be surprising. 38 What we’ve learned about how they switch ● Most switching coincides with punctuation, indicating ○ ○ ● a change in speaker turn or a switch at a phrasal boundary Within a phrase ○ nearly balanced French↔English 39 Conclusions This work shows how NLP tools can be fruitfully recruited to shed light on contemporary mixed language use. 40
© Copyright 2026 Paperzz