Bon cop, bad cop: A tale of two cities

Transcultural urban spaces: where
geography meets language
16/17 October 2015
Berne
Bon cop, bad cop:
A tale of two cities
Kelsey Ball, Barbara E. Bullock, Gualberto Guzmán,
Rozen Neupane, Kristopher S. Novak,
& Jacqueline Larsen Serigos
The University of Texas
Outline
1. Language, geography, contact, and multilingualism
○
Mobility and mixing
2. The Bilingual Annotation Tasks (BATs) Force
○
Investigating multilingual texts with automated tools
3. Significance
○
Insight into how bilingual speakers use their linguistic resources as they move between different
urban spaces
2
Language, geography, contact, and
multilingualism
3
Superdiversity
●
In zones of high contact, such as urban centers, innovative language use is
ubiquitous, reflective of superdiversity (Vertovec 2007).
●
Contemporary urban speech characterized by mixing is referred to by a variety
of terms:
○
○
○
○
urban youth languages
ethnolects
translanguaging/code-switching
hybrid language practices
4
Language mixing and multilingual corpora
●
As speakers move between and within spaces, interacting with different
interlocutors, they move between and within languages.
●
What is the nature of contemporary mixed language forms?
○
○
Linguistic studies : How do they operate structurally?
Cultural and urban studies: What kinds of meanings and identities do they convey?
5
Polylingual performances
●
Not limited to borrowing words from Language X to insert into Language Y; can
include:
○
○
○
Non-normative usages
Local forms
Polylingual language play: memes
○
Convergence in structures and meaning between languages
●
In order to make sense of this, we need data—lots of it
●
But these features make it difficult to annotate these types of text
6
Attempts to study multilingual data
●
LIDES/LIPPS Group
●
BilingBank
○
●
accessible via http://www.talkbank.org/
Contacts de Langues: Analyses Plurifactorielles (CLAPOTY)
○
Isabelle Léglise (CNRS SEDYL/CELIA) - http://clapoty.vjf.cnrs.fr/index.html
7
Studying multilingual data: Traditionally
●
●
●
typically manually annotated
set base/matrix language
discrete categorization of contact features
○
○
○
Established borrowings (vs. single word code-switch)
■ “mon chum”
Syntactic calques (vs. a dialectal extensions of existing patterns in a language)
■ “take a coffee” < have coffee
The language of a proper name (Named Entity)
■ Robert
8
Studying multilingual data: More recently
●
Probabilistic approaches to contact features
○
Allows for gradient/blended classification
●
Flexible word sequences (ngrams)
●
Automatic annotation
9
Why automate?
●
●
●
●
●
●
Eliminate obstacles of cost and time
Eliminate errors/biases introduced by manual coding
Increase accountability through replicability
Process other people’s data to create open resources
Accelerate production of findings
Encourage collaboration
10
Bilingual Annotation Tasks (BATs) Force
/|\ (°v°)/|\
11
Goals of the BATs
●
Present a method and tools for analyzing larger data sets than linguists usually
work with
●
Demonstrate that it is possible to automatically identify the languages in a data
set without the need for human intervention
●
Apply a quick metric to calculate the degree to which a text is mixed so that it
can be compared with other texts
12
A first step: written corpus
●
Killer Crónicas: Bilingual Memories
○
~49,000 words
○
colloquial language
○
dialectal forms
○
highly bilingual
○
extensive mixing
13
A second step: oral corpus
●
Bon Cop Bad Cop
○
13,502 tokens
○
colloquial language
○
dialectal forms
○
extensive mixing
○
highly bilingual
14
Movie Clip
https://www.youtube.com/watch?
v=lLISTwHX88k&feature=youtu.be
15
Bon cop, Bad Cop, 2006
●
Plot
○
○
●
Officers from Ontario and Quebec join forces to solve a murder on the border
Film revolves mixing of languages and cultures
■ Bouchard: Francophone
■ Ward: Anglophone
Bilingualism
○
○
Touted as Canada’s first bilingual film, very successful at the box office
Filmed with a French and an English script; later edited for final mixed version
16
Methods
17
Corpus Creation
1.
Monolingual subtitles: open subtitle.com (English), subsmax. com (French)
18
Corpus Creation
2.
Manually combined the two subtitles to create a transcript
19
Annotation Tags for Manual Gold Standard
Language Tag
Example
English
years, country
Non-standard English
Hopportunity
French
frère, ville
Non-standard French
mononcle, tabarnak
Named Entities
Montréal, Toronto, David
Punctuation
.!,?–
Number
2001, VII
20
Example of the Gold Standard
21
Building a language identification model
●
●
●
Numbers and Punctuation
○ Rule-based
Named Entity
○ Stanford Named Entity Recognizer
English and French
○ Character Ngram
○ HMM
22
Character N-gram Model
●
●
●
●
What’s the probability that a given word is English? French?
Model a language using training data (La Presse corpus, CALLHOME corpus)
Identify language of an unknown word based on how often it occurs in training
data
Example:
23
Hidden Markov Model
●
●
What’s the probability that this word is English, considering that the previous
word was English?
Code-switching is less frequent (therefore less probable) than not switching.
-Me occurs in both English and French training data → ambiguous
Because previous word was English, me is more likely to be English
-
24
Annotation of Bon Cop, Bad Cop by model
25
Results
26
Evaluation of our model
Tag
Accuracy
Precision
Recall
English
98.3%
98.3%
96.6%
French
97.5%
97.4%
98.6%
Named Entity
69.4%
78.7%
85.5%
Over-all
97.0%
27
Over-all Language Distribution across Bon Cop Bad Cop
Named Ent.
Language
Tag
Punct.
Eng
French
Token count
28
Characters’ Usage of French Across Cities
Montréal
Toronto
29
Logistic
Regression
Can we predict language
based on character and
location?
●
Relative to Martin Ward...
○
Others: not a significant predictor
○
David Bouchard: 15 times more likely t
use French
○
Tattoo Killer: 1.6 times more likely to
use English
●
Relative to Toronto...
○
In Montréal: 8 times more likely to
use French
30
Bilingual metric applied to Bon Cop Bad Cop
●
The M-metric is developed by the LIPPS Group (2000):
where,
k=total number of languages/categories
pj=total number of words in the given language over total words in the corpus
●
The metric allows us to determine how bilingual a text is: the closer the value is
to 1, the more ‘bilingual’ the text
31
M-metric for Bon Cop Bad Cop
32
Switch points
Types of Switches
Example
Eng to French
He has had an unfortunate contretemps
French to English
On peut aller demander à ses partners
Punct. only
Hey, where you going?
Chez nous.
Named Ent. and Punct.
Bye, Patrick. Merci d'avoir appelé.
33
Switch Pattern
Indirect
Switches
Direct
Switches
Indirect
Switches
Over ⅔ of Direct switches involved single word borrowings (eg. ses affaires cheap américaines)
34
Discussion and Conclusions
35
What we’ve achieved
●
●
Presented a profile of language switching in a large bilingual text
○
locale
○
speaker
Demonstrated that it is possible to automatically identify language switches
without manually coding all the data
○
●
with a high degree of accuracy: 97%
Calculated the degree to which segments of the corpus are mixed and how they
are mixed
○
M score of .71 for overall mixing where 1 = perfectly mixed
36
What we’ve learned about who switches, where
●
●
Context matters— a lot
■
In Toronto, only the Québécois cop (Bouchard) uses French; all others perform in English
■
In Montreal, the Torontonian cop (Ward) and Tattoo Killer perform bilingually
Speaker matters— a lot
■
■
The most bilingual character appears to be Tattoo Killer
The least is Bouchard
37
What else we’ve learned about who switches
●
The anglophone speakers accommodate more to the francophone than vice
versa.
○
Given the politics of language in Quebec with regard to French language maintenance, and the
fact that the director and producer of the film are Québécois, this might not be surprising.
38
What we’ve learned about how they switch
●
Most switching coincides with
punctuation, indicating
○
○
●
a change in speaker turn
or a switch at a phrasal boundary
Within a phrase
○
nearly balanced French↔English
39
Conclusions
This work shows how NLP tools can be
fruitfully recruited to shed light on
contemporary mixed language use.
40