Corpora

UNIVERSITÀ DEGLI STUDI DI MACERATA
Dipartimento di Studi Umanistici – Lingue, Mediazione, Storia, Lettere, Filosofia
Corso di Laurea Magistrale in Lingue Moderne
per la Comunicazione e la Cooperazione Internazionale (Classe LM-38)
TPCI inglese - mod. B
Strumenti e tecnologie per la traduzione
specialistica - a.a. 2016/2017
PART 2: Corpora and Translation
Sara Castagnoli
[email protected]
1
What is a corpus?
Some (authoritative) definitions
• “a collection of naturally-occurring language text, chosen to characterize a
state or variety of a language” (Sinclair, 1991:171)
• “a collection of texts assumed to be representative of a given language,
dialect, or other subset of a language, to be used for linguistic analysis”
(Francis, 1992:7)
• “a closed set of texts in machine-readable form established for general or
specific purposes by previously defined criteria” (Engwall, 1992:167)
• “a finite-sized body of machine-readable text, sampled in order to be
maximally representative of the language variety under consideration”
(McEnery & Wilson, 1996:23)
• “a collection of (1) machine-readable (2) authentic texts […] which is (3)
sampled to be (4) representative of a particular language or language
variety” (McEnery et al., 2006:5)
What is / is not a corpus…?
• A newspaper archive on CD-ROM?
The answer is
• An online glossary?
always “NO”
• A digital library (e.g. Project
(see
Gutenberg)?
definition)
• All RAI 1 programmes (e.g. for spoken
TV language)
Corpora vs. web
•Corpora:
– Usually stable
•searches can be replicated
– Control over contents
•we can select the texts to be included, or have control over selection
strategies
– Ad-hoc linguistically-aware software to investigate them
•concordancers can sort / organise concordance lines
• Web (as accessed via Google or other search engines):
– Very unstable
•results can change at any time for reasons beyond our control
– No control over contents
•what/how many texts are indexed by Google’s robots?
– Limited control over search results
•cannot sort or organise hits meaningfully; they are presented
randomly
Click here for another corpus vs. Google comparison
What types of corpora exist? A brief overview
• A corpus is a principled collection of naturally occurring electronic
texts designed to be a representative sample of language in actual use
• Some of the main features and criteria used to describe and classify
corpora:
general
closed / finite
specialised
open-ended (monitor)
written
raw (pre-corpus)
spoken (transcribed)
marked-up (augmented)
multimodal (audio/video)
POS-tagged (augmented)
balanced (sample)
annotated (augmented)
opportunistic
monolingual
synchronic
bi- / multilingual
diachronic
parallel
static
comparable
dynamic
An example of planned balance:
the British National Corpus
• 100 m words of contemporary spoken and written British
English
• Representative of British English “as a whole”
• Designed to be appropriate for a variety of uses:
lexicography, education, research, commercial applications
(computational tools)
• Balanced with regard to genre, subject matter and style
• Sampling and representativeness very difficult to ensure
BNC
NEW Spoken BNC to
be released in 2017!
• 4,124 texts: 90% written, 10% spoken
• Largest collection of spoken English ever collected (10m
words), but reflects typical imbalance in favour of written
text (for understandable practical reasons)
• Written portion: 75% informative, 25% imaginative
BNC written material
Sources:
• 60% books
• 25% periodicals
• 5% brochures and other ephemera
• E.g. bus tickets, produce containers, junk mail
• 5% unpublished letters, essays, minutes
• 5% plays, speeches (written to be spoken)
Register levels:
• 30% literary or technical “high”
• 45% “middle”
• 25% informal “low”
BNC Subject coverage
• Planned to reflect pattern of book publishing in UK
over last 20 years
Subject
Imaginative
World affairs
Social science
Leisure
Applied science
Commerce
Arts
Natural science
Belief & thought
Unclassified
Number of texts
625
453
510
374
364
284
259
144
146
50
% of total written
22
18
15
11
8
8
8
4
3
3
BNC Spoken corpus
• Context-governed material
•
•
•
•
Lectures, tutorials, classrooms
News reports
Product demonstrations, consultations, interviews
Sermons, political speeches, public meetings,
parliamentary debates
• Sports commentaries, phone-ins, chat shows
• Samples from 12 different regions
10/18
BNC Spoken corpus
• Ordinary conversation
•
•
•
•
•
•
2000 hrs from 124 volunteers, 38 different regions
Four different socio-economic groupings
Equal male and female, age range 15 to 60+
All conversations over a 2-day period recorded
No secret recording, and allowed to erase
Systematic details kept of time, location, details of participants (sex,
age, race, occupation, education, social group, ), topic, etc.
• Transcription issues:
•
•
•
•
include false starts, hesitations, etc.
some paralinguistic features (shouting, whispering),
use of dialect words/grammar
but no phonetic information
What types of corpora exist? A brief overview
• A corpus is a principled collection of naturally occurring electronic
texts designed to be a representative sample of language in actual use
• Some of the main features and criteria used to describe and classify
corpora:
general
closed / finite
specialised
open-ended (monitor)
written
raw (pre-corpus)
spoken (transcribed)
marked-up (augmented)
multimodal (audio/video)
POS-tagged (augmented)
balanced (sample)
annotated (augmented)
opportunistic
monolingual
synchronic
bi- / multilingual
diachronic
parallel
static
comparable
12
dynamic
Dynamic (Monitor) vs static (Finite)
• A static corpus will give a snapshot of language use
at a given time
• Easier to control balance of content
• May limit usefulness, esp. as time passes
• A dynamic corpus is ever-changing
• Called “monitor” corpus because allows us to monitor
language change over time
Key concepts and technical notions in
corpus-based translation studies
• Wordlist, frequency list, keyword list
• Types, tokens, type/token ratio (lexical variation)
• Function/grammatical words vs. content/lexical words (lexical
density)
“Type” and “token”
• “Token” means individual occurrence of a word
• “Type” means instance of a given word
• The man saw the girl with the telescope
• 8 tokens, 6 types
• “Type” may refer to lexeme, or individual word
form
• run, runs, ran, running: 1 or 4 types?
Key concepts and technical notions
• Wordlist, frequency list, keyword list
• Types, tokens, type/token ratio (lexical variation)
• Function/grammatical words vs. content/lexical words (lexical
density)
• Concordance (concordancing software)
• KWIC (keyword in context)
• Nodeword
• Sorting
Concordance for nodeword “eyes” (sorted 1L) generated from the BNC
Key concepts and technical notions
• Wordlist, frequency list, keyword list
• Types, tokens, type/token ratio (lexical variation)
• Function/grammatical words vs. content/lexical words
(lexical density)
• Concordance (concordancing software)
• KWIC (keyword in context)
• Nodeword
• Sorting
• Collocation (collocates)
• Lemmatisation (morphological analysis)
• (POS-)Tagging (grammatical analysis)
• Parsing (syntactic analysis)
A translation-relevant corpus typology
Corpora
general / reference
monolingual
(usually)
(normally ready)
specialised
monolingual
parallel
multilingual
comparable
General / reference monolingual corpora
Corpora
general / reference
monolingual
(usually)
(normally ready)
very big (>100M words)
)
specialised
monolingual
British National Corpus (BrEng)
COCA (AmEng)
La Repubblica (?)
CORIS (Ita),
WaCky corpora, PAISA’
Leeds Internet corpora
Mannheim corpora
multilingual
parallel comparable
Example:
BNC concordances
for nodewords “eyes”
and “eye”
21
Concordance for nodeword “eyes” (sorted 1L) generated from the BNC
Concordance for nodeword “eye” (sorted 1L) generated from the BNC
24
www.nature.com/nature/journal/v455/n7215/full/455835b.html
General / reference monolingual corpora (of English)
Last week, tens of thousands
of researchers took to the
streets to register their
opposition to a proposed bill
designed to control civilservice spending.
Took to the streets
• http://corpus.leeds.ac.uk/internet.html
• English
• Let’s try to understand:
• Meaning
• Extended (sentential) co-text, preferential co-selections
• Context(s) of use
• Semantic preference
• Semantic prosody
Using general / reference monolingual corpora
(from/on the Web): Leeds Internet corpora
*
http://corpus.leeds.ac.uk/internet.html
Let’s explore internal variation - Examples of (possible) useful
queries
• Any other forms of the verb take? (colligational constraints)
• Plural/singular of the noun street? (colligational constraints)
• Other verbs? (collocational flexibility)
• Other nouns? (collocational flexibility)
• Select “CQP syntax only” * (automatic POS-tagging!)
• http://cwb.sourceforge.net/files/CQP_Tutorial/
• Look at the examples on the following slides for guidance and adapt those models to
your searches
• Try out a number of different options to familiarise yourself with the search syntax,
and understand what kinds of searches it can support
Examples of (possible) useful queries
• Any other forms of the verb take? (colligational constraints)
Plural/singular of the noun street? (colligational constraints)
• [lemma="take"] "to" "the" [lemma="street"]
• Lemmatised search: finds all possible forms of verb and noun
• [pos="V.*"] "to" "the" [lemma="street"]
• Lemmatised and POS-specific search: as above but finds all verbs
• [lemma="take"] "to" "the" [pos="N.*"]
• Lemmatised and POS-specific search: as above but finds all nouns
• Click on the link to the left of the concordance line for context
Now the translation into Italian of “took to the streets”
• Verb?
• andare
• scendere
• …?
• Preposition?
•
•
•
•
in
nella/nelle
per la/per le?
…?
• Noun?
• strada/strade
• piazza/piazze
• …?
Which queries do we
need? How many are
necessary?
Now the translation into Italian of “take to the street”
• [pos="V.*"] [] [lemma="strada"]
NB: [] means ‘any word in that position’
• [pos="V.*"] [] [lemma="piazza"]
• very general (slower)
• [pos="V.*"] [word="(in|nella|nelle)"] [lemma="strada"]
• [pos="V.*"] [word="(in|nella|nelle)"] [lemma="piazza"]
• more specific/restrictive
NB: | is called ‘pipe’, lists alternatives
• [lemma="scendere"] [word="(in|nella|nelle)"] [lemma="strada"]
• [lemma="andare"] [word="(in|nella|nelle)"] [lemma="piazza"]
• very specific/restrictive
Last week, tens of thousands
of researchers took to the
streets to register their
opposition to a proposed bill
designed to control civilservice spending.
REGISTER ONE’S OPPOSITION
• Now search the BNC for this expression.
• What does it mean?
• Which “feelings” are usually “registered”?
•
•
•
•
•
•
•
•
•
•
interest
concern
support
dismay
frustrations
dissatisfaction
disapproval
protest
commitment
…
Monolingual general / reference corpora
available online (at least partially, i.e. as demos)
• British National Corpus (BNC, British English)
• www.natcorp.ox.ac.uk
• COCA (American English)
• http://corpus.byu.edu/coca/
• The CORIS corpus (Italian)
• http://corpora.dslo.unibo.it/coris_ita.html
• Leeds Internet corpora
• English, Chinese, Arabic, French, German, Italian, Japanese, Polish,
Portuguese, Russian, Spanish: http://corpus.leeds.ac.uk/internet.html
• Mannheim corpora (German)
• http://corpora.ids-mannheim.de/ccdb
• Corpus del Español (Spanish)
• www.corpusdelespanol.org
• CREA (Spanish)
• http://corpus.rae.es/creanet.html
explore the Web
to see what other
corpora are
available !