PolyU Language Bank - KSU Faculty Member websites

Corpus Linguistics
Developing a
PolyU Language Bank
Sherman Lee
[email protected]
PI: Grahame Bilbow
Thanks to: Chris Greaves, Raymond Cheung, Li Lan
Outline

Background




As an illustration



Exploring units of meaning
Case study
Developing a PolyU Language Bank




Goals of corpus linguistics
Types of corpora
Applications of corpus analysis
Aims and objectives of project
Similar existing projects
Procedures
The PolyU Language Bank



Current status
Sample corpora
Sample search
2
Goals of corpus linguistics

Chomskyan
linguistics






‘Langue’
(competence)
Ideal speaker/hearer
Language = innate
mental faculty
Intuitive evidence
Universals
Grammar

Corpus
linguistics






‘Parole’
(performance)
Complexity/variation
Language = social
phenomenon
Empirical evidence
Differences
Meaning
3
Basic tools

Corpus: a systematic collection of speech or writing
that is built according to explicit design criteria for a
specific purpose

c.f. EAGLES’ broad definition: “A corpus can
potentially contain any text type, incl. word lists,
dictionaries, etc.”

Concordancer: search engine
(e.g. WordSmith; SARA)

Concordance: occurrences of search item, displayed
in list with immediate context shown
4
Types of corpora
Written vs Spoken
 General vs Specialised



e.g. ESP, Learner corpora
Monolingual vs Multilingual

e.g. Parallel, Comparable
Synchronic vs Diachronic; Monitor
 Annotated vs Unannotated

5
Written corpora
Brown
LOB
Time of compilation
1960s
1970s
Compiled at
Brown University (US)
Lancaster, Oslo, Bergen
Language variety
Written American English
Written British English
Size
1 million words (500 texts of 2000 words each)
Design
Balanced corpora; 15 genres of text, incl. press reportage,
editorials, reviews, religion, government documents,
reports, biographies, scientific writing, fiction
6
Specialised corpora
CSPAE
CHILDES
Time of compilation
1990s
Since 1980s
Compiled at / by
Michael Barlow
(Rice Univ)
Spoken professional American
English
Size
2 million words (tagged)
Project started at Carnegie Mellon
Univ; contributors worldwide
20 languages, incl.: E.Asian,
Germanic, Romance, Slavic…;
mainly conversational data;
c. 20 million words (growing)
Design
Transcripts from professional
settings (meetings, conferences…)
by 400 speakers;
academia (1 M) politics (1 M wds)
“Child language data exchange
system”, offering transcripts of
monolingual and bilingual children’s
language (language acquisition data)
Language variety
7
Other examples of available corpora
COMPILED AT
LANGUAGE
SIZE
DESIGN
Written American English
1 million
(tagged)
15 genres of text: press reportage, religion, fiction…
Written / spoken English
450 million – year 2002
(tagged)
1 million
(grammatically parsed)
Monitor corpus; mostly written: newspapers, books;
spoken: conversations, broadcasts, interviews...
One of 15 projects worldwide preparing different
national / regional varieties of English; 200 written,
300 spoken texts, various genres
First generation major corpora
Brown Corpus
(1960s)
Brown Univ, US
Second generation mega corpora
Bank of English
(since 1991)
International Corpus
of English
[ICE-GB] (1990s)
COBUILD,
Birmingham Univ
UCL, London
Written / spoken British
Engl.
Specialised corpora
Corpus of Spoken
Professional
American Engl.
[CSPAE] (1990s)
Rice Univ, US
Spoken American English
2 million
(tagged)
Transcripts from professional settings (meetings,
press conferences) by approximately 400 speakers,
centred on activities tied to academics and politics
Louvain Centre for
English Corpus
Linguistics, Belg.
Engl. writing by learners of
from 19 mother tongue
backgrounds, incl. Chi.
Over 2 million
Essay writing by advanced learners of English as a
foreign language
HK Cantonese
170,000 characters
Spontaneous speech recorded from phone-in radio
programs and forums, by 69 speakers
French, English and
Spanish
1 million tokens in each
language
(tagged)
Trilingual parallel corpus from telecommunications
domain; aligned at sentence level
Learner corpora
International Corpus
of Learner English
[ICLE] (Since
1990s)
Non-English monolingual corpora
HK Cantonese
Adult Corpus
[HKCAC] (2000)
Dept Speech &
Hearing Sci’s, HKU
Multilingual / Parallel corpora
International
Telecommunications
Corpus [ITU /
CRATER] (1995)
CRATER project
(Corpus Resources
& Terminology
Extraction) Lanc U.
Some applications of corpus analysis

Language teaching & learning


Empirical teaching data – authentic examples of language use
Reference source – answering learners’ questions or explaining
learner errors:
• “What’s the difference between ‘at last’ and ‘in the end’?”
• “How is ‘hardly’ used?”



Translation




Preparation of teaching materials – e.g. vocabulary lists, CLOZE tests
CALL; concordancing and data-driven learning
Using parallel texts to find suitable translation equivalents
Creation of translation databases or glossaries for domain-specific
terminology, e.g. business, law, science
Exploring units of meaning in texts
Linguistics and language research




Lexicography & lexical studies – e.g. relative word frequency
Language variation – e.g. linguistic features across registers
Grammar – corpora used as data to test hypotheses, syntactic theory
Pragmatics & discourse – e.g. CA of discourse features in spoken
(conversational) data
9
Exploring meaning,
units of meaning

Focus on meaning because:



What are basic units of meaning?



People interested in the meanings of texts, in how language is
actually used in discourse
Meaning is a key problem for translation, language learning,
information management…
Language teaching (TEFL): vocabulary often introduced in the
form of new single words
Words considered to be basic units of meaning
Is the word an ideal unit of meaning?
“… If you dog a dog during the dog days
of summer, you’ll be a dog tired dog catcher…”
“… Can I sit down? My dogs are barking…”

Most lexical errors made by language learners result from
failure to deal with ambiguities of single words
10
‘Unambiguous
Units of Meaning’




Notion of an ‘Unambiguous Unit of Meaning’
necessary for understanding meaning
UUoM = keyword and all words in the context that
contribute to making the word unambiguous
Compounds, idioms, multi-word units, collocations,
set phrases
Often determined by a syntactic pattern

Adj + N
• friendly fire, closing remarks

V+N
• invite proposals, draw conclusions

Adv + A
• politically correct, environmentally friendly

N + of + N
• cause of death, proof of identity, code of practice, duty of care
11
Case study

Search for units of meaning in online dictionaries and corpora



friendly fire
environmentally friendly
Corpora from 1990s

British National Corpus (BNC)
• 100,000,000+ words
• Written (90%)
• Extracts from regional/national newspapers, specialist periodicals, academic
books, popular fiction, un/published letters, memos, school/university essays
• Spoken (10%)
• Informal conversation, formal meetings (business, government), radio shows,
phone-ins

The Times (1995, Jan – March)
• 10,220,367 words
• Written : business, home news, readers’ letters, reviews

Corpora from 1960 - 1970s

Brown corpus / LOB corpus
• Each 1 million words
• Written, balanced corpora of 15 genres of text
12
Search results
BNC
[100M]
“friendly”
The Times
[10.2M]
3952
“friendly fire”
Brown
[1 M]
363
37
1
LOB
[1 M]
61
0
55
0
[header] (no context)
Wordnet 2.0
Dictionary.com
Encarta World English
Cambridge Advanced
Merriam-Webster Online
TigerNT Eng-Chi Online
Lexiconer Online Eng-Chi







friendly fire 3
[text] (phrase)
so-called friendly fire
‘friendly fire’
friendly fire
[text] (word)
so-called friendly-fire
‘friendly-fire’
friendly-fire
0
3
18
10
0
0
1
1
1
1
0
0
0
“environmentally”
“environmentally friendly”
692
44
205
23
(phrase)
Wordnet 2.0
Dictionary.com
Encarta World English
Cambridge Advanced
Merriam-Webster Online
TigerNT Eng-Chi Online
Lexiconer Online Eng-Chi







environmentally friendly 155
23
environmentally-friendly 50
0
(word)
0
0
0
0
What the results show

‘friendly fire’, ‘environmentally friendly’





Represent fairly new concepts
Occur in the newer corpora (1990s) as units of meaning
Occur as entries in some of the online dictionaries only
(not bilingual dictionaries)
New terminology and terms of common usage not
always recorded in dictionaries and termbanks
One way of using corpora for learning and
translation:

Use corpus evidence to help students recognise units of
meaning; introduce notion of units of meaning into
language learning
16
Aims of PULB project

To design and build an archive of language
corpora = ‘language bank’



To be used by staff and students in the
department
For teaching, language learning and research
purposes
To provide a user-friendly platform


A WWW interface via which users can freely
access the language bank
With browse, search and concordance facilities
17
Ingredients of PULB






Sources: standard corpora, departmental
collections
Medium: written texts, transcribed spoken
data
Language types: native speaker, learner
corpora
Languages: English, Chinese, Japanese,
French, German
Genres: business, law, academia, media,
social, literature
Target Size: 30 million
words (European) / characters (Asian)
18
Why a language bank?
- “What’s in it for us”

Free and simple shared access to a collection of language corpora

That you can utilise for your teaching
• Authentic examples of language use at your fingertips
• Empirical teaching data covering different specialisms (ESP, EAP)

That you can utilise for your research
• A ready-made collection of data waiting for you to work on
• Saving on time and resources

Way of incorporating new methods and information technology into
the department’s teaching and research activities




Increase students’ awareness of this rapidly developing methodology /
branch of language studies (corpus linguistics, corpora studies)
Way of integrating theory with technology in the classroom
Train students to be more computer-literate
All of the above can
• Motivate students to become active learners
• Help students to more effectively learn the target language (cf goals of DDL)
19
Similar existing projects

W3 Corpora Project (Essex)





http://clwww.essex.ac.uk/w3c/
Access to corpora (Gutenberg texts, LOB, LOB-tagged)
Web interface for performing searches
Online tutorial and info on corpus linguistics
Web Concordancer (VLC, PolyU)



http://vlc.polyu.edu.hk/concordance/
Access to variety of corpora and texts (bilingual/parallel
corpora, news, Bible, works of fiction)
Web interface for performing searches
20
Directions for PULB

Build a language bank with features that
parallel those of similar sites

~ VLC
• Bring together corpora and texts of various types and
genres, of different languages

~ Essex
• Make available different facilities for different
categories of users (cf. legal considerations)
• Provide on-site tutorial, corpora-based info

Include extra features


Allow searches in multiple texts / corpora
simultaneously
Some form of parallel concordancing
30
Target composition of PULB
French
Business
Chinese
Chinese
German
Business
Japanese
PolyU Language Bank
Legal
Chinese
Japanese
Japanese
Literature
English
General corpora
Spoken Corpora
Academic
English
English
Literature
HK spoken
corpus
Conference
speeches
Academic
presentations
Business
writing
Legal
English
Specialised corpora
Teaching
reflections
B
R
O
W
N
Social
interactions
Business
English
(PUBC)
I
C
E
Student
work
B
N
C
Learner corpora
Workplace
English
31
Procedures (i)

Collate, sort, categorise data from
various sources
•
•
Commercially available data
Departmental collections, incl.
 PolyU
Business Corpus (Li and Bilbow)
 Bilingual corpora (Xu)
 ESP / EAP corpora (Forey)
 Learner corpora (Sengupta)
…
32
Procedures (ii)

For the departmental collections:

Decide how to present each collection


E.g. Sub-categories, macro categories
Clean up texts



E.g. Duplications of text samples
E.g. Structural features (headings, typographic features)
E.g. Personal information found in data
• To protect anonymity or privacy of authors and speakers

Annotate texts

Provide descriptive information about each corpus
• Compiler, time of compilation, type of collection…

Provide descriptive information about the texts
• Number, size, genre of subtexts
• Bibliographic info (written text)
• Ethnographic info (spoken data)

Provide structural information for texts if necessary
• Mark texts for paragraph boundaries etc…
33
Procedures (iii)

Put corpora together on platform; set up search
and support facilities:




‘PULB map’
Browse facility
Search and concordance facilities
Tutorial / general information

Transplant PULB onto dept website for use by
staff and students

Promote PULB among corpora community

Data provider to data archives / distribution sites, e.g.
OLAC; ICAME
34
The PolyU Language Bank

Current status
Range of corpora totalling 12M+ words
 Individual corpus descriptions
 Index of corpora
 Simple to use built-in concordancer
 Available at
http://langbank.engl.polyu.edu.hk/

35
The PolyU Language Bank

Some of the currently available corpora







PolyU Business Corpus (Eng, Chi, Jap)
BNC Sampler Corpus (Spoken, Written)
Corpus of Multilingual Texts
Corpus of Nursing and Health Science Texts
Learner Corpus of Essays and Reports
HK Bilingual Corpus of Legal and Documentary
Texts
...
37
How you can contribute

Talk to us about your ideas

What would you like to see being incorporated into PULB?
• In terms of corpora
• In terms of search facilities and supplementary information





Can you think of other ways in which PULB can be organised
and structured?
How likely are you to make use of PULB in your teaching and
research?
Do you have any suggestions for corpus studies based on
available or potentially available corpora from PULB?
Do you know of similar projects being undertaken elsewhere
that we can learn from?
Talk to us about your collections / corpora



Do you have collections of language data from past research
projects that are (could be) presented as a corpus (corpora)?
Can we help you put your collections to good use?
Can we work together to incorporate your collections into
PULB?
41
Concluding remarks




Corpora represent a valuable but under exploited
resource for teaching and research
PULB aims to bring together various corpora
under a single departmental archive, accessible
via WWW
You can help us by contributing your ideas
and/or your language collections
Please visit and test the PULB website at
http://langbank.engl.polyu.edu.hk/ and provide
us with feedback using the online evaluation
form
Thank you very much
42
Social grooming
CLOZE
PolyU Business Corpus


Compiled in 1999-2000 (Li & Bilbow)
Multilingual - comparable corpora:





English (c. 1.3 M words)
Chinese (c. 1.2 M words)
Japanese (c. 1.1 M words)
Business texts from: newspapers,
government reports, company reports
and brochures…
Has been used for creating a bilingual
English-Chinese business lexicon
45
PolyU Business Lexicon
Duplication

Download Report

PolyU Language Bank - KSU Faculty Member websites

Paperzz.com

Your Paperzz