The Corpógrafo Theory and Practice

The Corpógrafo
Theory and Practice
Belinda Maia & Luís Sarmento
PoloFLUP
LINGUATECA
ABRAPT Mini-curso 30.08.04
A bit of history
• PALC ’97 – 'Do-ityourself corpora ...
with a little bit of help
from your friends!'
• CULT 1998 ‘Making corpora – a
learning process’
 Contrastive linguistics
 Corpora linguistics
 Translation teaching
 General > specific
language
ABRAPT Mini-curso 30.08.04
A bit of history
• 2000 – First Master’s in
Terminology and
Translation at FLUP
• PALC 2001 - ‘Training
Translators in
Terminology and
Information Retrieval
using Comparable and
Parallel Corpora’
 Specialized translation and
terminology
 Contact with domain
experts
 Importance of IT
 Need for technical help for
more ambitious students!
ABRAPT Mini-curso 30.08.04
A bit of history
• LREC 2002 - ‘Corpora for
terminology extraction –
the differing perspectives
and objectives of
researchers, teachers and
language services
providers’
• 2002 – Second Master’s in
Terminology and
Translation at FLUP
 Plea for help to Diana
Santos
 October 2002
LINGUATECA Polo FLUP
ABRAPT Mini-curso 30.08.04
LINGUATECA
• See http://www.linguateca.pt
• Leader > Diana Santos (SINTEF – Oslo)
• Objective - to create resources and tools for
the computational processing of Portuguese
• Poles at Oslo, Lisbon, Braga and Porto
• Porto – Polo CLUP/FLUP
ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP
• See http://www.linguateca.pt/poloclup/
• On-line suite of corpora tools to work with
comparable corpora with emphasis on
bilingual research
– Focus on special domains
– Construction of terminology databases,
ontologies and domain models
Corpógrafo
ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP
• See http://www.linguateca.pt/poloclup/
• General help in constructing resources
specific to the need of FLUP/CLUP
– For researchers, teachers and students
– For teaching methodology at FLUP
 BNC & Reuter’s corpora on intranet
 A small ‘chat’ corpus
ABRAPT Mini-curso 30.08.04
More history
• 2003 – Poster of the GC – at CL2003
• 2003 – ‘What are comparable corpora?’
CL2003
• 2003 – Experimentation with evaluation of
Machine Translation
• 2003 – Experimentation with GC
• 2003 – Third Master’s in Terminology and
Translation at FLUP
ABRAPT Mini-curso 30.08.04
GC – Integrated Web Environment for Corpora Linguistics
What is GC?
GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work
environment for Corpora-Based Linguistic Research. GC allows users to:
Motivation
• access several Corpora tools from a single entry point using a regular web browser
• Lack of Comprehensive, wide-scope Corpora Tools
• Commercial Packages are usually difficult to Integrate/Customize
• Tools are not prepared to support cooperative work.
• Linguistic knowledge is not usually integrated in tools.
• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)
• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)
• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)
• communicate and exchange results with other users
Internet Integration
GC provides seamless integration with the World
Wide Web allowing users to:
Developer’s Tasks:
• Integrate Existing Tools/Resources
BNC
• Develop Additional Generic Tools
• search specific Corpora resources on the Internet
CETEM
Público
COMPARA
Custom Interface
Custom Interface
Others
• use available translation-engines in parallel.
Developer Task:
• Interact with Users/Administrator
Custom Interface
• query the web for concordances
Custom Interface
• Develop Custom Tools for particular
research needs
DEV
Administrator’s Tasks:
• Concordance Engine
• Corpora Bot
• Taggers
• Statistics
Tool Pool
• Aligner (Semi-Auto)
Internet
• Custom Tools
Terminology DB
• Users, Groups and Disk Quotas
• Corpora Taxonomy (see box)
Inter-user
Communication
• Documentation Organization
• Access Service Statistics
ADM
USER
Teacher’s Tasks:
• Provide on-line tutorials
• Provide links to:
• on-line teaching material
• bibliography and other resources
Virtual
Desktop
Personal
Corpora
Terminology Extraction Tool
(Auto/Semi-Auto)
PS
Inter-User Communication
• Tagging and Aligning Cooperatively
TXT
RTF
HTML
• Messaging Service
• Exchange of Corpora Resources
PDF
DOC
ABRAPT Mini-curso 30.08.04
Corpora Taxonomy
• Medium: written, spoken, multimedia
• Domain: Engineering, medicine, etc.
• Genre: scientific, technical, informative, etc.
And then...
• PoloCLUP’s 3rd function:
• Evaluation of Machine Translation
– Experimentation with evaluation
– Teaching + research focus
• Results:
– TrAva – MT evaluation tool
– CorTA – Corpus of 1 EN input + 4 MT
output sentences
ABRAPT Mini-curso 30.08.04
Prescriptive v descriptive
terminology
•
•
•
•
•
Paper > digital form
Static > dynamic resources
‘Democratization’ of terminology
ISO standards > socioterminology
Knowledge structures increasingly
recognized as structured but dynamic - ask
Gerhard Budin to explain this to you ….
ABRAPT Mini-curso 30.08.04
Perspectives of terminology users
• Domain experts and
vested interests
• Translators
• Information retrieval
• Knowledge
engineering
 Standardized
terminology
 Getting the right word
 Finding information
 Perfecting Google
 Structuring knowledge
 Finding it fast
ABRAPT Mini-curso 30.08.04
Bridging the Gap
•
•
•
•
•
General linguists
Translation teachers
Translation students
Corpus linguists
Computational
linguists
• Computer engineers
Computer-phobia
Computer-worship
ABRAPT Mini-curso 30.08.04
The Corpógrafo combines:
• Terminology, translation and language study
and research (Belinda)
• Terminology databases (Domain experts)
• Computational linguistics research and
production of resources (Diana)
• Information retrieval and artificial
intelligence (Luís)
= Discussions on priorities!
ABRAPT Mini-curso 30.08.04
Corpora and Terminology
•
•
•
•
•
Corpora as input
Terminology extraction
Terminology databases
Structuring of domain knowledge
Further corpora
ABRAPT Mini-curso 30.08.04
Internet
Corpora
Corpora
Analysis
Terminology
Database
Text details
Text details
Text details
ABRAPT Mini-curso 30.08.04
Working with the Corpógrafo
• Corpógrafo is a suite of integrated tools for
INDIVIDUAL or GROUP research
• All research done ONLINE
• Each username/password = separate space on our
server
• At present > anyone can work with it using 10 MB
space for FREE
• BUT - you get an empty space + tools + tutorial!
ABRAPT Mini-curso 30.08.04
Terminology
old v new
•
•
•
•
•
•
Prescriptive > descriptive
Paper > digital form
Static > dynamic resources
‘Democratization’ of terminology
ISO standards > socioterminology
Knowledge structures increasingly
recognized as structured but dynamic - ask
Gerhard Budin to explain this to you ….
ABRAPT Mini-curso 30.08.04
Perspectives of terminology users
• Domain experts and
vested interests
• Translators
• Information retrieval
• Knowledge
engineering
 Standardized
terminology
 Getting the right word
 Finding information
 Perfecting Google
 Structuring knowledge
 Finding it fast
ABRAPT Mini-curso 30.08.04
Bridging the Gap
•
•
•
•
•
General linguists
Translation teachers
Translation students
Corpus linguists
Computational
linguists
• Computer engineers
Computer-phobia
Computer-worship
ABRAPT Mini-curso 30.08.04
Focus of Corpógrafo
• Design priorities are to:
–
–
–
–
–
See the Big Picture
Create the Overall Framework
Get feedback from users to see their needs
Develop according to real research needs
Fill in the details and improve techniques as
needed
ABRAPT Mini-curso 30.08.04
Corpógrafo and special domains
• Master’s in Terminology and Translation
• Terminology projects with the support of domain
specialists in:
– Engineering – Electronics, Mechanical Engineering
– Geography - Population Geography, Natural Hazards –
Fire, Floods, Earthquakes, Coastal Erosion,
– Medicine - Kidney support machines, Neurology
– Science – Genetics
– Technology – GPS – Geographical Positioning Systems
ABRAPT Mini-curso 30.08.04
Corpógrafo and
terminology/translation research
•
Ongoing dissertations on aspects of:
–
–
–
–
–
–
Terminology – databases for different uses,
neologisms, definition searches, semantic relations,
conceptual analysis
Corpora – text analysis, corpora construction
Technical writing > Electrical Appliances
Localization
Terminology in documentaries
Translation of Multimedia
ABRAPT Mini-curso 30.08.04
Linguateca
• Linguateca’s policy - all resources and
tools freely available online
• Primary users - Portuguese and Brazilian
ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP
• Bi- or multi-lingual in interest
• Corpógrafo available for experiments on a
small scale to the general public
• Possibilities of future work on projects with
users from other universities and other
countries
ABRAPT Mini-curso 30.08.04
Contacts
If you are interested is finding out more, please
contact me:
Belinda Maia
[email protected]
The Corpógrafo can be used
(with a username and password) at:
http://www.linguateca.pt and
http://poloclup.linguateca.pt/ferramentas/gc
ABRAPT Mini-curso 30.08.04
ABRAPT Mini-curso 30.08.04
Corpógrafo
1. File Manager - area where each individual or
group can:
–
–
–
–
–
–
–
convert various text formats to .txt
upload texts to their space on server
‘clean’ them of unnecessary material
check tokenization and sentence divisions
consult wordlists – alphabetical, frequency etc
group texts into corpora
register full information on source, domain and text
type
ABRAPT Mini-curso 30.08.04
Corpógrafo
2. Corpora analysis area:
–
Concordancing tools allowing for
•
•
–
KWIC concordancing
KWIC concordancing with sorted according to
word to left or right
N-gram tool
•
•
N-grams
Term-candidates
– With filters for PT
ABRAPT Mini-curso 30.08.04
Corpógrafo
3. Terminology database
–
–
–
–
–
–
–
Terms
Definitions
Examples
Morphology
Multilingual equivalents
Sources and text details of corpora used
Semantic relations – further complexity
ABRAPT Mini-curso 30.08.04
Internet
Corpora
Corpora
Analysis
Terminology
Database
Text details
Text details
Text details
ABRAPT Mini-curso 30.08.04
Future developments
– general policy
• General testing and improvement of the
Corpógrafo
• Experimentation with ideas from other projects:e.g. Wordnet, Framenet
• Experimentation with theories of semantic
primitives, human universals etc
• Development of new ideas or functions – using
isomorphic relationships between researchers’
needs and our possibilities
ABRAPT Mini-curso 30.08.04
Future developments
- File Manager
• Creation of overall framework – perhaps
UDC based – for:
– consultation of research available to public
– information on ongoing research
• Coordination of individual corpus projects
into bigger projects, when possible or
necessary
ABRAPT Mini-curso 30.08.04
File Manager
Theoretical questions
• Domain organization – UDC or ?
• Categorization of text by genre – how many
genres?
• Reliability of texts from Internet – how does one
guarantee quality?
• Is a translator or linguist able to distinguish a
‘good text’?
• Should the domain specialist choose the texts?
ABRAPT Mini-curso 30.08.04
Corpora construction
theoretical questions / problems
• How large is a good domain corpus?
• No domain corpus will produce EVERY
term in the area
• Comparable corpora v. Parallel corpora
• Aligning comparable corpora at term level
ABRAPT Mini-curso 30.08.04
Future developments
- Corpora analysis
• Development of finer-grained
concordancing
• Experimentation with finding definitions in
context
• Semi-automatic creation of keyword
shortlists for further text retrieval
ABRAPT Mini-curso 30.08.04
Corpora Analysis
Theoretical questions
• How far can one rely on the computational
linguist or computer engineer to produce
analyses of corpora?
• If (semi-) automated processes produce
80% possible results, should the linguist /
translator rubbish these processes?
• Can we leave it all the computer engineer?
ABRAPT Mini-curso 30.08.04
Future developments
- terminology databases
• Refinement of terminology fields
• Development of further multi-lingual
functions
• Development of organized and robust set of
semantic relations
• Semi-automatic visualizing of semantic
relations
ABRAPT Mini-curso 30.08.04
Terminology databases
Theory
• How much information does a database
need?
• How much does the user of a database
need?
• Is it reasonable to hope that all our
databases could one day communicate with
each other and help us with translation /
information retrieval – or whatever?
ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being
used at present?
• Master’s in Terminology and Translation
• Terminology projects with the support of domain
specialists in:
– Engineering – Electronics, Mechanical Engineering
– Geography - Population Geography, Natural Hazards –
Fire, Floods, Earthquakes, Coastal Erosion,
– Medicine - Kidney support machines, Neurology
– Science – Genetics
– Translation and Localization
ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being
used at present?
•
Dissertations completed on:
– Definitions for different purposes + pedagogical
glossary for Corrosion, Electrical engineering
http://www.fe.up.pt/~cdm/QAE/QAE_gloss_b.htm
– Socioterminology – in the area of Composite
Materials
– Graphical representation of Conceptual systems
– Terminology and Metaphors
– Football Metaphors
ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being
used at present?
•
Ongoing dissertations on aspects of:
–
–
–
–
–
Terminology – databases for different uses,
neologisms, conceptual analysis
Corpora – text analysis, corpora construction
Translation and localization terminology
Technical writing > Electrical Appliances
Terminology in documentaries
ABRAPT Mini-curso 30.08.04
Pedagogical applications
of the Corpógrafo
• Undergraduate courses – only possible if both
teachers and students are trained to use it
• Postgraduate research
– Terminology and translation (Belinda + domain
experts)
– Computational linguistics (Diana)
– Information retrieval (Luís)
• Long live team work!
ABRAPT Mini-curso 30.08.04
To what extent is the Corpógrafo
available to others?
• Linguateca’s policy is to make all resources
and tools available online
• Primary users are expected to be Portuguese
and Brazilian as most of resources and tools
are for Portuguese
• PoloFLUP’s main objective – comparable
corpora and terminology tools
ABRAPT Mini-curso 30.08.04
To what extent is the Corpógrafo
available to others?
• PoloFLUP is, by definition, bi- or multilingual in interest
• The Corpógrafo is therefore available for
experiments on a small scale to the general
public
• In the future – we hope to be able to work
on projects with users from other
universities and other countries
ABRAPT Mini-curso 30.08.04
Contacts
If you are interested is finding out more, please
contact me:
Belinda Maia
[email protected]
The Corpógrafo can be used
(with a username and password) at:
http://www.linguateca.pt and
http://poloclup.linguateca.pt/ferramentas/gc
ABRAPT Mini-curso 30.08.04