Metadata for sign language corpora

Metadata for sign language corpora
Background document for an ECHO workshop
May 8 + 9, 2003, Radboud University Nijmegen
http://www.let.ru.nl/sign-lang/echo/events.html
Onno Crasborn
Department of Linguistics, Radboud University Nijmegen
[email protected]
Thomas Hanke
Institute for German Sign Language, University of Hamburg
[email protected]
Version: Tuesday, January 12, 2010 (links updated)
0. Preparation for the workshop
The goal of the workshop is to acquaint you with the concept of metadata and the IMDI standard, and to
decide on a set of categories specific for our subfield of the language sciences. The proposal for a list of
categories in this document (section 6) is a first proposal for such a set.
In order to prepare for the workshop, we would like to ask you to do four things.
1. Study this document and specifically the proposal in section 6, and take a look at the IMDI document
‘Metadata elements for session descriptions, draft proposal v3.02’.
2. Think carefully about the kinds of data collections (corpora) you have at your institute (and others
present in your country that differ in kind from your own corpora), and how they are currently
archived. Is there a database storing information about your video collection, for example, or do you
have text document listing your tapes?
3. Evaluate the proposal in section 6, and see which types of information you would like to store but are
missing in the combination of the IMDI 3.02 proposal plus the sign-specific and deaf-specific
additions that are proposed in section 6.
4. In addition, evaluate the proposal in section 6 from the other side: which types of information would
you never use to describe your corpus, and are redundant from your point of view?
There are several information sources on internet that provide further information when you still have
questions after reading this document, or which you could browse to get a better view on what metadata
are and what the IMDI standard is. The full description of categories in the IMDI 3.02 proposal is useful
to print and access for looking up details of the categories in case you need a more detailed description,
but it is not intended as a text to read from start to end. The two manuals for the IMDI BCBrowser and
the IMDI Editor (MS Word documents), which are the software tools presently available for using the
IMDI metadata set, do however give an accessible introduction to the structure of IMDI. A glossary of
some of the technical terms used in the various documents can be found on the ISLE web site.
(The full links can be found in section 7 at the end of this document.)
Metadata for sign language corpora
1
1. What is the problem?
Now that sign language research in many European countries is expanding at a rapid pace, data
collections are also growing considerably. Sign language data typically consist of video recordings, but in
many cases are accompanied by transcriptions (whether on paper or in computer format), lexical or
phonological analyses stored in databases, etc. In some cases, sign language data consist of numeric
information such as the outputs of eye tracking systems or movement tracking equipment.
At the same time, since the ESF InterSign workshops in the late 1990s, the development of computer
technology has made it possible to digitize video using a simple desktop computer, and more importantly,
to store many dozens of hours of digitized video even on a local hard disk. A considerable effort in terms
of time is required to digitize existing corpora of analogue video recordings, but it is likely that more and
more institutes will actually do this within the next few years. Digitized video files will be stored on a
server in the institute, and made accessible over the local intranet. The advantages are enormous: video
recordings will no longer degrade as analogue tapes did, and any segment of any recording is accessible
within a few seconds for any member of the team. By storing annotation files on a server as well,
incremental annotations based on the original media by colleagues become possible. Even more
importantly in the European context, it will become possible to provide access to data to collaborators in
other locations within the country or in other countries. EU funding will keep emphasizing collaboration
between institutes in different member states, and being able to share data over internet can be a key
factor in successful collaboration.
However, all these possibilities do require a good way of cataloguing the available data. Data will only be
used in practice if they can be searched in some way, whether locally or from the other side of the world.
If students or new team members still have to go to the local long-time team member to ask whether there
might be a recording of young children with deaf parents, for example, not too much time will actually be
gained. It is the experience of many people that even the available video data collected over the years (in
the case of sign language, typically a period stretching 5 to even 25 years) are so large and contain data
for so many different types of studies, that no one has a full overview of what there is.
To take a modest (that is, relatively small) example, consider the video corpus collected for phonological
studies at the universities of Leiden and Nijmegen since 1994. About 90 hours of analogue video material
are available, which can be divided into 5 broad categories. Three categories consist of well-defined
elicited and spontaneous recordings for specific research projects in the area of phonetics and phonology
(about 50 hours). One category consists of a very disparate set of video recordings made for research
purposes of various kinds, ranging from 5 to 30 minutes per session (about 10 hours in total). The fifth
category (30 hours) comprises sign language tapes that are commercially available (mostly NGT stories
for children), a large set of TV recordings featuring signers or discussing deaf issues (news items,
documentaries and movies), and data from other researchers in various countries. Since the first three
categories were collected by ourselves for our own research, we have reasonable view on the nature of the
data on these tapes. This is not true for the latter two categories: the recordings are simply too divergent
to be able to recollect everything that is present. The small database application we created to list the
tapes we have (describing tape number, title, short description, recording date, type of tape) is not detailed
enough to look up whether we have interactional data by informants from the north of the country, for
example. For new colleagues, the whole collection is untransparent. The Dutch Science Foundation has
provided a grant for the archiving of all the analogue tapes on digital tapes or in computer files, but if we
do not use a systematic way to describe the content of the tapes at the same time, the recordings are
bound to stay in the closet, even though there is a lot of very valuable material on there that can be used
for many different projects. The need for proper cataloguing will actually increase substantially when
video tapes are digitized, because in many cases an hour of analogue video recordings are naturally split
up into multiple digital movie files (one for each elicitation task, for example). In the present example,
the 90 hours of analogue video data are expected to result in about 1,000 digital movie files.
Metadata for sign language corpora
2
The answer to the cataloguing problem is the use of ‘metadata’: descriptions at a general level of the
nature of the data that can be considered constant for a whole recording (as opposed to linguistic
annotation or analysis of the data themselves that describes the actions unfolding in time). The workshop
aims to expand one of the most elaborate metadata proposals for describing spoken language corpora
(IMDI) to cover sign language data collections as well. This document provides an introduction to the
concept of metadata (section 2), presents the IMDI standard and associated software (section 3), discusses
how you can create and distribute your own collections of metadata, and makes a proposal for an
extension of the IMDI set that can be discussed at the workshop (sections 5 and 6).
The workshop is part of the ‘ECHO’ project, which started late 2002 and will end in 2004. ECHO is an
acronym of European Cultural Heritage Online, a large pilot study funded by the EU to explore how the
internet can be used to make data in different subdisciplines of the humanities accessible to other
researchers and to a wider public. ECHO aims to make an inventory of existing data collection efforts, to
explore technological developments that are called for, and to carry out a series of case studies in five
different disciplines to show the potential of the technology. The case study on language concerns sign
language. A small set of comparable data is being collected for British Sign Language, Swedish Sign
Language, and Sign Language of the Netherlands. Data will be annotated using a very general set of
transcription conventions (see http://www.let.ru.nl/sign-lang/echo/documents.html) using the ELAN
annotation software, and published on the internet to be used and/or further annotated for linguistic
research by anyone in the world.
2. What are metadata?
Metadata can be described as ‘data about data’, or ‘information about an information resource’. When
we think of the field of linguistics, data typically consist of audio or video recordings, transcriptions
associated with audio or video sources, lexica, etc.1 Transcriptions or other analytical procedures directly
reflect the linguistic content, describing form and meaning at various linguistic levels. An important
aspect of transcriptions is the dynamic nature of this type of data: it changes with time.
As opposed to these linguistic data, we can also characterize our sources in terms of what one might call
‘administrative data’. These metadata include information about the date and location of recording,
information about the nature of the recording (spontaneous vs. elicited data), elicitation methods, details
about the people participating in the recording, etc.
This separation between data and metadata may appear not sharply defined. This is correct, often the
exact boundary is determined by pragmatic reasons of data accessibility and data file format. Important is
to have the metadata characterise the “content” data in a usefull and consistent way so that when one is
looking for data or resources in a situation when only metadata is available, meaningfull results appear.
1
Examples of further types of data in linguistic research are dataglove signals, laryngograph signals, and eye
tracking signals.
Metadata for sign language corpora
3
data (media)
VHS tape, MPEG-1
file, WAV file, etc.
metadata description
recording date, media type,
elicitation method, register,
access rights, annotation
type, annotater, etc.
data (linguistic
annotation)
glossing, HamNoSys
transcription, etc.
For instance, information about the register used by a signer can be a property of the whole recording
session (such as when a lecture is being presented in a formal register), but if register variation was the
topic of the study and different register versions of the same sentence are being produced, a description of
the register of each sentence would feature in the transcription of the data. Some overlap between the
metadata descriptions and the linguistic transcriptions as in this example is unavoidable. In general
however, the two domains feature a description in terms of two distinct sets of categories.
However, this does not mean that users do not want to search for data combining categories from the two
domains. For example, one can easily imagine queries of the following types:
• How many instances of the sign MONDAY [data] are there in my NGT corpus [metadata] produced by
women [metadata] aged over 50 years [metadata]?
• What are the sentences in our corpus of hearing DGS interpreters [metadata] signing to a Deaf audience
[metadata] from the north of the country [metadata] where the signer uses raised eye brows [data] in nonquestion contexts [data]?
Combining searches in this way should be a possible function of the software used for retrieving data and
metadata, and is not excluded in principle by storing some information in the metadata description and
other information in the linguistic annotation.2
For a more elaborate and more technical introduction to metadata, see the technical report created for
ECHO Workpackage 2.
2
At this moment, the IMDI tools are specifically designed to work with ELAN transcription files, since files for
both can be accessed through internet. ELAN is a software tool with functionality similar to SignStream and
syncWRITER, with the main difference that it is designed to work with files over a network, so that data can easily
be accessed by different members of the research team, or by people in different countries. This network
functionality is central to the ECHO project, which aims to make scientific data available online.
Browsing a corpus using the IMDI tools provides users with direct links to the ELAN annotation and video files
(provided they are granted access to the data), but it is not yet possible to do searches combining queries from both
the data and the metadata domain. This will likely be developed in the near future.
Metadata for sign language corpora
4
3. What is the IMDI standard?
One of the most detailed and most universal proposals for a set of metadata descriptions for linguistic
corpora resulted from the ISLE project (International Standard for Language Engineering), and is called
IMDI (ISLE Metadata Initiative). It is the result of a series of meetings and discussions by a large
group of spoken language researchers and language engineers, and forms one of the cornerstones of the
ECHO project. Workpackage 1 of the ECHO project aims to make an inventory of existing metadata
corpora and to create a number of new ones for various languages.
IMDI comprises different sets of metadata, to describe different things:
• Session metadata: descriptions of combinations of media files and linguistic annotation files
• Catalogue metadata: descriptions on a more abstract level of the corpus as a whole.
• Lexicon metadata: descriptions of lexicons (still under development)
This document and the sign language workshop in May 2003 only discuss metadata elements to describe
sessions or resource bundles (a name that is also used in domains and situations where the term “Session”
has specific other connotations). Sessions are intuitively natural units of media files (video and/or audio)
and a linguistic annotation file (ELAN, SignStream, etc.). For example, if a data collection on a single
video tape contains the Frog Story signed by different informants, each informant’s rendering of the story
would be considered a separate session. In this way, it is possible to describe and search for every piece
of the source in a unique and extensive way. Each session is described by one IMDI file. Different
sessions can be combined in larger units (subcorpora, e.g. the ‘Frog Story Corpus’), which in turn
constitute larger corpora (e.g. the ‘Nijmegen Sign Language Corpus’).
It is important to make a distinction between the IMDI standard, which is the topic of the workshop, and
the software tools currently available to use the IMDI standard. While the set of elements used to describe
corpora should be relatively stable from the outset, the software to enter and retrieve these elements does
not necessarily have to be standardized, and can more easily be changed in the future. Currently, the only
available software tools to enter and retrieve information from IMDI files are the ones available on the
IMDI web site, developed by the Max Planck Institute for Psycholinguistics. Because of the standard
markup text format of the files (actually, XML), anyone is free to develop software for specific needs that
are not covered by the MPI software. This situation is comparable to standardized transcription systems:
the definition of the IPA or HamNoSys notation systems, or for example glossing conventions, is
independent of their use in a computer program such as MS Word or SignStream.
The IMDI elements for session descriptions are grouped into seven broad sets of fixed elements:
1. Session
2. Project
3. Collector
4. Content
5. Actors
6. Resources
7. References
The session concept bundles all information about the circumstances and conditions
of the linguistic event, groups the resources (e.g., video files and annotation files)
belonging to this event, and records the administrative information for the event.
Information about the project for which the sessions were originally created.
Name and contact information for the person who collected the session.
A set of categories describing the intellectual content of the session.
Names, roles and further information about the people involved in the session.
Information about the media files, such as URL, size, etc.
Citations and URLs to relevant publications and other archive resources.
A full description of the IMDI elements can be found in the IMDI 3.02 proposal. Another way to obtain a
quick overview of the available elements and their use is to open one of the IMDI descriptions of the
ECHO sign language corpus, which can be found on the workshop web site. If you open this file in the
IMDI editor, you get a good impression of the distinctions included in the current (3.02) version of the
IMDI proposal for session descriptions.
Metadata for sign language corpora
5
The available elements form a broad set that is shared between different subdisciplines of language
studies. To allow for the encoding of categories that are specific to a given subdiscipline, such as sign
language studies, or even specific to a single research project, there are several places in the IMDI
scheme where users can add fields of their own: so-called ‘keyword - value pairs’. Keyword-value pairs
are available at the following locations in the IMDI element set:
• Session
• Content
• Actor
• Language
In each location, an unbounded number of ‘keys’ can be created to describe project-specific
characteristics of the session. For example, the present standard does not incorporate any fields that are
specific to the sign language community, such as whether an actor is deaf or hearing. The goal of the
workshop is to agree on a first proposal for a set of key-value pairs that will be used by the sign
language field as a whole. In addition to this ‘standard extension’ (or ‘profile’), a sign language
researcher is still free to add extra fields that describe properties of the session that are not in the IMDI
3.02 or in the sign language set.
In addition to the key-value pairs, the IMDI 3.02 proposal also includes a number of ‘description’
elements. These description elements are intended to give a more elaborate prose description in any
language of different aspects of the session. The content of these description elements is not controlled,
and although the resulting descriptions can be very useful when browsing the corpus, the resulting
variation in the content of these elements over many thousands of sessions will make it useless to search
for words in these elements when one wants to find very specific information. For this reason, it is better
to agree on a standard key element for encoding the deafness of an actor than always adding the text ‘The
actor became deaf at age 16’ to the Actor.Description element. A proposal for a set of keys for the sign
language field is presented in section 6 of this document and forms the main item for discussion at the
workshop.
The kind of information that can be stored in a field, the ‘vocabulary’, can be of different types, as is
summarized in the following table.
Abbrev.
Type
Description
Example element
str
string
free string of text
Actor.Name
c
constrained
the text that can be entered is
constrained in some way
Actor.Age
ov
open vocabulary
the content of the element can be
chosen from a list or new items can
be added by the user
Resources.MediaFile.Format
ovl
open vocabulary list
the content of the element can be
chosen from a near exhaustive list
Content.Genre.Interactional
ccv
closed controlled
vocabulary
the content of the element has to be
chosen from a restricted and closed
set of items
Session.Location.Continent
These vocabulary types and the abbreviations are also used in the proposal in section 6.
Metadata for sign language corpora
6
4. How can one create metadata descriptions and make
them accessible to others?
As mentioned, an editor for IMDI metadata creation is available from MPI free of charge. For people who
need to start up the metadata collection task from zero, this certainly is the easiest way to go. The editor is
a Java program that can be run on a variety of platforms (Windows 98, Windows XP, Mac OS X, etc.),
and it will be updated to accommodate future changes of the standard. The program does maintenance of
vocabularies and more automatically. The situation might be different, however, if you have already
collected a substantial amount of metadata.
We therefore analysed how the existing metadata resources for the Nijmegen and Hamburg corpora could
be transferred to IMDI resources.
For the Nijmegen data, it turned out that a variety of formats were used for the metadata, and that
automatic transfer into IMDI is not worth the effort. The existing information will therefore be transferred
manually to IMDI by copying & pasting into the IMDI editor when new sessions are created.
For the Hamburg data, a fair part of the information contained in the IMDI standard is already available
in the institute’s transcription database. Other bits of information are kept in a variety of formats, such as
free-format text documents, spreadsheets, and transcriptions of interviews on the sociographic
background of the informants. As it is a goal in Hamburg to store all relevant data in one database, the
following three-part strategy was chosen.
•
•
•
Data already available in the Hamburg database will be made available by means of XML
exports. This is a fairly easy task to be completed in Summer 2003. This will already give an
overview of the corpora although not all fields are filled as appropriate.
In some cases, the suggested standard requires more detailed vocabularies than currently provided
by the Hamburg database. As a short-term solution, a mapping will be provided from the
Hamburg data to the IMDI vocabularies. In the long run, the database will be changed to directly
correspond to the IMDI value sets. Of course, this requires work to be done by the researchers
who conducted the earlier corpus collection tasks, and will therefore require a substantial amount
of time to be completed.
The Hamburg database will be extended to cover the metadata suggested in the IMDI proposal
but not yet covered in the database. This shall be completed in the course of 2003. Where the data
are readily available in other formats, they will be entered into the database as part of the
routinely database maintenance tasks. A few exceptions, however, will require intervention by the
principal researchers and can therefore not be expected to be finished within a year.
The situation of your metadata may require a similar or yet another approach. We suggest that you start
the analysis with trying to match your current metadata sets with the IMDI proposal field by field to see
which approach best fits your needs.
In principle, these metadata descriptions can be kept in-house for use by close colleagues. However, the
ECHO program strongly promotes the sharing of metadata corpora, without the concomitant need or
obligation to share the corresponding data as well. Within IMDI, one can indicate to what extent the data
described by the metadata are available to the outside world. Even if not a single videotape is accessible
to outside users, sharing metadata alone can already be very informative: researchers get an impression of
aspects of the methodology of different projects, such as elicitation methods, and can get in touch with
the researcher responsible for the data collection to discuss methodological aspects of the research project
in point.
IMDI metadata descriptions can be shared by publishing them on a HTTP (web) server, and distributing
the address of the top node of a corpus.
Metadata for sign language corpora
7
5. What needs to be discussed at the workshop?
There are several sign language specific properties of linguistic data that are not covered by the IMDI
3.02 metadata set described above. The IMDI standard provides a simple mechanism for extending its
scope: key-value pairs. Most major groups of the session description include a sub-schema for key-value
pairs. To store information here, you add a pair, give it a key comparable to schema elements and choose
a value from a vocabulary to be assigned to the key.
The following section proposes a set of key-value pairs for some of the most apparent properties of sign
language corpora that cannot be described by the IMDI 3.02 proposal for session descriptions. The
proposed keys mainly specify the background of the actors, including their hearing status, sign language
skills, school background, etc. The goal of the workshop is to decide on a list of keys and
vocabularies, based on the present proposal. It is important to realize that, just as in the standard set of
IMDI elements, not all elements are relevant for all sessions. The sign language set should include
information that applies to the average sign language session. For example, sessions with sign language
acquisition materials should be described with keys from both the sign language and the acquisition
fields. Furthermore, anyone is always free to add extra keys specific to one or more subcorpora.
As we just indicated, the goal of the workshop is to decide on a set of IMDI extensions (keys); this means
that you have to think carefully about three questions:
1.
2.
3.
Are there any categories missing in the proposal below?
Which of the categories below would you never use?
Are the proposals for vocabularies for each element well chosen?
It will be impossible to think of all possible scenarios just by looking at paper lists. Describing actual
sessions with the IMDI tools will no doubt lead to further refinements, which is exactly what happened to
the IMDI standard set when it was used by different groups of users.
6. A proposal: a set of key-value pairs to store sign
language specific properties
The IMDI standard provides a simple mechanism for extending its scope: key-value pairs. Every major
section of the session description includes a sub-schema for key-value pairs. To store information here,
you add a pair, give it a key comparable to schema elements and choose a value from a vocabulary to be
assigned to the key. This results in a slightly less-structured representation than what could be achieved
with an extension of the standard, but it does allow for a quick start, with the chance to later formally
propose an extension of the standard promoting the keys to scheme elements. In order to simulate the
concept of element groups, we suggest key values with dots for the moment, such as Hearing
Status.Hearing for Actor.keys.
The most important future development of the IMDI tools from the perspective of this proposal concerns
the creation of ‘profiles’. Profiles will contain sets of key-value pairs specific for different subgroups of
users, such as the sign language community. At this moment we can simulate and share a sort of profile
by making and sharing a ‘master document’ as the one that can be found on the workshop web page
(from early May on). Ideally, one would be able to choose one or more profiles from a list within the
IMDI editor. This will be developed in the near future.
Metadata for sign language corpora
8
If the set of sign language extensions is seen as a useful and stable set by the sign language community,
the ‘sign language profile’ can perhaps be given a more flexible layout in the IMDI editor and browser. In
order to simulate the concept of element groups, we suggest key values with dots, such as ‘Hearing
Status.Hearing’ for ‘Actor.keys’.
Numbers refer to paragraphs in the IMDI 3.02 proposal that already provide space for additional
information.
3
3.1
3.1.10
Metadata element definitions
Session
Session . keys
As there is no place for keys in the ‘resources’ section of IMDI, the following group has to be placed
here. In a new version of the IMDI standard and the associated tools, it will be possible to specify a keys
in the resources section as well.
Multiple Cameras
Group
Definition: Properties of the use and combination of multiple cameras.
Encoding: Multiple Cameras . Number
Multiple Cameras . Layout
Multiple Cameras . Viewpoint
Multiple Cameras . Focus
Comments: A description of the use multiple cameras during the recording, and how they appear in the
media files of this session. The situation in the sources is assumed to be similar, then;
ideally, it will become possible to encode this information for each resource (analogue tape,
digital file) separately by keys for each resource.
Multiple Cameras . Number
Definition: Number of video cameras used to record the session.
Encoding: c
Comments:
Multiple Cameras . Layout
Definition: Appearance of the video cameras when they are combined in one media file.
Encoding: OV: side-by-side / insert in top-left corner / insert in bottom-left corner / insert in
top-right corner / insert in bottom-right corner
Comments: If the views of the different cameras are not combined in either the source or the
media files, this field is left empty.
Multiple Cameras . Viewpoint
Definition: Viewpoints of the different cameras.
Encoding: string
Comments: Until a set of keys becomes available for the Resource section, people are
encouraged to use a systematic way for entering this information in a single field.
We propose to use a number or letter for the camera, followed by a colon, followed
by a string from the set { front, side, diagonal, top, ... }.
Multiple Cameras . Focus
Definition: Number of video cameras used to record the session.
Encoding: c
Comments: Until a set of keys becomes available for the Resource section, people are
encouraged to use a systematic way for entering this information in a single field.
Metadata for sign language corpora
9
We propose to use a number or letter for the camera, followed by a colon, followed
by a string from the set { upper body, whole body, face, mouth, ... }.
3.3
3.3.6
3.3.6.2
Content
Content . Languages
Content . Languages . Description
Space for describing code mixing, sign supported speech, etc. used in this session in prose.
Do we need separate keys for describing code mixing and code switching between different languages or
modalities?
3.3.7
Content . keys
Language Variety
Definition: Description of the language variety used in the session.
Encoding: string
Comments: Space for more constrained description of language variety used in this session.
Information about language skills of the individuals should be entered in the actor’s
description (cf. 3.4.2.15 Actor . keys).
Elicitation Method
Definition: A characterization of specific prompts used for eliciting language production.
Encoding: OV: single picture prompt / picture story prompt / written language prompt / sign
language prompt / video prompt.
Comments: When working on the influence of German on DGS compounding, for example, it is
essential to know if the spoken language competence has been activated by the
elicitation situation.
Content . Task might be appropriate for this purpose, but the open vocabulary seems
to suggest different levels of detail: While Wizard of Oz certainly is not related to
the utterance’s topic, some others are, such as room reservation. "Frog story" could
already have a (TM), it is well known to name both contents and elicitation method.
Content . Involvement would be a good place, if it were open vocabulary.
Interpreting
Group
Definition: Properties of interpreting appearing in the session.
Encoding: Interpreting . Source
Interpreting . Target
Interpreting . Interpreter Name [move to Actor?]
Interpreting . Visibility
Interpreting . Audience
Comments:
Interpreting . Source
Definition: Source modality and language type.
Encoding: OVL: sign language, speech, sign supported speech, text, fingerspelling
Comments:
Interpreting . Target
Definition: Target modality and language type.
Metadata for sign language corpora
10
Encoding: OVL: sign language, speech, sign supported speech, text (subtitling), fingerspelling
Comments:
Interpreting . Interpreter Name
Definition: Name of the interpreter(s).
Encoding: string
Comments: Use name or ‘unknown’.
= reference to Actor for details?
Interpreting . Visibility
Definition: Visibility of the interpreter in the video recordings.
Encoding: CCV: not visible / in view during whole session / in view during part of session
Comments:
Interpreting . Audience
Definition: Presence and nature of an audience that the interpreter is signing for.
Encoding: CCV: audience not present (signing to camera) / audience known to the interpreter /
heterogeneous group partly known to the interpreter / anonymous audience (e.g.
theatre)
Comments: If Interpreting.Target = subtitling, leave field empty.
3.4
3.4.2
3.4.2.15
Actors
Actor
Actor . keys
We propose to add a number of keys describing different aspects of the actors, mainly to characterize the
language background. All of these keys refer to relatively stable properties (skills) of the actors, not
to their actual behavior in the specific session at hand.
Note: descriptions of groups of keys are aligned with the left margin; description of elements are all
indented. The other formatting of the descriptions follows the IMDI documents. Keys that are further
specified by a set of keys are followed by “(sub)” in the lists.
General comment: most of the subjective data could be paralleled with “objective” data, such as ‘db left’
and ‘db right’ for the item ‘hearing’, scores in a language competence tests etc. Is this needed? Does
anyone have suggestions for specific field and values that are often measured in your corpus?
Actor keys
Group:
Encoding: Dialect
Dialect Background (sub)
Hearing Status (sub)
Sign Competence (sub)
Sign Systems Use (sub)
Family (sub)
Family . Children (sub)
Deaf Contacts (sub)
Education (sub)
Comments: Stable properties (skills) of the actor, not their actual use in a given session.
Metadata for sign language corpora
11
Dialect
Definition: Name of the dialect the actor uses.
Encoding: string or OV
Comments: Information on language variety, with a priori knowledge about the dialects for a
specific sign language
Dialect Background
Group
Encoding: Dialect Background . Raised at
Dialect Background . Living in
Dialect Background . Local since
Comments: Groups information on language variety, without knowing a priori what dialects exist
Dialect Background . Raised at
Definition: Town or region where the actor lived at the language acquisition age.
Encoding: string or OV
Comments:
Dialect Background . Living in
Definition: Town or region where the actor lived at the time of the recording.
Encoding: string or OV
Comments:
Dialect Background . Local since
Definition: Year the actor moved to the current place of residence.
Encoding: c
Comments:
Hearing Status
Group
Definition: Groups information about the hearing status of the actor. Only the first element is relevant
for all actors, the other elements specify details about hearing loss.
Encoding: Hearing Status . Hearing
Hearing Status . Hearing Rests
Hearing Status . Aid Type
Hearing Status . Aid Use
Comments:
Hearing Status . Hearing
Definition: Actor’s ability to hear.
Encoding: CCV: hearing / hard-of-hearing / deaf
Comments:
Hearing Status . Hearing Rests
Definition: Description of the types of acoustic signals the actor can still perceive.
Encoding: CCV:
I can hear the phone signals (busy etc.)
I can hear voices
I can understand a bit what people are saying.
Metadata for sign language corpora
12
Comments: These are subjective data, are objective data needed? OV needed instead of CCV?
Hearing Status . Aid Type
Definition: Description of the types of acoustic signals the actor can still perceive.
Encoding: CCV: none / conventional / CI
Comments:
Hearing Status . Aid Use
Definition: Information on how often the actors wears his/her hearing aid.
Encoding: CCV: always / regularly / sometimes / never
Comments:
Sign Competence
Group
Definition: Groups (partly subjective) information on the actor’s command of sign language.
Encoding: Sign Competence . Acquisition Age
Sign Competence . Acquisition Location
Sign Competence . Use Onset
Sign Competence . Regional
Comments:
Sign Competence . Acquisition Age
Definition: Age at which exposure to sign language and sign language use started.
Encoding: c (years;months)
Comments:
Sign Competence . Acquisition Location
Definition: Place where sign language was learnt.
Encoding: OV home / kindergarten / school / family beyond home / friends
Comments:
Sign Competence . Use Onset
Definition: Since what age does the actor regularly use sign language?
Encoding: c (years;months)
Comments:
Sign Competence . Regional
Definition: Does the actor consider him/herself a dialect user?
Encoding: OVL:
Using a dialect from my region
Using both regional and standard varieties
Using more than one regional variant
Comments:
Sign Systems Use
Group
Definition: Groups information on what sign subsystems are used.
Encoding: Sign Systems Use . Sign Supported Speech
Sign Systems Use . Fingerspelling
Metadata for sign language corpora
13
Sign Systems Use . Alternate Fingerspelling
Sign Systems Use . Cued Speech
Comments:
Sign Systems Use . Sign Supported Speech
Definition: How often does the actor use sign supported speech in everyday communication?
Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly
with colleagues
Comments:
Sign Systems Use . Fingerspelling
Definition: How often does the actor use fingerspelling (nationally dominant version) in
everyday communication?
Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly
with colleagues
Comments:
Sign Systems Use . Alternate Fingerspelling
Definition: How often does the actor use alternate fingerspelling in everyday communication?
Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly
with colleagues
Comments: Alternate means one-handed for Britain and two-handed elsewhere.
Sign Systems Use . Cued Speech
Definition: How often does the actor use cued speech in everyday communication?
Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly
with colleagues
Comments: Cued speech shall be understood to include methods such as Phoneme-based Manual
System.
Spoken Language Competence
Group
Definition: Groups information on what use the actor can make of spoken language.
Encoding: Spoken Language Competence . Articulation
Spoken Language Competence . Reception
Spoken Language Competence . Reading
Spoken Language Competence . Writing
Communication with Hearing
Spoken Language Competence . Articulation
Definition: How well can the actor articulate (subjective measure)?
Encoding: CCV: well / reasonably / not well / not at all
Comments: Questionnaire form “Hearing persons can understand me X”.
Spoken Language Competence . Reception
Definition: How well can the actor follow an oral utterance?
Encoding: CCV: well / reasonably / not well / not at all
Comments: Questionnaire form “I understand hearing persons X”.
Spoken Language Competence . Reading
Definition: How well can the actor read?
Encoding: CCV: well / reasonably / not well / not at all
Comments: Questionnaire form “I read X”.
Metadata for sign language corpora
14
Spoken Language Competence . Writing
Definition: How well can the actor write?
Encoding: CCV: well / reasonably / not well / not at all
Comments: Questionnaire form “I write X”.
Communication with Hearing
Definition: Which communication method with hearing people does the actor prefer?
Encoding: OVL: sign / sign-supported speech / gesture / mix between signing and speaking /
speech only / writing
Comments: Questionnaire form “When communicating with hearing persons, I use X”.
Family
Group
Definition: Describes hearing status of closest contact persons as well as preferred communication
systems used.
Encoding: Family . Mother
Family . Father
Family . Household (sub)
Family . Communication at Childhood
Family . Partner
Family . Children (sub)
Family . Communication Nowadays
Family . Mother
Definition: Describes mother’s hearing status.
Encoding: CCV: deaf / hard-of-hearing / hearing / n.a.
Comments: Use n.a. if there is no regular contact with mother. Replace with data for
grandmother or such where these persons were the primary educators. (Situation to
be described as it was at actor’s childhood.)
Family . Father
Definition: Describes father’s hearing status.
Encoding: CCV: deaf / hard-of-hearing / hearing / n.a.
Comments: Use n.a. if no regular contact with father. Replace with data for grandfather or such
where these persons were the primary educators. (Situation to be described as it was
at actor’s childhood.)
Family . Household
Group
Definition: Describes hearing status of brothers and sisters and other persons belonging to the household
(not counting mother and father nor the actor).
Encoding: Family . Household . Deaf
Family . Household . Hard-of-hearing
Family . Household . Hearing
Family . Household . Deaf
Definition: Number of deaf persons in the household (not counting mother and father nor the
actor).
Encoding: c
Comments: Situation to be described as it was at actor’s childhood.
Family . Household . Hard-of-hearing
Metadata for sign language corpora
15
Definition: Number of hard-of-hearing persons in the household (not counting mother and father
nor the actor).
Encoding: c
Comments: Situation to be described as it was at actor’s childhood.
Family . Household . Hearing
Definition: Number of hearing persons in the household (not counting mother and father nor the
actor).
Encoding: c
Comments: Situation to be described as it was at actor’s childhood.
Family . Communication at Childhood
Definition: Prevalent form of communication in the family of the actor at the time of his
childhood.
Encoding: OVL: sign / sign-supported speech / gesture / mix between signing and speaking /
speech only / writing
Comments:
Family . Partner
Definition: Describes partner’s hearing status.
Encoding: CCV: deaf / hard-of-hearing / hearing / n.a.
Comments: Describe situation at the time of the recording.
Family . Children
Group
Definition: Describes hearing status of actor’s children at the time of the recording.
Encoding: Family . Children . Deaf
Family . Children . Hard-of-hearing
Family . Children . Hearing
Comments: Special interview guidelines are to be defined for families with mixed nationalities.
Additional data concern who uses which language and who understands which language to
what extent.
Family . Children . Deaf
Definition: Number of actor’s children who are deaf.
Encoding: c
Comments: Situation to be described as it was at the time of the recording.
Family . Children . Hard-of-hearing
Definition: Number of actor’s children who are hard of hearing.
Encoding: c
Comments: Situation to be described as it was at the time of the recording.
Family . Children . Hearing
Definition: Number of actor’s children who are hearing.
Encoding: c
Comments: Situation to be described as it was at the time of the recording.
Family . Communication Nowadays
Definition: Prevalent form of communication in the actor’s family at the time of the recording.
Encoding: OVL: sign / sign-supported speech / gesture / mix between signing and speaking /
speech only / writing
Comments: Describe situation at the time of the recording.
Use n.a. (not applicable) if actor lives alone. Use data from actor’s parents if he/she
lives at parents’ household and has no partner.
Metadata for sign language corpora
16
Deaf Contacts
Group
Definition: Describes the extent of contacts to other deaf people beyond family.
Encoding: Deaf Contacts . Work
Deaf Contacts . Friends
Deaf Contacts . Deaf Club
Deaf Contacts . Active in Deaf Community
Comments: Summarize situation from the last five years.
Deaf Contacts . Work
Definition: Describes the extent of contacts to other deaf people at work / school.
Encoding: CCV: never / sometimes / regularly
Comments:
Deaf Contacts . Friends
Definition: Describes the extent of individual contacts to other deaf people (beyond work and
Deaf club).
Encoding: CCV: never / sometimes / regularly
Comments:
Deaf Contacts . Deaf Club
Definition: Describes the extent of contacts to other deaf people at a Deaf club / Deaf sports
club.
Encoding: CCV: never / sometimes / regularly
Comments:
Deaf Contacts . Active In Deaf Community
Definition: Describes whether the actor has some specific function in the Deaf community.
Encoding: CCV: no involvement / some participation / full engagement
Comments: “Some” shall mean participation in committees and such, full is for leaders, i.e.
“officials” of the Deaf organizations, from local to national level.
Education
Group
Definition: Describes where the actor was educated.
Encoding: Education . Kindergarten (sub)
Education . Primary School (sub)
Education . Secondary School (sub)
Education . Postsecondary Education (sub)
Education . Profession Learnt
Education . Current Profession
Education . Sign Teaching
Comments: Once again, these data are also used to determine language and dialect background.
Education . Kindergarten
Group
Definition: Describes pre-school education.
Encoding: Education . Kindergarten . Kind
Education . Kindergarten . Location
Comments: Also used for preschool.
Metadata for sign language corpora
17
Education . Kindergarten . Kind
Definition: Describes the education model used at the kindergarten.
Encoding: CCV: deaf / for hard-of-hearing / bilingual / integrated / n.a.
Comments:
Education . Kindergarten . Location
Definition: Describes where (town or region) the institution was located.
Encoding: string
Comments:
Education . Primary School
Group
Definition: Describes primary school education.
Encoding: Education . Primary School . Kind
Education . Primary School . Location
Education . Primary School . Boarding school
Comments: If several schools attended, create key-value pairs for all of them with a minimum stay of
one year.
Education . Primary School . Kind
Definition: Describes the education model used at the school.
Encoding: CCV: deaf / for hard-of-hearing / bilingual / integrated / n.a.
Comments:
Education . Primary School . Location
Definition: Describes where (town or region) the institution was located.
Encoding: string
Comments:
Education . Primary School . Boarding School
Definition: Describes whether the school attended was a boarding school and, if so, gives the
name of the school.
Encoding: string
Comments: Create value iff boarding school was attended: String is the name of the school.
Education . Secondary School
Group
Definition: Describes secondary school education.
Encoding: Education . Secondary School . Kind
Education . Secondary School . Location
Education . Secondary School . Boarding school
Comments: If several schools attended, create key-value pairs for all of them with a minimum stay of
one year.
Education . Secondary School . Kind
Definition: Describes the education model used at the school.
Encoding: CCV: deaf / for hard-of-hearing / bilingual / integrated / regular with interpreter /
n.a.
Comments:
Education . Secondary School . Location
Definition: Describes where (town or region) the institution was located.
Encoding: string
Comments:
Metadata for sign language corpora
18
Education . Secondary School . Boarding School
Definition: Describes whether the school attended was a boarding school and, if so, gives the
name of the school.
Encoding: string
Comments: Create value iff boarding school was attended: string is the name of the school.
Education . Postsecondary Education
Group
Definition: Describes postsecondary / higher education.
Encoding: Education . Postsecondary Education . Kind
Education . Postsecondary Education . Location
Comments: If several institutions attended, create key-value pairs for those with a minimum stay of one
year.
Education . Postsecondary Education . Kind
Definition: Describes the education model used at the school.
Encoding: OVL: vocational training / vocational training centre with interpreters / university /
university special courses for hard-of-hearing or deaf / university with interpreters
Comments:
Education . Postsecondary Education . Location
Definition: Describes where (town or region) the institution was located.
Encoding: string
Comments:
Education . Profession Learnt
Definition: Describes the profession learnt (e.g. via vocational training or university).
Encoding: string
Comments:
Education . Current Profession
Definition: Describes the actor’s current job.
Encoding: string
Comments: If unemployed, use last job.
Education . Sign Teaching
Definition: Amount of experience with teaching sign language.
Encoding: OVL: none / some / extensive
Comments:
7. Links
Workshop home page
The background document
Sign language master files for IMDI
ECHO project, home page
ECHO project, case study 4
ECHO project, technology
ECHO project, state of the art
Metadata for sign language corpora
http://www.let.ru.nl/sign-lang/echo/events.html
http://www.let.ru.nl/sign-lang/echo/docs/Metadata_SL.doc
http://www.let.ru.nl/sign-lang/IMDI
http://echo.mpiwg-berlin.mpg.de/
http://www.let.ru.nl/sign-lang/echo
http://www.mpi.nl/echo
http://www.ling.lu.se/projects/echo/contributors/
19
IMDI standard
IMDI tools
ISLE metadata glossary
ELAN annotation software
http://www.mpi.nl/IMDI
http://www.mpi.nl/IMDI/tools
http://www.mpi.nl/ISLE/glossary/glossary_frame.html
http://www.lat-mpi.eu/tools/elan
8. References
IMDI (ISLE Metadata Initiative), 2003, Part 1. Metadata elements for session descriptions. Draft
proposal version 3.02. March 2003.
Warning: the document and tools available online refer to version 2.5-2.8!
This 3.02 version of the proposal was sent to you with the present document.
IMDI (ISLE Metadata Initiative), 2001, Part 1B. Metadata elements for lexicon descriptions. Draft
proposal version 2.1. June 2001.
http://www.mpi.nl/IMDI/documents/Proposals/IMDI_Catalogue_2.1.pdf
IMDI (ISLE Metadata Initiative), 2001, Part 1C. Metadata elements for lexicon descriptions. Draft
proposal version 1.0. December 2001.
http://www.mpi.nl/IMDI/documents/Proposals/ISLE_Lexicon_1.0.pdf
Birgit Hellwig, 2003, IMDI Editor, version 2.0. Manual. Version: 02 Apr 2003.
http://www.mpi.nl/IMDI/tools/IMDI_Editor_Manual_2_0.doc
Birgit Hellwig, 2003, IMDI Browser, version 1.4. Manual. Version: 12 Sep 2002.
http://www.mpi.nl/IMDI/tools/IMDI_Browser_Manual-02-09-08.doc
Peter Wittenburg & Daan Broeder, 2003, Metadata in ECHO. Version: 10 Mar 2003.
http://www.mpi.nl/echo/tec-rep/wp2-tr08-2003v1.pdf
Metadata for sign language corpora
20