Metadata for sign language corpora Background document for an ECHO workshop May 8 + 9, 2003, Radboud University Nijmegen http://www.let.ru.nl/sign-lang/echo/events.html Onno Crasborn Department of Linguistics, Radboud University Nijmegen [email protected] Thomas Hanke Institute for German Sign Language, University of Hamburg [email protected] Version: Tuesday, January 12, 2010 (links updated) 0. Preparation for the workshop The goal of the workshop is to acquaint you with the concept of metadata and the IMDI standard, and to decide on a set of categories specific for our subfield of the language sciences. The proposal for a list of categories in this document (section 6) is a first proposal for such a set. In order to prepare for the workshop, we would like to ask you to do four things. 1. Study this document and specifically the proposal in section 6, and take a look at the IMDI document ‘Metadata elements for session descriptions, draft proposal v3.02’. 2. Think carefully about the kinds of data collections (corpora) you have at your institute (and others present in your country that differ in kind from your own corpora), and how they are currently archived. Is there a database storing information about your video collection, for example, or do you have text document listing your tapes? 3. Evaluate the proposal in section 6, and see which types of information you would like to store but are missing in the combination of the IMDI 3.02 proposal plus the sign-specific and deaf-specific additions that are proposed in section 6. 4. In addition, evaluate the proposal in section 6 from the other side: which types of information would you never use to describe your corpus, and are redundant from your point of view? There are several information sources on internet that provide further information when you still have questions after reading this document, or which you could browse to get a better view on what metadata are and what the IMDI standard is. The full description of categories in the IMDI 3.02 proposal is useful to print and access for looking up details of the categories in case you need a more detailed description, but it is not intended as a text to read from start to end. The two manuals for the IMDI BCBrowser and the IMDI Editor (MS Word documents), which are the software tools presently available for using the IMDI metadata set, do however give an accessible introduction to the structure of IMDI. A glossary of some of the technical terms used in the various documents can be found on the ISLE web site. (The full links can be found in section 7 at the end of this document.) Metadata for sign language corpora 1 1. What is the problem? Now that sign language research in many European countries is expanding at a rapid pace, data collections are also growing considerably. Sign language data typically consist of video recordings, but in many cases are accompanied by transcriptions (whether on paper or in computer format), lexical or phonological analyses stored in databases, etc. In some cases, sign language data consist of numeric information such as the outputs of eye tracking systems or movement tracking equipment. At the same time, since the ESF InterSign workshops in the late 1990s, the development of computer technology has made it possible to digitize video using a simple desktop computer, and more importantly, to store many dozens of hours of digitized video even on a local hard disk. A considerable effort in terms of time is required to digitize existing corpora of analogue video recordings, but it is likely that more and more institutes will actually do this within the next few years. Digitized video files will be stored on a server in the institute, and made accessible over the local intranet. The advantages are enormous: video recordings will no longer degrade as analogue tapes did, and any segment of any recording is accessible within a few seconds for any member of the team. By storing annotation files on a server as well, incremental annotations based on the original media by colleagues become possible. Even more importantly in the European context, it will become possible to provide access to data to collaborators in other locations within the country or in other countries. EU funding will keep emphasizing collaboration between institutes in different member states, and being able to share data over internet can be a key factor in successful collaboration. However, all these possibilities do require a good way of cataloguing the available data. Data will only be used in practice if they can be searched in some way, whether locally or from the other side of the world. If students or new team members still have to go to the local long-time team member to ask whether there might be a recording of young children with deaf parents, for example, not too much time will actually be gained. It is the experience of many people that even the available video data collected over the years (in the case of sign language, typically a period stretching 5 to even 25 years) are so large and contain data for so many different types of studies, that no one has a full overview of what there is. To take a modest (that is, relatively small) example, consider the video corpus collected for phonological studies at the universities of Leiden and Nijmegen since 1994. About 90 hours of analogue video material are available, which can be divided into 5 broad categories. Three categories consist of well-defined elicited and spontaneous recordings for specific research projects in the area of phonetics and phonology (about 50 hours). One category consists of a very disparate set of video recordings made for research purposes of various kinds, ranging from 5 to 30 minutes per session (about 10 hours in total). The fifth category (30 hours) comprises sign language tapes that are commercially available (mostly NGT stories for children), a large set of TV recordings featuring signers or discussing deaf issues (news items, documentaries and movies), and data from other researchers in various countries. Since the first three categories were collected by ourselves for our own research, we have reasonable view on the nature of the data on these tapes. This is not true for the latter two categories: the recordings are simply too divergent to be able to recollect everything that is present. The small database application we created to list the tapes we have (describing tape number, title, short description, recording date, type of tape) is not detailed enough to look up whether we have interactional data by informants from the north of the country, for example. For new colleagues, the whole collection is untransparent. The Dutch Science Foundation has provided a grant for the archiving of all the analogue tapes on digital tapes or in computer files, but if we do not use a systematic way to describe the content of the tapes at the same time, the recordings are bound to stay in the closet, even though there is a lot of very valuable material on there that can be used for many different projects. The need for proper cataloguing will actually increase substantially when video tapes are digitized, because in many cases an hour of analogue video recordings are naturally split up into multiple digital movie files (one for each elicitation task, for example). In the present example, the 90 hours of analogue video data are expected to result in about 1,000 digital movie files. Metadata for sign language corpora 2 The answer to the cataloguing problem is the use of ‘metadata’: descriptions at a general level of the nature of the data that can be considered constant for a whole recording (as opposed to linguistic annotation or analysis of the data themselves that describes the actions unfolding in time). The workshop aims to expand one of the most elaborate metadata proposals for describing spoken language corpora (IMDI) to cover sign language data collections as well. This document provides an introduction to the concept of metadata (section 2), presents the IMDI standard and associated software (section 3), discusses how you can create and distribute your own collections of metadata, and makes a proposal for an extension of the IMDI set that can be discussed at the workshop (sections 5 and 6). The workshop is part of the ‘ECHO’ project, which started late 2002 and will end in 2004. ECHO is an acronym of European Cultural Heritage Online, a large pilot study funded by the EU to explore how the internet can be used to make data in different subdisciplines of the humanities accessible to other researchers and to a wider public. ECHO aims to make an inventory of existing data collection efforts, to explore technological developments that are called for, and to carry out a series of case studies in five different disciplines to show the potential of the technology. The case study on language concerns sign language. A small set of comparable data is being collected for British Sign Language, Swedish Sign Language, and Sign Language of the Netherlands. Data will be annotated using a very general set of transcription conventions (see http://www.let.ru.nl/sign-lang/echo/documents.html) using the ELAN annotation software, and published on the internet to be used and/or further annotated for linguistic research by anyone in the world. 2. What are metadata? Metadata can be described as ‘data about data’, or ‘information about an information resource’. When we think of the field of linguistics, data typically consist of audio or video recordings, transcriptions associated with audio or video sources, lexica, etc.1 Transcriptions or other analytical procedures directly reflect the linguistic content, describing form and meaning at various linguistic levels. An important aspect of transcriptions is the dynamic nature of this type of data: it changes with time. As opposed to these linguistic data, we can also characterize our sources in terms of what one might call ‘administrative data’. These metadata include information about the date and location of recording, information about the nature of the recording (spontaneous vs. elicited data), elicitation methods, details about the people participating in the recording, etc. This separation between data and metadata may appear not sharply defined. This is correct, often the exact boundary is determined by pragmatic reasons of data accessibility and data file format. Important is to have the metadata characterise the “content” data in a usefull and consistent way so that when one is looking for data or resources in a situation when only metadata is available, meaningfull results appear. 1 Examples of further types of data in linguistic research are dataglove signals, laryngograph signals, and eye tracking signals. Metadata for sign language corpora 3 data (media) VHS tape, MPEG-1 file, WAV file, etc. metadata description recording date, media type, elicitation method, register, access rights, annotation type, annotater, etc. data (linguistic annotation) glossing, HamNoSys transcription, etc. For instance, information about the register used by a signer can be a property of the whole recording session (such as when a lecture is being presented in a formal register), but if register variation was the topic of the study and different register versions of the same sentence are being produced, a description of the register of each sentence would feature in the transcription of the data. Some overlap between the metadata descriptions and the linguistic transcriptions as in this example is unavoidable. In general however, the two domains feature a description in terms of two distinct sets of categories. However, this does not mean that users do not want to search for data combining categories from the two domains. For example, one can easily imagine queries of the following types: • How many instances of the sign MONDAY [data] are there in my NGT corpus [metadata] produced by women [metadata] aged over 50 years [metadata]? • What are the sentences in our corpus of hearing DGS interpreters [metadata] signing to a Deaf audience [metadata] from the north of the country [metadata] where the signer uses raised eye brows [data] in nonquestion contexts [data]? Combining searches in this way should be a possible function of the software used for retrieving data and metadata, and is not excluded in principle by storing some information in the metadata description and other information in the linguistic annotation.2 For a more elaborate and more technical introduction to metadata, see the technical report created for ECHO Workpackage 2. 2 At this moment, the IMDI tools are specifically designed to work with ELAN transcription files, since files for both can be accessed through internet. ELAN is a software tool with functionality similar to SignStream and syncWRITER, with the main difference that it is designed to work with files over a network, so that data can easily be accessed by different members of the research team, or by people in different countries. This network functionality is central to the ECHO project, which aims to make scientific data available online. Browsing a corpus using the IMDI tools provides users with direct links to the ELAN annotation and video files (provided they are granted access to the data), but it is not yet possible to do searches combining queries from both the data and the metadata domain. This will likely be developed in the near future. Metadata for sign language corpora 4 3. What is the IMDI standard? One of the most detailed and most universal proposals for a set of metadata descriptions for linguistic corpora resulted from the ISLE project (International Standard for Language Engineering), and is called IMDI (ISLE Metadata Initiative). It is the result of a series of meetings and discussions by a large group of spoken language researchers and language engineers, and forms one of the cornerstones of the ECHO project. Workpackage 1 of the ECHO project aims to make an inventory of existing metadata corpora and to create a number of new ones for various languages. IMDI comprises different sets of metadata, to describe different things: • Session metadata: descriptions of combinations of media files and linguistic annotation files • Catalogue metadata: descriptions on a more abstract level of the corpus as a whole. • Lexicon metadata: descriptions of lexicons (still under development) This document and the sign language workshop in May 2003 only discuss metadata elements to describe sessions or resource bundles (a name that is also used in domains and situations where the term “Session” has specific other connotations). Sessions are intuitively natural units of media files (video and/or audio) and a linguistic annotation file (ELAN, SignStream, etc.). For example, if a data collection on a single video tape contains the Frog Story signed by different informants, each informant’s rendering of the story would be considered a separate session. In this way, it is possible to describe and search for every piece of the source in a unique and extensive way. Each session is described by one IMDI file. Different sessions can be combined in larger units (subcorpora, e.g. the ‘Frog Story Corpus’), which in turn constitute larger corpora (e.g. the ‘Nijmegen Sign Language Corpus’). It is important to make a distinction between the IMDI standard, which is the topic of the workshop, and the software tools currently available to use the IMDI standard. While the set of elements used to describe corpora should be relatively stable from the outset, the software to enter and retrieve these elements does not necessarily have to be standardized, and can more easily be changed in the future. Currently, the only available software tools to enter and retrieve information from IMDI files are the ones available on the IMDI web site, developed by the Max Planck Institute for Psycholinguistics. Because of the standard markup text format of the files (actually, XML), anyone is free to develop software for specific needs that are not covered by the MPI software. This situation is comparable to standardized transcription systems: the definition of the IPA or HamNoSys notation systems, or for example glossing conventions, is independent of their use in a computer program such as MS Word or SignStream. The IMDI elements for session descriptions are grouped into seven broad sets of fixed elements: 1. Session 2. Project 3. Collector 4. Content 5. Actors 6. Resources 7. References The session concept bundles all information about the circumstances and conditions of the linguistic event, groups the resources (e.g., video files and annotation files) belonging to this event, and records the administrative information for the event. Information about the project for which the sessions were originally created. Name and contact information for the person who collected the session. A set of categories describing the intellectual content of the session. Names, roles and further information about the people involved in the session. Information about the media files, such as URL, size, etc. Citations and URLs to relevant publications and other archive resources. A full description of the IMDI elements can be found in the IMDI 3.02 proposal. Another way to obtain a quick overview of the available elements and their use is to open one of the IMDI descriptions of the ECHO sign language corpus, which can be found on the workshop web site. If you open this file in the IMDI editor, you get a good impression of the distinctions included in the current (3.02) version of the IMDI proposal for session descriptions. Metadata for sign language corpora 5 The available elements form a broad set that is shared between different subdisciplines of language studies. To allow for the encoding of categories that are specific to a given subdiscipline, such as sign language studies, or even specific to a single research project, there are several places in the IMDI scheme where users can add fields of their own: so-called ‘keyword - value pairs’. Keyword-value pairs are available at the following locations in the IMDI element set: • Session • Content • Actor • Language In each location, an unbounded number of ‘keys’ can be created to describe project-specific characteristics of the session. For example, the present standard does not incorporate any fields that are specific to the sign language community, such as whether an actor is deaf or hearing. The goal of the workshop is to agree on a first proposal for a set of key-value pairs that will be used by the sign language field as a whole. In addition to this ‘standard extension’ (or ‘profile’), a sign language researcher is still free to add extra fields that describe properties of the session that are not in the IMDI 3.02 or in the sign language set. In addition to the key-value pairs, the IMDI 3.02 proposal also includes a number of ‘description’ elements. These description elements are intended to give a more elaborate prose description in any language of different aspects of the session. The content of these description elements is not controlled, and although the resulting descriptions can be very useful when browsing the corpus, the resulting variation in the content of these elements over many thousands of sessions will make it useless to search for words in these elements when one wants to find very specific information. For this reason, it is better to agree on a standard key element for encoding the deafness of an actor than always adding the text ‘The actor became deaf at age 16’ to the Actor.Description element. A proposal for a set of keys for the sign language field is presented in section 6 of this document and forms the main item for discussion at the workshop. The kind of information that can be stored in a field, the ‘vocabulary’, can be of different types, as is summarized in the following table. Abbrev. Type Description Example element str string free string of text Actor.Name c constrained the text that can be entered is constrained in some way Actor.Age ov open vocabulary the content of the element can be chosen from a list or new items can be added by the user Resources.MediaFile.Format ovl open vocabulary list the content of the element can be chosen from a near exhaustive list Content.Genre.Interactional ccv closed controlled vocabulary the content of the element has to be chosen from a restricted and closed set of items Session.Location.Continent These vocabulary types and the abbreviations are also used in the proposal in section 6. Metadata for sign language corpora 6 4. How can one create metadata descriptions and make them accessible to others? As mentioned, an editor for IMDI metadata creation is available from MPI free of charge. For people who need to start up the metadata collection task from zero, this certainly is the easiest way to go. The editor is a Java program that can be run on a variety of platforms (Windows 98, Windows XP, Mac OS X, etc.), and it will be updated to accommodate future changes of the standard. The program does maintenance of vocabularies and more automatically. The situation might be different, however, if you have already collected a substantial amount of metadata. We therefore analysed how the existing metadata resources for the Nijmegen and Hamburg corpora could be transferred to IMDI resources. For the Nijmegen data, it turned out that a variety of formats were used for the metadata, and that automatic transfer into IMDI is not worth the effort. The existing information will therefore be transferred manually to IMDI by copying & pasting into the IMDI editor when new sessions are created. For the Hamburg data, a fair part of the information contained in the IMDI standard is already available in the institute’s transcription database. Other bits of information are kept in a variety of formats, such as free-format text documents, spreadsheets, and transcriptions of interviews on the sociographic background of the informants. As it is a goal in Hamburg to store all relevant data in one database, the following three-part strategy was chosen. • • • Data already available in the Hamburg database will be made available by means of XML exports. This is a fairly easy task to be completed in Summer 2003. This will already give an overview of the corpora although not all fields are filled as appropriate. In some cases, the suggested standard requires more detailed vocabularies than currently provided by the Hamburg database. As a short-term solution, a mapping will be provided from the Hamburg data to the IMDI vocabularies. In the long run, the database will be changed to directly correspond to the IMDI value sets. Of course, this requires work to be done by the researchers who conducted the earlier corpus collection tasks, and will therefore require a substantial amount of time to be completed. The Hamburg database will be extended to cover the metadata suggested in the IMDI proposal but not yet covered in the database. This shall be completed in the course of 2003. Where the data are readily available in other formats, they will be entered into the database as part of the routinely database maintenance tasks. A few exceptions, however, will require intervention by the principal researchers and can therefore not be expected to be finished within a year. The situation of your metadata may require a similar or yet another approach. We suggest that you start the analysis with trying to match your current metadata sets with the IMDI proposal field by field to see which approach best fits your needs. In principle, these metadata descriptions can be kept in-house for use by close colleagues. However, the ECHO program strongly promotes the sharing of metadata corpora, without the concomitant need or obligation to share the corresponding data as well. Within IMDI, one can indicate to what extent the data described by the metadata are available to the outside world. Even if not a single videotape is accessible to outside users, sharing metadata alone can already be very informative: researchers get an impression of aspects of the methodology of different projects, such as elicitation methods, and can get in touch with the researcher responsible for the data collection to discuss methodological aspects of the research project in point. IMDI metadata descriptions can be shared by publishing them on a HTTP (web) server, and distributing the address of the top node of a corpus. Metadata for sign language corpora 7 5. What needs to be discussed at the workshop? There are several sign language specific properties of linguistic data that are not covered by the IMDI 3.02 metadata set described above. The IMDI standard provides a simple mechanism for extending its scope: key-value pairs. Most major groups of the session description include a sub-schema for key-value pairs. To store information here, you add a pair, give it a key comparable to schema elements and choose a value from a vocabulary to be assigned to the key. The following section proposes a set of key-value pairs for some of the most apparent properties of sign language corpora that cannot be described by the IMDI 3.02 proposal for session descriptions. The proposed keys mainly specify the background of the actors, including their hearing status, sign language skills, school background, etc. The goal of the workshop is to decide on a list of keys and vocabularies, based on the present proposal. It is important to realize that, just as in the standard set of IMDI elements, not all elements are relevant for all sessions. The sign language set should include information that applies to the average sign language session. For example, sessions with sign language acquisition materials should be described with keys from both the sign language and the acquisition fields. Furthermore, anyone is always free to add extra keys specific to one or more subcorpora. As we just indicated, the goal of the workshop is to decide on a set of IMDI extensions (keys); this means that you have to think carefully about three questions: 1. 2. 3. Are there any categories missing in the proposal below? Which of the categories below would you never use? Are the proposals for vocabularies for each element well chosen? It will be impossible to think of all possible scenarios just by looking at paper lists. Describing actual sessions with the IMDI tools will no doubt lead to further refinements, which is exactly what happened to the IMDI standard set when it was used by different groups of users. 6. A proposal: a set of key-value pairs to store sign language specific properties The IMDI standard provides a simple mechanism for extending its scope: key-value pairs. Every major section of the session description includes a sub-schema for key-value pairs. To store information here, you add a pair, give it a key comparable to schema elements and choose a value from a vocabulary to be assigned to the key. This results in a slightly less-structured representation than what could be achieved with an extension of the standard, but it does allow for a quick start, with the chance to later formally propose an extension of the standard promoting the keys to scheme elements. In order to simulate the concept of element groups, we suggest key values with dots for the moment, such as Hearing Status.Hearing for Actor.keys. The most important future development of the IMDI tools from the perspective of this proposal concerns the creation of ‘profiles’. Profiles will contain sets of key-value pairs specific for different subgroups of users, such as the sign language community. At this moment we can simulate and share a sort of profile by making and sharing a ‘master document’ as the one that can be found on the workshop web page (from early May on). Ideally, one would be able to choose one or more profiles from a list within the IMDI editor. This will be developed in the near future. Metadata for sign language corpora 8 If the set of sign language extensions is seen as a useful and stable set by the sign language community, the ‘sign language profile’ can perhaps be given a more flexible layout in the IMDI editor and browser. In order to simulate the concept of element groups, we suggest key values with dots, such as ‘Hearing Status.Hearing’ for ‘Actor.keys’. Numbers refer to paragraphs in the IMDI 3.02 proposal that already provide space for additional information. 3 3.1 3.1.10 Metadata element definitions Session Session . keys As there is no place for keys in the ‘resources’ section of IMDI, the following group has to be placed here. In a new version of the IMDI standard and the associated tools, it will be possible to specify a keys in the resources section as well. Multiple Cameras Group Definition: Properties of the use and combination of multiple cameras. Encoding: Multiple Cameras . Number Multiple Cameras . Layout Multiple Cameras . Viewpoint Multiple Cameras . Focus Comments: A description of the use multiple cameras during the recording, and how they appear in the media files of this session. The situation in the sources is assumed to be similar, then; ideally, it will become possible to encode this information for each resource (analogue tape, digital file) separately by keys for each resource. Multiple Cameras . Number Definition: Number of video cameras used to record the session. Encoding: c Comments: Multiple Cameras . Layout Definition: Appearance of the video cameras when they are combined in one media file. Encoding: OV: side-by-side / insert in top-left corner / insert in bottom-left corner / insert in top-right corner / insert in bottom-right corner Comments: If the views of the different cameras are not combined in either the source or the media files, this field is left empty. Multiple Cameras . Viewpoint Definition: Viewpoints of the different cameras. Encoding: string Comments: Until a set of keys becomes available for the Resource section, people are encouraged to use a systematic way for entering this information in a single field. We propose to use a number or letter for the camera, followed by a colon, followed by a string from the set { front, side, diagonal, top, ... }. Multiple Cameras . Focus Definition: Number of video cameras used to record the session. Encoding: c Comments: Until a set of keys becomes available for the Resource section, people are encouraged to use a systematic way for entering this information in a single field. Metadata for sign language corpora 9 We propose to use a number or letter for the camera, followed by a colon, followed by a string from the set { upper body, whole body, face, mouth, ... }. 3.3 3.3.6 3.3.6.2 Content Content . Languages Content . Languages . Description Space for describing code mixing, sign supported speech, etc. used in this session in prose. Do we need separate keys for describing code mixing and code switching between different languages or modalities? 3.3.7 Content . keys Language Variety Definition: Description of the language variety used in the session. Encoding: string Comments: Space for more constrained description of language variety used in this session. Information about language skills of the individuals should be entered in the actor’s description (cf. 3.4.2.15 Actor . keys). Elicitation Method Definition: A characterization of specific prompts used for eliciting language production. Encoding: OV: single picture prompt / picture story prompt / written language prompt / sign language prompt / video prompt. Comments: When working on the influence of German on DGS compounding, for example, it is essential to know if the spoken language competence has been activated by the elicitation situation. Content . Task might be appropriate for this purpose, but the open vocabulary seems to suggest different levels of detail: While Wizard of Oz certainly is not related to the utterance’s topic, some others are, such as room reservation. "Frog story" could already have a (TM), it is well known to name both contents and elicitation method. Content . Involvement would be a good place, if it were open vocabulary. Interpreting Group Definition: Properties of interpreting appearing in the session. Encoding: Interpreting . Source Interpreting . Target Interpreting . Interpreter Name [move to Actor?] Interpreting . Visibility Interpreting . Audience Comments: Interpreting . Source Definition: Source modality and language type. Encoding: OVL: sign language, speech, sign supported speech, text, fingerspelling Comments: Interpreting . Target Definition: Target modality and language type. Metadata for sign language corpora 10 Encoding: OVL: sign language, speech, sign supported speech, text (subtitling), fingerspelling Comments: Interpreting . Interpreter Name Definition: Name of the interpreter(s). Encoding: string Comments: Use name or ‘unknown’. = reference to Actor for details? Interpreting . Visibility Definition: Visibility of the interpreter in the video recordings. Encoding: CCV: not visible / in view during whole session / in view during part of session Comments: Interpreting . Audience Definition: Presence and nature of an audience that the interpreter is signing for. Encoding: CCV: audience not present (signing to camera) / audience known to the interpreter / heterogeneous group partly known to the interpreter / anonymous audience (e.g. theatre) Comments: If Interpreting.Target = subtitling, leave field empty. 3.4 3.4.2 3.4.2.15 Actors Actor Actor . keys We propose to add a number of keys describing different aspects of the actors, mainly to characterize the language background. All of these keys refer to relatively stable properties (skills) of the actors, not to their actual behavior in the specific session at hand. Note: descriptions of groups of keys are aligned with the left margin; description of elements are all indented. The other formatting of the descriptions follows the IMDI documents. Keys that are further specified by a set of keys are followed by “(sub)” in the lists. General comment: most of the subjective data could be paralleled with “objective” data, such as ‘db left’ and ‘db right’ for the item ‘hearing’, scores in a language competence tests etc. Is this needed? Does anyone have suggestions for specific field and values that are often measured in your corpus? Actor keys Group: Encoding: Dialect Dialect Background (sub) Hearing Status (sub) Sign Competence (sub) Sign Systems Use (sub) Family (sub) Family . Children (sub) Deaf Contacts (sub) Education (sub) Comments: Stable properties (skills) of the actor, not their actual use in a given session. Metadata for sign language corpora 11 Dialect Definition: Name of the dialect the actor uses. Encoding: string or OV Comments: Information on language variety, with a priori knowledge about the dialects for a specific sign language Dialect Background Group Encoding: Dialect Background . Raised at Dialect Background . Living in Dialect Background . Local since Comments: Groups information on language variety, without knowing a priori what dialects exist Dialect Background . Raised at Definition: Town or region where the actor lived at the language acquisition age. Encoding: string or OV Comments: Dialect Background . Living in Definition: Town or region where the actor lived at the time of the recording. Encoding: string or OV Comments: Dialect Background . Local since Definition: Year the actor moved to the current place of residence. Encoding: c Comments: Hearing Status Group Definition: Groups information about the hearing status of the actor. Only the first element is relevant for all actors, the other elements specify details about hearing loss. Encoding: Hearing Status . Hearing Hearing Status . Hearing Rests Hearing Status . Aid Type Hearing Status . Aid Use Comments: Hearing Status . Hearing Definition: Actor’s ability to hear. Encoding: CCV: hearing / hard-of-hearing / deaf Comments: Hearing Status . Hearing Rests Definition: Description of the types of acoustic signals the actor can still perceive. Encoding: CCV: I can hear the phone signals (busy etc.) I can hear voices I can understand a bit what people are saying. Metadata for sign language corpora 12 Comments: These are subjective data, are objective data needed? OV needed instead of CCV? Hearing Status . Aid Type Definition: Description of the types of acoustic signals the actor can still perceive. Encoding: CCV: none / conventional / CI Comments: Hearing Status . Aid Use Definition: Information on how often the actors wears his/her hearing aid. Encoding: CCV: always / regularly / sometimes / never Comments: Sign Competence Group Definition: Groups (partly subjective) information on the actor’s command of sign language. Encoding: Sign Competence . Acquisition Age Sign Competence . Acquisition Location Sign Competence . Use Onset Sign Competence . Regional Comments: Sign Competence . Acquisition Age Definition: Age at which exposure to sign language and sign language use started. Encoding: c (years;months) Comments: Sign Competence . Acquisition Location Definition: Place where sign language was learnt. Encoding: OV home / kindergarten / school / family beyond home / friends Comments: Sign Competence . Use Onset Definition: Since what age does the actor regularly use sign language? Encoding: c (years;months) Comments: Sign Competence . Regional Definition: Does the actor consider him/herself a dialect user? Encoding: OVL: Using a dialect from my region Using both regional and standard varieties Using more than one regional variant Comments: Sign Systems Use Group Definition: Groups information on what sign subsystems are used. Encoding: Sign Systems Use . Sign Supported Speech Sign Systems Use . Fingerspelling Metadata for sign language corpora 13 Sign Systems Use . Alternate Fingerspelling Sign Systems Use . Cued Speech Comments: Sign Systems Use . Sign Supported Speech Definition: How often does the actor use sign supported speech in everyday communication? Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly with colleagues Comments: Sign Systems Use . Fingerspelling Definition: How often does the actor use fingerspelling (nationally dominant version) in everyday communication? Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly with colleagues Comments: Sign Systems Use . Alternate Fingerspelling Definition: How often does the actor use alternate fingerspelling in everyday communication? Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly with colleagues Comments: Alternate means one-handed for Britain and two-handed elsewhere. Sign Systems Use . Cued Speech Definition: How often does the actor use cued speech in everyday communication? Encoding: OVL: never / sometimes / regularly with family / regularly with friends / regularly with colleagues Comments: Cued speech shall be understood to include methods such as Phoneme-based Manual System. Spoken Language Competence Group Definition: Groups information on what use the actor can make of spoken language. Encoding: Spoken Language Competence . Articulation Spoken Language Competence . Reception Spoken Language Competence . Reading Spoken Language Competence . Writing Communication with Hearing Spoken Language Competence . Articulation Definition: How well can the actor articulate (subjective measure)? Encoding: CCV: well / reasonably / not well / not at all Comments: Questionnaire form “Hearing persons can understand me X”. Spoken Language Competence . Reception Definition: How well can the actor follow an oral utterance? Encoding: CCV: well / reasonably / not well / not at all Comments: Questionnaire form “I understand hearing persons X”. Spoken Language Competence . Reading Definition: How well can the actor read? Encoding: CCV: well / reasonably / not well / not at all Comments: Questionnaire form “I read X”. Metadata for sign language corpora 14 Spoken Language Competence . Writing Definition: How well can the actor write? Encoding: CCV: well / reasonably / not well / not at all Comments: Questionnaire form “I write X”. Communication with Hearing Definition: Which communication method with hearing people does the actor prefer? Encoding: OVL: sign / sign-supported speech / gesture / mix between signing and speaking / speech only / writing Comments: Questionnaire form “When communicating with hearing persons, I use X”. Family Group Definition: Describes hearing status of closest contact persons as well as preferred communication systems used. Encoding: Family . Mother Family . Father Family . Household (sub) Family . Communication at Childhood Family . Partner Family . Children (sub) Family . Communication Nowadays Family . Mother Definition: Describes mother’s hearing status. Encoding: CCV: deaf / hard-of-hearing / hearing / n.a. Comments: Use n.a. if there is no regular contact with mother. Replace with data for grandmother or such where these persons were the primary educators. (Situation to be described as it was at actor’s childhood.) Family . Father Definition: Describes father’s hearing status. Encoding: CCV: deaf / hard-of-hearing / hearing / n.a. Comments: Use n.a. if no regular contact with father. Replace with data for grandfather or such where these persons were the primary educators. (Situation to be described as it was at actor’s childhood.) Family . Household Group Definition: Describes hearing status of brothers and sisters and other persons belonging to the household (not counting mother and father nor the actor). Encoding: Family . Household . Deaf Family . Household . Hard-of-hearing Family . Household . Hearing Family . Household . Deaf Definition: Number of deaf persons in the household (not counting mother and father nor the actor). Encoding: c Comments: Situation to be described as it was at actor’s childhood. Family . Household . Hard-of-hearing Metadata for sign language corpora 15 Definition: Number of hard-of-hearing persons in the household (not counting mother and father nor the actor). Encoding: c Comments: Situation to be described as it was at actor’s childhood. Family . Household . Hearing Definition: Number of hearing persons in the household (not counting mother and father nor the actor). Encoding: c Comments: Situation to be described as it was at actor’s childhood. Family . Communication at Childhood Definition: Prevalent form of communication in the family of the actor at the time of his childhood. Encoding: OVL: sign / sign-supported speech / gesture / mix between signing and speaking / speech only / writing Comments: Family . Partner Definition: Describes partner’s hearing status. Encoding: CCV: deaf / hard-of-hearing / hearing / n.a. Comments: Describe situation at the time of the recording. Family . Children Group Definition: Describes hearing status of actor’s children at the time of the recording. Encoding: Family . Children . Deaf Family . Children . Hard-of-hearing Family . Children . Hearing Comments: Special interview guidelines are to be defined for families with mixed nationalities. Additional data concern who uses which language and who understands which language to what extent. Family . Children . Deaf Definition: Number of actor’s children who are deaf. Encoding: c Comments: Situation to be described as it was at the time of the recording. Family . Children . Hard-of-hearing Definition: Number of actor’s children who are hard of hearing. Encoding: c Comments: Situation to be described as it was at the time of the recording. Family . Children . Hearing Definition: Number of actor’s children who are hearing. Encoding: c Comments: Situation to be described as it was at the time of the recording. Family . Communication Nowadays Definition: Prevalent form of communication in the actor’s family at the time of the recording. Encoding: OVL: sign / sign-supported speech / gesture / mix between signing and speaking / speech only / writing Comments: Describe situation at the time of the recording. Use n.a. (not applicable) if actor lives alone. Use data from actor’s parents if he/she lives at parents’ household and has no partner. Metadata for sign language corpora 16 Deaf Contacts Group Definition: Describes the extent of contacts to other deaf people beyond family. Encoding: Deaf Contacts . Work Deaf Contacts . Friends Deaf Contacts . Deaf Club Deaf Contacts . Active in Deaf Community Comments: Summarize situation from the last five years. Deaf Contacts . Work Definition: Describes the extent of contacts to other deaf people at work / school. Encoding: CCV: never / sometimes / regularly Comments: Deaf Contacts . Friends Definition: Describes the extent of individual contacts to other deaf people (beyond work and Deaf club). Encoding: CCV: never / sometimes / regularly Comments: Deaf Contacts . Deaf Club Definition: Describes the extent of contacts to other deaf people at a Deaf club / Deaf sports club. Encoding: CCV: never / sometimes / regularly Comments: Deaf Contacts . Active In Deaf Community Definition: Describes whether the actor has some specific function in the Deaf community. Encoding: CCV: no involvement / some participation / full engagement Comments: “Some” shall mean participation in committees and such, full is for leaders, i.e. “officials” of the Deaf organizations, from local to national level. Education Group Definition: Describes where the actor was educated. Encoding: Education . Kindergarten (sub) Education . Primary School (sub) Education . Secondary School (sub) Education . Postsecondary Education (sub) Education . Profession Learnt Education . Current Profession Education . Sign Teaching Comments: Once again, these data are also used to determine language and dialect background. Education . Kindergarten Group Definition: Describes pre-school education. Encoding: Education . Kindergarten . Kind Education . Kindergarten . Location Comments: Also used for preschool. Metadata for sign language corpora 17 Education . Kindergarten . Kind Definition: Describes the education model used at the kindergarten. Encoding: CCV: deaf / for hard-of-hearing / bilingual / integrated / n.a. Comments: Education . Kindergarten . Location Definition: Describes where (town or region) the institution was located. Encoding: string Comments: Education . Primary School Group Definition: Describes primary school education. Encoding: Education . Primary School . Kind Education . Primary School . Location Education . Primary School . Boarding school Comments: If several schools attended, create key-value pairs for all of them with a minimum stay of one year. Education . Primary School . Kind Definition: Describes the education model used at the school. Encoding: CCV: deaf / for hard-of-hearing / bilingual / integrated / n.a. Comments: Education . Primary School . Location Definition: Describes where (town or region) the institution was located. Encoding: string Comments: Education . Primary School . Boarding School Definition: Describes whether the school attended was a boarding school and, if so, gives the name of the school. Encoding: string Comments: Create value iff boarding school was attended: String is the name of the school. Education . Secondary School Group Definition: Describes secondary school education. Encoding: Education . Secondary School . Kind Education . Secondary School . Location Education . Secondary School . Boarding school Comments: If several schools attended, create key-value pairs for all of them with a minimum stay of one year. Education . Secondary School . Kind Definition: Describes the education model used at the school. Encoding: CCV: deaf / for hard-of-hearing / bilingual / integrated / regular with interpreter / n.a. Comments: Education . Secondary School . Location Definition: Describes where (town or region) the institution was located. Encoding: string Comments: Metadata for sign language corpora 18 Education . Secondary School . Boarding School Definition: Describes whether the school attended was a boarding school and, if so, gives the name of the school. Encoding: string Comments: Create value iff boarding school was attended: string is the name of the school. Education . Postsecondary Education Group Definition: Describes postsecondary / higher education. Encoding: Education . Postsecondary Education . Kind Education . Postsecondary Education . Location Comments: If several institutions attended, create key-value pairs for those with a minimum stay of one year. Education . Postsecondary Education . Kind Definition: Describes the education model used at the school. Encoding: OVL: vocational training / vocational training centre with interpreters / university / university special courses for hard-of-hearing or deaf / university with interpreters Comments: Education . Postsecondary Education . Location Definition: Describes where (town or region) the institution was located. Encoding: string Comments: Education . Profession Learnt Definition: Describes the profession learnt (e.g. via vocational training or university). Encoding: string Comments: Education . Current Profession Definition: Describes the actor’s current job. Encoding: string Comments: If unemployed, use last job. Education . Sign Teaching Definition: Amount of experience with teaching sign language. Encoding: OVL: none / some / extensive Comments: 7. Links Workshop home page The background document Sign language master files for IMDI ECHO project, home page ECHO project, case study 4 ECHO project, technology ECHO project, state of the art Metadata for sign language corpora http://www.let.ru.nl/sign-lang/echo/events.html http://www.let.ru.nl/sign-lang/echo/docs/Metadata_SL.doc http://www.let.ru.nl/sign-lang/IMDI http://echo.mpiwg-berlin.mpg.de/ http://www.let.ru.nl/sign-lang/echo http://www.mpi.nl/echo http://www.ling.lu.se/projects/echo/contributors/ 19 IMDI standard IMDI tools ISLE metadata glossary ELAN annotation software http://www.mpi.nl/IMDI http://www.mpi.nl/IMDI/tools http://www.mpi.nl/ISLE/glossary/glossary_frame.html http://www.lat-mpi.eu/tools/elan 8. References IMDI (ISLE Metadata Initiative), 2003, Part 1. Metadata elements for session descriptions. Draft proposal version 3.02. March 2003. Warning: the document and tools available online refer to version 2.5-2.8! This 3.02 version of the proposal was sent to you with the present document. IMDI (ISLE Metadata Initiative), 2001, Part 1B. Metadata elements for lexicon descriptions. Draft proposal version 2.1. June 2001. http://www.mpi.nl/IMDI/documents/Proposals/IMDI_Catalogue_2.1.pdf IMDI (ISLE Metadata Initiative), 2001, Part 1C. Metadata elements for lexicon descriptions. Draft proposal version 1.0. December 2001. http://www.mpi.nl/IMDI/documents/Proposals/ISLE_Lexicon_1.0.pdf Birgit Hellwig, 2003, IMDI Editor, version 2.0. Manual. Version: 02 Apr 2003. http://www.mpi.nl/IMDI/tools/IMDI_Editor_Manual_2_0.doc Birgit Hellwig, 2003, IMDI Browser, version 1.4. Manual. Version: 12 Sep 2002. http://www.mpi.nl/IMDI/tools/IMDI_Browser_Manual-02-09-08.doc Peter Wittenburg & Daan Broeder, 2003, Metadata in ECHO. Version: 10 Mar 2003. http://www.mpi.nl/echo/tec-rep/wp2-tr08-2003v1.pdf Metadata for sign language corpora 20
© Copyright 2026 Paperzz