D5.1AV1 Data Category Registry API - Lirics

LIRICS
Deliverable D5.1A
Data Category Registry API
Project reference number
e-Content-22236-LIRICS
Project acronym
LIRICS
Project full title
Linguistic
Infrastructure
Resource and Systems
Project contact point
Laurent Romary, INRIA-Loria
for
Interoperable
615, rue du jardin botanique BP101.
54602 Villers lès Nancy (France)
[email protected]
Project web site
http://lirics.loria.fr
EC project officer
Erwin Valentini
Document title
Data Category Registry API
Deliverable ID
D5.1A
Document type
Report
Dissemination level
Public
Contractual date of delivery
M6
Actual date of delivery
30st June 2005
Status & version
Version 1
Work package, task & deliverable responsible
USDF
Author(s) & affiliation(s)
Marc Kemps-Snijders (MPI)
Peter Wittenburg (MPI)
Additional contributor(s)
With the help of Julien Nioche, Peter Wittenburg,
Gil Francopoulo and Julien Ducret
Keywords
ISO, DCR, API
Document evolution
version
date
1.0
30th June 2005
1.1
29th August 2005
1.2
31st August 2005
version
date
1
Content
1
General Outline ...................................................................................................................................... 3
2
Normative References ........................................................................................................................... 4
3
Definitions .............................................................................................................................................. 4
4
4.1
4.2
4.2.1
4.2.2
4.2.3
4.3
4.3.1
4.3.2
4.3.3
Specifications ........................................................................................................................................ 5
Requirements ......................................................................................................................................... 5
Principles of Interaction ........................................................................................................................ 5
Introduction ............................................................................................................................................ 5
Use Cases............................................................................................................................................... 5
Sequence Diagrams .............................................................................................................................. 7
Indicative planning .............................................................................................................................. 10
Introduction .......................................................................................................................................... 10
First phase API Description ................................................................................................................ 10
Error handling ...................................................................................................................................... 14
2
Introduction
Within ISO TC37/SC4 dealing with the Management of Language Resources one of the main
activities is to develop a Data Category Registry (DCR) which is basically a flat list of
concepts used in the linguistic domain in the broad sense. The DCR is sub-divided into
profiles that contain concepts relevant for thematic sections such as metadata, morphosyntax,
etc controlled by boards of experts. Each Data Category in this DCR representing such a
concept is described with the help of structured information. Documents referred too under
normative references describe in detail how the DCR is setup and how it is structured.
The purpose for building a Data Category Registry (DCR) is to achieve a higher degree of
interoperability between linguistic resources. Two syntactic annotations for example created
by different researchers may encode Part-Of-Speech in a different way. By referring to the
same concepts registered within the ISO DCR nevertheless interoperability can be achieved.
Modern applications could therefore exploit these references to the DCR. In an era where it is
becoming more easy to virtually join linguistic resources an access to such centrally
maintained reference frameworks for linguistic terminology will be one of the keys to success.
Accessing such a centrally maintained DCR requires standardized programming interfaces
allowing to request for specific information by linguistic applications. This document describes
in detail the nature of the application programming interface (API) that will allow programmes
to access information. The API is accessible via the Internet. First, we will describe the
problem being dealt with in simple terms, second, refer to existing standards, third, define the
terms used and finally give detailed specifications of the API. This specification will allow
application programmers to exploit the content of the DCR for the intended purposes.
1
General Outline
Increasingly often specialists will create application programs that want to make use of
different language resources covering different terminologies, i.e. there is no basis for
interoperability. The emergence, however, of the ISO TC37/SC4 defined Data Category
Registry will allow resource creators to refer to centrally registered concepts and in doing so
create interoperability. Applications, however, have to be able to exploit these references.
Already now we can point to a number of applications that are developed to work in such
environments of resources covering different terminologies. ELAN 1 and ANNEX, tools to
create and exploit multimedia annotations, LEXUS 2 to create and exploit complex LMF-based
lexica and GATE 3 a strong framework to carry out sequences of NLP (Natural Language
Processing) operations on language resources.
The following diagram may give an impression of the domain where the API described in
detail in this document can be used. We assume that a researcher wants to create either
manually or automatically a Part-Of-Speech annotation. We can further assume that he/she
wants to build on existing linguistic knowledge and to integrate the emerging resource into the
interoperable domain. Working with an annotation tool he will want to browse or search for an
appropriate concept to encode Part-Of-Speech in the ISO DCR. Once found he/she will want
1 ELAN is a tool for the manual creation and exploitation of complex annotations of primary linguistic
resources such as texts, sounds, images and videos. ANNEX is its web-based implementation.
(http://www.mpi.nl/tools)
2 LEXUS is a tool for the creation and exploitation of LMF-compliant lexica. (http://www.mpi.nl/lexus)
3 GATE is a framework for allowing to carry out sequences of NLP operations. (http://gate.usfd.uk??)
3
to integrate all useful information into the schema and make a reference to the DCR entry.
The information to be included could be the name of the concept to be included as the tier
name, the definition of the concept to allow a quick look-up for other people and the value
range (conceptual domain) to be included in menus to constrain the persons or algorithms
carrying out the actual encoding. Having done so search engines or other type of linguistic
tools could make use of the extracted information or contact the ISO DCR to request even
more information.
To make this working the application program has to be able to contact a service offering all
information of the DCR, i.e. an interface has to be specified how an arbitrary application
program can access the information contained in the DCR. The application program has to
include a module then that supports this interface and the DCR which is a structured resource
has to be encapsulated with a service also supporting this interface. The service at the DCR
side will be specified and implemented as web-service, i.e. it will support standards such as
WSDL to specify the accessible methods and SOAP to describe the exchange of information.
ISO
TC37/SC4
Data
Category
Registry
Web
Service
Interface
Module
Linguistic
Application
The API will be implemented in two phases: (1) In a first phase we will implement a simple
version not having all possible features. (2) In a second phase we will implement a complete
interface. The specifications in the following chapters will include the full API. The working
language for all specifications will be English.
2
Normative References
The following references contain further information on terms used in this document.
ISO TC37/SC4 http://www.tc37sc4.org/
UDDI
http://www.uddi.org
WSDL
http://www.w3.org/TR/wsdl
SOAP
www.w3.org/TR/soap
3
Definitions
API
An Application Programming Interface exactly specifies an interface between two
programming modules and/or services. It specifies the methods that can be executed and the
parameters that have to be provided or that will be returned. WSDL is a generic API
specification framework for services and modules that interact via the web protocol HTTP.
4
4
Specifications
4.1
Requirements
This paragraph briefly outlines the technical implementation requirements defined for the
project. Funtional requirements on the interface are described in subsequent sections.
The following technical requirements have to be met:
1. The DCR has to be offered finally as a web-service
a. there has to be a UDDI description
b. there has to be a WSDL API description
c. message interchange should occur according to the SOAP protocol
2. The UDDI entry has to be searchable/browsable as one of the linguistic information
services, so a physical address of the service interface definition (WSDL) has to be
returned.
4.2
4.2.1
Principles of Interaction
Introduction
The purpose of DCR interaction is to initially assist users in the selection process of suitable
data categories for their lexical resources from an application of their choice. The interaction
should be transparent and seamless, so the user needs no knowledge on how the DCR is
actually accessed or on the actual location of the DCR. Access to the DCR is performed by
the user’s application. The goal is to provide all functionality an application needs to guide
user’s through the selection process. Also, once a selection has been made information
regarding a specific datacategory must be accessible directly, i.e. without the need for any
intermediate steps.
4.2.2
Use Cases
Applications accessing the DCR will wish to retrieve information regarding datacategories.
For most aplication the process of retrieving information is uder driven were the user will
either browse or search the DCR. The figure below shows the use cases envisaged for
traversing the DCR.
5
4.2.2.1
Browse catalogue (basic browsing)
The user wishes to browse the data category catalogue in order to select an appropriate data
category. Rather than presenting the end-user with a flat list of all datacategories present in
the DCR a list of available profiles is presented where the user may select the profile
applicable to the domain he/she is working in. When a profile is selected the datacategories
from that profile are presented. The list should provide suffient information for the user to
determine whether this datacategory is of any interest. The user may then proceed to select a
datacategory from the list and view the details.
4.2.2.2
Search Catalogue
When a end-user is familiar with the DCR he/she may want to select a datacategory without
having to go through the catalogue browsing selection process. Instead, a datacategory may
be selected by searching the catalogue using appropriate search parameters. The user is
then presented with a list of data categories matching the specified search criteria from which
the desired data category may be selected and the datacategory details are displayed.
4.2.2.3
Browse catalogue (ConceptGeneric browsing)
Since the DCR support is-a relations, stored under BroaderConceptGeneric, another means
of browsing the DCR is to pursue these relations. An example of this is-a relation is transitive
verb, which is a verb. Browsing the catalogue in this way may be done either top-down or
bottom-up. The first implies that first all datacategories are loaded which themselves have no
BroaderConceptGeneric, i.e. they are the top of the hierarchy. Browsing is then done by
selecting a datacategory after which a list is presented of datacategories of which the
selected datacategory is the BroaderConceptGeneric. When a verb is selected for example
6
the next level in the hierarchy would include transitive verb. The user may next proceed to
view details on a datacategory from the presented list by selecting a datacategory of interest.
In bottom-up browsing a user is interested in retrieving the information of the
BroaderConceptGeneric of the selected datacategory. In the case of transitive verb a user will
be interested in the information related to verb. The user may next proceed to view details on
a datacategory from the presented list by selecting a datacategory of interest. Bottom-up
browsing is only done when a datacategory has already been found, e.g. by searching the
DCR.
4.2.3
Sequence Diagrams
The various use cases can be broken down to display the detailed interaction between the
application and the DCR connector. The next paragraphs describe the ineraction for the
various use cases.
4.2.3.1
Browse catalogue (basic browsing)
When an end-user browses the catelogue in this manner, first a list of profiles is presented.
The user then selects a profile of interest for which the list of datacategories is presented.
The user may then select a datacategory to view all details.
Browsing the catalogue involves the following steps:
1. A list of profiles is requested from the DCR connector.
2. A list of data categories is requested from the DCR connector for a specific profile.
The list of datacategories may be returned in an abbreviated form.
3. The details for a specific datacategory are requested from the DCR. All information of
the datacategory is returned .
The following sequence diagram illustrates the interaction process.
7
4.2.3.2
Search Catalogue
A user may access the DCR by specifying search terms for the datacategory he/she is
interested in.
The search process comprises of the following steps:
1. search the DCR using specified keywords and search parameters. The system will
return a list of all datacategories which mach the specified search criteria.
2. View details regarding one of the datacategories from the presented result. List. The
system will return all detailed information regarding the requested datacategory.
The following sequence diagram outlines this process.
8
4.2.3.3
Browse catalogue (ConceptGeneric browsing)
A user may traverse the DCR by browsing the BroaderConceptGeneric relations. Two
approaches are possible, top-down and bottom-up. In bottom-down browsing the user selects
all datacategories which are regarded as top level concepts, i.e. there are no higher level
concepts. In bottom-up browsing, the user traverses the DCR from a preselected
datacategory to more generic datacategories. An example for top-down navigation is going
from verb to trnasitive verb, an example for bottom-up browsing is going from transitive verb
to verb.
Top-Down navigation consists of the following steps:
1. A list of all datacategories is requested from the DCR which do not have a more
generic concept. A list of datacategories is returned.
2. The user selects a datacategory from the list and requests all datacategory for which
the selected datacategory is a more generic concept. A list of datacategories is
returned. The list may be empty if no datacategories are present.
3. The user selects a datacategory from the list to view its details
Bottom-up navigation is based on the assumption that a user has preselected a datacategory
and is interested in the more generic datacategory. Bottom-up navigation consists of only a
single step.
1. Get the generic concept for this datacategory. The DCR will return all datacategory
details or none if no generic concept is defined for the selected datacategory.
9
4.3
4.3.1
Indicative planning
Introduction
Development of the API is foreseen in 2 phases. In the first phase basic interaction with the
DCR is foreseen (basic browsing/searching), while in the second phase more advanced DCR
interaction will be modelled (concept browsing and other functionalities). For the first phase
all use cases and interactions have been described. For the second phase API only concept
browsing is currently defined. Use cases and and specific interactions for other functionalities
are still under development.
4.3.2
First phase API Description
From the specifications mentioned an API may be derived providing all necessary
functionality to implements the required use cases. In the first phase 2 use cases have been
identified to be described and implemented: browse catalogue( basic browsing) and search
10
catalogue. All calls to the API are stateless, i.e. all information needed to understand the call
is present in the call’s parameters. The resulting API calls are listed below.
The following methods are distinguished:
a. function List getProfiles ()
give me the names of all DCR entries for a certain profile in a certain
language (default languages are English). The function will return a list of strings
that at least have to include URIDs and names of the datcats.
The results will be delivered in the following RelaxNF format:
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<ref name="struct_listOfProfiles" />
</start>
<define name="struct_listOfProfiles">
<element name="struct">
<attribute name="type">
<value>ListofProfiles</value>
</attribute>
<zeroOrMore>
<element name="feat">
<attribute name="type">
<value>profile</value>
</attribute>
<text />
</element>
</zeroOrMore>
</element>
</define>
</grammar>
b. function List getDataCategories (aProfile)
returns all datacategories for the specified profile. The method returns a list of
strings that at least have to include URIDs and names of the datcats.
Example: getDataCategories( “Terminology”)
Parameter name
AProfile
Type
String
ValueRange
-
The results will be delivered in the following relaxNG format
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<ref name="struct_DCS" />
</start>
<define name="struct_DCS">
<element name="struct">
<attribute name="type">
<value>DCS</value>
</attribute>
<zeroOrMore>
<ref name="struct_DC" />
</zeroOrMore>
</element>
</define>
<define name="struct_DC">
<element name="struct">
<attribute name="type">
<value>DC</value>
11
</attribute>
<attribute name="id">
<text />
</attribute>
<element name="feat">
<attribute name="type">
<value>registrationStatus</value>
</attribute>
<choice>
<value>standard</value>
<value>qualified</value>
<value>candidate</value>
<value>retired</value>
<value>superseded</value>
</choice>
</element>
<element name="feat">
<attribute name="type">
<value>registrationAuthority</value>
</attribute>
<text />
</element>
<element name="feat">
<attribute name="type">
<value>identifier</value>
</attribute>
<text />
</element>
<element name="feat">
<attribute name="type">
<value>version</value>
</attribute>
<text />
</element>
</element>
</define>
</grammar>
As becomes clear from the schema dataCategories are delivered in a
reduced form, containing the datacategory’s ID, registrationStatus,
registrationAuthority, identifier and version only.
c. function List getDataCategories (aProfile, aRegistrationStatus)
returns all datacategories for the specified profile with the specified
registration status . The method returns a list of strings that at least have to
include URIDs and names of the datcats.
Example: getDataCategories( “Terminology”, “standard”)
Parameter name
AProfile
aRegistrationStatus
The results will be delivered
above( getDataCategories(aProfile))
Type
String
String
in
the
ValueRange
All,
standard,
qualified,
candidate,
retired,
superseded
RelaxNG format described
d. function DataCategory getDataCategory (URID)
returns the datacategory identified through the specified URID.
12
Example getDataCategory(“12AJS WSD”);
Parameter name
URID
Type
String
ValueRange
-
The results will delivered in the RelaxNG format of ISO 12620(available from
the Syntax site here). A compressed model is is shown below.
e. Function List searchDataCategories( aListOfkeyword, aListOfFields,
aProfile, aRegistrationStatus)
Returns a simplified list of datacategories. The search is performed using the
speciifed list of keywords. An AND operator is assumed between the keywords.
The same wildcards may be used in the keywords list as are currently possible
within the SYNTAX search interface.The search is performed over the specified
list of fields. Possible field values are identifier, definition, explanation, example
and note. An OR operator is assumed between the listOfFields elements. Profile
and registration status are optional.
13
Parameter name
AListOfKeywords
aListOfFields
Type
String array
String array
AProfile
aRegistrationStatus
String
String
ValueRange
Identifier,
definition,
explanation,
example, note
All,
standard,
qualified,
candidate,
retired,
superseded
The results are returned in the RelaxNG format specified under
getDataCategories(aProfile), i.e. a list of summary datacategories is returned.
4.3.3
Error handling
It seems obvious that situations may occur where the DCR connector is unable to fullfill the
data request. An example where this may happen is that details on a datacategory are
requested and a non existent identifier is used. For these situations appropriate error
messages must be returned.
The following table lists all calls and foreseen error messages.
Method call
All
Error description
General Failure
Error code
FAILURE_GENERAL
GetDataCategories
The specified
does not exist
profile
PROFILE_INVALID
GetDataCategories
The specified registration
status does not exist
REGISTRATIONSTATUS_INVALID
GetDataCategory
The specified URID does
not exist
URID_INVALID
Upon failure the appropriate error code will be returned, signalling the type of error, along with
additional information which may be usefull for developers of the calling application. The latter
is merely intended for debugging purposes and is optional.
14