LIRICS Deliverable D5.1A Data Category Registry API Project reference number e-Content-22236-LIRICS Project acronym LIRICS Project full title Linguistic Infrastructure Resource and Systems Project contact point Laurent Romary, INRIA-Loria for Interoperable 615, rue du jardin botanique BP101. 54602 Villers lès Nancy (France) [email protected] Project web site http://lirics.loria.fr EC project officer Erwin Valentini Document title Data Category Registry API Deliverable ID D5.1A Document type Report Dissemination level Public Contractual date of delivery M6 Actual date of delivery 30st June 2005 Status & version Version 1 Work package, task & deliverable responsible USDF Author(s) & affiliation(s) Marc Kemps-Snijders (MPI) Peter Wittenburg (MPI) Additional contributor(s) With the help of Julien Nioche, Peter Wittenburg, Gil Francopoulo and Julien Ducret Keywords ISO, DCR, API Document evolution version date 1.0 30th June 2005 1.1 29th August 2005 1.2 31st August 2005 version date 1 Content 1 General Outline ...................................................................................................................................... 3 2 Normative References ........................................................................................................................... 4 3 Definitions .............................................................................................................................................. 4 4 4.1 4.2 4.2.1 4.2.2 4.2.3 4.3 4.3.1 4.3.2 4.3.3 Specifications ........................................................................................................................................ 5 Requirements ......................................................................................................................................... 5 Principles of Interaction ........................................................................................................................ 5 Introduction ............................................................................................................................................ 5 Use Cases............................................................................................................................................... 5 Sequence Diagrams .............................................................................................................................. 7 Indicative planning .............................................................................................................................. 10 Introduction .......................................................................................................................................... 10 First phase API Description ................................................................................................................ 10 Error handling ...................................................................................................................................... 14 2 Introduction Within ISO TC37/SC4 dealing with the Management of Language Resources one of the main activities is to develop a Data Category Registry (DCR) which is basically a flat list of concepts used in the linguistic domain in the broad sense. The DCR is sub-divided into profiles that contain concepts relevant for thematic sections such as metadata, morphosyntax, etc controlled by boards of experts. Each Data Category in this DCR representing such a concept is described with the help of structured information. Documents referred too under normative references describe in detail how the DCR is setup and how it is structured. The purpose for building a Data Category Registry (DCR) is to achieve a higher degree of interoperability between linguistic resources. Two syntactic annotations for example created by different researchers may encode Part-Of-Speech in a different way. By referring to the same concepts registered within the ISO DCR nevertheless interoperability can be achieved. Modern applications could therefore exploit these references to the DCR. In an era where it is becoming more easy to virtually join linguistic resources an access to such centrally maintained reference frameworks for linguistic terminology will be one of the keys to success. Accessing such a centrally maintained DCR requires standardized programming interfaces allowing to request for specific information by linguistic applications. This document describes in detail the nature of the application programming interface (API) that will allow programmes to access information. The API is accessible via the Internet. First, we will describe the problem being dealt with in simple terms, second, refer to existing standards, third, define the terms used and finally give detailed specifications of the API. This specification will allow application programmers to exploit the content of the DCR for the intended purposes. 1 General Outline Increasingly often specialists will create application programs that want to make use of different language resources covering different terminologies, i.e. there is no basis for interoperability. The emergence, however, of the ISO TC37/SC4 defined Data Category Registry will allow resource creators to refer to centrally registered concepts and in doing so create interoperability. Applications, however, have to be able to exploit these references. Already now we can point to a number of applications that are developed to work in such environments of resources covering different terminologies. ELAN 1 and ANNEX, tools to create and exploit multimedia annotations, LEXUS 2 to create and exploit complex LMF-based lexica and GATE 3 a strong framework to carry out sequences of NLP (Natural Language Processing) operations on language resources. The following diagram may give an impression of the domain where the API described in detail in this document can be used. We assume that a researcher wants to create either manually or automatically a Part-Of-Speech annotation. We can further assume that he/she wants to build on existing linguistic knowledge and to integrate the emerging resource into the interoperable domain. Working with an annotation tool he will want to browse or search for an appropriate concept to encode Part-Of-Speech in the ISO DCR. Once found he/she will want 1 ELAN is a tool for the manual creation and exploitation of complex annotations of primary linguistic resources such as texts, sounds, images and videos. ANNEX is its web-based implementation. (http://www.mpi.nl/tools) 2 LEXUS is a tool for the creation and exploitation of LMF-compliant lexica. (http://www.mpi.nl/lexus) 3 GATE is a framework for allowing to carry out sequences of NLP operations. (http://gate.usfd.uk??) 3 to integrate all useful information into the schema and make a reference to the DCR entry. The information to be included could be the name of the concept to be included as the tier name, the definition of the concept to allow a quick look-up for other people and the value range (conceptual domain) to be included in menus to constrain the persons or algorithms carrying out the actual encoding. Having done so search engines or other type of linguistic tools could make use of the extracted information or contact the ISO DCR to request even more information. To make this working the application program has to be able to contact a service offering all information of the DCR, i.e. an interface has to be specified how an arbitrary application program can access the information contained in the DCR. The application program has to include a module then that supports this interface and the DCR which is a structured resource has to be encapsulated with a service also supporting this interface. The service at the DCR side will be specified and implemented as web-service, i.e. it will support standards such as WSDL to specify the accessible methods and SOAP to describe the exchange of information. ISO TC37/SC4 Data Category Registry Web Service Interface Module Linguistic Application The API will be implemented in two phases: (1) In a first phase we will implement a simple version not having all possible features. (2) In a second phase we will implement a complete interface. The specifications in the following chapters will include the full API. The working language for all specifications will be English. 2 Normative References The following references contain further information on terms used in this document. ISO TC37/SC4 http://www.tc37sc4.org/ UDDI http://www.uddi.org WSDL http://www.w3.org/TR/wsdl SOAP www.w3.org/TR/soap 3 Definitions API An Application Programming Interface exactly specifies an interface between two programming modules and/or services. It specifies the methods that can be executed and the parameters that have to be provided or that will be returned. WSDL is a generic API specification framework for services and modules that interact via the web protocol HTTP. 4 4 Specifications 4.1 Requirements This paragraph briefly outlines the technical implementation requirements defined for the project. Funtional requirements on the interface are described in subsequent sections. The following technical requirements have to be met: 1. The DCR has to be offered finally as a web-service a. there has to be a UDDI description b. there has to be a WSDL API description c. message interchange should occur according to the SOAP protocol 2. The UDDI entry has to be searchable/browsable as one of the linguistic information services, so a physical address of the service interface definition (WSDL) has to be returned. 4.2 4.2.1 Principles of Interaction Introduction The purpose of DCR interaction is to initially assist users in the selection process of suitable data categories for their lexical resources from an application of their choice. The interaction should be transparent and seamless, so the user needs no knowledge on how the DCR is actually accessed or on the actual location of the DCR. Access to the DCR is performed by the user’s application. The goal is to provide all functionality an application needs to guide user’s through the selection process. Also, once a selection has been made information regarding a specific datacategory must be accessible directly, i.e. without the need for any intermediate steps. 4.2.2 Use Cases Applications accessing the DCR will wish to retrieve information regarding datacategories. For most aplication the process of retrieving information is uder driven were the user will either browse or search the DCR. The figure below shows the use cases envisaged for traversing the DCR. 5 4.2.2.1 Browse catalogue (basic browsing) The user wishes to browse the data category catalogue in order to select an appropriate data category. Rather than presenting the end-user with a flat list of all datacategories present in the DCR a list of available profiles is presented where the user may select the profile applicable to the domain he/she is working in. When a profile is selected the datacategories from that profile are presented. The list should provide suffient information for the user to determine whether this datacategory is of any interest. The user may then proceed to select a datacategory from the list and view the details. 4.2.2.2 Search Catalogue When a end-user is familiar with the DCR he/she may want to select a datacategory without having to go through the catalogue browsing selection process. Instead, a datacategory may be selected by searching the catalogue using appropriate search parameters. The user is then presented with a list of data categories matching the specified search criteria from which the desired data category may be selected and the datacategory details are displayed. 4.2.2.3 Browse catalogue (ConceptGeneric browsing) Since the DCR support is-a relations, stored under BroaderConceptGeneric, another means of browsing the DCR is to pursue these relations. An example of this is-a relation is transitive verb, which is a verb. Browsing the catalogue in this way may be done either top-down or bottom-up. The first implies that first all datacategories are loaded which themselves have no BroaderConceptGeneric, i.e. they are the top of the hierarchy. Browsing is then done by selecting a datacategory after which a list is presented of datacategories of which the selected datacategory is the BroaderConceptGeneric. When a verb is selected for example 6 the next level in the hierarchy would include transitive verb. The user may next proceed to view details on a datacategory from the presented list by selecting a datacategory of interest. In bottom-up browsing a user is interested in retrieving the information of the BroaderConceptGeneric of the selected datacategory. In the case of transitive verb a user will be interested in the information related to verb. The user may next proceed to view details on a datacategory from the presented list by selecting a datacategory of interest. Bottom-up browsing is only done when a datacategory has already been found, e.g. by searching the DCR. 4.2.3 Sequence Diagrams The various use cases can be broken down to display the detailed interaction between the application and the DCR connector. The next paragraphs describe the ineraction for the various use cases. 4.2.3.1 Browse catalogue (basic browsing) When an end-user browses the catelogue in this manner, first a list of profiles is presented. The user then selects a profile of interest for which the list of datacategories is presented. The user may then select a datacategory to view all details. Browsing the catalogue involves the following steps: 1. A list of profiles is requested from the DCR connector. 2. A list of data categories is requested from the DCR connector for a specific profile. The list of datacategories may be returned in an abbreviated form. 3. The details for a specific datacategory are requested from the DCR. All information of the datacategory is returned . The following sequence diagram illustrates the interaction process. 7 4.2.3.2 Search Catalogue A user may access the DCR by specifying search terms for the datacategory he/she is interested in. The search process comprises of the following steps: 1. search the DCR using specified keywords and search parameters. The system will return a list of all datacategories which mach the specified search criteria. 2. View details regarding one of the datacategories from the presented result. List. The system will return all detailed information regarding the requested datacategory. The following sequence diagram outlines this process. 8 4.2.3.3 Browse catalogue (ConceptGeneric browsing) A user may traverse the DCR by browsing the BroaderConceptGeneric relations. Two approaches are possible, top-down and bottom-up. In bottom-down browsing the user selects all datacategories which are regarded as top level concepts, i.e. there are no higher level concepts. In bottom-up browsing, the user traverses the DCR from a preselected datacategory to more generic datacategories. An example for top-down navigation is going from verb to trnasitive verb, an example for bottom-up browsing is going from transitive verb to verb. Top-Down navigation consists of the following steps: 1. A list of all datacategories is requested from the DCR which do not have a more generic concept. A list of datacategories is returned. 2. The user selects a datacategory from the list and requests all datacategory for which the selected datacategory is a more generic concept. A list of datacategories is returned. The list may be empty if no datacategories are present. 3. The user selects a datacategory from the list to view its details Bottom-up navigation is based on the assumption that a user has preselected a datacategory and is interested in the more generic datacategory. Bottom-up navigation consists of only a single step. 1. Get the generic concept for this datacategory. The DCR will return all datacategory details or none if no generic concept is defined for the selected datacategory. 9 4.3 4.3.1 Indicative planning Introduction Development of the API is foreseen in 2 phases. In the first phase basic interaction with the DCR is foreseen (basic browsing/searching), while in the second phase more advanced DCR interaction will be modelled (concept browsing and other functionalities). For the first phase all use cases and interactions have been described. For the second phase API only concept browsing is currently defined. Use cases and and specific interactions for other functionalities are still under development. 4.3.2 First phase API Description From the specifications mentioned an API may be derived providing all necessary functionality to implements the required use cases. In the first phase 2 use cases have been identified to be described and implemented: browse catalogue( basic browsing) and search 10 catalogue. All calls to the API are stateless, i.e. all information needed to understand the call is present in the call’s parameters. The resulting API calls are listed below. The following methods are distinguished: a. function List getProfiles () give me the names of all DCR entries for a certain profile in a certain language (default languages are English). The function will return a list of strings that at least have to include URIDs and names of the datcats. The results will be delivered in the following RelaxNF format: <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <ref name="struct_listOfProfiles" /> </start> <define name="struct_listOfProfiles"> <element name="struct"> <attribute name="type"> <value>ListofProfiles</value> </attribute> <zeroOrMore> <element name="feat"> <attribute name="type"> <value>profile</value> </attribute> <text /> </element> </zeroOrMore> </element> </define> </grammar> b. function List getDataCategories (aProfile) returns all datacategories for the specified profile. The method returns a list of strings that at least have to include URIDs and names of the datcats. Example: getDataCategories( “Terminology”) Parameter name AProfile Type String ValueRange - The results will be delivered in the following relaxNG format <grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <ref name="struct_DCS" /> </start> <define name="struct_DCS"> <element name="struct"> <attribute name="type"> <value>DCS</value> </attribute> <zeroOrMore> <ref name="struct_DC" /> </zeroOrMore> </element> </define> <define name="struct_DC"> <element name="struct"> <attribute name="type"> <value>DC</value> 11 </attribute> <attribute name="id"> <text /> </attribute> <element name="feat"> <attribute name="type"> <value>registrationStatus</value> </attribute> <choice> <value>standard</value> <value>qualified</value> <value>candidate</value> <value>retired</value> <value>superseded</value> </choice> </element> <element name="feat"> <attribute name="type"> <value>registrationAuthority</value> </attribute> <text /> </element> <element name="feat"> <attribute name="type"> <value>identifier</value> </attribute> <text /> </element> <element name="feat"> <attribute name="type"> <value>version</value> </attribute> <text /> </element> </element> </define> </grammar> As becomes clear from the schema dataCategories are delivered in a reduced form, containing the datacategory’s ID, registrationStatus, registrationAuthority, identifier and version only. c. function List getDataCategories (aProfile, aRegistrationStatus) returns all datacategories for the specified profile with the specified registration status . The method returns a list of strings that at least have to include URIDs and names of the datcats. Example: getDataCategories( “Terminology”, “standard”) Parameter name AProfile aRegistrationStatus The results will be delivered above( getDataCategories(aProfile)) Type String String in the ValueRange All, standard, qualified, candidate, retired, superseded RelaxNG format described d. function DataCategory getDataCategory (URID) returns the datacategory identified through the specified URID. 12 Example getDataCategory(“12AJS WSD”); Parameter name URID Type String ValueRange - The results will delivered in the RelaxNG format of ISO 12620(available from the Syntax site here). A compressed model is is shown below. e. Function List searchDataCategories( aListOfkeyword, aListOfFields, aProfile, aRegistrationStatus) Returns a simplified list of datacategories. The search is performed using the speciifed list of keywords. An AND operator is assumed between the keywords. The same wildcards may be used in the keywords list as are currently possible within the SYNTAX search interface.The search is performed over the specified list of fields. Possible field values are identifier, definition, explanation, example and note. An OR operator is assumed between the listOfFields elements. Profile and registration status are optional. 13 Parameter name AListOfKeywords aListOfFields Type String array String array AProfile aRegistrationStatus String String ValueRange Identifier, definition, explanation, example, note All, standard, qualified, candidate, retired, superseded The results are returned in the RelaxNG format specified under getDataCategories(aProfile), i.e. a list of summary datacategories is returned. 4.3.3 Error handling It seems obvious that situations may occur where the DCR connector is unable to fullfill the data request. An example where this may happen is that details on a datacategory are requested and a non existent identifier is used. For these situations appropriate error messages must be returned. The following table lists all calls and foreseen error messages. Method call All Error description General Failure Error code FAILURE_GENERAL GetDataCategories The specified does not exist profile PROFILE_INVALID GetDataCategories The specified registration status does not exist REGISTRATIONSTATUS_INVALID GetDataCategory The specified URID does not exist URID_INVALID Upon failure the appropriate error code will be returned, signalling the type of error, along with additional information which may be usefull for developers of the calling application. The latter is merely intended for debugging purposes and is optional. 14
© Copyright 2026 Paperzz