European Cultural Heritage Online ECHO PUBLIC Contract n°: HPSE / 2002 / 00137 Title: D2.4 Demonstrator covering the infrastructure and the collaborative tool in an integrated way D2.5 Report evaluating the demonstrator on the basis of the general requirements mainly worked out in the AGORA Author: Peter Wittenburg Concerned WPs: Workpackage 2 (Technology) Abstract: Published in: Keywords: Date of issue of this report: 16th May 2004 Project financed within the Key Action Improving the Socio-economic Knowledge Base WP2 Deliverable D2.1 Specification Report Deliverables D2.4 and 2.5 Interoperable Metadata Domain Evaluation Version 1 Peter Wittenburg Nijmegen 16.5.2004 This note emerged in collaboration with Lund University and contains various contributions from almost all ECHO partners. Since the reports 2.4 and 2.5 are about the metadata infrastructure we suggest to combine them. They largely make use of reports that were partly distributed earlier: • • • WP2 Note on ECHO’s Digital Open Resource Area (DORA) - WP2-TR013-2003 – Version 6 WP2 Note on an ECHO Ontology – WP2-TR017-2004 – Version 2 WP2 Note on the DORA Search Engine - WP2-TR018-2004 – Version 1 2 Content This report includes the three WP2 reports cited at the front page and a note about the availability of the code and the knowledge components. A. WP2 Note on ECHO’s Digital Open Resource Area (DORA)...................................... 5 1. DORA Design Principles............................................................................................ 5 1.1 Topology ............................................................................................................... 6 1.2 User Interface Aspects .......................................................................................... 6 1.3 Selection & Searching Modes............................................................................. 10 1.4 Domains und Sub-Domains ................................................................................ 11 1.5 Hitlist................................................................................................................... 11 1.6 Implementation Issues ........................................................................................ 12 1.7 Harvesting Comments......................................................................................... 13 2. Metadata Mapping .................................................................................................... 14 2.1 Introduction......................................................................................................... 14 2.2 Metadata Elements for DORA............................................................................ 15 2.3 Formal Framework for Mapping ........................................................................ 19 Appendix A : Metadata set used by the RMV .............................................................. 21 Appendix B: Metadata set used by in the History of Science (Berlin) ......................... 25 Appendix C: Metadata set used by the IMSS ............................................................... 27 Appendix D: Metadata set used in the Fotothek........................................................... 28 Appendix E: Metadata set used in the Lineamenta Project .......................................... 30 Appendix F: Metadata set used in the Maps of Rome Project...................................... 31 Appendix G: Metadata set used in the Language Domain ........................................... 32 Appendix H: Metadata set used by NECEP ................................................................. 34 Appendix I: Metadata set used Philosophy................................................................... 35 Appendix J: Dual Mapping between Structured Elements ........................................... 36 Appendix K: Mapping for Views ................................................................................. 40 1. DC View ............................................................................................................... 41 2. Necep View........................................................................................................... 42 3. RMV View............................................................................................................ 42 4. Fotothek View....................................................................................................... 43 5. Lineamenta View .................................................................................................. 44 6. HoS Berlin View................................................................................................... 45 7. Rome Maps View ................................................................................................. 45 8. IMSS View............................................................................................................ 46 9. Language View ..................................................................................................... 47 Appendix L: Schemas ............................................................................................... 48 B. WP2 Note on an ECHO Ontology ............................................................................... 49 1. Provided Components............................................................................................... 49 2. Generated Components - Overview.......................................................................... 50 3. Components in Detail ............................................................................................... 51 3.1 ECHO Concepts.................................................................................................. 51 3.2 ECHO Mappings................................................................................................. 53 3.3 OVM-Geographic Thesaurus.............................................................................. 54 3.4 MPI-Geographic Thesaurus ................................................................................ 55 3.5 OVM Category Thesaurus .................................................................................. 56 3.6 Iconclass Category Thesaurus............................................................................. 57 3 3.7 IconClass-to-OVM Mapping .............................................................................. 58 3.8 OVM-to-IconClass Mapping .............................................................................. 59 3.7 MPI Content List................................................................................................. 59 4. ECHO Knowledge Repositories ............................................................................... 60 5. Exploitation............................................................................................................... 60 C. WP2 Note on the DORA Search Engine...................................................................... 62 1. Search Engine ........................................................................................................... 62 1.1 DORA Interface .................................................................................................. 62 1.2 Harvesting ........................................................................................................... 64 1.3 Data Pre-Processing ............................................................................................ 65 1.4 Index Creation..................................................................................................... 67 1.5 Searching............................................................................................................. 69 2. Evaluation ................................................................................................................. 70 2.1 Formal ................................................................................................................. 71 2.2 Examples and Semantics..................................................................................... 71 2.3 Ranking ............................................................................................................... 74 3. Conclusions............................................................................................................... 74 D. Availability of the Code and the Knowledge Components.......................................... 77 4 A. WP2 Note on ECHO’s Digital Open Resource Area (DORA) Peter Wittenburg 24.02.2004 1. DORA Design Principles DORA is the portal that offers discovery services for various resources that were and are created by major European initiatives, in particular by the ECHO initiative. The ECHO initiative is gathering resources in the five different disciplines Linguistics, History of Art, History of Science, Ethnology and Philosophy. Under the header of Linguistics resources from a couple of other initiatives will be made available as well: • • • the INTERA project that has as goal to create an integrated domain of language resources; the DOBES project documenting endangered languages all over the world; the MPI and the Lund University language resources. While the linguistic part in ECHO focuses on minority languages such as Sign Language and linguistic objects with a heritage aspect, INTERA is focusing on major languages and combining language resource centers in Europe and DOBES is focusing on languages (in particular nonEuropean) that probably will become extinct in a few years time. In combining these initiatives, and the MPI for Psycholinguistics as well, DORA will offer access to a large set and therefore forming a critical mass. Under the header of Ethnology also various resources will be made available: the NECEP society database, the collection of the DOGON project and the large collection of the Dutch Ethnology Museum (RMV). Other resources may be integrated as well, at a later time. In the area of History of Arts three databases will be added: Fotothek, Lineamenta and ancient maps of Rome. All are housed in the Biblioteka Herziana. In the area of History of Science a number of collections will be part of the DORA domain. IMSS Florence will contribute with its large collection and institutions such as U Bern, MPI for History of Science and perhaps others will contribute as well. In the area of Philosophy the collection of texts from the ECHO partner will be integrated. DORA offers various access methods primarily to the metadata descriptions as a simple and easy navigation space. Hits will allow the users to access the resources themselves, given that they have the proper access rights. The metadata descriptions are openly accessible. The access to the resources that can be text, images, movies, sounds and 3D objects may be restricted. Various views and access mechanisms will be available to meet the requirements of the different user groups. The language resource domain within DORA is mainly using the IMDI metadata standard, although this is not necessary. Therefore, the IMDI domain is a large sub-domain in DORA. For many other holdings different metadata sets are used, i.e. to create a unified umbrella various mappings have to be carried out. This is described later in this document. 5 At first instance Lund U and the MPI Nijmegen will maintain DORA. However, others can set up a similar portal since the sources will be made openly available. 1.1 Topology The DORA service is a central one, i.e. all metadata will be harvested at a central server and stored optimally for fast access. This implies that the central server will only have copies of data, the original copies will stay at the original institutions where they also may be subject to changes and extensions. With each partner, a procedure will be discussed that will allow us to harvest the metadata records. The DORA service is not a service that extends to the resources themselves, i.e. the metadata may have references to the digital objects they describe such as images, texts, sound files or movies, but these resources stay at the institutions. If a certain institution does not have sufficient resources to house videos ECHO could act as an umbrella to also house the resources at a central server1. Summarizing we can conclude that in the DORA metadata scenario all institutions act as data providers, i.e. they offer their metadata records for being harvested by the DORA service providers. Different protocols will be necessary to harvest the data. Different types of records will be offered by the different institutions. DORA service providers the mapping of data and the different types of searches will be carried out on service providing machines all data providers provide their metadata records via the OAI harvesting protocol except for IMDI, NECEP and philosophy where the XML files will be used data providers 1.2 User Interface Aspects First we want to list a number of requirements for the user interface: • • • • • • • • • it has to support the normal working environments such as web browsers (first a limited set of browsers will be supported) it has to be simple and robust it has to look professional for the normal web user it has to offer simple Google like search on metadata as the first choice2 users can select the domain they want to search in - the default domain is “all” o a preference file has to support that different users have different defaults (question where to store this: on server or as bookmarks, ...) users can select a certain view (domain specific vocabulary) to specify their queries the opening page has to be attractive, i.e. the layout has to be designed carefully all pages must use one underlying style the opening page has to 1 Under certain circumstances the MPI for Psycholinguistics could house resources. In a second version a lexicon could be displayed to help people to find suitable terms while indicating the domain from which they are taken. 2 6 allow to jump to geographic browsing (no idea yet whether we can include other resources than from languages and ethnology) o allow to jump to IMDI type tree browsing o allow to go to the specific search engines provided by the disciplines such as the full IMDI infrastructure the opening page should contain all relevant links (ECHO, IMDI, MPI, DOBES, ELRA, Lund, INTERA, ...) it has to be checked in how far we want to extend to DC/OLAC repositories, i.e. in how far we want to harvest other sites the DORA service should allow OAI (DC) service providers to harvest its holding the first version must be ready as soon as possible, i.e. when components are ready they should be made visible o • • • • DORA Main Page (test page is available under: corpus1.mpi.nl/ds/dora_demo2; please, note that it is under construction) geographic selection if possible domain & sub-domain selection complex structured search offering domain dependent views (terms & explanations) browsing if possible full text search field Google like This figure3 indicates the major elements of the DORA user interface. It will support simple search, complex structured search, selection of domains and where possible geographical and hierarchical browsing. In this version we miss an indication of the possibility to extend the simple search on metadata (keyword type), annotations (general type of metadata) and relations. For all forms of searches (simple and complex) the terms used in the descriptions will be indicated in a separate window. This will facilitate searching since it will inform the user about what is existing and it will minimize typing errors. It has to be worked out what the best way is to offer the wordlist in a structured way since they can become very long. 3 Yet an appropriate symbol representing philosophy is missing. 7 Complex Search Page When the user selects Complex Search the following page will show up: search domain is selected selection of complex search selection of view (domain vocab for complex search) Ethnology NECEP view RMV view query input fields Still the user can select the domain and sub-domain he/she wants to search in and whether he/she wants to search on metadata, annotations and/or relations. When a special view is selected a suitable vocabulary will be shown which the user may be more familiar with. The offered fields can be used to enter strings to form the structured query. In general we will use a subset of elements from the different domains. Candidates are such elements that can be mapped to other domains. If users want to do more specific searches using elements that cannot be mapped they will be able to go to the specific search engines. One of the detailed views is the DC view and it will offer the well-known 15 DC elements. Browsing Page Currently, we see two domains where browsing in metadata domains is an issue. IMDI uses this concept for language resources and the Alcatraz environment seems to support browsing according to some thesaurus. Where possible we will support browsing in such metadata domains. An interaction should be supported in so far that any browsing is used as a specification of a subdomain for simple search as well. If a user has selected some node by browsing it should therefore be possible to do simple search and use the node as a selection criterion to narrow down the search space. Since date information is used by many metadata sets it has to be checked in how far it is possible to generate a browsable tree that orders resources according to their date. 8 Geographic Browsing Page One very popular form of browsing is to use geographical information. Since many metadata sets are using geographic indicators such as continent, country, region and place it may be possible to add this type of information to geographic maps such that people can make selections based on these maps. DORA has to differentiate the different usages of the geographical information, i.e. the place of origin is not the same as the place where an object is located. In general one would use the place of origin within the DORA framework. This has to be analyzed in more detail. Again here it is important to allow selection criteria, i.e. to only show information for the selected domains and sub-domains. In many cases it is a problem to associate a document with geographical maps. A society will live within a region, but drawing regions can easily cause political problems. Therefore, DORA will associate information with useful points on the maps although this is not as optimal in many respects. 9 The world map can be broken up into a number of sub-pages at two or three levels. A possible second layer is indicated in the figure above. That should be sufficient to mark all points with sufficient detail. There may be some detail maps as for the History of Arts where most resources point to places in Italy. When selecting a point by clicking all resources are shown as hits such that people can view or listen them. 1.3 Selection & Searching Modes Here we want to summarize the searching modes again. • • • • • • • Domain Selection. The user can select the domains he wants to operate in and that has to affect the search and selection modes except the geographic one. We will offer domains and sub-domains for selection. Resource-Type Selection. The user can select to operate on metadata, annotations and/or relations in the simple search modus. Simple search offers Google like facilities and at first instance the user does not get any help. At a later stage one could think of a lexicon of all possible terms. This simple search operates on an index that contains all metadata values that occur in the participating domains. This includes in particular the descriptions since, for example in ethnology, especially the descriptions contain the useful material. In doing so ss ignores all structure of the metadata sets and therefore looses the high precision of structured search. Complex Search offers a few major categories of each domain with a domain specific naming. In particular those categories that can be mapped between the disciplines should be mentioned. It has yet to be defined which categories will be made available. Of course, in this mode the controlled vocabularies should be available to guide the users. Browsing can be chosen to navigate in browsable domains such as the IMDI world with normal web browsers making use of on the fly created html. The possibility of automatically creating a historical browsing tree will be investigated. Geographic Selection can be chosen by clicking on the world map. The only possibility is to click on marked spots that will result in a list of all sessions belonging to this spot and display them. It has to be checked in how far this can be improved by linking to a node in browsable trees. So - clicking on a spot in the map will execute a complex search with the location and or item information (this has to be carefully checked). Domain-Specific Search. The user has the possibility to go to the domain specific search that will offer all fields for that particular domain or sub-domain. Use of Mappings Since DORA will combine different domains, terminologies have to be mapped while searching. The detailed mappings have to be worked out. The mappings will be used when performing a 10 complex search. In simple search any term can be entered and the program does not know which view the person takes. So term mapping does not make sense for simple search. In complex search a user takes a view. This activates a number of mapping tables from the chosen user views to the other domains. The mappings will extend and modify the search query for the other domains. 1.4 Domains und Sub-Domains DORA knows a number of domains and sub-domains. They can be changeable in a domain configuration file. The Domains and Sub-Domains are: • Languages o ECHO o IMDI Domain o INTERA o DOBES o MPI Nijmegen o Lund • Ethnology o NECEP Paris o DOGON Leiden o RMV Leiden • History of Arts o Lineamenta o Fotothek o Ancient Maps of Rome • History of Science o IMSS Florence o Collections from Bern and Berlin • Philosophy o Philosophy Paris The domain-configuration file has to include addresses that can be used for harvesting purposes as well. This configuration file can be used to generate the entries and menus. An indication is given below. The details have to be worked out. domain-name sub-domain-name protocol address web-site cv addresses 1.5 Hitlist All hits as search results have to be shown in a unique way offering the DORA style and a number of choices. The web site should immediately allow to continue searching etc, i.e. the actual selection and navigation mode should be shown again. Here we can learn from Google to optimize ergonomics. From the hit list it should be possible to • view the metadata record and from there jump to other sources such as info files or articles (references) • view and listen to the resources 11 • invoke other shells that allow to go on with navigating and visualization (this has to be discussed in detail how it can be done)4 In the case that it is not possible to directly refer to the resources a suitable shell from the participating sites has to be invoked with the correct arguments. For streaming audio/video a communication with a streaming server has to be realized. session X session Y session Z domain domain domain sub-d sub-d sub-d MD MD MD wav wav mpg mpg text text jpg The layout for the hit-list page is only indicated schematically. The presentation as a simple list is not at all optimal, since people want to exploit results in a more suitable form. But in the first version nothing special will be done. Google-like designs should be considered. At first instance there is no rating involved. Due to the involvement of different domains we first have to get experience with result lists. Different domains may require different criteria for determining the relevance of a document. Possible criteria could be: • hit comes from structured vs. non-structured information • weak mappings are indicated and drop the rating • spelling differences between terms • frequency of terms found in a metadata record and in associated documents This has to be sorted out in a later phase. 1.6 Implementation Issues At the client side normal html and JavaScript is used. For streaming services the QT client has to be invoked (QT has to receive the right parameters to be able to request the execution of a certain file) and for example for full IMDI requests the IMDI browser can be used. It has to be checked in how far controlled vocabularies have to be used to support structured search or whether it is better to offer the actual terms used. At the server side Perl/XSLT scripts will be 4 Users may want to go from a hit for example about a DOGON building directly to images or to the guided DOGON tour that is available at a web-site. 12 used to generate the html information that is extracted for example from the IMDI and other XML files. CVs other interfaces IMDI browser client QT perl IMDI XML JSP Index Files Structure File mapping http server stream server JavaServerPages will be used to solve all other aspects at the server side. It will access index files to quickly generate results in the two searching modes. It has to be sorted out whether the full text search will need a different kind of index structure than that one that is used for the structured search. JSP need the mapping files for cross-discipline activities. JSP need the IMDI structure file to support the restricted search that was described on the browsing page. When someone is browsing for example in the IMDI domain a selected node could be the start for an additional search, i.e. this requires that the selection made is known to the JSP. To restrict the search JSP have to know which sessions belong to that node. Perhaps controlled vocabularies have to be supported in the second phase. In the configuration file all CVs used have to be specified by its address and the category it is associated with. 1.7 Harvesting Comments With respect to the harvesting some general comments should be made for clarification: • Only data from known sites will be harvested, i.e. data on local notebooks or so are not considered. • The amount of searchable data can become fairly large, in particular if we integrate annotations and relations. • We assume that the repository content will change, i.e. harvesting should be carried out at regular intervals. This has to be discussed in more detail with the partners depending on the experiences. • The MD schemas may change. Special attention has to be drawn to such occasions. • Keyword-value pairs as possible in IMDI will be treated as descriptions at first instance. • Those who chose to be harvested via the OAI harvesting protocol have to register as OAI data providers. MPI for Psycholinguistics can offer help. 13 2. Metadata Mapping WP2 has to realize an infrastructure for joint searching and where possible browsing covering all disciplines in ECHO: history of arts, history of science, ethnology, linguistics and philosophy. The metadata sets applied in the different fields are different in many ways such that mapping is required. Further, the interface has to be offered in several languages such that dedications of all terms to these languages are required. We also have to accept that at this moment the used element names are not yet defined in open repositories according to international standards such as for example ISO 11179. We lack appropriate and accepted tools and repository structures. Therefore this note suggests preliminary structures for open repositories (available at the WP2 site) that contain element definitions, translations to some languages and relations between the elements. The information has to be such that it can be easily transformed into future frameworks. In this document version we will not yet translate the schemas into RDF, but first describe the structures in XML. The RDF formulations will be added later. What we will do is to describe the immediate requirements resulting from establishing a common search infrastructure. 2.1 Introduction We are faced with several domain and sub-domain ontologies that all use their own definitions of elements (terms), i.e. there is nothing as a common ontology. Therefore, within ECHO we have to develop a framework that allows the mapping between the different metadata sets. First, we would like to briefly characterize the metadata sets of the participating domains/subdomains. domain = languages all metadata is filled in according to the IMDI standard; so sub-domains are included just as other linked IMDI repositories; sub-domain = all contributors share the same element semantics the metadata set includes a rich description that describes the project, the researchers, the formal nature of the resources and their contents; it contains about 40 elements and points to the raw and derived resources the metadata set was designed to manage and discover resources in large distributed scenario the number of metadata records is currently larger than 20.000; due to ongoing work this number is continuously increasing; for the metadata details see www.mpi.nl/IMDI domain = ethnology sub-domain = NECEP (database of societies) with the help of an exhaustive set of elements (about 150) researchers are describing societies; in addition prose texts elaborate on certain aspects of societies and explain how to interpret the chosen values; where possible additional media resources illustrate aspects; the metadata set was designed to describe societies in great detail and also to easily find information on societies; the database is in its beginning phase, i.e. there are only a few records and the expectation is to have about 10 controlled ones at the end of the ECHO project; for the metadata details see appendix H domain = ethnology sub-domain = Dutch Ethnology Museum (RMV) RMV has a huge collection of ethnological objects (>250.000) of which only a few are available in digital form and described by metadata (> 3500); every year the digital collection increases in size by about 3500 objects; for budget reasons only 12 elements are used to describe the objects; metadata is used to easily discover objects in the digital archive; 14 for the metadata details see appendix A domain = history of arts sub-domain = fotothek database (Biblioteka Herziana) The Fotothek is a large collection of partly related digital images (6.000 images, 100.000 descriptions); all images are described by metadata that are created according to the MIDAS standard that uses the IconClass thesaurus to encode the content; the MIDAS standard is an exhaustive set that has elements to describe the creator, the involved archives, the content ??; it also encodes hierarchical relationships; metadata is used for management and discovery purposes; for the metadata details see appendix D domain = history of arts sub-domain = lineamenta database The lineamenta database is a new database, its new integrated design was developed to include all sorts of information; survey type of metadata is included in different tables; internally they use a rich metadata set, but only comparatively few fields will be exported to fit with the metadata scheme introduced by history of science (see below); in total there are 500.000 drawings, but the project assumes that at the end of the ECHO project about 300 drawings will be described; internally domain = history of arts sub-domain = ancient maps of Rome database The maps of Rome is currently a small database of about 200 maps described with the help of metadata, the detailed set has to be investigated in more detail, first data was provided. domain = history of science sub-domain = Berlin/Bern The metadata set is a new one and contains about 30 elements; it is possible to add another 15 elements taken from Dublin Core; most of the metadata elements are used for administrational purposes, i.e. only few can be used for resource discovery, in particular in cross-discipline environments; for the metadata details see appendix B domain = history of science sub-domain = IMSS Florence IMSS has a large collection of instruments, documents and artistic objects all being catalogued; recently a new metadata set has been worked out that uses the Dublin Core field as the core and has for each document type a couple of extra fields, therefore the total amount of fields is about 40 and the set is flat, IMSS just started to fill in these templates to describe their holding domain = philosophy The philosophy domain does not have sub-domains; the philosophy group from Paris is working on a fully-linked rich dictionary that translates “terms” into different languages; there will limited set of lexical entries (terms) at the end of the ECHO project; typical metadata fields are used to describe the lexical entries; a precise set is being determined currently – it will be extracted from the texts 2.2 Metadata Elements for DORA5 DORA offers a number of ways for searching: full-text searching on all metadata elements (and even beyond keyword type metadata), structured search offering selected elements and geographical search where possible. For people with detailed queries the portal will link through to the specialized sites. 5 DORA = the ECHO portal called Digital Open Resource Area 15 All ways of searching are based on metadata (and partly on annotation) harvesting. The DORA service provider applies two methods of harvesting as described in chapter 1.1. The DORA service will harvest complete records such as they are offered by the data providers. Filtering and indexing as necessary for the different search options will be done by the DORA service. It has to be checked in a second phase how the annotations and relations will be harvested. At first instance they don’t fit with the OAI model, since the required Dublin Core set cannot be provided – so registration as OAI data provider is not possible. If data is openly available and in XML format the most easy way would be to read the XML files. 2.2.1 Full-text Search For full-text search we will include all fields of the different metadata sets and optionally annotations and relations. We assume that those fields that don’t bear meaningful information to be queried such as addresses, references/links, contact names etc will not decrease the precision and recall significantly. The DORA service provider will harvest6 all metadata information that will be offered by the data providers and for full-text search create joint indexes. These will be created such that we can trace back from which domain and sub-domain the hits were taken. For full-text search there are no different views, i.e. no specialized domain-specific vocabulary. The consequence is that full-text search does not support semantic mapping at first instance. The search should offer a wordlist, however, that shows the user the possibilities when typing his query. This feature can be used as well for checking typo errors and for easy completion. 2.2.2 Structured Search To support structured search we have to be selective and only support those elements that can be mapped between the different domains and sub-domains. We can expect that the user who wants to search for domain-specific details will always want to use domain-specific interfaces. For inputting and executing queries two options have to be available: • • The user must be able to select the domains and sub-domains the search should include. The user must be able to select a view (terminology) to input his query. Since there are even large differences between the terminologies used by the sub-communities, the user must be able to select a sub-community view. In addition to the domain/sub-domain views we will add the Dublin Core view that will offer the Dublin Core vocabulary. The table below gives a first idea of which field will be used from the different domains/sub-domains and how they can be mapped. Since there are so many differences between the domains we started with dualistic mapping schemes between two sets and from there derive mappings for each view. In the table we use the mapping from Dublin Core to the other domains serves as a basis for explanation. We have to develop such mapping schemes from every view since yet we cannot identify a common base such as is used in WordNet that uses a common list of concepts. At first instance we will exclude the unmarked fields (white) from the view since they don’t seem to offer very promising results. From this exemplary table it is obvious that the semantic mapping of the metadata elements is not at all trivial. The decisions made can lead to misleading results and wrong conclusions. Therefore, it is necessary to allow people to use other mapping schemes. This would mean that it 6 Harvesting will be done by requesting XML files using HTTP or by applying the OAI MH protocol. The details are described in other WP2 documents. 16 must be possible to either make it easy to set up a new service provider or to influence the logic machine by pointing to different ontologies. As an example for the problems we will discuss in the following paragraphs three cases are discussed: • • • DC the more simple one of “geographic location” the slightly more difficult one of “languages” the more difficult one to map content Ethnology NECEP RMV Title History of Arts Fotothek Lineamenta object name object title title Creator name artist person Subject categorization title of building prim icono sec icono object keywords name artist date period object type Description Publisher Contributor Date date Resource Type Format Resource ID Source Language society name language name Relation Coverage Time Coverage Location date Continent Country Ethnic Region cultural region geo region date period location content place History of Science Berlin IMSS title title creator participant keywords subject content language person m.author contributor participant date m.year date date doc type doc type type type mime type format format language language language language content.language date year m.date m.year coverage.t date coverage.l Continent Country Region location m.title creator m.author Languages IMDI Rights For almost all metadata sets it makes sense to describe the location with which the resource is primarily associated. • • • • • In NECEP the area is described where the society is located, i.e. also related objects such as images, videos etc are associated with that geographical area. The information is contained in three levels of detail. In the RMV catalogue the aerial information is contained in two fields “cultural region” and “geographic region”. The cultural region is ambiguous since in many cases ethnic information will be mentioned. The Fotothek has two entries that could map. They have an element “location” that contains information about the place of creation. The element “content place” refers to a place that is referred to in the document itself (a painting created in Rome can include a scene from Egypt). The IMDI set used in the languages domain elements that refer to the geographical area in three levels. DC has the field coverage that has a qualifier for aerial coverage. The elements that contain language information have two different meanings, they can refer to the language a document is about or a language a document is in. So a text can be in English, but describe the Trumai language. Different user groups are interested in different aspects of this. • DC’s language field has the meaning “the language a document is written in”. One would describe the language a document is about in the “subject” element. Yet there is no qualifier for this, so we don’t know whether the element is used to encode this. 17 • • • NECEP has a language element, but it also has a society element. Often the language and society names are the same or at least similar. The HoS-Berlin set has the element “language” but it is assumed that they only code the language a document is written in. The IMDI set is specialized and has options for both. In fact we can’t differentiate between the two meanings at the beginning. The most difficult element (element sub-set) is the content description. Completely different dimensions and thesauri are used for content encoding. • • • • • • DC uses the element subject which is however not specified in more detail. So it can include all types of content description values. The NECEP set is meant to describe societies, so the society is the object. In this way almost all elements describe the content. The RMV catalogue has an element called categorization. The value this element can take is a list of keywords extracted from the SNVT thesaurus (see appendix A). So basically the content description has one dimension filled with keywords classifying a given object. The Fotothek uses primarily two entries “primary iconography” and “secondary iconography”. Both elements can have values that are taken from the complex IconClass thesaurus (see appendix D). The construction is similar to that one of RMV, however, the classes differ considerably. The HoS Berlin archive has in its metadata sets the element “keywords”, but they are not yet specified. The IMDI set has a rather elaborated sub-set to describe the content. The sub-elements are Genre, SubGenre, CommunicationContext, Task, Modality, Subject, Description and Keys7. Task and Subject both of which are fairly unconstrained can be mapped most easily with what other domains describe as content. Metadata Set K Metadata Set L Selected View Metadata Set M mappings Metadata Set N Special concern has to be devoted to the question of how to map the content descriptions to allow useful joint queries. We first have to check how these elements are actually used within the domains. A careful analysis may reduce the necessary effort. Summarizing we can say that only a start with pair wise comparison lead us to useful interpretations (see appendix J). From these we will derive per view mappings to all other sets as indicated in the above figure. We realize also that at this moment we start from the proper 7 The Language element, describing the language the resource is about, is also part of the content description block. 18 definitions of the semantics of the elements. However, it is known that the usage of the elements varies to a certain extent, i.e. for the second phase we will have to check the usage of elements. 2.3 Formal Framework for Mapping The mapping requires a number of information types: • • • • definition of terms in English (element names, controlled vocabulary elements) dedications of all terms to the following languages: o French o German o Italian o Swedish o Dutch the relations between the terms alternatives (synonyms) in some cases as for language and society names Alternatives are seen as special type of relations. All definitions will appear in the DORA namespace for matters of simplicity, although the IMDI definitions are currently being integrated in open RDF-based repositories. For the term definitions we will use the following schema8: termID term-name term-XPath domain-name sub-domain-name description dedications fre = french-name ger = german-name ita = italian-name swe = swedish-name dut = dutch-name For the relations we will use the following schema: namespace:termID namespace:termID relation-type match-factor The terms can be elements of the metadata sets, but also elements of the controlled vocabularies of elements. In some cases thesauri are used. It has to be analyzed yet in how far an equality of nodes in such thesauri implies an equality of sub-trees. Within the project we have to find out what kind of relation types will be used. At first instance we will make use of the “equality” relationship from OWL and define a “maps_to” relationship. This relationship is associated with a matching factor that specifies the degree of match between 1 and 3 with “1” meaning an almost perfect match. This can be used while searching as an indicator of how much noise is expected. It could also be used for ranking. A deeper semantic modeling could be carried out, but this would require more time and specialists. Therefore, we will not include this in the current ECHO project. Therefore, also we are not interested in specifying everything in RDF right now. We will use a specific search engine that 8 The schemas will be translated to XML/RDF schemas within the first phase implementation. 19 makes use of the simple relation types. The schemas for the two structures can be found in appendix L. 20 Appendix A : Metadata set used by the RMV The following elements are used within the Ethnology Museum in Leiden (RMV). Nr 1 2 3 4 Element Name cultural origin date presentation title name of object 5 material/fabrication 6 7 8 size special physical features publicly oriented description 9 object history 10 11 12 13 14 geographic origin categorization source links reference to digital object others Description • Culture, style and period taken from the OMV thesaurus, which is continent and region oriented • Religion oriented description (society, ...) different formal options are given: exact date dd-mm-yyyy from/to yyyy/yyyy before yyyy after yyyy about yyyy before 00 yyyy (vC)/yyyy (vC) short title to be used in exhibitions; there can be other title choices such as: sorting title, local title, official title, series title, descriptive title, printing title, function title, English title; there is a field to specify the language the title is in short but specific object indication ; additional information can be associated such as sorting name, alternative name, active name; also here the language can be specified a description of the major materials the object exists of; can be several terms physical size of object possibility to indicate special features of the object a prose description of the object that can be used for public presentations this element offers the possibility to mention the collection the object was part of beforehand or a number that identifies its relation to an earlier exhibition or so location where the object was used; all geographic terms have to be taken from the OMV thesaurus; some additional info can be specified such as sorting location, comments description of the content with the help of keywords extracted from the OMV category thesaurus; references to different types of sources such as publications, related literature, unpublished documents, exhibitions; for each of these there is a field not yet fully defined not yet fully defined, manual speaks about meta objects mapping st st pr pr pr - st st - For mapping purposes we can identify three different options: no usage (-), usage in a structured way (st), usage as free prose text (pr). The original RMV-catalog, handled in their internal database, is transformed into the categories mentioned in the table below. These are the categories offered when using the OAI-interface. 21 Nr 1 2 Element Name identifier date 3 format dimensions 4 format materials 5 description 6 cultural origin 7 8 geographical origin content description 9 coverage spatial 10 11 coverage temporal title 12 contributor Description identification number different formal options are given: exact date dd-mm-yyyy from/to yyyy/yyyy before yyyy after yyyy about yyyy about xx century from/to century/century before 00 yyyy (vC)/yyyy (vC) dimensions: height; width; depth mapping - st - the type of material used and the type of technique used. a prose description of the object that can be used for public presentations style, period and culture taken from the OMV category thesaurus; indicating the cultural origin of the object (continent and region oriented), sometimes identical to coverage-spatial geographical origin of the object, taken from the OVM category thesaurus which is region oriented (continent, region, country, district, reservation or city) description of the content with the help of keywords extracted from the OMV category thesaurus; cultural origin of the object taken from the OMV thesaurus which is region and religion oriented temporal period, can be prose text type of object and short description, or name of object name of person or institute contributing to the acquisition of the object - st st st pr pr - Content Description The content is described by categories according to the SNVT thesaurus. Here we want to introduce the main categories and discuss their usefulness for the joint infrastructure. mapping to languages can have similar motives encoded in texts or in MD content Nr Category mapping to HoA mapping to HoS 01 0101 0102 0103 02 0201 0202 0203 0204 0205 03 hunting, fishery, food gathering can have similar motives encoded in IconClass and texts can have similar motives encoded in texts or titles can have similar motives encoded in IconClass and texts can have similar motives encoded in texts or titles can have similar motives encoded in texts or in MD content 0301 agriculture and horticulture overlap little 0302 forestry can have similar motives encoded in texts or in MD content hunting fishing gathering food weapons & war fist weapons and accessories casting weapons & accessories defense and protection means ornamental weapons artifacts related to war agriculture, horticulture, forestry overlap little 22 04 0401 0402 05 0501 0503 0504 0505 0506 0507 06 0601 0602 0603 0604 0605 07 0701 0702 0703 08 0801 0802 0803 0804 0805 0806 0807 09 0901 0902 0903 0904 0905 0906 0907 10 1001 1002 1003 1004 1005 11 1101 1102 1103 1104 1105 12 1201 1202 1203 1204 cattle breeding and products vee en pluimvee hoeden overlap little overlap little overlap little overlap little overlap little overlap little overlap little overlap little overlap little overlap little overlap little can have similar motives encoded in texts or in MD content can have similar motives encoded in IconClass and texts can have similar motives encoded in texts or titles can have similar motives encoded in texts or in MD content overlap little can have similar motives encoded in texts or titles overlap little can have similar motives encoded in texts or titles overlap little overlap little overlap little overlap little overlap little overlap little overlap little insect breeding food, drink, drugs preparation of food food beverages serving and consuming conservation and storage drinks, drugs and stimulants clothing and ornamental parts of clothing clothing footwear ornamentation of the body personal ornament clothing accessories hygienic care, medicine, personal comfort care of the body, hygiene medicine personal care, making toilet housing choosing and preparing the building site parts of construction furniture and household effects lighting, heating and fire domestic animals water supply (architectural) structures trade and commerce gathering raw material and natural products handicrafts and industries industry recycling measures and weights media of exchange trade and commerce transportation transport by human strength transport by animal mount or animal traction traffic on the water route and appliances airborne traffic communication mnemotechnical appliances scripts signaling means education, teaching, educational appliances demonstrating, explication, transmission social, law, political life symbols of status, rank and dignity, means of identification legal system artifacts related to slavery memorabilia 23 13 1301 1302 1303 1304 1305 14 1401 1402 1403 1404 1405 1406 1407 15 1501 1502 1503 1504 1505 16 1601 1602 1603 17 1701 1702 1703 life cycle overlap little can have similar motives encoded in texts or in MD content overlap little can have similar motives encoded in texts or in MD content overlap little overlap little can have similar motives encoded in texts or in MD content overlap little overlap little overlap little overlap little overlap little overlap little pregnancy, birth and first year initiation marriage overlap little aging death and mourning religion and ritual representations of the supernatural cult objects and other holy objects altars, sanctuaries and their interior decoration and furniture sacrifices overlap little magical protection and defence ritual appliances symbols of religious status art dance and appurtenances theatre plastic art cartography music recreation, sports and games toys for children equipment for sports and games knick-knacks, collectors items indefinite indefinite general indefinite dishes indefinite textile The object is classified according to these categories, i.e. a set of numbers determines what this object is. For some categories there are even more fine-grained semantics that seem to be difficult to use in an interoperable scenario. Meaning of classification: If an object falls into the categories 0205 and 1505 then we may conclude that the object is a song about war. When further the cultural origin says that the object is from the Amazonas area in Brazil we may find it if someone searches for music related to war for the Trumai people (a tribe living in the Amazonas area). 24 Appendix B: Metadata set used by in the History of Science (Berlin) The metadata set such as recently proposed by the HoS group is primarily focusing on management tasks, i.e. the amount of elements that describe the content of a resource is small. The set is a flat list that offers a category “meta” that can be used to enter Dublin Core type of descriptions. element description name creator archive-creation-date archive-storage-date archive-path derive-from sub-element archive-path description comment informal textual description of the resource filename of the resource project or person that created the resource, not useful time and date of creation of the archive entry not useful within DORA linked-with archive-path description content-type meta dir document type comparable to MIME type substructure see below description name path meta not useful within DORA substructure see below file description name path date modificationdate creation-date size mime-type md5cs meta not useful within DORA substructure see below The meta substructure contains elements that are partly dependent on the type of document. The generic type as listed in the following may give an impression. language DRI context the language a document is in not useful for searching link name link to collection as a context description of that collection author year title secondary-author secondary-title Dublin-Core type of fields generic 25 volume number pages date place-published publisher edition tertiary-author tertiary-title number-of-volumes type-of-work subsidiary author alternative-title isbn-issn call-number label keywords abstract notes url not useful for searching Dublin-Core type of field not useful for searching DC type of fields not useful for searching useful but unconstrained not useful for searching 26 Appendix C: Metadata set used by the IMSS Here we will list the elements used for describing instruments. The other two schemes for documents and artistic objects share the same core and are very similar. element belongsTo contextualized DCcontributor DCcopyright DCcoverage DCcreator DCdate DCdescription DCformat DCidentifier DClanguage DCpublisher DCrelation DCsource DCsubject DCtitle DCtype Giver hasComponentType hasInstrumentType hasWR historicallyLocatedIn inventor isDedicated isDocumentedIn isPartOf locatedIn objectRelated owner preservedIn purchaser receiver refersToDiscipline relatedConcept shortname shown simulatedBy usedFor user comment not useful for searching not useful for searching name of artists or engineers not useful for searching not yet clear how the field will be used name of artists etc prose text not yet clear how the field will be used not useful for searching to describe the language the descriptions are in not useful for searching not useful for searching not useful for searching not yet clear how the field will be used not yet clear how the field will be used not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching ? not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not useful for searching not clear whether useful not useful for searching not useful for searching not useful for searching not useful for searching IMSS uses a flat list where a number of pointers contain relations, i.e. implicitly a hierarchical scheme is realized. For us it is not clear yet for all fields how they will be used. Examples will help. 27 Appendix D: Metadata set used in the Fotothek For the Fotothek, BH uses the MIDAS rules to describe their image objects with metadata records. The purpose of the MIDAS rules is beyond the pure discovery and is also used for management. It is a fairly exhaustive structured description set and allows creating linked hierarchies between objects. Only the most relevant elements are shown in the following table. The important description of the content of an image is done according to the IconClass rules. Object-Document Objekt-Verwalter Ort Verwalterart Name-Museum Abteilung Inventar-Nr Person Titel Obj ob28 2864 2890 2900 2930 2950 2910 2914 ObjektAufbewahrung Ort Ortsteil Straße Nr Stelle 5108 5110 5116 5117 5125 Objekt-Künstler Name Name in BH Authentizität Tätigkeit Datierung Zeitangabe ob30 3100 31bh 3470 3475 5064 5062 Entstehungsort 5130 Objekttitel Bauwerksname Gattung Art Sachbegriff Material Technik prim. Ikonogr. sec. Ikonogr lokaler Bezug Objekt-Bauwerk Ort Sachbegriff Träger etc Objekt-Person Name Beziehung zu Objekt Link Hersteller Sachbegriff Titel 5200 5202 5220 5226 5230 5260 5300 5500 5510 5560 ob26 2664 2690 2694 ob40 4100 5007 5008 5009 5010 5013 Description description fields about owner or administrator description fields about where the object is housed: some geographical or topographical information like Australia, Venice description fields about artist date of creation or period of time could be any other date descr. place of creation here “Kunststil” like Venetian etc… known name of the object instead of 5200 for building sub-genre for paintings topic of sub-genre, e.g. “Architecturzeichnung” Object type type of material used type of technique used primary content descr secondary content descr place the content refers to Description of the relation between the object and a building (there are many more descriptive fields) Relation to other person Relation to other object and description of other object (a normalization would be better, i.e. to include the object as a regular one in the domain and have just a link to it) 28 Bauwerk Ort Zeit etc Ereigniskurztitel Literaturnachweis Foto Nummer Verwalter Fotograf AufnahmeDatum Zugangsdatum Inhalt Signatur Dateiname Kommentar Urheber etc 5014 5015 5011 7190 8350 8450 8470 8460 8490 8498 8496 8510 8515 8540 9990 9902 Description of the photo of the The content is described according to the IconClass proposal that is widely used in the arts domain. IconClass was worked out by Dutch scientists and is available at the Dutch academy of sciences. (a short description will follow – the thesaurus is too large to be described fully at this place) 29 Appendix E: Metadata set used in the Lineamenta Project The Lineamenta collection uses internally a rich description set, however, it seems that they will only export a limited set. For this export the same core metadata set is used as for the History of Science – Berlin collections. They use a slightly different specialized “meta” set that is indicated here. element image language document type title person location date object keywords comment reference to an image language the document is written in associated with fixed vocabulary, e.g. “architectural drawing” short description of a drawing (the entry “Gegenstand”) equivalent to DC:creator and contributor, all persons related with their respective fields of activity place, institution where the object is placed date of origin, YYYY.MM.DD or YYYY.MM or YYYY or YYYY-YYYY detailed description of the object, i.e. related building or name of an event which was the background for the genesis of the work of art this field seems to contains no data DORA usage not useful for DORA search useful useful useful useful useful useful useful ? Here further examples should be made available. 30 Appendix F: Metadata set used in the Maps of Rome Project The descriptive data is kept in a relational database that has three tables: PDR, Piantecopie, Persone. These were exported to separate XML documents. From these XML documents received we can identify the following metadata elements that are relevant for DORA: element <autorlink> author-name alternative names date of birth date of deadth place of birth place of acting <data> date <titolo> title method dim-alt dim-long orientation <incislink> engraver <editlink> editor huelsen scaccia frutaz rome-veduta description collection image reference comment metadata elements describing the author date of origin of the object, YYYY or YYYY-YYYY transcription of the title not clear whether this can be mapped engraver, is it a relevant contributor? these terms are not yet clear probably not a search term at DORA level DORA usage useful useful not useful not useful not useful not useful useful useful ? not useful for searching not useful for searching not useful for searching ? useful ? ? ? ? not useful ? for backlinking This list has to be checked with Bibl Herziana. 31 Appendix G: Metadata set used in the Language Domain All metadata descriptions in the language area are created according to the IMDI standard (see www.mpi.nl/IMDI). IMDI provides a structured set that is used for resource discovery and management. Session Name Title Date Location Continent Country Region + Address Description + Resource Reference Keys Project + Name Title Id Contact Decription + Content Genre SubGenre + Communication Context Interactivity Planning Type Involvement Social Context Event Structure Channel Task Modalities Subject + Languages Language + Description + Description + Keys Actors Description + Actor + Resource Refs Role Family Social Role Name + Full Name Code Language + Ethnic group Age Sex Education Anonymous Contact Description + Keys Session Resources Media File + Resource Id Resource Link Type Size Format Quality Recording Conditions Position Access Description + Written Resource + Resource Id Resource Link Media Resource Link Date Type SubType Format Size Derivation Content Encoding Character Encoding Validation Access Language Id Anonymized Description + Source + Id Format Quality Position Access Description + Anonyms Resource Link Access References Description + 32 Language Access Id (ccv) Name + (str) MotherTongue (ccv) Primary (ccv) Dominant (ccv) Description + (sub) Keys Availability (string) Description + (sub) Date (c) Owner (string) Publisher (string) Contact (sub) Contact Key + (sub) Name (string) Address (string) E-mail (c) Organisation (string) Key Name = Value (string) Vocabulary Link (c) Resource Reference Type (cv) Description Text (string) Language Id (ccv) Link (c) Name (string) SubType (ocv) Format (cv) Link (c) Validation Type Methodology Level Description (sub) 33 Appendix H: Metadata set used by NECEP The following elements are used within Non European Components of European Patrimony (NECEP). Nr 1 2 3 4 5 6 Element Name society name alternative name language name country continent ethnic region Comment usual anthropological designation alternative names and spellings used more than one, countries of residence continent or areas this element is not found in the data we received 34 Appendix I: Metadata set used Philosophy For the philosophical lexicon the IMDI metadata structure was used for reasons of simplicity. For elements were filled in: • • • • project researcher as creator concept in focus as title and content description location of creation The texts were included as descriptions to integrate them into the full-text search supported under simple search. All mappings that are valid for the IMDI metadata set are valid for the philosophy domain as well. 35 Appendix J: Dual Mapping between Structured Elements This chapter can be seen as exercises to come to final mappings for the different views (see K), and therefore is not adapted. For a couple of dual sets some topics are discussed that are relevant and indicate the problems that we expect. The NECEP-RMV mapping makes sense since NECEP describes societies in detail of which RMV will have objects in its repository. NECEP RMV comment A1 society names subject-cultural region A7 alternative names subject-cultural region B2 continent B1 country B3 ethnic region C1 language name subject-cultural region subject-geographical subject-cultural region subject-geographical subject-cultural region subject-geographical subject-cultural region has to be checked whether values are the same, probably value matching necessary has to be checked whether values are the same, probably value matching necessary RMV has two fields that apply, details have to be checked RMV has two fields that apply, details have to be checked RMV has two fields that apply, details have to be checked a mapping between languages and societies is necessary The NECEP-IMDI mapping makes sense since NECEP describes societies for which one can probably find language resources in the languages domain. NECEP IMDI comment A1 society Names A7 alternative names B2 continent B1 country B3 ethnic Region C1 language name language name a mapping between languages and societies is necessary language name continent country region language name perhaps mapping due to different names perhaps mapping due to different names perhaps mapping due to different names perhaps mapping due to different names The RMV-IMDI mapping makes sense since one may find objects in the RMV repository that may be related with language resources. RMV IMDI comment fields mentioned above will be used see above date date categorization content rmv.date is date of creation; imdi.date is date of recording; overlap seems to be small rmv.categorization contains a set of numbers describing the type of content included; IMDI uses a whole sub-structure for content; has to be checked how this can be mapped With respect to the HOS-IMDI mapping we don’t expect too much overlap in the scope of the ECHO project. There may be language resources that appear in both repositories. HoS Berlin IMDI comment creator meta.author9 language actor actor language meta.year date title10 content title 9 not much overlap to be expected not much overlap to be expected here is a difference: hos.language refers to the language the resource is in while imdi.language refers to the language the resource is about; nevertheless, hos.language could be useful for linguists; hos.meta.date means year of publication while imdi.date refers to the date of the recording The hos set includes secondary and tertiary authors. The indicated mapping should include them as well. 36 keywords content hos.meta.keywords describe the content of the resource and can be mapped with the content description in IMDI; it is not clear how keywords will be used in HoS With respect to the IMSS – IMDI mapping we don’t expect too much overlap as well despite the formal overlap between the fields used. HoS IMSS IMDI comment DCcontributor DCcoverage actor location, date DCcreator DCdate actor date DCformat DClanguage language DCsubject DCtitle DCtype inventor content title type actor IMSS will have to use qualifiers to separate the two information types in IMSS probably the language the document is in, in IMDI both is possible no information yet how this field will be used not yet clear whether this field is relevant In the current ECHO project we do not expect too much overlap, which is due to the fact that both repositories will not have too many resources that are related. However, in principle much overlap can be expected, since texts from the language resource area can for example explain objects in the HoA area. HoArts IMDI comment Fotothek 3100 name artist 5064 date 5062 period 5130 location of creation 5200 object title 5202 title of building 5230 object type 5500 prim iconography 5510 sec iconography 5560 place of content actor date date location title title content content content location overlap estimated to be small hoa.date is precise; hoa.period offers different options; both can be matched with imdi.date hoa title in case of buildings not yet clear whether there is a potential for matching here a classification according to the IconClass system is used location as part of the content of the painting Not much overlap is expected since the resources probably are not that much related. HoArts IMDI comment Lineamenta document type creator m.language m.person m.year m.title m.date m.keywords object m.location 10 actor language actor date title date content title location no real equivalence in IMDI since the vocabulary is different overlap estimated to be small Lin is encoding the language the document is in overlap estimated to be small no specifications yet as how to fill in keywords in Lin no formal distinction in continent, countries etc The HoS set includes secondary and tertiary titles. The indicated mapping should include them as well. 37 Here one can expect some overlap in principle. However, the metadata set chosen by HoS does not allow to draw too many relations. HoArts HoS Berlin comment Fotothek 3100 name artist 5064 date 5062 period 5200 object title 5202 title of building 5230 object type 5500 prim iconography 5510 sec iconography creator meta.author meta.year meta.year title(s) title(s) keywords keywords keywords it is not yet clear how keywords will be used in HoS it is not yet clear how keywords will be used in HoS it is not yet clear how keywords will be used in HoS A number of Dublin Core mappings will be used. Therefore, we compare some sets from the DC view point. Dublin Core HoS-Berlin comment DCcontributor DCcoverage DCcreator DCdate DCformat DClanguage DCsubject DCtitle DCtype author secondary author tertiary author year author secondary author tertiary author date document type mime type language keywords title secondary title tertiary title doc type DC not very clear – so not clear how to map The mapping between DC and IMDI is fairly straightforward. Dublin Core IMDI participant DCcontributor location DCcoverage DCcreator DCdate DCformat DClanguage DCsubject DCtitle DCtype date participant date format language content language title DC language is language a document is written in not at all clear how subject is used language the doc is about would fall under DC:subject DC semantics not very clear The mapping between DC and HoA-Fotothek. Dublin Core HoA-Fotothek 3100 name artist DCcontributor 5062 period DCcoverage DCcreator DCdate DCformat DClanguage comment comment 5130 place 3100 name artist 5064 date 38 DCsubject DCtitle DCtype prim iconography sec iconography 5220 5200 object title 5202 building title not at all clear how subject is used object type DC semantics not very clear The mapping between RMV and DC does not give many options. Dublin Core RMV comment DCcontributor contributor DCcoverage date subject-cultural region subject-geographic coverage-spatial coverage-temporal DCcreator DCdate date DCformat format DClanguage DCsubject subject-cultural region subject-geographical subject-content DCtitle presentation title name of object DCtype 39 Appendix K: Mapping for Views As mentioned above we have to evaluate the usage of the various fields to optimize the mapping schemes. First it seems to be handy to describe the metadata elements to be used in short form as an overview. Set IMDI Lineamenta element name language continent country region date actors title content type format appearance language continent country region date actors title content type format title person object date keywords title person object date keywords document type language location document type language location Set IMSS NECEP element name creator date subject title type format language contributor inventor coverage spatial coverage temporal appearance creator date subject title type format language contributor inventor coverage spatial coverage temporal antropological designation alternative name continent countries of residence official ethnic regions society name alternative name continent country ethnic region language name language name Set Fotothek RMV Leiden this set is derived from the XML files we received HoS Berlin author content-type language year title keywords date author content type language year title keywords date element name name artist (3100) creator (9902) person name (4100) date (5064) period (5062) location (5130) content place (5560) place (2864) name museum(2900) short title (7190) object title (5200) building title (5202) object type (5230) type (5226) prim. iconography (5500) sec. iconography (5510) appearance artist object artist photo person name date period place of creation content place place institute short title object title building title object type type primary iconography secondary iconography coverage spatial coverage temporal subject geographical origin date subject category coverage spatial coverage temporal geographical origin date content description title title this set is derived from the XML files we received Rome Maps author-name/autorlink alternative names date title editor/editlink incisore/incislink author name alternative author date title editor engraver Philosophy 40 1. DC View We refer to the names in the table above. DC Ethnology NECEP RMV Title title Creator Contributor Subject content descr. Date date Type Format Language “jpg”, “mpeg”, “wav” society name altern. name language name Coverage temporal Coverage spatial continent country ethnic region “jpg” Fotothek object title building title artist object artist photo History of Arts Lineamenta title person Rome Maps title author name editor author name editor artist object person prim icono sec icono date period object type object keywords “rome maps” date date “jpg” document type “tiff”, “jpg” date period date geogr. origin coverage spatial place of creation content place location date Philosophy Languages IMDI title title title author creator actors author contributor actors keywords subject content date date type type format format language language language date year coverage temp. date coverage spat. continent country region year date content type “jpg” “image” language date coverage temp. History of Science Berlin IMSS 41 2. Necep View NECEP Ethnology NECEP RMV society name alt. name coverage spat. coverage spat. coverage spat. geogr. origin coverage spat. geogr. origin coverage spat. geogr. origin coverage spat. continent country ethnic region language name Fotothek History of Arts Lineamenta Rome Maps History of Science Berlin IMSS Philosophy Languages IMDI language language place of creation content place place of creation content place place of creation content place location “europe” continent location “italy” country location “rome” region language coverage spat. language 3. RMV View RMV coverage spatial Ethnology NECEP RMV society name continent country ethnic region date Fotothek History of Arts Lineamenta Rome Maps History of Science Berlin IMSS geogr. origin content descr. continent country ethnic region Languages IMDI language continent country region place of creation content place location “europe” “italy” “rome” date period date date year date coverage temp. date coverage temp. object title title object title title title title place of creation content place location “europe” “italy” “rome” coverage spat. continent country region prim.iconogr. sec. iconogr. keywords subject content coverage spat. coverage temp. title Philosophy keywords date 42 4. Fotothek View Fotothek Ethnology NECEP RMV Fotothek History of Arts Lineamenta institute location place location place of creation content place object title continent country region continent country region coverage spat. geogr. origin. location coverage spat. geogr. origin location title building title short title title object object title object Rome Maps “europe” “italy” “rome” “europe” “italy” “rome” “europe” “italy” “rome” “europe” “italy” “rome” coverage spat. coverage spat. coverage spat. coverage spat. Philosophy Languages IMDI continent country region continent country region continent country region continent country region title title title title author name editor engraver author creator actors year date year date date coverage temp. date coverage temp. keywords keywords type subject subject artist object person artist photo person person name person editor engraver author name date date date date period date date date content descr. content descr. document type document type keywords keywords “maps” “maps” type object type prim. iconogr. sec. iconogr. History of Science Berlin IMSS date date content content 43 5. Lineamenta View Lineamenta location Ethnology NECEP RMV continent country ethnic region geogr. origin coverage spat. title title date date object document type language keywords person Fotothek place of creation content place place institute object title artist object short title date period object title building title short title History of Science Berlin IMSS “europe” “italy” Philosophy Languages IMDI coverage spat. continent country region title title title title date date year date coverage temp date “rome maps” title language prim.iconogr. sec. iconogr. object type “maps” keywords artist object person name editor engraver author name coverage spat. content descr. Rome Maps “printed map” “landscape drawing” “italien” type language name History of Arts Lineamenta type language subject content 44 6. HoS Berlin View HoS Berlin Ethnology NECEP RMV author language Fotothek artist object language name society name History of Arts Lineamenta person Rome Maps History of Science Berlin IMSS author name editor coverage spatial year date date date date period date period date date date date Philosophy creator actors language language date coverage temp. date coverage temp. type content type Languages IMDI date date title title object title title object title title title keywords content descr. prim.iconogr. sec.iconogr. keywords “maps” subject content 7. Rome Maps View Rome Maps author name altern. author date title editor engraver Ethnology NECEP RMV date title Fotothek History of Arts Lineamenta Rome Maps History of Science Berlin IMSS Philosophy Languages IMDI artist object person author creator actors date object title date title date title date title contributor date title 45 8. IMSS View (same as the DC view) IMSS Ethnology NECEP RMV Fotothek History of Arts Lineamenta Rome Maps object title building title title creator artist photo person contributor artist object person prim. iconogr. sec. iconogr. date period date period object type object keywords “rome maps” date date date date title title title author name editor author name editor History of Science Berlin IMSS Philosophy Languages IMDI title title author actors author actors keywords content inventor subject content descr. date date coverage temporal date coverage temp. type format language coverage spatial “jpg”, “mpeg”, “wav” society name language name continent country ethnic region “jpg” “jpg” document type “tiff”, “jpg” “jpg” “image” language coverage spatial geogr. origin place of creation content place location date year date year content type date type format language “rome” date language continent country region 46 9. Language View Language NECEP Ethnology RMV language society name language name continent continent country country region ethnic region Fotothek coverage spatial coverage spatial geogr. origin coverage spatial geogr. origin coverage spatial geogr. origin History of Arts Lineamenta Rome Maps language place of creation content place place of creation content place place of creation content place History of Science Berlin IMSS language date date coverage temp. content content descirption actors title date period prim.iconogr. sec.iconogr. “europe” coverage spatial location “italy” coverage spatial location “rome” coverage spatial date date date year type format date coverage temp. keywords “maps keywords subject author name editor author creator title title title artist photo title object title title object Languages IMDI language location type format Philosophy 47 Appendix L: Schemas Schema for Term Definitions <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'> <xs:element name="term"> <xs:complexType> <xs:sequence> <xs:element name="termID" type="xs:ID"/> <xs:element name="term-name" type="xs:string"/> <xs:element name="xpath" type="xs:URI"/> <xs:element name="domain-name" type="xs:string"/> <xs:element name="sub-domain-name" type="xs:string"/> <xs:element name="description" type="xs:string"/> <xs:element name="dedications"> <xs:complexType> <xs:sequence> <xs:element name="fra" type="xs:string"/> <xs:element name="ger" type="xs:string"/> <xs:element name="ita" type="xs:string"/> <xs:element name="swe" type="xs:string"/> <xs:element name="dut" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:schema> Schema for relations <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'> <ECHO:schema xmlns:xs=’http://www.mpi.nl/echo/schemas/ECHO-def-schema’> <xs:element name="mapping"> <xs:complexType> <xs:sequence> <xs:element name="termID" type="xs:ID”/> <xs:element name="termID" type="xs:ID"/> <xs:element name="relation-type" type="xs:string"/> <xs:element name="match-factor" type="xs:integer"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> 48 B. WP2 Note on an ECHO Ontology Peter Wittenburg 20.2.2004 Essential part of the DORA11 ECHO portal which was presented several times at meetings and discussed in detail with nearly all ECHO participants is the integration of ontological knowledge from several domains. This paper wants to document the knowledge components, their extraction processes and their relations. The resulting components will be available at the end of the ECHO project in well-documented formats. This document can be seen as supplementary to the one that describes the DORA infrastructure, the selections made with respect to the semantics and the mapping choices. From several projects and initiatives we know that the mapping choices can be questioned, since two persons will not agree. But this is exactly the reason why we rely on practical ontologies that can easily be changed and amended by other persons such that the chosen mappings better reflect the intentions. Despite many difficulties we can state that we were able to establish an ECHO ontology that covers the offered semantics of the participating disciplines and that is now base of the DORA machinery. 1. Provided Components The following components were provided by the participants and external sources: 1. Metadata Descriptions XML repositories covering the metadata descriptions of the various data providers often without any form of validation. These were partially associated with a. the list of the metadata vocabularies of which some referred to Dublin Core concepts, others to proper definitions and others to verbal explanations12; b. formal syntax descriptions (only in three cases). 2. Content Thesauri Two metadata sets are making use of thesauri to describe the content of the object. a. The RMV uses the OMV thesaurus that is derived from the AAT thesaurus13. b. The Fotothek uses the IconClass14 thesaurus which was available as an interactive CDROM. 11 Digital Open Resource Area: see WP2-TR16-2003; web-site to come Metadata definitions will always include some tolerance in the usage due to the different interpretations of the definitions of the semantic scope. Non-existing definitions or unclear definitions lead to wider tolerances in usage of course. 13 It should be noted here that Brik de Zwart supported the ECHO work by not only providing the only real OAI implementation, but also providing the OMV thesaurus in a structured form. Thanks a lot!! 12 49 c. Other metadata sets are using either unconstrained keyword elements or use a limited number of narrowly defined elements. 3. Geographic Information a. The RMV is using a geographic thesaurus. b. Other metadata sets are using either unconstrained elements or a limited number of more clearly and constraint elements such as continent and country. c. It was noted that language and society names in many cases include geographical information. 2. Generated Components - Overview From this basic information a number of essential components were extracted. Most of them are in XML, others are in a structured form that is easy to process, but will be transformed to XML until the end of ECHO. Yet RDF was not used to represent knowledge. Concept definitions can be done in XML and this is the way that is used by ISO groups such as TC37/SC4. For the mapping file that contains assertions about concepts RDF is the most suitable format. However, since there is no complete logic, since we have many fuzzy mapping relations and since we lack appropriate standard inference engines there is no immediate need to formulate the relations as RDF assertions. The mappings are embedded in XML so that they can be easily transformed to RDF. 1. Validated Metadata Sets The metadata information was transformed into validated and machine readable formats. Structure and character encoding was standardized to XML and UNICODE. 2. ECHO Concepts This XML file consists of all elements from the various metadata sets that were selected to be used in DORA, i.e. that are not too specialistic. The current version is: echo-term-v6.xml 3. ECHO Mappings This XML file consists of an exhaustive mapping between all elements found in the concepts file. It is guided by the wish to do the access from different views. The current version is: echo-mapping-v5.xml 4. OVM-Geographic Thesaurus This file contains the geographic thesaurus as used within the RMV descriptions. Where possible the OVM geographic thesaurus points to comparable entries in the MPI geographic thesaurus. The current version is: ovm-geo-thesaurus-v3.xml 5. MPI-Geographic Thesaurus An analysis was carried out on all geographically oriented fields on all metadata records of all data providers except RMV to get a list of geographic concepts that 14 IconClass was bought from the KNAW Amsterdam. 50 are actually used. From these a “complete” geographic thesaurus15 was created. Where possible the MPI geographic thesaurus points to comparable entries in the OVM geographical thesaurus. The current version is: mpi-geo-thesaurus-v4.xml 6. OVM Category Thesaurus This thesaurus contains all values that are used in the RMV content description field and they are ordered in a hierarchical way. This thesaurus is based on the AAT thesaurus. The current version is: ovm-category-thesaurus-v2.xml 7. Iconclass Category Thesaurus This thesaurus contains all values that are used in the Fotothek content description field (Iconography) and they are ordered in a hierarchical way. The current version is: iconclass-category-thesaurus-v2.xml 8. IconClass-to-OVM Mapping This file contains a mapping between IconClass and OVM nodes where this is semantically feasible. It was clear that only a one-directional mapping would serve the needs. The current version is: iconclass2ovm-mapping-v3.xml 9. OVM-to-IconClass Mapping This file contains a mapping between IconCLass and OVM nodes where this is semantically feasible. It was clear that only a one-directional mapping would serve the needs. The current version is: ovm2iconclass-mapping-v3.xml 10. MPI Content List An analysis was made on all content type fields that can be found in all metadata records of all data providers except RMV and Fotothek. A mapping file was created that links these descriptors with those to be found in the OVM and the IconClass thesauri. The current version is: IMDI2iconclass-and-ovm-v1.xml 11. Other Components There are a few other files that are used to facilitate the DORA searching machinery, but they don’t contain essential knowledge representations. 3. Components in Detail In this chapter we want to discuss some aspects in more detail. 3.1 ECHO Concepts All concepts that were decided to be used for the DORA interface from the different metadata sets. So we choose a setup that seems now to be followed by many 15 The OVM geographical thesaurus is not complete and not appropriately structured. Different types of concepts appear at a certain depth. Therefore, we could not use it as master thesaurus. A conversion would have required manual work. 51 groups representing knowledge. Concept definitions are separated from any relational information except if a sub/superclass relation is an evident part of the concept definition. This gives everyone the possibility to relate concepts in the own way and nothing is prescribed. In ISO TC37/SC4 it is argued that equality and sub/superclass relations can be part of the definition of a concept. This is very dependent on the scope of the domain considered. According to the ISO 11179 model the domain description has to be part of the concept definition. We have taken a strict role to separate definition and relation, since we don’t have yet a sufficiently detailed view on the semantic scope of all terms. Each concept found is defined by a number of attributes which are indicated in the following XML fragment. <terms> <term> <termID> 001 </termID> unique identifier <term-name> title </term-name> concept name <xpath> dc.title </xpath> how to find it <domain-name> DublinCore </domain-name> ECHO domain name <sub-domain-name> </sub-domain-name> ECHO subdomain <description> name given to resource </description> a prose definition <dedications> <fra> titre </fra> French dedication <ger> Titel </ger> German dedication <ita> titolo </ita> Italian dedication <swe> titel </swe> Swedish dedication <dut> titel </dut> Dutch dedication </dedications> </term> <term> .... .... </term </terms> If there is enough time left in the ECHO project we will transform this into an ISO 11179, ISO 12620 compliant XML form so that it can be put openly on the web and used by others. However, in ECHO we will not introduce relational information into the document and will not eliminate equivalent concepts (synonyms etc). Mainly since the machinery is now developed such that it will use this normalized type of representation. The file was generated only to a small extent automatically. All translations were created manually. 52 3.2 ECHO Mappings The mappings are done according to the Technical Report WP2-TR16-2003 about Mapping. They exist of references to the concept file, a relation type and a matching factor that currently is not used. Before using this information we first have to get more experience. The intention is to indicate the quality of the mapping, i.e. the amount of semantic overlap between the related concepts. The following XML fragment indicates how the file is structured. For easiness of reading a supplementary file was created that contains all concept information. However, this cannot be the basis for the DORA machinery, since the information would be stored at two places which is not acceptable from maintenance reasons. <mappings> <mapping> <termID>001</termID> <termID>080</termID> <relation-type>isEqualTo</relation-type> <match-factor>1</match-factor> </mapping> <mapping> <termID>002</termID> <termID>027</termID> <relation-type>mapsTo</relation-type> <match-factor>1</match-factor> </mapping> <mapping> .... .... </mapping> </mappings> first concept reference second concept reference relation type matching factor It can easily be seen that the structure can be easily transformed into an RDF assertion. Let us take the example from the first fragment. <termID>001</termID> <termID>080</termID> <relation-type>isEqualTo</relation-type> This XML substructure would translate to the following RDF assertion. concept 001 isEqualTo concept 080 The following semantic relations are used in the mapping file: isEqualTo isSubclassOf the two related terms are semantically equivalent Example: DC:Date isEqualTo IMDI:Date the first concept is a hyperonym of the second one 53 Example: DC:Creator is SubclassOf IMDI:Particpant isSuperclassOf the first concept is a hyponym of the second one Example: IMDI:Participant isSuperclassOf DC:Creator MapsTo the first concept is related with the second one this relation was chosen in many cases, but the semantic overlap cannot be specified in terms that can be exploited by strict logic; it represents a kind of fuzzy matching, i.e. only the move to some granular feature space would allow us to make the relation more specific and precise. Example: DC:Creator mapsTo RomeMaps:Editor All relations were created based on manual inspection of the definitions and after having talked with the sub-domain experts. Currently, we start analyzing the usage of the fields which may lead to changes. 3.3 OVM-Geographic Thesaurus This thesaurus was extracted semi-automatically from a web-representation. For reasons of simplicity we indicate the thesaurus in table form. It has three entries: (1) the OVM abbreviation that is used in the metadata records; (2) the geographic name used by OVM in Dutch and (3) a reference to the appropriate node in the so-called MPI geographic thesaurus. OVM Abbreviation OVM.AAA OVM.AAA.AAA OVM.AAA.AAA.AAA OVM.AAA.AAA.AAA.AAA OVM.AAA.AAA.AAA.AAA.AAA OVM.AAA.AAA.AAA.AAA.AAB OVM.AAA.AAA.AAA.AAA.AAB.AAA OVM.AAA.AAA.AAA.AAA.AAB.AAB OVM.AAA.AAA.AAA.AAA.AAB.AAC OVM.AAA.AAA.AAA.AAA.AAC OVM.AAA.AAA.AAA.AAA.AAD OVM.AAA.AAA.AAA.AAB OVM.AAA.AAA.AAA.AAB.AAA OVM.AAA.AAA.AAA.AAB.AAA.AAA OVM.AAA.AAA.AAA.AAB.AAB OVM.AAA.AAA.AAB OVM.AAA.AAA.AAB.AAA OVM.AAA.AAA.AAB.AAA.AAA OVM.AAA.AAA.AAB.AAA.AAA.AAA OVM.AAA.AAA.AAB.AAA.AAB OVM.AAA.AAA.AAB.AAA.AAC OVM.AAA.AAA.AAB.AAA.AAC.AAA OVM.AAA.AAA.AAB.AAA.AAD OVM.AAA.AAA.AAB.AAA.AAD.AAA OVM.AAA.AAA.AAB.AAA.AAE OVM.AAA.AAA.AAB.AAA.AAE.AAA OVM.AAA.AAA.AAB.AAA.AAE.AAB OVM.AAA.AAA.AAB.AAA.AAF OVM.AAA.AAA.AAB.AAA.AAF.AAA OVM.AAA.AAA.AAB.AAA.AAG OVM Geo-Name Geografische herkomst Afrika Afrikaanse eilanden Afrikaanse eilanden- Oost Comoren Madagascar Antananarivo Betafo Nosy Bé Mauritius Seychellen Afrikaanse eilanden- West Canarische eilanden Tenerife St. Helena Centraal-Afrika Angola Angola:regionaal Angola- Noordwest Bengo Benguela Catumbela Bié Chinguar Cabinda Futila Loango Cuamato Forte Rocadas Cuanza MPI Geo-Name reference to mpi-geo-thesaurus Africa Island nations Comoros Madagascar Mauritius Seychelles Central Africa Angola 54 The OVM geographic thesaurus does not have a canonical hierarchical structure that could look like: <continent> <sub-continent> <country> <region> <place> ... It leaves out nodes where nothing suitable could be filled in, i.e. countries can appear at different levels of depth. This makes it difficult to automatically transform this thesaurus into a canonical structure and it is too large to do a manual transformation within ECHO. Therefore, the resulting XML structure can only use arbitrary <struct> tags. This does not harm searching, since the nodes represent super-classes that can be exploited. The link to a node in the MPI geographic thesaurus can also be exploited. OVM geographic thesaurus MPI geographic thesaurus The figure indicates the partial match between the two geographic thesauri. Partial matching in the geographical domain means in the far most cases that complete sub-trees can be matched. Only in few cases at the regional level the classifications may be unclear. 3.4 MPI-Geographic Thesaurus Due to the non-canonical form of the OVM-geographic thesaurus it was decided to add another canonical thesaurus and enter all geographically oriented names that can be found in one of the metadata records (except OVM) into this one. An analysis of all other metadata records revealed that there were not too many different names. For example in the large Fotothek repository only a few names are re-occurring. Also in the large language domain mostly the categorization is done systematically until the country level. Some used the region element, but in total there were not too many different ones. So it was an easy job to add all names into a canonical structure that was extracted semi-automatically from an official and reliable web-site. <continents> <continent> <cnt-name> Africa” </cnt-name> <dedications> <ger> Afrika </ger> </dedications> <ovm-code> OVM.AAA.AAA </ovm-code> <sub-continents> <sub-continent> <sc-name> North Africa </sc-name> <ovm-code> OVM.AAA.AAA.AAC” </ovm-code> <countries> <country> <cou-name> Algeria </cou-name> <ovm-code> OVM.AAA.AAA.AAC.AAA <ovm-code> </country> 55 <country> <cou-name> Egypt </cou-name> <dedications> <ger>Ägypten </ger> </dedications> <ovm-code> OVM.AAA.AAA.AAC.AAB </ovm-code> <country> <cou-name> Libya </cou-name> <ovm-code> OVM.AAA.AAA.AAC.AAC </ovm-code> </country> <country> <cou-name> Morocco </cou-name> <ovm-code> OVM.AAA.AAA.AAC.AAD </ovm-code> </country> <country> <cou-name> Sudan </cou-name> <ovm-code> OVM.AAA.AAA.AAC.AAF </ovm-code> </country> <country> <cou-name> Tunisia </cou-name> <ovm-code> OVM.AAA.AAA.AAC.AAG.AAX </ovm-code> <places> <place> <pl-name> Tunis </pl-name> <ovm-code> OVM.AAA.AAA.AAC.AAG.AAY </ovm-code> </place> ... </places> <country> ... </country> </countries> ... </sub-continent> ... </sub-continents> </continent> ... <continents> Yet the links in the OVM geographical thesaurus are not XML path expressions. This has to be generated to make it a fully XML compliant version that can easily be re-used by others. For the DORA machinery it is not of relevance since optimal index structures are generated anyhow for fast processing. Only for some entries language dedications are specified. It would be too much work to create names in the different languages for all entries except that we will find reliable multilingual geographic lexicons. 3.5 OVM Category Thesaurus The categories and the Dutch labels of this thesaurus were extracted semiautomatically from a web-representation. For reasons of simplicity we indicate the thesaurus in table form. It has three entries: (1) the OVM abbreviation that is used in the metadata records; (2) the English category naming and (3) the original Dutch category naming. OVM indeling/categories OVM.AAC OVM.AAC.AAA OVM.AAC.AAA.AAA OVM.AAC.AAA.AAA.AAA OVM.AAC.AAA.AAA.AAB OVM.AAC.AAA.AAA.AAC OVM.AAC.AAA.AAA.AAE OVM.AAC.AAA.AAA.AAE.AAA OVM.AAC.AAA.AAA.AAE.AAB OVM.AAC.AAA.AAB OVM.AAC.AAA.AAB.AAA English OVM Category "hunting, fishery, food gathering" hunting hunting without tools hunting with lures hunting with traps and snares hunting with weapons hunting with fist weapons hunting with projectiles fishery fishery without tools Dutch OVM Categorie "jacht, visserij, voedselgaring" jacht jacht zonder handwerktuigen jacht met lokmiddelen jacht met vallen en strikken jacht met wapens (inclusief accessoires) jacht met handwapens jacht met projectielen visserij visserij zonder handwerktuigen 56 OVM.AAC.AAA.AAB.AAB OVM.AAC.AAA.AAB.AAC OVM.AAC.AAA.AAB.AAE fishery with lures fishery with traps and nets fishery with weapons OVM.AAC.AAA.AAB.AAE.AAA OVM.AAC.AAA.AAB.AAE.AAB OVM.AAC.AAA.AAC OVM.AAC.AAB OVM.AAC.AAB.AAA fishery with fist weapons fishery with projectiles gathering food "weapons, warfare, war" fist weapons and accessories visserij met lokmiddelen visserij met vallen en netten visserij met wapens (inclusief accessoires) visserij met handwapens visserij met projectielen voedsel verzamelen "wapens, strijd en oorlog" handwapens en accessoires Since the IconClass thesaurus uses English labeling and since at the user interface at least English labeling should be used all entries were translated into English labels as well. It would be too much work within ECHO to generate other language dedications. This should be done semi-automatically by using appropriate technology. An XML version is being created currently which will be made public at the end of the ECHO project. 3.6 Iconclass Category Thesaurus The categories of this thesaurus were extracted semi-automatically from a CDROM. Again, for reasons of simplicity we indicate the thesaurus in table form. It has two entries: (1) the IC abbreviation that is used in the metadata records and (2) the English category labeling. 1 10 11 11A 11A1 11A11 11A2 11A21 11A22 11A221 11A23 11A3 11A31 11B 11B1 11B11 11B114 11B12 11B13 11B14 11B2 11B21 11B22 11B23 11B3 11B31 11B32 11B321 11B322 11B3231 11B3232 11B33 Religion and Magic (symbolic) representations ~ creation, cosmos, cosmogony, universe, and life (in the broadest sense) Christian religion Deity, God (in general) ~ Christian religion God the Creator God measuring the Universe (with compasses) Divine Nature Divinity, 'Divinità ' (Ripa) symbols ~ Divine Nature circle symbolizing God's perfectness God's perfections God's wrath 'Flagello di Dio' (Ripa) the Holy Trinity, 'Trinitas coelestis'; Father, Son and Holy Ghost ~ Christian religion Trinity represented by tripartite symbols symbols of the Trinity ~ circular and/or triangular forms or arrangements three animals, geometrically arranged within a circle or triangle Trinity represented as a person with three heads Trinity represented by three animals sharing one head other tripartite symbols of the Trinity Trinity in which each of the Persons (God, Christ, Holy Ghost) is represented either by an object or by an animal representation of the Trinity: hand (Father), lamb (Son), and dove (Holy Ghost) representation of the Trinity: hand, cross and dove representation of the Trinity: hand, chalice and dove Holy Trinity in which one, two or all figures are represented in human shape Trinity as three persons Trinity in which God the Father and Christ are represented as persons, the Holy Ghost as dove God the Father seated, holding the youthful Christ (Emmanuel) in his lap God the Father and Christ enthroned God the Father holding the crucifix, 'Gnadenstuhl', Mercy-Seat, Throne of Grace God the Father standing or seated, holding the body of Christ, 'Pitié-de-Notrerepresentations of the Trinity The extraction of a clean, complete and well-structured file was not trivial and partially manual work had to be carried out. The thesaurus had to be complete since many mappings were found between OVM and IconClass nodes. 57 An XML version is being created currently which will be made public at the end of the ECHO project, if there are no IPR restrictions involved. This has to be discussed with KNAW. 3.7 IconClass-to-OVM Mapping This mapping file is a result of a careful one-directional comparison. This comparison could only be done manually, since any formal comparison based on pure linguistic knowledge could lead to misleading results. The context had to be considered to do the right interpretations. <mappings> <mapping> <ic-code> 1 </ic-code> <ic-label> Religion and Magic </ic-label> <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAC </ovm-code> <ovm-label> altars, sanctuaries and their interior decoration and furniture </ovm-label> </ovm-mapping> <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAD </ovm-code> <ovm-label> sacrifices </ovm-label> </ovm-mapping <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAF </ovm-code> <ovm-label> ritual appliances </ovm-label> </ovm-mapping> <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAG </ovm-code> <ovm-label> symbols of religious status </ovm-label> </ovm-mapping> </mapping> <mapping> <ic-code> 10 </ic-code> <ic-label> Religion and Magic </ic-label> <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAC </ovm-code> <ovm-label> (symbolic) representations, creation, cosmos, cosmogony, universe, life </ovm-label> </ovm-mapping> </mapping> <mapping> <ic-code> 13 </ic-code> <ic-label> magic, supernaturalism, occultism </ic-label> <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAB </ovm-code> <ovm-label> cult objects and other holy objects </ovm-label> </ovm-mapping> </mapping> <mapping> <ic-code> 13C3 </ic-code> <ic-label> magic objects, apotropaia </ic-label> <ovm-mapping> <ovm-code> OVM.AAC.AAN.AAE </ovm-code> <ovm-label> magical protection and defence </ovm-label> </ovm-mapping> </mapping> ... </mappings In contrast to the geographic mapping described above a mapping between two nodes often does not mean that complete sub-trees would map. For ECHO it would be too much to do a complete analysis. This has to be left over to other projects. OVM category thesaurus IconClass category thesaurus 58 As indicated above there will be much debate about particular mappings. Therefore it is even more true that individuals or groups should be able to influence inferencing by being able to modify the mappings easily. This requires open definitions as they are envisaged for example in ISOTC37/SC4 based on ISO 11179 and ISO 12620 and suitable tools, but in the area of cultural heritage we are far away from such a situation. 3.8 OVM-to-IconClass Mapping This mapping file is complementary to the one-directional comparison described above. For the same reasons also this comparison could only be done manually. <mappings> <mapping> <ovm-code> OVM.AAC.AAA.AAA.AAA </ovm-code> <ic-label> hunting without tools </ic-label> <ovm-mapping> <ovm-code> 43C111 </ovm-code> <ovm-label> game, hunted animals, hunt, bird hunting </ovm-label> </ovm-mapping> </mapping> <mapping> <ovm-code> OVM.AAC.AAA.AAA.AAB </ovm-code> <ic-label> hunting with lures </ic-label> <ovm-mapping> <ovm-code> 43C132 </ovm-code> <ovm-label> duck decoy </ovm-label> </ovm-mapping> <ovm-mapping> <ovm-code> 43C1(+43)</ovm-code> <ovm-label> lures (hunting)</ovm-label> </ovm-mapping> </mapping> <mapping> <ovm-code> OVM.AAC.AAA.AAA.AAC </ovm-code> <ic-label> hunting with traps and snares </ic-label> <ovm-mapping> <ovm-code> 43C131</ovm-code> <ovm-label> finch trap, finchery </ovm-label> </ovm-mapping> </mapping> ... </mappings> For some comments see above. 3.7 MPI Content List To achieve content mappings were possible it is important to try to map all content describing elements from all metadata sets with the thesauri used by RMV and Fotothek and to find of course links between them. We extracted the list of all values we found so far and are currently comparing the entries. This all can only be done manually. <mappings> <mapping> <mpi-label> Speech </mpi-label> <ic-code> 31B6235 </ic-code> <ic-label> speaking </ic-label> </mapping> 59 <mapping> <mpi-label> writing </mpi-label> <ic-code>49L11</ic-code> <ic-label> handwriting, writing as activity </ic-label> <ovm-code> OVM.AAC.AAK.AAB </ovm-code> <ovm-label> script </ovm-label> </mapping> <mapping> <mpi-label> Speech, some gesture </mpi-label> <ic-code>31B6235</ic-code> <ic-label> speaking </ic-label> <ic-code>31A25</ic-code> <ic-label> postures and gestures of arms and hands </ic-label> </mapping> 4. ECHO Knowledge Repositories In chapter 3 we made some comments about the need for flexible knowledge representation infrastructures for the area of cultural heritage. This mainly is due to the fact that people will not agree about definitions - so it should be possible to add new definitions. Even more problematic are the mappings, since only in a few cases one can speak about a perfect match. In the case of the thesaurus mappings we yet did not use relation-types. It is beyond the scope of the ECHO project to sort out how the inherent semantics can be modeled more precisely to be able to exploit the mappings in a more fine-grained way. Currently, all mappings between the thesaurus nodes are of the type “mapsTo” which implement a fuzzy mapping indicating some form of overlap without being more precise. To come to a more open and flexible knowledge representation infrastructure we will set up an ISO TC37/SC4 compliant repository and start defining the DORA categories with the help of this framework. For the mapping files appropriate open repositories will be offered at the MPI web-address including all schemas16. RDF seems to be a primary candidate for the representation in teh Semantic Web era. Currently, however, XML is seen as being sufficient. This could allow everyone to modify aspects of the mapping and use it in their machinery. We see this start of an open knowledge representation infrastructure as one of the outcomes of ECHO. The current DORA machinery will not make use of this open infrastructure, since it would cost too much effort to rewrite all programs and scripts. 5. Exploitation Within ECHO we have created a practical ontology covering a number of knowledge components. From careful inspection of certain representations such as the thesauri we could identify many useful mappings that can be exploited by the DORA machinery. However, we yet cannot say enough about the usage of the various metadata categories by those people who generate the metadata descriptions. From 16 Before doing this at the end of the ECHO project we have to check the IPR situation. 60 experience we know that there is some semantic spreading, yet we cannot make any quantifying statements. When DORA uses the full set of components described here17, we have to start investigations how effective the mappings are in exploiting possible relations between the different domains and sub-domains. Here we are at the beginning. Partly this has also to do with the fact that only few repositories have a large size (Fotothek, RMV, Languages). 17 The machinery is constantly extended with the goal to be ready end of April 2004. 61 C. WP2 Note on the DORA Search Engine Peter Wittenburg 9.5.2004 In two reports we have described the DORA18 concept and the underlying mapping scheme (WP2-TR16-2004) and its ontology components (WP2-TR17-2004). In this document we want to describe the search engine and summarize its evaluation19. While the DORA document describes the intentions and possibilities, this document describes what was implemented. It is not a technical documentation, but describes to a certain detail which implementation decisions were taken and which problems were encountered. The search engine is based on the mappings as described in the DORA note and in the Ontology note, i.e., it implements the mappings and semantic relations in specific ways to achieve high performance. The evaluation part has to consider two aspects: (1) The formal correctness of the algorithms have to be checked and (2) the usefulness and appropriateness of the semantics included in DORA has to be evaluated. Finally, answers to the following two questions have to be given: • • • Are the chosen semantic relation useful? Does metadata interdisciplinary help to answer questions? What kind of infrastructure is necessary to overcome current limitations? It should be noted here that the included number of records is about 95.000 records and that the distribution is uneven. It is obvious that searching only makes sense in large collections such as delivered from Fotothek (75715 records) and languages (17403 records). The relatively small number of records provided by the other repositories at this moment (20 to 1100) limits the strength of the evaluation. Any data that was offered by the data providers was integrated20. 1. Search Engine In this chapter we want to describe the actual DORA interface, the harvesting principles, the data correction steps to be taken, the nature of the index creation process and the searching process. It should be mentioned that the DORA engine is implemented largely with Java21. 1.1 DORA Interface The DORA interface was implemented as described in the original DORA document. However, during the ECHO project it became apparent that some of the goals were too challenging to be met within the short period of time. Everyone interested can make use of the DORA engine, it is available under the following URL: 18 Digital Open Resource Area: see WP2-TR16-2003; web-site to come The evaluation will be updated in May 2004 20 In the case of the RMV repository it is being checked why not more than the current 20 records can be harvested. 21 A technical documentation will go into more detail 19 62 http://corpus1.mpi.nl/ds/dora/ The user can select the disciplines and within the disciplines the data providers to be included in the search. The disciplines are indicated by images and the data providers by menu lists. The interface offers two search options: (1) In simple search the user can specify words that are searched for in all metadata fields provided including full-text fields that contain prose-text. (2) In complex search the user can select a view that is derived from the vocabulary used by the different data providers. All details of these views are explained in the DORA note. 63 Originally, it was intended to include browsing, geographical browsing and annotations in the search. These features were not implemented. Languages is the only domain where browsing is made available so here it is makes sense to go to the language portal immediately. The geographical browsing turned out to be too difficult to be implemented in the ECHO period. Due to the large scale difference (continents to maps of ancient Rome) we would have needed scalable maps that allow to step down to details of Rome and it was seen as too much work to provide the exact coordinates of all locations involved in the DORA domain. Metadata descriptions do not yet include formal geographical coordinates such that points could be created automatically. The option to search on annotations is provided and it would not be too difficult to add annotations to the index, however, it is not as effective. Also here some plans were too ambitious to be realized in the short ECHO period. The idea in history of science was to relate web-sites with each other by entering typed relations. These annotations would be very excellent resources to be integrated in searches. Yet no data could be created. It should be mentioned that the interface is configuration file driven, i.e., it can be easily adapted to other configurations that would imply other • • • disciplines data providers within them views Every data source in DORA gets an ID which is used as the key to combine different knowledge. 1.2 Harvesting The way data providers deliver data within ECHO is different as the table indicates. NECEP online XML RMV online OAI Languages online XML/OAI Lineamenta off-line email CIPRO off-line email Fotothek off-line email IMSS online OAI Berlin not yet up Philosophy online XML Five collections were online and could be harvested according to a various schemes. Three of the interfaces are offering an OAI MHP compliant interface. In the case of languages the XML variant was preferred since it includes all metadata fields. The three data sources extracted files at certain moments and provided them by sending emails. In the latter case a harvesting concept was not applicable. For those data sources that could be harvested a process file was created. It can be modified in a simple way with the help of a web-interface. The following parameters can be defined via this interface to tune the harvesting engine: • • • • • data provider ID frequency of harvesting day time to execute the harvesting (hour/minute) day to execute the harvesting import prefix 64 • • • classpath to the data processing programs the label of the data provider root URL as harvesting address In addition the file contains parameters such as location of logging information, date and time of last harvesting etc. The classpath reference is of great importance since it refers to executable code that contains the knowledge about how to grab the data from the specified URL (OAI/XML) and how to preprocess the data delivered from the source. A log file is created that contains protocol information describing the harvesting process. In addition to the information mentioned above it says how many records were received per source, which type of errors were encountered. This file is also used to document other steps and to protocol the query handling. 1.3 Data Pre-Processing The data delivered had to be corrected and modified in different ways. Here we can only give a few examples. The purpose of this chapter is not to complain, but to show the problems one is faced with when building an interoperable metadata domain at the various levels. Initiatives such as OAI have a great value, although the metadata harvesting protocol is very simple. Its wide acceptance makes clear to every data provider that it is the task of the data provider to provide correct data and not that one of the service provider. The experience not only in ECHO shows that we are still far away from that goal. Much effort was due to changes in the data delivered over time. The language domain changed the IMDI version such that new X-paths were necessary and new mappings had to be established. However, this step was an explicit one supported by proper schemas. In many cases changes were done without notice or without providing a schema. Path corrections could only be carried out after visual inspection. OAI MHP Type of Harvesting (RMV, IMSS) In the case of OAI harvesting the type of preprocessing was comparatively simple. This has to do with the fact that a validation check is carried out when registering as OAI data provider. A schema has to be provided and the data delivered is validated against this schema, i.e., at the encoding and syntax level correct data can be assumed. Still at the content encoding level some pre-processing had to be carried out, since this is beyond schemas. Due to the limited number of fields in Dublin Core different types RMV chose to package different types of information into one Dublin Core field. During preprocessing this had to be separated again. Also some of the encodings had to be interpreted and modified to separate formal encodings and explanatory (and therefore searchable) strings. In principle, however, the choice of OAI to put all validation errors at the shoulders of the data provider seems to be the best one can do. It requires that the data providers who know their data very well and have the responsibility to clean up all encoding and syntax problems. In general the broad semantic definitions of fields in Dublin Core such as DC:Coverage or DC:Subject make it difficult at the semantic level to create suitable mappings. In some cases it is too early to make statements about the usage of such fields. 65 XML Type of Harvesting (NECEP, Languages, Philosophy) In the case of harvesting online available XML data in two cases a schema was available (NECEP, Languages) and validation was carried out by the data provider, so proper metadata was delivered. In the case of philosophy IMDI type of metadata descriptions were created manually from the given texts, therefore also proper schema-based metadata was available. In fact the philosophy data exists from textual descriptions that were interpreted as prose descriptions, i.e., they are not part of the complex search but integrated into the index for simple search. In the language case a major schema change was done during the DORA work, therefore several utility files containing Xpaths etc had to be adapted. Some repositories such as those created by Lund University within ECHO are still using the old IMDI version, i.e., it had to be noticed which version is used for different parts in the language domain. Therefore, a proper harvesting scheme would have to check regularly the version of the underlying schema to make sure that the settings are still ok. The IMDI import module has the appropriate knowledge and can adapt the import schema, however, the Xpath specifications have to be updated. Static Harvesting (other providers) In the case of the other data providers in ECHO static files were exchanged – in general by email. As far as we know XML data was generated by extracting data from relational database repositories of different types. Here many problems were encountered. Again it should be mentioned that our colleagues did their best to provide useful metadata – it’s just a picture of the state of technology. • • • • • • lack of proper XML headers; no UTF-8 character encoding although the XML header claims it22; lack of an XML schema prohibiting any validation; invalid XML constructions; existence of several XML document headers in one file; changes of the underlying schema In the case of the Fotothek it was known that the records are highly nested, so a normalized structure had to be created. It was not always clear to the DORA developers which of the fields had to be replicated. It became also apparent that the encodings found in the metadata records did not fit with the encodings found in the thesauri for example. Some pre-processing had to be done here as well. Normalized validated DORA Repositories Before actually doing any further processing normalized and validated (as far as possible) XML files were created for all repositories. These are part of the DORA ontology, have a documented structure such that the Xpath definitions contained in 22 These kind of problems are very serious ones, since during parsing no errors are created. In general errors can only be indicated if searches don’t lead to appropriate results. The string “Milano” was not extended due to the geographic thesaurus as subpart of “Italy” and “Europe” since it contained non-UTF-8 character encodings. We assume that some of these errors are still hidden in the index. 66 the various other resources are correct. In general, this pre-processing step was necessary to come to useful repositories, but it took too much time. When creating these normalized XML files also the punctuation characters were removed from the data to allow proper and easy matching. For presentation purposes the original string is preserved as well. 1.4 Index Creation Since DORA contains now about 95.000 records and since it can be expected that these numbers will increase rapidly, it was decided to focus on fast indexing mechanisms and to do as much as semantic processing off-line, i.e., not during search. Exploiting the different knowledge components in real time would lead to unacceptable delays. It was decided to use a binary tree where every word found somewhere in the metadata descriptions (including the prose texts) is included as a sequence of nodes. With proper encoding techniques such a tree would guarantee almost equal access times for all queries. It was checked whether an API provided by some of the already existing search engines could be used. Since the search algorithm itself was not seen as the component that would take much time this option was not chosen, i.e., based on existing experience and knowledge a treetraversing algorithm was programmed. Before creating the index tree the semantic extension had to take place. To accomplish this first the codes found in the Fotothek and RMV metadata descriptions were replaced by the strings and separated respectively. At the same moment the mapping between the three content thesauri was used to add the appropriate strings (iconclass2ovm-mapping-v3.xml, ovm2iconclass-mappingv3.xml, IMDI2iconclass-and-ovm-v1.xml). Due to the semantic vagueness of the entries found and of the relations between the thesauri it was decided to not extend to all super-classes in the thesauri. Tests have shown that this would result in an semantic explosion and a decrease in precision23. The following example may illustrate the operation. The following relation is taken from the iconclass2ovm-mapping file. A specific Iconclass code has relations to two OVM codes. 31D human life and its ages OVM.AAC.AAM life cycle OVM.AAC.AAM.AAA pregnancy, birth and first year Iconclass code that maps to OVM classes corresponding Iconclass string OVM code appropriate OVM string OVM code appropriate OVM string When in a record of the Fotothek repository the entry “31D” is found, it will first be replaced by the corresponding string. Then the two semantically overlapping strings of the OVM thesaurus are added. The resulting entry would be transformed from “31D” to “human life and its ages; life cycle; pregnancy, birth and first year” 23 Here the term “precision” is used known from the field of information extraction. It indicates how many hits were obtained that are inappropriate. A decrease in precision means that too many “wrong” hits were found. 67 In doing so the user would find this entry also if the search string “life cycle” was entered. For all geographic information a full extension was made. Two thesauri were used: ovm-geo-thesaurus-v3.xml; mpi-geo-thesaurus-v4.xml. The first is being used for the OVM collection, the second was assembled by looking through all geographically relevant fields including the names of museums, names of languages spoken in that area, etc in the other repositories (for more details we refer to the ontology document). Where possible also other names than the English were added24. So if Milano was found, also Milan and Mailand were added. The mpi-geo-thesaurus-v4 thesaurus also contains mappings to the appropriate categories in the OVM thesaurus. The following example is taken from the mpi-geothesaurus-v4 thesaurus. West Africa OVM.AAA.AAA.AAE Benin OVM.AAA.AAA.AAE.AAA.AAA Burkina Faso OVM.AAA.AAA.AAE.AAB.AAA <lang>Dogon It says that Benin and Burkina Faso can be found in West Africa and that the language Dogon is spoken in the area of Burkina Faso. During index creation therefore two three types of information were added to an entry such as “Milano”. It would result in the entry “Milano, Milan, Mailand, Italy, Italien, Italia, Europe, Europa” This would give the corresponding record as a hit, if for example the string “Italien” would be used to specify the location in a query. In this case hierarchy extension makes sense, since the geographic concepts are exactly defined. Since only one index is used both for simple and complex search, special care had to be taken how the extension can be done for prose text. For keyword type of metadata elements it was assumed that the vocabulary is used properly, i.e. we expect to find the complete string for an institution such as “Sterling and Francine Clark Art Institute” (an institution in Williamstown/ Massachusetts/USA). This allows us to match the complete string and therefore reduce the chance of fault hits. However, in prose text we may find various variants of such a string such as the “the Art Institute from Sterling and Francine Clark”, nevertheless the search engine should find the entry. We could only implement policies that do not rely on advanced Natural Language Processing. Therefore, during the extension it was allowed to break the found string down and to match for example “Sterling”. Such a policy would increase the risk of false hits, but in case of more information in the query such as “Francine Clark” those records that come from the mentioned institution would get a high rating and appear at the top. The result of these processes is a large index file that includes all necessary types of information for each node in the tree such as Document ID, Repository ID and 24 This could only be done in a limited and unsystematic way to help using the DORA engine. 68 Xpath Information. So when a hit was found it can for example immediately be extracted where it comes from. 1.5 Searching Searching is simply done by traversing the binary tree for every entry found in the query. This results in a number of hits which are filtered according to the selections made in the interface. When looking for the string “horse” also the “hits” for “horses” are used which is a morphological variant. Yet no lexical processing is used in the search algorithm. The filtering includes that for domains, for sub-domains and for the field names for complex search. The latter includes all semantic mapping relations between the metadata categories as explained in the DORA note. In doing so the task of semantic mapping is reduced to a filtering step making mapping very fast. A simple ranking mechanism is applied in the search algorithm. When two or more separate items as for example in “Sterling and Francine Clark Art Institute” (5 different items) all result in hits, then the hit receives a very high ranking. Further, the number of occurrences of a certain string in a metadata record is used to increase the ranking. Therefore we can speak about three ranking levels: (1) Highest ranking for the co-occurrence of multiple words appearing in the query. (2) Moderate ranking when a word occurs several times in a record. (3) Singular occurrence of one word of the query string. 69 With respect to the hits all information that is provided by the data providers is used to give as quick feedback as possible. In the above figures a few examples are given. The first example is the result of entering “horses” in simple search. It results in 8 hits from three different domains. In the case of the IMSS hits a back link is provided to the web-page with the following object: “PAOLO SANTINI (after TACCOLA) - Double-grindstone mill powered by two horses”., i.e., when clicking on the back link the shown page appears. In the case of languages when querying for example “wittenburg”, a resource is shown with gesture data. When clicking on the back link one first gets the metadata entry, but can then request the annotations with the appropriate video fragment. Two options are available: (1) The annotations created with ELAN can be viewed with the help of HTML where clicking on an annotation will active the appropriate video fragment. (2) ELAN allows to generate a SMIL25 object which is addressable via the metadata. When clicking streaming video is shown with subtitles. ELAN allows to select the tiers to be seen and the time fragment that is of interest. In the third example the word “rome” is entered as query, delivering many hits for example from the CIPRO repository. Here two options are given. When clicking on the thumbnail a larger image of the map is shown. When clicking on the back link a page is offered with showing the appropriate map within the DIGILIB image processing framework. The presentation of the hits and the back link possibilities can certainly be improved, but they were not in the center of the ECHO work. Also some repositories include many resources that are not open. 2. Evaluation This evaluation is split in three parts. In the first we will make some comments about the formal correctness which we distinguish from the usefulness of the chosen 25 SMIL is a W3C supported standard for media presentations and will be supported by an increasing number of browsers. 70 semantic mappings and operations which we will discuss with the help of examples. While in the case of the formal correctness one can speak about “errors”, the semantic mappings are a matter of subjective evaluation. The third part will make statements about the ranking. 2.1 Formal The formal correctness include all aspects such as • • • • Are all specifications made in the ontology correctly implemented? Are the final metadata files (created by conversion) correct? Are the extension mechanisms that create the final index file correct? Are the extensions such that we don’t get a semantic explosion? The latter has also to do with semantic evaluation, so it could also appear under 2.2. During the last weeks much testing was done to see whether the engine and the underlying mapping files are correct. We distinguish two types of mappings: (1) Those mappings that are specified between the different metadata elements. (2) Those mappings that are established between the thesauri. The mapping scheme between the metadata elements was provided and discussed very early with the data providing teams. The first version of the DORA document was distributed in late 2003, so that all teams could respond. The corrections we received were integrated. It was checked in detail during the tests whether the mappings are effective while searching. Here the method was to investigate specific examples that were obvious from studying the metadata sets. As far as can be seen from these investigations the specified mappings are used correctly. The check of the correctness of the implementation of the thesaurus mappings and extensions was especially tested for the geographical elements. Here we discovered a number of errors which mainly had to do with incorrect character encodings in the metadata files. Although UTF-8 was mentioned in the header we found out that this specification was not correct in some cases. Also in some cases additional characters were introduced in the strings. Only by these operational checks we could find out these errors. For the obvious cases corrections were carried out, although we cannot claim that these kinds of problems are completely removed. Another problem we encountered was that the thesaurus extension leads to an explosion of hits in the case of the content description. In the case of geographical terms we have a well-defined domain that is organized hierarchically. In the case of content descriptions we don’t have such a well-structured domain. Both – the application of semantic mappings between nodes of the content thesauri and the hierarchical extension – leads to cycles and an explosion amounting in too many non useful hits. Therefore, we concluded that for the content description within ECHO we will only exploit the mapping specifications and not use the hierarchy information. A more detailed semantic analysis would have to be carried out to come to refinements. This was beyond the scope of the ECHO project. 2.2 Examples and Semantics First, we will give a number of examples and then give a first evaluation. 71 Example 1 Simple Search “weapons” 87 matches are found: Fotothek: 84, RMV: 1, IMSS: 2 Complex Search “weapons” Fotothek - Iconography: 84, RMV - Content Description: 1 , IMSS - title: 2 Both search types lead to the same result. In the case of complex search the mapping between the fields becomes effective leading to acceptable results. Example 2 Simple Search “dogon” 1 match was found: NECEP: 1 Complex Search “dogon” View NECEP - society name: 1 in NECEP View IMSS - Ianguage: 1 in NECEP View DC - language: 1 in NECEP View Language - language: 1 in NECEP Complex Search “mali” View Language - country: 1 in NECEP This example demonstrates the effect of mapping and geographical thesaurus. The language element is mapped to the society name element in NECEP although this is semantically not fully correct. Entering “mali” in the country specification yields a hit since “mali” is seen as a superclass to “dogon”. Here a relation type such as “has_language” would be semantically more appropriate. Example 3 Simple Search “inuit” 2 matches are found: Language: 1, NECEP: 1 Complex Search “inuit” View Language - *: 0 in Language (could not be found in the Language domain) View Language – language: 1 in NECEP Complex Search “greenland” View Language – language: 1 in NECEP The results are similar compared to example 2. It indicates that the element including “inuit” in the language domain is not an element that is used for mapping. It was used as an optional field by one specific researcher. Example 4 Simple Search “agriculture” 75 matches are found: Language: 73, Fotothek: 2 Complex Search “agriculture” View Fotothek - iconography: 2 in Fotothek View RMV – content: 2 in Fotothek View IMDI – content: 2 in Fotothek 72 The results can be misleading. The 73 hits for language result from matching with recording place (“southern agriculture kindergarten”) or affiliation of an actor (“ministry of agriculture”). In the case of Fotothek the hits make sense since it is about “harvesting”. The mapping in complex search works properly as indicated. Of course, in complex search the misleading hits from the language domain are not found. Example 5 Simple Search “clothing” 22 matches: Language: 8, RMV: 8, Fotothek: 6 Complex Search “clothing” View RMV – content: 8 in RMV, 6 in Fotothek View Fotothek – iconography: 8 in RMV, 6 in Fotothek View Language – content: 8 in RMV, 6 in Fotothek Again the rich annotations that are inserted in various free-text fields in the language domain lead to not useful hits. They are about chats at the bakery shop and the clothes people are wearing – so it’s not about clothing as an object which may be intended by the person specifying the search. The results for complex search from different domains shows the correctness of the mappings. The language hits are excluded, but the others are found. Example 6 Simple Search “horses” 7 matches: Fotothek: 2, Language: 2, IMSS: 3 Complex Search “horses” View Fotothek – object title: 3 in IMSS View Fotothek – iconography: 2 in Fotothek View Lineamenta – title: 3 in IMSS View Lineamenta – keywords: 2 in Fotothek View IMSS – title: 3 in IMSS View IMSS –subject: 2 in Fotothek View Language – title: 3 in IMSS View Language – content: 2 in Fotothek This example clearly indicates the strength of simple search and the weakness of complex search. The pattern of complex search is like a narrow path in the complex semantic space. If one looks at title one finds the IMSS hits, if one looks at content one finds the Fotothek hits. Both, however, are leading to useful hits where “horses” have an important role. The reason partly is that metadata in many cases is very sparsely encoded. In the case of IMSS the term horses is only mentioned in the title, but the content element is yet not used. In the language case thesaurus information is used to infer from the title content “spatial layout task, farm scenarios” to “horses”. Further tests and examples will follow. Yet, there is no clear statement whether simple or complex search are better. Simple search is good when one wants to be sure to get a large number of hits where the probability is very high that the documents looking for are included – even at the price of a large number of hits. Complex search is more selective and its 73 matching operations are much more strict. In general complex search is excellent for those metadata elements that describe a more precise domain such as date, geographic location and authors. Content descriptions are done in very different ways and according to different categorization principles (thesauri, keywords). Any professional search on these elements requires a high degree of knowledge about the underlying category system and its semantics. If one wants to exploit the advantages a thesaurus such as IconClass can offer, one has to know its semantic construction principles. One big advantage of simple search is that it uses all fields even if they contain prose text. However, it also increases the number of appropriate hits as was shown in the examples. 2.3 Ranking Ranking is a possibility to satisfy the user in case of low precision. It is a general rule to offer more hits even if non-appropriate documents are included, since there is always a penalty between “recall” and “precision”. If the “recall” (ratio of appropriate documents found to total number of appropriate documents) shall be increased normally the precision (ratio of appropriate documents to in-appropriate) decreases. But the primary goal is to find all appropriate documents and offer them. A compromise then is to offer all appropriate documents first in case of clear evidence. The implemented ranking is based on frequency of occurrence and not on semantic criteria. It makes sense to weight multiple occurrence of different terms higher than multiple occurrence of one term. The fact that more terms found in the query input are matching raises the probability that the found document is a useful hit. The results found are in general satisfying. An implementation of a ranking based on semantic criteria requires much more experience and insight to the usage of all concepts. Since many metadata sets were offered at a very late moment within the project there was no chance to include semantics in rating. Including semantics also means to include a bias. It is obvious that people disagree on semantic relations and want to be able to tune the semantically related operations according to their wishes. Therefore, we refrained from making use of the “mapping quality” parameter which can be added to the mapping relations between the different metadata elements. It would require much more time to come with useful defaults. At this stage of the DORA search engine ranking based on formal criteria is much more appropriate than including semantic criteria. 3. Conclusions The final conclusions will be drawn when all evaluations have been done in June. Here some preliminary conclusions are made. Creating an interoperable and interdisciplinary search space is a difficult task. So DORA is one of the first attempts to do this in a flexible and unbiased way without a specific goal in mind. It is not yet clear whether this approach is useful. A project approach – even if it includes a few disciplines – may have specific objectives in 74 mind that will require a careful analysis of the included semantics and it may include strong biases. DORA was intended to make it easy to integrate other domains into the search space. Integrating another discipline requires activities at the harvesting and data preprocessing level which will not be commented here. It was already described that most of the repositories are yet not so far to offer validated, correct and stable output. The OAI MHP protocol is important, but many repositories are not ready. Even the concept of metadata was new for some and a fair debate showed that some question the usefulness of keyword type of metadata. Here we can see a difference between institutions that hold large collections of multimedia objects and those that are more text oriented. Discipline integration also requires various operations to integrate the semantics: • • • The mappings to other metadata elements have to be added to support complex search. In the case of geographic descriptions one has to create a discipline specific list of terms and relate them to nodes in a geographic thesaurus. In the case of content descriptions one also has to create relations to concepts used in other domains. Currently, the effort is very high, since there is no structural support and there are no existing knowledge documents one can refer to in the area of the humanities. What is needed to support such work and also allow individuals or groups to tune the semantics to their needs is as follows: • • • • Open Data Category Registries that contain ISO compliant concept definitions occurring in a discipline. Compliance to standards such as ISO 11179 would guarantee a certain degree of homogeneity and increase the reusability. The definitions should be included in XML files that are associated with a schema. These definitions should contain only those relations that are part of the proper definitions of a concept, i.e., if for example the sub-class relation is important to define a concept than a relation to another concept could be included. However, it is wise to reduce this to a minimum, since relations often are a matter of disagreement even within domain. This also is valid for the thesauri. As far as is known to us, the big thesauri have their own definition style, come with a particular access interface and are not open available as an XML file26. For the mappings we also need frameworks to easily create practical ontologies. These should be described in RDF and refer to concepts defined in open registries. It must be possible for users to easily create their own versions, i.e., to adapt existing relations or to add new ones. All these components must be machine-readable and inference engines must be available that can operate on them. 26 To make IconClass useful in the DORA framework the database format used on the distributed CDROM had to be decoded with the help of scripts and some manual intervention to come to an appropriate XML structured file. 75 • • Registration mechanisms have to be designed that allow to register knowledge components and to search for them. The RDF-S and OWL definitions are an excellent start to formalize relation types, however, in practical work we are often faced with fuzzy or unclear relations that cannot be described by RDF-S/OWL types. Part of the work has been started in the area of Language Resources (ISO TC37/SC4). This can be seen as an example to start such work in other disciplines of the humanities. It will pave the way of the humanities towards the Semantic Web. DORA is an attempt to tackle some of the problems based on open and wellstructured ontology components, yet, most of them are not based on established standards. A key point for success of DORA like approaches with complex search based on selected metadata categories will be the flexibility for users and groups to tune the semantics. The above mentioned steps will help doing this, but smart and userfriendly tools have to be available. From the experience it is obvious that the choice to not offer Dublin Core as the Gold Semantic Standard was appropriate. The success of selective search will depend on the knowledge about the vocabularies and the quality of the mappings. Dublin Core presents a rather reduced vocabulary with loosely defined concepts. It is not obvious how different disciplines will map their concepts on the Dublin Core ones and in general this mapping is not open. So the concept of a GOLD standard may be useful for cases like the domain of book descriptions where the concepts such as title, author, year of appearance and publisher developed for many years and are used by all libraries. For purposes such as DORA which want to go beyond these formalized elements, Dublin Core cannot be recommended. It may play a role for occasional users, but it can be questioned whether DC search is preferable compared to simple search. An important aspect that restricts the quality of this evaluation is the lack of detailed metadata descriptions in many cases and the comparatively small number of objects in some of the repositories. Only the Fotothek and Language repositories have a large number of records. For repositories that offer about 100 records or less browsing is sufficient and then superior to searching. However, it is obvious that this will change in all disciplines since the number of digital objects stored increases extremely fast. The DORA technology has to be seen as one of the possible initiatives to indicate how difficult semantic integration is and how much has to be done in future. We need more of such attempts to build the infrastructures and tools to cope with the challenges of the Semantic Web and to prepare the disciplines of the humanities for these challenges. 76 D. Availability of the Code and the Knowledge Components Since we received suggestions for optimizations from the various partners until the date of writing this report, we will finish the modification work in May 2004. After that date we will generate two ZIPpackages: • • one containing all relevant code for the DORA Search engine one containing all relevant knowledge components We intend to have this done in mid June and make the two parts available at the WP2 web-site. The first package will not include all the scripts that were necessary to pre-process the various data sets. We will provide the code of programs that are still in operation. With respect to the latter we have to check what the terms are to put our XML-version of IconClass on the web. 77
© Copyright 2024 Paperzz