Leaving our mark on the world

European Cultural Heritage Online
ECHO
PUBLIC
Contract nВ°:
HPSE / 2002 / 00137
Title:
D2.4 Demonstrator covering the infrastructure and the
collaborative tool in an integrated way
D2.5 Report evaluating the demonstrator on the basis of
the general requirements mainly worked out in the
AGORA
Author:
Peter Wittenburg
Concerned WPs:
Workpackage 2 (Technology)
Abstract:
Published in:
Keywords:
Date of issue of this report:
16th May 2004
Project financed within the Key Action
Improving the Socio-economic Knowledge Base
WP2 Deliverable D2.1
Specification Report
Deliverables D2.4 and 2.5
Interoperable Metadata Domain
Evaluation
Version 1
Peter Wittenburg
Nijmegen
16.5.2004
This note emerged in collaboration with Lund University and contains various contributions from
almost all ECHO partners. Since the reports 2.4 and 2.5 are about the metadata infrastructure we
suggest to combine them. They largely make use of reports that were partly distributed earlier:
•
•
•
WP2 Note on ECHO’s Digital Open Resource Area (DORA) - WP2-TR013-2003 – Version 6
WP2 Note on an ECHO Ontology – WP2-TR017-2004 – Version 2
WP2 Note on the DORA Search Engine - WP2-TR018-2004 – Version 1
2
Content
This report includes the three WP2 reports cited at the front page and a note about the availability
of the code and the knowledge components.
A. WP2 Note on ECHO’s Digital Open Resource Area (DORA)...................................... 5
1. DORA Design Principles............................................................................................ 5
1.1 Topology ............................................................................................................... 6
1.2 User Interface Aspects .......................................................................................... 6
1.3 Selection & Searching Modes............................................................................. 10
1.4 Domains und Sub-Domains ................................................................................ 11
1.5 Hitlist................................................................................................................... 11
1.6 Implementation Issues ........................................................................................ 12
1.7 Harvesting Comments......................................................................................... 13
2. Metadata Mapping .................................................................................................... 14
2.1 Introduction......................................................................................................... 14
2.2 Metadata Elements for DORA............................................................................ 15
2.3 Formal Framework for Mapping ........................................................................ 19
Appendix A : Metadata set used by the RMV .............................................................. 21
Appendix B: Metadata set used by in the History of Science (Berlin) ......................... 25
Appendix C: Metadata set used by the IMSS ............................................................... 27
Appendix D: Metadata set used in the Fotothek........................................................... 28
Appendix E: Metadata set used in the Lineamenta Project .......................................... 30
Appendix F: Metadata set used in the Maps of Rome Project...................................... 31
Appendix G: Metadata set used in the Language Domain ........................................... 32
Appendix H: Metadata set used by NECEP ................................................................. 34
Appendix I: Metadata set used Philosophy................................................................... 35
Appendix J: Dual Mapping between Structured Elements ........................................... 36
Appendix K: Mapping for Views ................................................................................. 40
1. DC View ............................................................................................................... 41
2. Necep View........................................................................................................... 42
3. RMV View............................................................................................................ 42
4. Fotothek View....................................................................................................... 43
5. Lineamenta View .................................................................................................. 44
6. HoS Berlin View................................................................................................... 45
7. Rome Maps View ................................................................................................. 45
8. IMSS View............................................................................................................ 46
9. Language View ..................................................................................................... 47
Appendix L: Schemas ............................................................................................... 48
B. WP2 Note on an ECHO Ontology ............................................................................... 49
1. Provided Components............................................................................................... 49
2. Generated Components - Overview.......................................................................... 50
3. Components in Detail ............................................................................................... 51
3.1 ECHO Concepts.................................................................................................. 51
3.2 ECHO Mappings................................................................................................. 53
3.3 OVM-Geographic Thesaurus.............................................................................. 54
3.4 MPI-Geographic Thesaurus ................................................................................ 55
3.5 OVM Category Thesaurus .................................................................................. 56
3.6 Iconclass Category Thesaurus............................................................................. 57
3
3.7 IconClass-to-OVM Mapping .............................................................................. 58
3.8 OVM-to-IconClass Mapping .............................................................................. 59
3.7 MPI Content List................................................................................................. 59
4. ECHO Knowledge Repositories ............................................................................... 60
5. Exploitation............................................................................................................... 60
C. WP2 Note on the DORA Search Engine...................................................................... 62
1. Search Engine ........................................................................................................... 62
1.1 DORA Interface .................................................................................................. 62
1.2 Harvesting ........................................................................................................... 64
1.3 Data Pre-Processing ............................................................................................ 65
1.4 Index Creation..................................................................................................... 67
1.5 Searching............................................................................................................. 69
2. Evaluation ................................................................................................................. 70
2.1 Formal ................................................................................................................. 71
2.2 Examples and Semantics..................................................................................... 71
2.3 Ranking ............................................................................................................... 74
3. Conclusions............................................................................................................... 74
D. Availability of the Code and the Knowledge Components.......................................... 77
4
A. WP2 Note on ECHO’s Digital Open Resource Area
(DORA)
Peter Wittenburg
24.02.2004
1. DORA Design Principles
DORA is the portal that offers discovery services for various resources that were and are created
by major European initiatives, in particular by the ECHO initiative. The ECHO initiative is
gathering resources in the five different disciplines Linguistics, History of Art, History of Science,
Ethnology and Philosophy.
Under the header of Linguistics resources from a couple of other initiatives will be made
available as well:
•
•
•
the INTERA project that has as goal to create an integrated domain of language
resources;
the DOBES project documenting endangered languages all over the world;
the MPI and the Lund University language resources.
While the linguistic part in ECHO focuses on minority languages such as Sign Language and
linguistic objects with a heritage aspect, INTERA is focusing on major languages and combining
language resource centers in Europe and DOBES is focusing on languages (in particular nonEuropean) that probably will become extinct in a few years time. In combining these initiatives,
and the MPI for Psycholinguistics as well, DORA will offer access to a large set and therefore
forming a critical mass.
Under the header of Ethnology also various resources will be made available: the NECEP
society database, the collection of the DOGON project and the large collection of the Dutch
Ethnology Museum (RMV). Other resources may be integrated as well, at a later time.
In the area of History of Arts three databases will be added: Fotothek, Lineamenta and ancient
maps of Rome. All are housed in the Biblioteka Herziana.
In the area of History of Science a number of collections will be part of the DORA domain. IMSS
Florence will contribute with its large collection and institutions such as U Bern, MPI for History of
Science and perhaps others will contribute as well.
In the area of Philosophy the collection of texts from the ECHO partner will be integrated.
DORA offers various access methods primarily to the metadata descriptions as a simple and
easy navigation space. Hits will allow the users to access the resources themselves, given that
they have the proper access rights. The metadata descriptions are openly accessible. The access
to the resources that can be text, images, movies, sounds and 3D objects may be restricted.
Various views and access mechanisms will be available to meet the requirements of the different
user groups.
The language resource domain within DORA is mainly using the IMDI metadata standard,
although this is not necessary. Therefore, the IMDI domain is a large sub-domain in DORA. For
many other holdings different metadata sets are used, i.e. to create a unified umbrella various
mappings have to be carried out. This is described later in this document.
5
At first instance Lund U and the MPI Nijmegen will maintain DORA. However, others can set up a
similar portal since the sources will be made openly available.
1.1 Topology
The DORA service is a central one, i.e. all metadata will be harvested at a central server and
stored optimally for fast access. This implies that the central server will only have copies of data,
the original copies will stay at the original institutions where they also may be subject to changes
and extensions. With each partner, a procedure will be discussed that will allow us to harvest the
metadata records. The DORA service is not a service that extends to the resources themselves,
i.e. the metadata may have references to the digital objects they describe such as images, texts,
sound files or movies, but these resources stay at the institutions. If a certain institution does not
have sufficient resources to house videos ECHO could act as an umbrella to also house the
resources at a central server1.
Summarizing we can conclude that in the DORA metadata scenario all institutions act as data
providers, i.e. they offer their metadata records for being harvested by the DORA service
providers. Different protocols will be necessary to harvest the data. Different types of records will
be offered by the different institutions.
DORA service providers
the mapping of data and the
different types of searches
will be carried out on
service providing machines
all data providers provide
their metadata records via
the OAI harvesting protocol
except for IMDI, NECEP
and philosophy where the
XML files will be used
data providers
1.2 User Interface Aspects
First we want to list a number of requirements for the user interface:
•
•
•
•
•
•
•
•
•
it has to support the normal working environments such as web browsers (first a limited
set of browsers will be supported)
it has to be simple and robust
it has to look professional for the normal web user
it has to offer simple Google like search on metadata as the first choice2
users can select the domain they want to search in - the default domain is “all”
o a preference file has to support that different users have different defaults
(question where to store this: on server or as bookmarks, ...)
users can select a certain view (domain specific vocabulary) to specify their queries
the opening page has to be attractive, i.e. the layout has to be designed carefully
all pages must use one underlying style
the opening page has to
1
Under certain circumstances the MPI for Psycholinguistics could house resources.
In a second version a lexicon could be displayed to help people to find suitable terms while indicating the
domain from which they are taken.
2
6
allow to jump to geographic browsing (no idea yet whether we can include other
resources than from languages and ethnology)
o allow to jump to IMDI type tree browsing
o allow to go to the specific search engines provided by the disciplines such as the
full IMDI infrastructure
the opening page should contain all relevant links (ECHO, IMDI, MPI, DOBES, ELRA,
Lund, INTERA, ...)
it has to be checked in how far we want to extend to DC/OLAC repositories, i.e. in how
far we want to harvest other sites
the DORA service should allow OAI (DC) service providers to harvest its holding
the first version must be ready as soon as possible, i.e. when components are ready they
should be made visible
o
•
•
•
•
DORA Main Page
(test page is available under: corpus1.mpi.nl/ds/dora_demo2; please, note that it is under construction)
geographic
selection
if possible
domain &
sub-domain
selection
complex structured search
offering domain dependent
views (terms & explanations)
browsing
if possible
full text search field
Google like
This figure3 indicates the major elements of the DORA user interface. It will support simple
search, complex structured search, selection of domains and where possible geographical and
hierarchical browsing. In this version we miss an indication of the possibility to extend the simple
search on metadata (keyword type), annotations (general type of metadata) and relations.
For all forms of searches (simple and complex) the terms used in the descriptions will be
indicated in a separate window. This will facilitate searching since it will inform the user about
what is existing and it will minimize typing errors. It has to be worked out what the best way is to
offer the wordlist in a structured way since they can become very long.
3
Yet an appropriate symbol representing philosophy is missing.
7
Complex Search Page
When the user selects Complex Search the following page will show up:
search domain is
selected
selection
of complex search
selection
of view
(domain vocab for
complex search)
Ethnology
NECEP view
RMV view
query
input
fields
Still the user can select the domain and sub-domain he/she wants to search in and whether
he/she wants to search on metadata, annotations and/or relations. When a special view is
selected a suitable vocabulary will be shown which the user may be more familiar with. The
offered fields can be used to enter strings to form the structured query. In general we will use a
subset of elements from the different domains. Candidates are such elements that can be
mapped to other domains. If users want to do more specific searches using elements that cannot
be mapped they will be able to go to the specific search engines.
One of the detailed views is the DC view and it will offer the well-known 15 DC elements.
Browsing Page
Currently, we see two domains where browsing in metadata domains is an issue. IMDI uses this
concept for language resources and the Alcatraz environment seems to support browsing
according to some thesaurus. Where possible we will support browsing in such metadata
domains.
An interaction should be supported in so far that any browsing is used as a specification of a subdomain for simple search as well. If a user has selected some node by browsing it should
therefore be possible to do simple search and use the node as a selection criterion to narrow
down the search space.
Since date information is used by many metadata sets it has to be checked in how far it is
possible to generate a browsable tree that orders resources according to their date.
8
Geographic Browsing Page
One very popular form of browsing is to use geographical information. Since many metadata sets
are using geographic indicators such as continent, country, region and place it may be possible to
add this type of information to geographic maps such that people can make selections based on
these maps.
DORA has to differentiate the different usages of the geographical information, i.e. the place of
origin is not the same as the place where an object is located. In general one would use the place
of origin within the DORA framework. This has to be analyzed in more detail.
Again here it is important to allow selection criteria, i.e. to only show information for the selected
domains and sub-domains. In many cases it is a problem to associate a document with
geographical maps. A society will live within a region, but drawing regions can easily cause
political problems. Therefore, DORA will associate information with useful points on the maps
although this is not as optimal in many respects.
9
The world map can be broken up into a number of sub-pages at two or three levels. A possible
second layer is indicated in the figure above. That should be sufficient to mark all points with
sufficient detail. There may be some detail maps as for the History of Arts where most resources
point to places in Italy. When selecting a point by clicking all resources are shown as hits such
that people can view or listen them.
1.3 Selection & Searching Modes
Here we want to summarize the searching modes again.
•
•
•
•
•
•
•
Domain Selection. The user can select the domains he wants to operate in and that has
to affect the search and selection modes except the geographic one. We will offer
domains and sub-domains for selection.
Resource-Type Selection. The user can select to operate on metadata, annotations
and/or relations in the simple search modus.
Simple search offers Google like facilities and at first instance the user does not get any
help. At a later stage one could think of a lexicon of all possible terms. This simple search
operates on an index that contains all metadata values that occur in the participating
domains. This includes in particular the descriptions since, for example in ethnology,
especially the descriptions contain the useful material. In doing so ss ignores all structure
of the metadata sets and therefore looses the high precision of structured search.
Complex Search offers a few major categories of each domain with a domain specific
naming. In particular those categories that can be mapped between the disciplines
should be mentioned. It has yet to be defined which categories will be made available. Of
course, in this mode the controlled vocabularies should be available to guide the users.
Browsing can be chosen to navigate in browsable domains such as the IMDI world with
normal web browsers making use of on the fly created html. The possibility of
automatically creating a historical browsing tree will be investigated.
Geographic Selection can be chosen by clicking on the world map. The only possibility
is to click on marked spots that will result in a list of all sessions belonging to this spot
and display them. It has to be checked in how far this can be improved by linking to a
node in browsable trees. So - clicking on a spot in the map will execute a complex search
with the location and or item information (this has to be carefully checked).
Domain-Specific Search. The user has the possibility to go to the domain specific
search that will offer all fields for that particular domain or sub-domain.
Use of Mappings
Since DORA will combine different domains, terminologies have to be mapped while searching.
The detailed mappings have to be worked out. The mappings will be used when performing a
10
complex search. In simple search any term can be entered and the program does not know which
view the person takes. So term mapping does not make sense for simple search.
In complex search a user takes a view. This activates a number of mapping tables from the
chosen user views to the other domains. The mappings will extend and modify the search query
for the other domains.
1.4 Domains und Sub-Domains
DORA knows a number of domains and sub-domains. They can be changeable in a domain
configuration file.
The Domains and Sub-Domains are:
•
Languages
o ECHO
o IMDI Domain
o INTERA
o DOBES
o MPI Nijmegen
o Lund
•
Ethnology
o NECEP Paris
o DOGON Leiden
o RMV Leiden
•
History of Arts
o Lineamenta
o Fotothek
o Ancient Maps of Rome
•
History of Science
o IMSS Florence
o Collections from Bern and Berlin
•
Philosophy
o Philosophy Paris
The domain-configuration file has to include addresses that can be used for harvesting purposes
as well. This configuration file can be used to generate the entries and menus. An indication is
given below. The details have to be worked out.
domain-name
sub-domain-name
protocol
address
web-site
cv addresses
1.5 Hitlist
All hits as search results have to be shown in a unique way offering the DORA style and a
number of choices. The web site should immediately allow to continue searching etc, i.e. the
actual selection and navigation mode should be shown again. Here we can learn from Google to
optimize ergonomics.
From the hit list it should be possible to
• view the metadata record and from there jump to other sources such as info files or
articles (references)
• view and listen to the resources
11
•
invoke other shells that allow to go on with navigating and visualization (this has to be
discussed in detail how it can be done)4
In the case that it is not possible to directly refer to the resources a suitable shell from the
participating sites has to be invoked with the correct arguments. For streaming audio/video a
communication with a streaming server has to be realized.
session X
session Y
session Z
domain
domain
domain
sub-d
sub-d
sub-d
MD
MD
MD
wav
wav
mpg
mpg
text
text
jpg
The layout for the hit-list page is only indicated schematically. The presentation as a simple list is
not at all optimal, since people want to exploit results in a more suitable form. But in the first
version nothing special will be done. Google-like designs should be considered.
At first instance there is no rating involved. Due to the involvement of different domains we first
have to get experience with result lists. Different domains may require different criteria for
determining the relevance of a document.
Possible criteria could be:
• hit comes from structured vs. non-structured information
• weak mappings are indicated and drop the rating
• spelling differences between terms
• frequency of terms found in a metadata record and in associated documents
This has to be sorted out in a later phase.
1.6 Implementation Issues
At the client side normal html and JavaScript is used. For streaming services the QT client has to
be invoked (QT has to receive the right parameters to be able to request the execution of a
certain file) and for example for full IMDI requests the IMDI browser can be used. It has to be
checked in how far controlled vocabularies have to be used to support structured search or
whether it is better to offer the actual terms used. At the server side Perl/XSLT scripts will be
4
Users may want to go from a hit for example about a DOGON building directly to images or to the guided
DOGON tour that is available at a web-site.
12
used to generate the html information that is extracted for example from the IMDI and other XML
files.
CVs
other
interfaces
IMDI
browser
client
QT
perl
IMDI
XML
JSP
Index
Files
Structure
File
mapping
http
server
stream
server
JavaServerPages will be used to solve all other aspects at the server side. It will access index
files to quickly generate results in the two searching modes. It has to be sorted out whether the
full text search will need a different kind of index structure than that one that is used for the
structured search. JSP need the mapping files for cross-discipline activities.
JSP need the IMDI structure file to support the restricted search that was described on the
browsing page. When someone is browsing for example in the IMDI domain a selected node
could be the start for an additional search, i.e. this requires that the selection made is known to
the JSP. To restrict the search JSP have to know which sessions belong to that node.
Perhaps controlled vocabularies have to be supported in the second phase. In the configuration
file all CVs used have to be specified by its address and the category it is associated with.
1.7 Harvesting Comments
With respect to the harvesting some general comments should be made for clarification:
• Only data from known sites will be harvested, i.e. data on local notebooks or so are not
considered.
• The amount of searchable data can become fairly large, in particular if we integrate
annotations and relations.
• We assume that the repository content will change, i.e. harvesting should be carried out
at regular intervals. This has to be discussed in more detail with the partners depending
on the experiences.
• The MD schemas may change. Special attention has to be drawn to such occasions.
• Keyword-value pairs as possible in IMDI will be treated as descriptions at first instance.
• Those who chose to be harvested via the OAI harvesting protocol have to register as OAI
data providers. MPI for Psycholinguistics can offer help.
13
2. Metadata Mapping
WP2 has to realize an infrastructure for joint searching and where possible browsing covering all
disciplines in ECHO: history of arts, history of science, ethnology, linguistics and philosophy. The
metadata sets applied in the different fields are different in many ways such that mapping is
required. Further, the interface has to be offered in several languages such that dedications of all
terms to these languages are required. We also have to accept that at this moment the used
element names are not yet defined in open repositories according to international standards such
as for example ISO 11179. We lack appropriate and accepted tools and repository structures.
Therefore this note suggests preliminary structures for open repositories (available at the WP2
site) that contain element definitions, translations to some languages and relations between the
elements. The information has to be such that it can be easily transformed into future
frameworks. In this document version we will not yet translate the schemas into RDF, but first
describe the structures in XML. The RDF formulations will be added later. What we will do is to
describe the immediate requirements resulting from establishing a common search infrastructure.
2.1 Introduction
We are faced with several domain and sub-domain ontologies that all use their own definitions of
elements (terms), i.e. there is nothing as a common ontology. Therefore, within ECHO we have to
develop a framework that allows the mapping between the different metadata sets.
First, we would like to briefly characterize the metadata sets of the participating domains/subdomains.
domain = languages
all metadata is filled in according to the IMDI standard; so sub-domains are included just
as other linked IMDI repositories;
sub-domain = all contributors share the same element semantics
the metadata set includes a rich description that describes the project, the researchers,
the formal nature of the resources and their contents; it contains about 40 elements and
points to the raw and derived resources
the metadata set was designed to manage and discover resources in large distributed
scenario
the number of metadata records is currently larger than 20.000; due to ongoing work this
number is continuously increasing;
for the metadata details see www.mpi.nl/IMDI
domain = ethnology
sub-domain = NECEP (database of societies)
with the help of an exhaustive set of elements (about 150) researchers are describing
societies; in addition prose texts elaborate on certain aspects of societies and explain
how to interpret the chosen values; where possible additional media resources illustrate
aspects;
the metadata set was designed to describe societies in great detail and also to easily find
information on societies;
the database is in its beginning phase, i.e. there are only a few records and the
expectation is to have about 10 controlled ones at the end of the ECHO project;
for the metadata details see appendix H
domain = ethnology
sub-domain = Dutch Ethnology Museum (RMV)
RMV has a huge collection of ethnological objects (>250.000) of which only a few are
available in digital form and described by metadata (> 3500); every year the digital
collection increases in size by about 3500 objects;
for budget reasons only 12 elements are used to describe the objects;
metadata is used to easily discover objects in the digital archive;
14
for the metadata details see appendix A
domain = history of arts
sub-domain = fotothek database (Biblioteka Herziana)
The Fotothek is a large collection of partly related digital images (6.000 images, 100.000
descriptions); all images are described by metadata that are created according to the
MIDAS standard that uses the IconClass thesaurus to encode the content;
the MIDAS standard is an exhaustive set that has elements to describe the creator, the
involved archives, the content ??; it also encodes hierarchical relationships;
metadata is used for management and discovery purposes;
for the metadata details see appendix D
domain = history of arts
sub-domain = lineamenta database
The lineamenta database is a new database, its new integrated design was developed to
include all sorts of information; survey type of metadata is included in different tables;
internally they use a rich metadata set, but only comparatively few fields will be exported
to fit with the metadata scheme introduced by history of science (see below); in total
there are 500.000 drawings, but the project assumes that at the end of the ECHO project
about 300 drawings will be described; internally
domain = history of arts
sub-domain = ancient maps of Rome database
The maps of Rome is currently a small database of about 200 maps described with the
help of metadata, the detailed set has to be investigated in more detail, first data was
provided.
domain = history of science
sub-domain = Berlin/Bern
The metadata set is a new one and contains about 30 elements; it is possible to add
another 15 elements taken from Dublin Core;
most of the metadata elements are used for administrational purposes, i.e. only few can
be used for resource discovery, in particular in cross-discipline environments;
for the metadata details see appendix B
domain = history of science
sub-domain = IMSS Florence
IMSS has a large collection of instruments, documents and artistic objects all being
catalogued; recently a new metadata set has been worked out that uses the Dublin Core
field as the core and has for each document type a couple of extra fields, therefore the
total amount of fields is about 40 and the set is flat, IMSS just started to fill in these
templates to describe their holding
domain = philosophy
The philosophy domain does not have sub-domains; the philosophy group from Paris is
working on a fully-linked rich dictionary that translates “terms” into different languages;
there will limited set of lexical entries (terms) at the end of the ECHO project; typical
metadata fields are used to describe the lexical entries; a precise set is being determined
currently – it will be extracted from the texts
2.2 Metadata Elements for DORA5
DORA offers a number of ways for searching: full-text searching on all metadata elements (and
even beyond keyword type metadata), structured search offering selected elements and
geographical search where possible. For people with detailed queries the portal will link through
to the specialized sites.
5
DORA = the ECHO portal called Digital Open Resource Area
15
All ways of searching are based on metadata (and partly on annotation) harvesting. The DORA
service provider applies two methods of harvesting as described in chapter 1.1. The DORA
service will harvest complete records such as they are offered by the data providers. Filtering and
indexing as necessary for the different search options will be done by the DORA service.
It has to be checked in a second phase how the annotations and relations will be harvested. At
first instance they don’t fit with the OAI model, since the required Dublin Core set cannot be
provided – so registration as OAI data provider is not possible. If data is openly available and in
XML format the most easy way would be to read the XML files.
2.2.1 Full-text Search
For full-text search we will include all fields of the different metadata sets and optionally
annotations and relations. We assume that those fields that don’t bear meaningful information to
be queried such as addresses, references/links, contact names etc will not decrease the
precision and recall significantly.
The DORA service provider will harvest6 all metadata information that will be offered by the data
providers and for full-text search create joint indexes. These will be created such that we can
trace back from which domain and sub-domain the hits were taken.
For full-text search there are no different views, i.e. no specialized domain-specific vocabulary.
The consequence is that full-text search does not support semantic mapping at first instance. The
search should offer a wordlist, however, that shows the user the possibilities when typing his
query. This feature can be used as well for checking typo errors and for easy completion.
2.2.2 Structured Search
To support structured search we have to be selective and only support those elements that can
be mapped between the different domains and sub-domains. We can expect that the user who
wants to search for domain-specific details will always want to use domain-specific interfaces.
For inputting and executing queries two options have to be available:
•
•
The user must be able to select the domains and sub-domains the search should include.
The user must be able to select a view (terminology) to input his query. Since there are
even large differences between the terminologies used by the sub-communities, the user
must be able to select a sub-community view.
In addition to the domain/sub-domain views we will add the Dublin Core view that will offer the
Dublin Core vocabulary. The table below gives a first idea of which field will be used from the
different domains/sub-domains and how they can be mapped. Since there are so many
differences between the domains we started with dualistic mapping schemes between two sets
and from there derive mappings for each view. In the table we use the mapping from Dublin Core
to the other domains serves as a basis for explanation. We have to develop such mapping
schemes from every view since yet we cannot identify a common base such as is used in
WordNet that uses a common list of concepts.
At first instance we will exclude the unmarked fields (white) from the view since they don’t seem
to offer very promising results.
From this exemplary table it is obvious that the semantic mapping of the metadata elements is
not at all trivial. The decisions made can lead to misleading results and wrong conclusions.
Therefore, it is necessary to allow people to use other mapping schemes. This would mean that it
6
Harvesting will be done by requesting XML files using HTTP or by applying the OAI MH protocol. The
details are described in other WP2 documents.
16
must be possible to either make it easy to set up a new service provider or to influence the logic
machine by pointing to different ontologies.
As an example for the problems we will discuss in the following paragraphs three cases are
discussed:
•
•
•
DC
the more simple one of “geographic location”
the slightly more difficult one of “languages”
the more difficult one to map content
Ethnology
NECEP
RMV
Title
History of Arts
Fotothek
Lineamenta
object name
object title
title
Creator
name artist
person
Subject
categorization
title of building
prim icono
sec icono
object
keywords
name artist
date
period
object type
Description
Publisher
Contributor
Date
date
Resource
Type
Format
Resource ID
Source
Language
society name
language name
Relation
Coverage
Time
Coverage
Location
date
Continent
Country
Ethnic Region
cultural region
geo region
date
period
location
content place
History of Science
Berlin
IMSS
title
title
creator
participant
keywords
subject
content
language
person
m.author
contributor
participant
date
m.year
date
date
doc type
doc type
type
type
mime type
format
format
language
language
language
language
content.language
date
year
m.date
m.year
coverage.t
date
coverage.l
Continent
Country
Region
location
m.title
creator
m.author
Languages
IMDI
Rights
For almost all metadata sets it makes sense to describe the location with which the resource is
primarily associated.
•
•
•
•
•
In NECEP the area is described where the society is located, i.e. also related objects
such as images, videos etc are associated with that geographical area. The information is
contained in three levels of detail.
In the RMV catalogue the aerial information is contained in two fields “cultural region” and
“geographic region”. The cultural region is ambiguous since in many cases ethnic
information will be mentioned.
The Fotothek has two entries that could map. They have an element “location” that
contains information about the place of creation. The element “content place” refers to a
place that is referred to in the document itself (a painting created in Rome can include a
scene from Egypt).
The IMDI set used in the languages domain elements that refer to the geographical area
in three levels.
DC has the field coverage that has a qualifier for aerial coverage.
The elements that contain language information have two different meanings, they can refer to
the language a document is about or a language a document is in. So a text can be in English,
but describe the Trumai language. Different user groups are interested in different aspects of this.
•
DC’s language field has the meaning “the language a document is written in”. One would
describe the language a document is about in the “subject” element. Yet there is no
qualifier for this, so we don’t know whether the element is used to encode this.
17
•
•
•
NECEP has a language element, but it also has a society element. Often the language
and society names are the same or at least similar.
The HoS-Berlin set has the element “language” but it is assumed that they only code the
language a document is written in.
The IMDI set is specialized and has options for both.
In fact we can’t differentiate between the two meanings at the beginning.
The most difficult element (element sub-set) is the content description. Completely different
dimensions and thesauri are used for content encoding.
•
•
•
•
•
•
DC uses the element subject which is however not specified in more detail. So it can
include all types of content description values.
The NECEP set is meant to describe societies, so the society is the object. In this way
almost all elements describe the content.
The RMV catalogue has an element called categorization. The value this element can
take is a list of keywords extracted from the SNVT thesaurus (see appendix A). So
basically the content description has one dimension filled with keywords classifying a
given object.
The Fotothek uses primarily two entries “primary iconography” and “secondary
iconography”. Both elements can have values that are taken from the complex IconClass
thesaurus (see appendix D). The construction is similar to that one of RMV, however, the
classes differ considerably.
The HoS Berlin archive has in its metadata sets the element “keywords”, but they are not
yet specified.
The IMDI set has a rather elaborated sub-set to describe the content. The sub-elements
are Genre, SubGenre, CommunicationContext, Task, Modality, Subject, Description and
Keys7. Task and Subject both of which are fairly unconstrained can be mapped most
easily with what other domains describe as content.
Metadata
Set K
Metadata
Set L
Selected
View
Metadata
Set M
mappings
Metadata
Set N
Special concern has to be devoted to the question of how to map the content descriptions to
allow useful joint queries. We first have to check how these elements are actually used within the
domains. A careful analysis may reduce the necessary effort.
Summarizing we can say that only a start with pair wise comparison lead us to useful
interpretations (see appendix J). From these we will derive per view mappings to all other sets as
indicated in the above figure. We realize also that at this moment we start from the proper
7
The Language element, describing the language the resource is about, is also part of the content
description block.
18
definitions of the semantics of the elements. However, it is known that the usage of the elements
varies to a certain extent, i.e. for the second phase we will have to check the usage of elements.
2.3 Formal Framework for Mapping
The mapping requires a number of information types:
•
•
•
•
definition of terms in English (element names, controlled vocabulary elements)
dedications of all terms to the following languages:
o French
o German
o Italian
o Swedish
o Dutch
the relations between the terms
alternatives (synonyms) in some cases as for language and society names
Alternatives are seen as special type of relations.
All definitions will appear in the DORA namespace for matters of simplicity, although the IMDI
definitions are currently being integrated in open RDF-based repositories.
For the term definitions we will use the following schema8:
termID
term-name
term-XPath
domain-name
sub-domain-name
description
dedications
fre = french-name
ger = german-name
ita = italian-name
swe = swedish-name
dut = dutch-name
For the relations we will use the following schema:
namespace:termID
namespace:termID
relation-type
match-factor
The terms can be elements of the metadata sets, but also elements of the controlled vocabularies
of elements. In some cases thesauri are used. It has to be analyzed yet in how far an equality of
nodes in such thesauri implies an equality of sub-trees.
Within the project we have to find out what kind of relation types will be used. At first instance we
will make use of the “equality” relationship from OWL and define a “maps_to” relationship. This
relationship is associated with a matching factor that specifies the degree of match between 1
and 3 with “1” meaning an almost perfect match. This can be used while searching as an
indicator of how much noise is expected. It could also be used for ranking.
A deeper semantic modeling could be carried out, but this would require more time and
specialists. Therefore, we will not include this in the current ECHO project. Therefore, also we are
not interested in specifying everything in RDF right now. We will use a specific search engine that
8
The schemas will be translated to XML/RDF schemas within the first phase implementation.
19
makes use of the simple relation types. The schemas for the two structures can be found in
appendix L.
20
Appendix A : Metadata set used by the RMV
The following elements are used within the Ethnology Museum in Leiden (RMV).
Nr
1
2
3
4
Element Name
cultural origin
date
presentation title
name of object
5
material/fabrication
6
7
8
size
special physical features
publicly oriented description
9
object history
10
11
12
13
14
geographic origin
categorization
source links
reference to digital object
others
Description
• Culture, style and period taken from the OMV
thesaurus, which is continent and region
oriented
• Religion oriented description (society, ...)
different formal options are given:
exact date dd-mm-yyyy
from/to
yyyy/yyyy
before
yyyy
after
yyyy
about
yyyy
before 00
yyyy (vC)/yyyy (vC)
short title to be used in exhibitions; there can be
other title choices such as: sorting title, local title,
official title, series title, descriptive title, printing
title, function title, English title; there is a field to
specify the language the title is in
short but specific object indication ; additional
information can be associated such as sorting
name, alternative name, active name; also here
the language can be specified
a description of the major materials the object
exists of; can be several terms
physical size of object
possibility to indicate special features of the object
a prose description of the object that can be used
for public presentations
this element offers the possibility to mention the
collection the object was part of beforehand or a
number that identifies its relation to an earlier
exhibition or so
location where the object was used; all
geographic terms have to be taken from the OMV
thesaurus; some additional info can be specified
such as sorting location, comments
description of the content with the help of
keywords extracted from the OMV category
thesaurus;
references to different types of sources such as
publications, related literature, unpublished
documents, exhibitions; for each of these there is
a field
not yet fully defined
not yet fully defined, manual speaks about meta
objects
mapping
st
st
pr
pr
pr
-
st
st
-
For mapping purposes we can identify three different options: no usage (-), usage in a structured
way (st), usage as free prose text (pr).
The original RMV-catalog, handled in their internal database, is transformed into the categories
mentioned in the table below. These are the categories offered when using the OAI-interface.
21
Nr
1
2
Element Name
identifier
date
3
format dimensions
4
format materials
5
description
6
cultural origin
7
8
geographical origin
content description
9
coverage spatial
10
11
coverage temporal
title
12
contributor
Description
identification number
different formal options are given:
exact date dd-mm-yyyy
from/to
yyyy/yyyy
before
yyyy
after
yyyy
about
yyyy
about
xx century
from/to
century/century
before 00
yyyy (vC)/yyyy (vC)
dimensions: height; width; depth
mapping
-
st
-
the type of material used and the type of
technique used.
a prose description of the object that can be used
for public presentations
style, period and culture taken from the OMV
category thesaurus; indicating the cultural origin of
the object (continent and region oriented),
sometimes identical to coverage-spatial
geographical origin of the object, taken from the
OVM category thesaurus which is region oriented
(continent, region, country, district, reservation or
city)
description of the content with the help of
keywords extracted from the OMV category
thesaurus;
cultural origin of the object taken from the OMV
thesaurus which is region and religion oriented
temporal period, can be prose text
type of object and short description, or name of
object
name of person or institute contributing to the
acquisition of the object
-
st
st
st
pr
pr
-
Content Description
The content is described by categories according to the SNVT thesaurus. Here we want to
introduce the main categories and discuss their usefulness for the joint infrastructure.
mapping to
languages
can have similar
motives encoded
in texts or in MD
content
Nr
Category
mapping to HoA
mapping to HoS
01
0101
0102
0103
02
0201
0202
0203
0204
0205
03
hunting, fishery, food gathering
can have similar
motives encoded
in IconClass and
texts
can have similar
motives encoded
in texts or titles
can have similar
motives encoded
in IconClass and
texts
can have similar
motives encoded
in texts or titles
can have similar
motives encoded
in texts or in MD
content
0301
agriculture and horticulture
overlap little
0302
forestry
can have similar
motives encoded
in texts or in MD
content
hunting
fishing
gathering food
weapons & war
fist weapons and accessories
casting weapons & accessories
defense and protection means
ornamental weapons
artifacts related to war
agriculture, horticulture, forestry
overlap little
22
04
0401
0402
05
0501
0503
0504
0505
0506
0507
06
0601
0602
0603
0604
0605
07
0701
0702
0703
08
0801
0802
0803
0804
0805
0806
0807
09
0901
0902
0903
0904
0905
0906
0907
10
1001
1002
1003
1004
1005
11
1101
1102
1103
1104
1105
12
1201
1202
1203
1204
cattle breeding and products
vee en pluimvee hoeden
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
can have similar
motives encoded
in texts or in MD
content
can have similar
motives encoded
in IconClass and
texts
can have similar
motives encoded
in texts or titles
can have similar
motives encoded
in texts or in MD
content
overlap little
can have similar
motives encoded
in texts or titles
overlap little
can have similar
motives encoded
in texts or titles
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
insect breeding
food, drink, drugs
preparation of food
food
beverages
serving and consuming
conservation and storage
drinks, drugs and stimulants
clothing and ornamental parts of
clothing
clothing
footwear
ornamentation of the body
personal ornament
clothing accessories
hygienic care, medicine, personal
comfort
care of the body, hygiene
medicine
personal care, making toilet
housing
choosing and preparing the
building site
parts of construction
furniture and household effects
lighting, heating and fire
domestic animals
water supply
(architectural) structures
trade and commerce
gathering raw material and
natural products
handicrafts and industries
industry
recycling
measures and weights
media of exchange
trade and commerce
transportation
transport by human strength
transport by animal mount or
animal traction
traffic on the water
route and appliances
airborne traffic
communication
mnemotechnical appliances
scripts
signaling means
education, teaching, educational
appliances
demonstrating, explication,
transmission
social, law, political life
symbols of status, rank and
dignity, means of identification
legal system
artifacts related to slavery
memorabilia
23
13
1301
1302
1303
1304
1305
14
1401
1402
1403
1404
1405
1406
1407
15
1501
1502
1503
1504
1505
16
1601
1602
1603
17
1701
1702
1703
life cycle
overlap little
can have similar
motives encoded
in texts or in MD
content
overlap little
can have similar
motives encoded
in texts or in MD
content
overlap little
overlap little
can have similar
motives encoded
in texts or in MD
content
overlap little
overlap little
overlap little
overlap little
overlap little
overlap little
pregnancy, birth and first year
initiation
marriage
overlap little
aging
death and mourning
religion and ritual
representations of the
supernatural
cult objects and other holy objects
altars, sanctuaries and their
interior decoration and furniture
sacrifices
overlap little
magical protection and defence
ritual appliances
symbols of religious status
art
dance and appurtenances
theatre
plastic art
cartography
music
recreation, sports and games
toys for children
equipment for sports and games
knick-knacks, collectors items
indefinite
indefinite general
indefinite dishes
indefinite textile
The object is classified according to these categories, i.e. a set of numbers determines what this
object is. For some categories there are even more fine-grained semantics that seem to be
difficult to use in an interoperable scenario.
Meaning of classification: If an object falls into the categories 0205 and 1505 then we may
conclude that the object is a song about war. When further the cultural origin says that the object
is from the Amazonas area in Brazil we may find it if someone searches for music related to war
for the Trumai people (a tribe living in the Amazonas area).
24
Appendix B: Metadata set used by in the History of Science
(Berlin)
The metadata set such as recently proposed by the HoS group is primarily focusing on
management tasks, i.e. the amount of elements that describe the content of a resource is small.
The set is a flat list that offers a category “meta” that can be used to enter Dublin Core type of
descriptions.
element
description
name
creator
archive-creation-date
archive-storage-date
archive-path
derive-from
sub-element
archive-path
description
comment
informal textual description of the resource
filename of the resource
project or person that created the resource, not useful
time and date of creation of the archive entry
not useful within DORA
linked-with
archive-path
description
content-type
meta
dir
document type comparable to MIME type
substructure see below
description
name
path
meta
not useful within DORA
substructure see below
file
description
name
path
date
modificationdate
creation-date
size
mime-type
md5cs
meta
not useful within DORA
substructure see below
The meta substructure contains elements that are partly dependent on the type of document. The
generic type as listed in the following may give an impression.
language
DRI
context
the language a document is in
not useful for searching
link
name
link to collection as a context
description of that collection
author
year
title
secondary-author
secondary-title
Dublin-Core type of fields
generic
25
volume
number
pages
date
place-published
publisher
edition
tertiary-author
tertiary-title
number-of-volumes
type-of-work
subsidiary author
alternative-title
isbn-issn
call-number
label
keywords
abstract
notes
url
not useful for searching
Dublin-Core type of field
not useful for searching
DC type of fields
not useful for searching
useful but unconstrained
not useful for searching
26
Appendix C: Metadata set used by the IMSS
Here we will list the elements used for describing instruments. The other two schemes for
documents and artistic objects share the same core and are very similar.
element
belongsTo
contextualized
DCcontributor
DCcopyright
DCcoverage
DCcreator
DCdate
DCdescription
DCformat
DCidentifier
DClanguage
DCpublisher
DCrelation
DCsource
DCsubject
DCtitle
DCtype
Giver
hasComponentType
hasInstrumentType
hasWR
historicallyLocatedIn
inventor
isDedicated
isDocumentedIn
isPartOf
locatedIn
objectRelated
owner
preservedIn
purchaser
receiver
refersToDiscipline
relatedConcept
shortname
shown
simulatedBy
usedFor
user
comment
not useful for searching
not useful for searching
name of artists or engineers
not useful for searching
not yet clear how the field will be used
name of artists etc
prose text
not yet clear how the field will be used
not useful for searching
to describe the language the descriptions are in
not useful for searching
not useful for searching
not useful for searching
not yet clear how the field will be used
not yet clear how the field will be used
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
?
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not useful for searching
not clear whether useful
not useful for searching
not useful for searching
not useful for searching
not useful for searching
IMSS uses a flat list where a number of pointers contain relations, i.e. implicitly a hierarchical
scheme is realized. For us it is not clear yet for all fields how they will be used. Examples will
help.
27
Appendix D: Metadata set used in the Fotothek
For the Fotothek, BH uses the MIDAS rules to describe their image objects with metadata
records. The purpose of the MIDAS rules is beyond the pure discovery and is also used for
management. It is a fairly exhaustive structured description set and allows creating linked
hierarchies between objects. Only the most relevant elements are shown in the following table.
The important description of the content of an image is done according to the IconClass rules.
Object-Document
Objekt-Verwalter
Ort
Verwalterart
Name-Museum
Abteilung
Inventar-Nr
Person
Titel
Obj
ob28
2864
2890
2900
2930
2950
2910
2914
ObjektAufbewahrung
Ort
Ortsteil
StraГџe
Nr
Stelle
5108
5110
5116
5117
5125
Objekt-KГјnstler
Name
Name in BH
Authentizität
Tätigkeit
Datierung
Zeitangabe
ob30
3100
31bh
3470
3475
5064
5062
Entstehungsort
5130
Objekttitel
Bauwerksname
Gattung
Art
Sachbegriff
Material
Technik
prim. Ikonogr.
sec. Ikonogr
lokaler Bezug
Objekt-Bauwerk
Ort
Sachbegriff
Träger
etc
Objekt-Person
Name
Beziehung zu Objekt
Link
Hersteller
Sachbegriff
Titel
5200
5202
5220
5226
5230
5260
5300
5500
5510
5560
ob26
2664
2690
2694
ob40
4100
5007
5008
5009
5010
5013
Description
description fields about owner or administrator
description fields about where the object is
housed:
some geographical or topographical information
like Australia, Venice
description fields about artist
date of creation or period of time
could be any other date descr.
place of creation
here “Kunststil” like Venetian etc…
known name of the object
instead of 5200 for building
sub-genre for paintings
topic of sub-genre, e.g. “Architecturzeichnung”
Object type
type of material used
type of technique used
primary content descr
secondary content descr
place the content refers to
Description of the relation between the object
and a building (there are many more descriptive
fields)
Relation to other person
Relation to other object and description of other
object (a normalization would be better, i.e. to
include the object as a regular one in the
domain and have just a link to it)
28
Bauwerk
Ort
Zeit
etc
Ereigniskurztitel
Literaturnachweis
Foto
Nummer
Verwalter
Fotograf
AufnahmeDatum
Zugangsdatum
Inhalt
Signatur
Dateiname
Kommentar
Urheber
etc
5014
5015
5011
7190
8350
8450
8470
8460
8490
8498
8496
8510
8515
8540
9990
9902
Description of the photo of the
The content is described according to the IconClass proposal that is widely used in the arts
domain. IconClass was worked out by Dutch scientists and is available at the Dutch academy of
sciences.
(a short description will follow – the thesaurus is too large to be described fully at this place)
29
Appendix E: Metadata set used in the Lineamenta Project
The Lineamenta collection uses internally a rich description set, however, it seems that they will
only export a limited set. For this export the same core metadata set is used as for the History of
Science – Berlin collections. They use a slightly different specialized “meta” set that is indicated
here.
element
image
language
document type
title
person
location
date
object
keywords
comment
reference to an image
language the document is written in
associated with fixed vocabulary,
e.g. “architectural drawing”
short description of a drawing (the entry
“Gegenstand”)
equivalent to DC:creator and contributor,
all persons related with their respective
fields of activity
place, institution where the object is
placed
date of origin,
YYYY.MM.DD or
YYYY.MM or
YYYY or
YYYY-YYYY
detailed description of the object,
i.e. related building or name of an event
which was the background for the
genesis of the work of art
this field seems to contains no data
DORA usage
not useful for DORA search
useful
useful
useful
useful
useful
useful
useful
?
Here further examples should be made available.
30
Appendix F: Metadata set used in the Maps of Rome Project
The descriptive data is kept in a relational database that has three tables: PDR, Piantecopie,
Persone. These were exported to separate XML documents.
From these XML documents received we can identify the following metadata elements that are
relevant for DORA:
element
<autorlink> author-name
alternative names
date of birth
date of deadth
place of birth
place of acting
<data> date
<titolo> title
method
dim-alt
dim-long
orientation
<incislink> engraver
<editlink> editor
huelsen
scaccia
frutaz
rome-veduta
description
collection
image reference
comment
metadata elements describing the author
date of origin of the object,
YYYY or YYYY-YYYY
transcription of the title
not clear whether this can be mapped
engraver, is it a relevant contributor?
these terms are not yet clear
probably not a search term at DORA level
DORA usage
useful
useful
not useful
not useful
not useful
not useful
useful
useful
?
not useful for searching
not useful for searching
not useful for searching
?
useful
?
?
?
?
not useful
?
for backlinking
This list has to be checked with Bibl Herziana.
31
Appendix G: Metadata set used in the Language Domain
All metadata descriptions in the language area are created according to the IMDI standard (see
www.mpi.nl/IMDI). IMDI provides a structured set that is used for resource discovery and
management.
Session
Name
Title
Date
Location
Continent
Country
Region +
Address
Description +
Resource Reference
Keys
Project +
Name
Title
Id
Contact
Decription +
Content
Genre
SubGenre +
Communication Context
Interactivity
Planning Type
Involvement
Social Context
Event Structure
Channel
Task
Modalities
Subject +
Languages
Language +
Description +
Description +
Keys
Actors
Description +
Actor +
Resource Refs
Role
Family Social Role
Name +
Full Name
Code
Language +
Ethnic group
Age
Sex
Education
Anonymous
Contact
Description +
Keys
Session
Resources
Media File +
Resource Id
Resource Link
Type
Size
Format
Quality
Recording Conditions
Position
Access
Description +
Written Resource +
Resource Id
Resource Link
Media Resource Link
Date
Type
SubType
Format
Size
Derivation
Content Encoding
Character Encoding
Validation
Access
Language Id
Anonymized
Description +
Source +
Id
Format
Quality
Position
Access
Description +
Anonyms
Resource Link
Access
References
Description +
32
Language
Access
Id (ccv)
Name + (str)
MotherTongue (ccv)
Primary (ccv)
Dominant (ccv)
Description + (sub)
Keys
Availability (string)
Description + (sub)
Date (c)
Owner (string)
Publisher (string)
Contact (sub)
Contact
Key + (sub)
Name (string)
Address (string)
E-mail (c)
Organisation (string)
Key
Name = Value (string)
Vocabulary Link (c)
Resource Reference
Type (cv)
Description
Text (string)
Language Id (ccv)
Link (c)
Name (string)
SubType (ocv)
Format (cv)
Link (c)
Validation
Type
Methodology
Level
Description (sub)
33
Appendix H: Metadata set used by NECEP
The following elements are used within Non European Components of European Patrimony
(NECEP).
Nr
1
2
3
4
5
6
Element Name
society name
alternative name
language name
country
continent
ethnic region
Comment
usual anthropological designation
alternative names and spellings used
more than one, countries of residence
continent or areas
this element is not found in the data we received
34
Appendix I: Metadata set used Philosophy
For the philosophical lexicon the IMDI metadata structure was used for reasons of simplicity. For
elements were filled in:
•
•
•
•
project
researcher as creator
concept in focus as title and content description
location of creation
The texts were included as descriptions to integrate them into the full-text search supported under
simple search. All mappings that are valid for the IMDI metadata set are valid for the philosophy
domain as well.
35
Appendix J: Dual Mapping between Structured Elements
This chapter can be seen as exercises to come to final mappings for the different views (see K),
and therefore is not adapted. For a couple of dual sets some topics are discussed that are
relevant and indicate the problems that we expect.
The NECEP-RMV mapping makes sense since NECEP describes societies in detail of which
RMV will have objects in its repository.
NECEP
RMV
comment
A1 society names
subject-cultural region
A7 alternative
names
subject-cultural region
B2 continent
B1 country
B3 ethnic region
C1 language name
subject-cultural region
subject-geographical
subject-cultural region
subject-geographical
subject-cultural region
subject-geographical
subject-cultural region
has to be checked whether values are the same, probably
value matching necessary
has to be checked whether values are the same, probably
value matching necessary
RMV has two fields that apply, details have to be checked
RMV has two fields that apply, details have to be checked
RMV has two fields that apply, details have to be checked
a mapping between languages and societies is necessary
The NECEP-IMDI mapping makes sense since NECEP describes societies for which one can
probably find language resources in the languages domain.
NECEP
IMDI
comment
A1 society Names
A7 alternative
names
B2 continent
B1 country
B3 ethnic Region
C1 language name
language name
a mapping between languages and societies is necessary
language name
continent
country
region
language name
perhaps mapping due to different names
perhaps mapping due to different names
perhaps mapping due to different names
perhaps mapping due to different names
The RMV-IMDI mapping makes sense since one may find objects in the RMV repository that may
be related with language resources.
RMV
IMDI
comment
fields mentioned
above will be used
see above
date
date
categorization
content
rmv.date is date of creation; imdi.date is date of recording;
overlap seems to be small
rmv.categorization contains a set of numbers describing the
type of content included; IMDI uses a whole sub-structure for
content; has to be checked how this can be mapped
With respect to the HOS-IMDI mapping we don’t expect too much overlap in the scope of the
ECHO project. There may be language resources that appear in both repositories.
HoS Berlin
IMDI
comment
creator
meta.author9
language
actor
actor
language
meta.year
date
title10
content
title
9
not much overlap to be expected
not much overlap to be expected
here is a difference: hos.language refers to the language the
resource is in while imdi.language refers to the language the
resource is about; nevertheless, hos.language could be useful
for linguists;
hos.meta.date means year of publication while imdi.date
refers to the date of the recording
The hos set includes secondary and tertiary authors. The indicated mapping should include them as well.
36
keywords
content
hos.meta.keywords describe the content of the resource and
can be mapped with the content description in IMDI; it is not
clear how keywords will be used in HoS
With respect to the IMSS – IMDI mapping we don’t expect too much overlap as well despite the
formal overlap between the fields used.
HoS IMSS
IMDI
comment
DCcontributor
DCcoverage
actor
location, date
DCcreator
DCdate
actor
date
DCformat
DClanguage
language
DCsubject
DCtitle
DCtype
inventor
content
title
type
actor
IMSS will have to use qualifiers to separate the two
information types
in IMSS probably the language the document is in, in IMDI
both is possible
no information yet how this field will be used
not yet clear whether this field is relevant
In the current ECHO project we do not expect too much overlap, which is due to the fact that both
repositories will not have too many resources that are related. However, in principle much overlap
can be expected, since texts from the language resource area can for example explain objects in
the HoA area.
HoArts
IMDI
comment
Fotothek
3100 name artist
5064 date
5062 period
5130 location of creation
5200 object title
5202 title of building
5230 object type
5500 prim iconography
5510 sec iconography
5560 place of content
actor
date
date
location
title
title
content
content
content
location
overlap estimated to be small
hoa.date is precise; hoa.period offers different options; both
can be matched with imdi.date
hoa title in case of buildings
not yet clear whether there is a potential for matching
here a classification according to the IconClass system is
used
location as part of the content of the painting
Not much overlap is expected since the resources probably are not that much related.
HoArts
IMDI
comment
Lineamenta
document type
creator
m.language
m.person
m.year
m.title
m.date
m.keywords
object
m.location
10
actor
language
actor
date
title
date
content
title
location
no real equivalence in IMDI since the vocabulary is different
overlap estimated to be small
Lin is encoding the language the document is in
overlap estimated to be small
no specifications yet as how to fill in keywords
in Lin no formal distinction in continent, countries etc
The HoS set includes secondary and tertiary titles. The indicated mapping should include them as well.
37
Here one can expect some overlap in principle. However, the metadata set chosen by HoS does
not allow to draw too many relations.
HoArts
HoS Berlin
comment
Fotothek
3100 name artist
5064 date
5062 period
5200 object title
5202 title of building
5230 object type
5500 prim iconography
5510 sec iconography
creator
meta.author
meta.year
meta.year
title(s)
title(s)
keywords
keywords
keywords
it is not yet clear how keywords will be used in HoS
it is not yet clear how keywords will be used in HoS
it is not yet clear how keywords will be used in HoS
A number of Dublin Core mappings will be used. Therefore, we compare some sets from the DC
view point.
Dublin Core
HoS-Berlin
comment
DCcontributor
DCcoverage
DCcreator
DCdate
DCformat
DClanguage
DCsubject
DCtitle
DCtype
author
secondary author
tertiary author
year
author
secondary author
tertiary author
date
document type
mime type
language
keywords
title
secondary title
tertiary title
doc type
DC not very clear – so not clear how to map
The mapping between DC and IMDI is fairly straightforward.
Dublin Core
IMDI
participant
DCcontributor
location
DCcoverage
DCcreator
DCdate
DCformat
DClanguage
DCsubject
DCtitle
DCtype
date
participant
date
format
language
content
language
title
DC language is language a document is written in
not at all clear how subject is used
language the doc is about would fall under DC:subject
DC semantics not very clear
The mapping between DC and HoA-Fotothek.
Dublin Core
HoA-Fotothek
3100 name artist
DCcontributor
5062 period
DCcoverage
DCcreator
DCdate
DCformat
DClanguage
comment
comment
5130 place
3100 name artist
5064 date
38
DCsubject
DCtitle
DCtype
prim iconography
sec iconography
5220
5200 object title
5202 building title
not at all clear how subject is used
object type
DC semantics not very clear
The mapping between RMV and DC does not give many options.
Dublin Core
RMV
comment
DCcontributor
contributor
DCcoverage
date
subject-cultural region
subject-geographic
coverage-spatial
coverage-temporal
DCcreator
DCdate
date
DCformat
format
DClanguage
DCsubject
subject-cultural region
subject-geographical
subject-content
DCtitle
presentation title
name of object
DCtype
39
Appendix K: Mapping for Views
As mentioned above we have to evaluate the usage of the various fields to optimize the mapping schemes. First it seems to be handy to describe the
metadata elements to be used in short form as an overview.
Set
IMDI
Lineamenta
element name
language
continent
country
region
date
actors
title
content
type
format
appearance
language
continent
country
region
date
actors
title
content
type
format
title
person
object
date
keywords
title
person
object
date
keywords
document type
language
location
document type
language
location
Set
IMSS
NECEP
element name
creator
date
subject
title
type
format
language
contributor
inventor
coverage spatial
coverage temporal
appearance
creator
date
subject
title
type
format
language
contributor
inventor
coverage spatial
coverage temporal
antropological designation
alternative name
continent
countries of residence
official ethnic regions
society name
alternative name
continent
country
ethnic region
language name
language name
Set
Fotothek
RMV Leiden
this set is derived from the XML files we received
HoS Berlin
author
content-type
language
year
title
keywords
date
author
content type
language
year
title
keywords
date
element name
name artist (3100)
creator (9902)
person name (4100)
date (5064)
period (5062)
location (5130)
content place (5560)
place (2864)
name museum(2900)
short title (7190)
object title (5200)
building title (5202)
object type (5230)
type (5226)
prim. iconography (5500)
sec. iconography (5510)
appearance
artist object
artist photo
person name
date
period
place of creation
content place
place
institute
short title
object title
building title
object type
type
primary iconography
secondary iconography
coverage spatial
coverage temporal
subject geographical origin
date
subject category
coverage spatial
coverage temporal
geographical origin
date
content description
title
title
this set is derived from the XML files we received
Rome Maps
author-name/autorlink
alternative names
date
title
editor/editlink
incisore/incislink
author name
alternative author
date
title
editor
engraver
Philosophy
40
1. DC View
We refer to the names in the table above.
DC
Ethnology
NECEP
RMV
Title
title
Creator
Contributor
Subject
content descr.
Date
date
Type
Format
Language
“jpg”, “mpeg”,
“wav”
society name
altern. name
language name
Coverage
temporal
Coverage
spatial
continent
country
ethnic region
“jpg”
Fotothek
object title
building title
artist object
artist photo
History of Arts
Lineamenta
title
person
Rome Maps
title
author name
editor
author name
editor
artist object
person
prim icono
sec icono
date
period
object type
object
keywords
“rome maps”
date
date
“jpg”
document type
“tiff”, “jpg”
date
period
date
geogr. origin
coverage spatial
place of creation
content place
location
date
Philosophy
Languages
IMDI
title
title
title
author
creator
actors
author
contributor
actors
keywords
subject
content
date
date
type
type
format
format
language
language
language
date
year
coverage temp.
date
coverage spat.
continent
country
region
year
date
content type
“jpg”
“image”
language
date
coverage temp.
History of Science
Berlin
IMSS
41
2. Necep View
NECEP
Ethnology
NECEP
RMV
society name
alt. name
coverage spat.
coverage spat.
coverage spat.
geogr. origin
coverage spat.
geogr. origin
coverage spat.
geogr. origin
coverage spat.
continent
country
ethnic region
language name
Fotothek
History of Arts
Lineamenta
Rome Maps
History of Science
Berlin
IMSS
Philosophy
Languages
IMDI
language
language
place of creation
content place
place of creation
content place
place of creation
content place
location
“europe”
continent
location
“italy”
country
location
“rome”
region
language
coverage spat.
language
3. RMV View
RMV
coverage spatial
Ethnology
NECEP
RMV
society name
continent
country
ethnic region
date
Fotothek
History of Arts
Lineamenta
Rome Maps
History of Science
Berlin
IMSS
geogr. origin
content descr.
continent
country
ethnic region
Languages
IMDI
language
continent
country
region
place of creation
content place
location
“europe”
“italy”
“rome”
date
period
date
date
year
date
coverage temp.
date
coverage temp.
object title
title
object
title
title
title
title
place of creation
content place
location
“europe”
“italy”
“rome”
coverage spat.
continent
country
region
prim.iconogr.
sec. iconogr.
keywords
subject
content
coverage spat.
coverage temp.
title
Philosophy
keywords
date
42
4. Fotothek View
Fotothek
Ethnology
NECEP
RMV
Fotothek
History of Arts
Lineamenta
institute
location
place
location
place of
creation
content place
object title
continent
country
region
continent
country
region
coverage spat.
geogr. origin.
location
coverage spat.
geogr. origin
location
title
building title
short title
title
object
object
title
object
Rome Maps
“europe”
“italy”
“rome”
“europe”
“italy”
“rome”
“europe”
“italy”
“rome”
“europe”
“italy”
“rome”
coverage spat.
coverage spat.
coverage spat.
coverage spat.
Philosophy
Languages
IMDI
continent
country
region
continent
country
region
continent
country
region
continent
country
region
title
title
title
title
author name
editor
engraver
author
creator
actors
year
date
year
date
date
coverage temp.
date
coverage temp.
keywords
keywords
type
subject
subject
artist object
person
artist photo
person
person name
person
editor
engraver
author name
date
date
date
date
period
date
date
date
content descr.
content descr.
document type
document type
keywords
keywords
“maps”
“maps”
type
object type
prim. iconogr.
sec. iconogr.
History of Science
Berlin
IMSS
date
date
content
content
43
5. Lineamenta View
Lineamenta
location
Ethnology
NECEP
RMV
continent
country
ethnic region
geogr. origin
coverage spat.
title
title
date
date
object
document type
language
keywords
person
Fotothek
place of creation
content place
place
institute
object title
artist object
short title
date
period
object title
building title
short title
History of Science
Berlin
IMSS
“europe”
“italy”
Philosophy
Languages
IMDI
coverage spat.
continent
country
region
title
title
title
title
date
date
year
date
coverage temp
date
“rome maps”
title
language
prim.iconogr.
sec. iconogr.
object type
“maps”
keywords
artist object
person name
editor
engraver
author name
coverage spat.
content descr.
Rome Maps
“printed map”
“landscape
drawing”
“italien”
type
language name
History of Arts
Lineamenta
type
language
subject
content
44
6. HoS Berlin View
HoS Berlin
Ethnology
NECEP
RMV
author
language
Fotothek
artist object
language name
society name
History of Arts
Lineamenta
person
Rome Maps
History of Science
Berlin
IMSS
author name
editor
coverage spatial
year
date
date
date
date
period
date
period
date
date
date
date
Philosophy
creator
actors
language
language
date
coverage temp.
date
coverage temp.
type
content type
Languages
IMDI
date
date
title
title
object title
title
object
title
title
title
keywords
content descr.
prim.iconogr.
sec.iconogr.
keywords
“maps”
subject
content
7. Rome Maps View
Rome Maps
author name
altern. author
date
title
editor
engraver
Ethnology
NECEP
RMV
date
title
Fotothek
History of Arts
Lineamenta
Rome Maps
History of Science
Berlin
IMSS
Philosophy
Languages
IMDI
artist object
person
author
creator
actors
date
object title
date
title
date
title
date
title
contributor
date
title
45
8. IMSS View
(same as the DC view)
IMSS
Ethnology
NECEP
RMV
Fotothek
History of Arts
Lineamenta
Rome Maps
object title
building title
title
creator
artist photo
person
contributor
artist object
person
prim. iconogr.
sec. iconogr.
date
period
date
period
object type
object
keywords
“rome maps”
date
date
date
date
title
title
title
author name
editor
author name
editor
History of Science
Berlin
IMSS
Philosophy
Languages
IMDI
title
title
author
actors
author
actors
keywords
content
inventor
subject
content descr.
date
date
coverage temporal
date
coverage temp.
type
format
language
coverage spatial
“jpg”, “mpeg”,
“wav”
society name
language name
continent
country
ethnic region
“jpg”
“jpg”
document type
“tiff”, “jpg”
“jpg”
“image”
language
coverage spatial
geogr. origin
place of creation
content place
location
date
year
date
year
content type
date
type
format
language
“rome”
date
language
continent
country
region
46
9. Language View
Language
NECEP
Ethnology
RMV
language
society name
language name
continent
continent
country
country
region
ethnic region
Fotothek
coverage spatial
coverage spatial
geogr. origin
coverage spatial
geogr. origin
coverage spatial
geogr. origin
History of Arts
Lineamenta
Rome Maps
language
place of creation
content place
place of creation
content place
place of creation
content place
History of Science
Berlin
IMSS
language
date
date
coverage temp.
content
content descirption
actors
title
date
period
prim.iconogr.
sec.iconogr.
“europe”
coverage spatial
location
“italy”
coverage spatial
location
“rome”
coverage spatial
date
date
date
year
type
format
date
coverage temp.
keywords
“maps
keywords
subject
author name
editor
author
creator
title
title
title
artist photo
title
object title
title
object
Languages
IMDI
language
location
type
format
Philosophy
47
Appendix L: Schemas
Schema for Term Definitions
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
<xs:element name="term">
<xs:complexType>
<xs:sequence>
<xs:element name="termID" type="xs:ID"/>
<xs:element name="term-name" type="xs:string"/>
<xs:element name="xpath" type="xs:URI"/>
<xs:element name="domain-name" type="xs:string"/>
<xs:element name="sub-domain-name" type="xs:string"/>
<xs:element name="description" type="xs:string"/>
<xs:element name="dedications">
<xs:complexType>
<xs:sequence>
<xs:element name="fra" type="xs:string"/>
<xs:element name="ger" type="xs:string"/>
<xs:element name="ita" type="xs:string"/>
<xs:element name="swe" type="xs:string"/>
<xs:element name="dut" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:schema>
Schema for relations
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
<ECHO:schema xmlns:xs=’http://www.mpi.nl/echo/schemas/ECHO-def-schema’>
<xs:element name="mapping">
<xs:complexType>
<xs:sequence>
<xs:element name="termID" type="xs:ID”/>
<xs:element name="termID" type="xs:ID"/>
<xs:element name="relation-type" type="xs:string"/>
<xs:element name="match-factor" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
48
B. WP2 Note on an ECHO Ontology
Peter Wittenburg
20.2.2004
Essential part of the DORA11 ECHO portal which was presented several times at
meetings and discussed in detail with nearly all ECHO participants is the integration
of ontological knowledge from several domains. This paper wants to document the
knowledge components, their extraction processes and their relations. The resulting
components will be available at the end of the ECHO project in well-documented
formats.
This document can be seen as supplementary to the one that describes the DORA
infrastructure, the selections made with respect to the semantics and the mapping
choices. From several projects and initiatives we know that the mapping choices can
be questioned, since two persons will not agree. But this is exactly the reason why
we rely on practical ontologies that can easily be changed and amended by other
persons such that the chosen mappings better reflect the intentions.
Despite many difficulties we can state that we were able to establish an ECHO
ontology that covers the offered semantics of the participating disciplines and that is
now base of the DORA machinery.
1. Provided Components
The following components were provided by the participants and external sources:
1. Metadata Descriptions
XML repositories covering the metadata descriptions of the various data
providers often without any form of validation. These were partially
associated with
a. the list of the metadata vocabularies of which some referred to Dublin
Core concepts, others to proper definitions and others to verbal
explanations12;
b. formal syntax descriptions (only in three cases).
2. Content Thesauri
Two metadata sets are making use of thesauri to describe the content of the
object.
a. The RMV uses the OMV thesaurus that is derived from the AAT
thesaurus13.
b. The Fotothek uses the IconClass14 thesaurus which was available as an
interactive CDROM.
11
Digital Open Resource Area: see WP2-TR16-2003; web-site to come
Metadata definitions will always include some tolerance in the usage due to the different interpretations of
the definitions of the semantic scope. Non-existing definitions or unclear definitions lead to wider tolerances in
usage of course.
13
It should be noted here that Brik de Zwart supported the ECHO work by not only providing the only real OAI
implementation, but also providing the OMV thesaurus in a structured form. Thanks a lot!!
12
49
c. Other metadata sets are using either unconstrained keyword elements or
use a limited number of narrowly defined elements.
3. Geographic Information
a. The RMV is using a geographic thesaurus.
b. Other metadata sets are using either unconstrained elements or a limited
number of more clearly and constraint elements such as continent and
country.
c. It was noted that language and society names in many cases include
geographical information.
2. Generated Components - Overview
From this basic information a number of essential components were extracted. Most
of them are in XML, others are in a structured form that is easy to process, but will
be transformed to XML until the end of ECHO. Yet RDF was not used to represent
knowledge. Concept definitions can be done in XML and this is the way that is used
by ISO groups such as TC37/SC4. For the mapping file that contains assertions
about concepts RDF is the most suitable format. However, since there is no
complete logic, since we have many fuzzy mapping relations and since we lack
appropriate standard inference engines there is no immediate need to formulate the
relations as RDF assertions. The mappings are embedded in XML so that they can
be easily transformed to RDF.
1. Validated Metadata Sets
The metadata information was transformed into validated and machine readable
formats. Structure and character encoding was standardized to XML and
UNICODE.
2. ECHO Concepts
This XML file consists of all elements from the various metadata sets that were
selected to be used in DORA, i.e. that are not too specialistic.
The current version is: echo-term-v6.xml
3. ECHO Mappings
This XML file consists of an exhaustive mapping between all elements found in
the concepts file. It is guided by the wish to do the access from different views.
The current version is: echo-mapping-v5.xml
4. OVM-Geographic Thesaurus
This file contains the geographic thesaurus as used within the RMV descriptions.
Where possible the OVM geographic thesaurus points to comparable entries in
the MPI geographic thesaurus.
The current version is: ovm-geo-thesaurus-v3.xml
5. MPI-Geographic Thesaurus
An analysis was carried out on all geographically oriented fields on all metadata
records of all data providers except RMV to get a list of geographic concepts that
14
IconClass was bought from the KNAW Amsterdam.
50
are actually used. From these a “complete” geographic thesaurus15 was created.
Where possible the MPI geographic thesaurus points to comparable entries in
the OVM geographical thesaurus.
The current version is: mpi-geo-thesaurus-v4.xml
6. OVM Category Thesaurus
This thesaurus contains all values that are used in the RMV content description
field and they are ordered in a hierarchical way. This thesaurus is based on the
AAT thesaurus.
The current version is: ovm-category-thesaurus-v2.xml
7. Iconclass Category Thesaurus
This thesaurus contains all values that are used in the Fotothek content
description field (Iconography) and they are ordered in a hierarchical way.
The current version is: iconclass-category-thesaurus-v2.xml
8. IconClass-to-OVM Mapping
This file contains a mapping between IconClass and OVM nodes where this is
semantically feasible. It was clear that only a one-directional mapping would
serve the needs.
The current version is: iconclass2ovm-mapping-v3.xml
9. OVM-to-IconClass Mapping
This file contains a mapping between IconCLass and OVM nodes where this is
semantically feasible. It was clear that only a one-directional mapping would
serve the needs.
The current version is: ovm2iconclass-mapping-v3.xml
10. MPI Content List
An analysis was made on all content type fields that can be found in all metadata
records of all data providers except RMV and Fotothek. A mapping file was
created that links these descriptors with those to be found in the OVM and the
IconClass thesauri.
The current version is: IMDI2iconclass-and-ovm-v1.xml
11. Other Components
There are a few other files that are used to facilitate the DORA searching
machinery, but they don’t contain essential knowledge representations.
3. Components in Detail
In this chapter we want to discuss some aspects in more detail.
3.1 ECHO Concepts
All concepts that were decided to be used for the DORA interface from the different
metadata sets. So we choose a setup that seems now to be followed by many
15
The OVM geographical thesaurus is not complete and not appropriately structured. Different types of
concepts appear at a certain depth. Therefore, we could not use it as master thesaurus. A conversion would have
required manual work.
51
groups representing knowledge. Concept definitions are separated from any
relational information except if a sub/superclass relation is an evident part of the
concept definition. This gives everyone the possibility to relate concepts in the own
way and nothing is prescribed. In ISO TC37/SC4 it is argued that equality and
sub/superclass relations can be part of the definition of a concept. This is very
dependent on the scope of the domain considered. According to the ISO 11179
model the domain description has to be part of the concept definition.
We have taken a strict role to separate definition and relation, since we don’t have
yet a sufficiently detailed view on the semantic scope of all terms. Each concept
found is defined by a number of attributes which are indicated in the following XML
fragment.
<terms>
<term>
<termID> 001 </termID>
unique identifier
<term-name> title </term-name>
concept name
<xpath> dc.title </xpath>
how to find it
<domain-name> DublinCore </domain-name>
ECHO domain
name
<sub-domain-name> </sub-domain-name>
ECHO
subdomain
<description> name given to resource </description>
a prose
definition
<dedications>
<fra> titre </fra>
French
dedication
<ger> Titel </ger>
German dedication
<ita> titolo </ita>
Italian dedication
<swe> titel </swe>
Swedish dedication
<dut> titel </dut>
Dutch dedication
</dedications>
</term>
<term>
....
....
</term
</terms>
If there is enough time left in the ECHO project we will transform this into an ISO
11179, ISO 12620 compliant XML form so that it can be put openly on the web and
used by others. However, in ECHO we will not introduce relational information into
the document and will not eliminate equivalent concepts (synonyms etc). Mainly
since the machinery is now developed such that it will use this normalized type of
representation.
The file was generated only to a small extent automatically. All translations were
created manually.
52
3.2 ECHO Mappings
The mappings are done according to the Technical Report WP2-TR16-2003 about
Mapping. They exist of references to the concept file, a relation type and a matching
factor that currently is not used. Before using this information we first have to get
more experience. The intention is to indicate the quality of the mapping, i.e. the
amount of semantic overlap between the related concepts. The following XML
fragment indicates how the file is structured. For easiness of reading a
supplementary file was created that contains all concept information. However, this
cannot be the basis for the DORA machinery, since the information would be stored
at two places which is not acceptable from maintenance reasons.
<mappings>
<mapping>
<termID>001</termID>
<termID>080</termID>
<relation-type>isEqualTo</relation-type>
<match-factor>1</match-factor>
</mapping>
<mapping>
<termID>002</termID>
<termID>027</termID>
<relation-type>mapsTo</relation-type>
<match-factor>1</match-factor>
</mapping>
<mapping>
....
....
</mapping>
</mappings>
first concept reference
second concept reference
relation type
matching factor
It can easily be seen that the structure can be easily transformed into an RDF
assertion. Let us take the example from the first fragment.
<termID>001</termID>
<termID>080</termID>
<relation-type>isEqualTo</relation-type>
This XML substructure would translate to the following RDF assertion.
concept 001
isEqualTo
concept 080
The following semantic relations are used in the mapping file:
isEqualTo
isSubclassOf
the two related terms are semantically equivalent
Example: DC:Date isEqualTo IMDI:Date
the first concept is a hyperonym of the second one
53
Example: DC:Creator is SubclassOf IMDI:Particpant
isSuperclassOf
the first concept is a hyponym of the second one
Example: IMDI:Participant isSuperclassOf DC:Creator
MapsTo
the first concept is related with the second one
this relation was chosen in many cases, but the semantic
overlap
cannot be specified in terms that can be exploited by
strict logic;
it represents a kind of fuzzy matching, i.e. only the move
to some
granular feature space would allow us to make the
relation more
specific and precise.
Example: DC:Creator mapsTo RomeMaps:Editor
All relations were created based on manual inspection of the definitions and after
having talked with the sub-domain experts. Currently, we start analyzing the usage
of the fields which may lead to changes.
3.3 OVM-Geographic Thesaurus
This thesaurus was extracted semi-automatically from a web-representation. For
reasons of simplicity we indicate the thesaurus in table form. It has three entries: (1)
the OVM abbreviation that is used in the metadata records; (2) the geographic name
used by OVM in Dutch and (3) a reference to the appropriate node in the so-called
MPI geographic thesaurus.
OVM Abbreviation
OVM.AAA
OVM.AAA.AAA
OVM.AAA.AAA.AAA
OVM.AAA.AAA.AAA.AAA
OVM.AAA.AAA.AAA.AAA.AAA
OVM.AAA.AAA.AAA.AAA.AAB
OVM.AAA.AAA.AAA.AAA.AAB.AAA
OVM.AAA.AAA.AAA.AAA.AAB.AAB
OVM.AAA.AAA.AAA.AAA.AAB.AAC
OVM.AAA.AAA.AAA.AAA.AAC
OVM.AAA.AAA.AAA.AAA.AAD
OVM.AAA.AAA.AAA.AAB
OVM.AAA.AAA.AAA.AAB.AAA
OVM.AAA.AAA.AAA.AAB.AAA.AAA
OVM.AAA.AAA.AAA.AAB.AAB
OVM.AAA.AAA.AAB
OVM.AAA.AAA.AAB.AAA
OVM.AAA.AAA.AAB.AAA.AAA
OVM.AAA.AAA.AAB.AAA.AAA.AAA
OVM.AAA.AAA.AAB.AAA.AAB
OVM.AAA.AAA.AAB.AAA.AAC
OVM.AAA.AAA.AAB.AAA.AAC.AAA
OVM.AAA.AAA.AAB.AAA.AAD
OVM.AAA.AAA.AAB.AAA.AAD.AAA
OVM.AAA.AAA.AAB.AAA.AAE
OVM.AAA.AAA.AAB.AAA.AAE.AAA
OVM.AAA.AAA.AAB.AAA.AAE.AAB
OVM.AAA.AAA.AAB.AAA.AAF
OVM.AAA.AAA.AAB.AAA.AAF.AAA
OVM.AAA.AAA.AAB.AAA.AAG
OVM Geo-Name
Geografische herkomst
Afrika
Afrikaanse eilanden
Afrikaanse eilanden- Oost
Comoren
Madagascar
Antananarivo
Betafo
Nosy BГ©
Mauritius
Seychellen
Afrikaanse eilanden- West
Canarische eilanden
Tenerife
St. Helena
Centraal-Afrika
Angola
Angola:regionaal
Angola- Noordwest
Bengo
Benguela
Catumbela
BiГ©
Chinguar
Cabinda
Futila
Loango
Cuamato
Forte Rocadas
Cuanza
MPI Geo-Name
reference to mpi-geo-thesaurus
Africa
Island nations
Comoros
Madagascar
Mauritius
Seychelles
Central Africa
Angola
54
The OVM geographic thesaurus does not have a canonical hierarchical structure
that could look like:
<continent>
<sub-continent>
<country>
<region>
<place>
...
It leaves out nodes where nothing suitable could be filled in, i.e. countries can
appear at different levels of depth. This makes it difficult to automatically transform
this thesaurus into a canonical structure and it is too large to do a manual
transformation within ECHO. Therefore, the resulting XML structure can only use
arbitrary <struct> tags. This does not harm searching, since the nodes represent
super-classes that can be exploited. The link to a node in the MPI geographic
thesaurus can also be exploited.
OVM
geographic
thesaurus
MPI
geographic
thesaurus
The figure indicates the partial match between the two geographic thesauri. Partial
matching in the geographical domain means in the far most cases that complete
sub-trees can be matched. Only in few cases at the regional level the classifications
may be unclear.
3.4 MPI-Geographic Thesaurus
Due to the non-canonical form of the OVM-geographic thesaurus it was decided to
add another canonical thesaurus and enter all geographically oriented names that
can be found in one of the metadata records (except OVM) into this one. An
analysis of all other metadata records revealed that there were not too many
different names. For example in the large Fotothek repository only a few names are
re-occurring. Also in the large language domain mostly the categorization is done
systematically until the country level. Some used the region element, but in total
there were not too many different ones. So it was an easy job to add all names into
a canonical structure that was extracted semi-automatically from an official and
reliable web-site.
<continents>
<continent>
<cnt-name> Africa” </cnt-name>
<dedications>
<ger> Afrika </ger>
</dedications>
<ovm-code> OVM.AAA.AAA </ovm-code>
<sub-continents>
<sub-continent>
<sc-name> North Africa </sc-name>
<ovm-code> OVM.AAA.AAA.AAC” </ovm-code>
<countries>
<country>
<cou-name> Algeria </cou-name>
<ovm-code> OVM.AAA.AAA.AAC.AAA <ovm-code>
</country>
55
<country>
<cou-name> Egypt </cou-name>
<dedications>
<ger>Г„gypten </ger>
</dedications>
<ovm-code> OVM.AAA.AAA.AAC.AAB </ovm-code>
<country>
<cou-name> Libya </cou-name>
<ovm-code> OVM.AAA.AAA.AAC.AAC </ovm-code>
</country>
<country>
<cou-name> Morocco </cou-name>
<ovm-code> OVM.AAA.AAA.AAC.AAD </ovm-code>
</country>
<country>
<cou-name> Sudan </cou-name>
<ovm-code> OVM.AAA.AAA.AAC.AAF </ovm-code>
</country>
<country>
<cou-name> Tunisia </cou-name>
<ovm-code> OVM.AAA.AAA.AAC.AAG.AAX </ovm-code>
<places>
<place>
<pl-name> Tunis </pl-name>
<ovm-code> OVM.AAA.AAA.AAC.AAG.AAY </ovm-code>
</place>
...
</places>
<country>
...
</country>
</countries>
...
</sub-continent>
...
</sub-continents>
</continent>
...
<continents>
Yet the links in the OVM geographical thesaurus are not XML path expressions.
This has to be generated to make it a fully XML compliant version that can easily be
re-used by others. For the DORA machinery it is not of relevance since optimal
index structures are generated anyhow for fast processing.
Only for some entries language dedications are specified. It would be too much work
to create names in the different languages for all entries except that we will find
reliable multilingual geographic lexicons.
3.5 OVM Category Thesaurus
The categories and the Dutch labels of this thesaurus were extracted semiautomatically from a web-representation. For reasons of simplicity we indicate the
thesaurus in table form. It has three entries: (1) the OVM abbreviation that is used in
the metadata records; (2) the English category naming and (3) the original Dutch
category naming.
OVM indeling/categories
OVM.AAC
OVM.AAC.AAA
OVM.AAC.AAA.AAA
OVM.AAC.AAA.AAA.AAA
OVM.AAC.AAA.AAA.AAB
OVM.AAC.AAA.AAA.AAC
OVM.AAC.AAA.AAA.AAE
OVM.AAC.AAA.AAA.AAE.AAA
OVM.AAC.AAA.AAA.AAE.AAB
OVM.AAC.AAA.AAB
OVM.AAC.AAA.AAB.AAA
English
OVM Category
"hunting, fishery, food gathering"
hunting
hunting without tools
hunting with lures
hunting with traps and snares
hunting with weapons
hunting with fist weapons
hunting with projectiles
fishery
fishery without tools
Dutch
OVM Categorie
"jacht, visserij, voedselgaring"
jacht
jacht zonder handwerktuigen
jacht met lokmiddelen
jacht met vallen en strikken
jacht met wapens (inclusief accessoires)
jacht met handwapens
jacht met projectielen
visserij
visserij zonder handwerktuigen
56
OVM.AAC.AAA.AAB.AAB
OVM.AAC.AAA.AAB.AAC
OVM.AAC.AAA.AAB.AAE
fishery with lures
fishery with traps and nets
fishery with weapons
OVM.AAC.AAA.AAB.AAE.AAA
OVM.AAC.AAA.AAB.AAE.AAB
OVM.AAC.AAA.AAC
OVM.AAC.AAB
OVM.AAC.AAB.AAA
fishery with fist weapons
fishery with projectiles
gathering food
"weapons, warfare, war"
fist weapons and accessories
visserij met lokmiddelen
visserij met vallen en netten
visserij met wapens (inclusief
accessoires)
visserij met handwapens
visserij met projectielen
voedsel verzamelen
"wapens, strijd en oorlog"
handwapens en accessoires
Since the IconClass thesaurus uses English labeling and since at the user interface
at least English labeling should be used all entries were translated into English
labels as well. It would be too much work within ECHO to generate other language
dedications. This should be done semi-automatically by using appropriate
technology.
An XML version is being created currently which will be made public at the end of
the ECHO project.
3.6 Iconclass Category Thesaurus
The categories of this thesaurus were extracted semi-automatically from a CDROM.
Again, for reasons of simplicity we indicate the thesaurus in table form. It has two
entries: (1) the IC abbreviation that is used in the metadata records and (2) the
English category labeling.
1
10
11
11A
11A1
11A11
11A2
11A21
11A22
11A221
11A23
11A3
11A31
11B
11B1
11B11
11B114
11B12
11B13
11B14
11B2
11B21
11B22
11B23
11B3
11B31
11B32
11B321
11B322
11B3231
11B3232
11B33
Religion and Magic
(symbolic) representations ~ creation, cosmos, cosmogony, universe, and life (in the broadest sense)
Christian religion
Deity, God (in general) ~ Christian religion
God the Creator
God measuring the Universe (with compasses)
Divine Nature
Divinity, 'DivinitГ ' (Ripa)
symbols ~ Divine Nature
circle symbolizing God's perfectness
God's perfections
God's wrath
'Flagello di Dio' (Ripa)
the Holy Trinity, 'Trinitas coelestis'; Father, Son and Holy Ghost ~ Christian
religion
Trinity represented by tripartite symbols
symbols of the Trinity ~ circular and/or triangular forms or arrangements
three animals, geometrically arranged within a circle or triangle
Trinity represented as a person with three heads
Trinity represented by three animals sharing one head
other tripartite symbols of the Trinity
Trinity in which each of the Persons (God, Christ, Holy Ghost) is represented
either by an object or by an animal
representation of the Trinity: hand (Father), lamb (Son), and dove (Holy Ghost)
representation of the Trinity: hand, cross and dove
representation of the Trinity: hand, chalice and dove
Holy Trinity in which one, two or all figures are represented in human shape
Trinity as three persons
Trinity in which God the Father and Christ are represented as persons, the Holy
Ghost as dove
God the Father seated, holding the youthful Christ (Emmanuel) in his lap
God the Father and Christ enthroned
God the Father holding the crucifix, 'Gnadenstuhl', Mercy-Seat, Throne of Grace
God the Father standing or seated, holding the body of Christ, 'PitiГ©-de-Notrerepresentations of the Trinity
The extraction of a clean, complete and well-structured file was not trivial and
partially manual work had to be carried out. The thesaurus had to be complete since
many mappings were found between OVM and IconClass nodes.
57
An XML version is being created currently which will be made public at the end of
the ECHO project, if there are no IPR restrictions involved. This has to be discussed
with KNAW.
3.7 IconClass-to-OVM Mapping
This mapping file is a result of a careful one-directional comparison. This
comparison could only be done manually, since any formal comparison based on
pure linguistic knowledge could lead to misleading results. The context had to be
considered to do the right interpretations.
<mappings>
<mapping>
<ic-code> 1 </ic-code>
<ic-label> Religion and Magic </ic-label>
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAC </ovm-code>
<ovm-label> altars, sanctuaries and their interior decoration and furniture </ovm-label>
</ovm-mapping>
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAD </ovm-code>
<ovm-label> sacrifices </ovm-label>
</ovm-mapping
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAF </ovm-code>
<ovm-label> ritual appliances </ovm-label>
</ovm-mapping>
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAG </ovm-code>
<ovm-label> symbols of religious status </ovm-label>
</ovm-mapping>
</mapping>
<mapping>
<ic-code> 10 </ic-code>
<ic-label> Religion and Magic </ic-label>
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAC </ovm-code>
<ovm-label> (symbolic) representations, creation, cosmos, cosmogony, universe, life </ovm-label>
</ovm-mapping>
</mapping>
<mapping>
<ic-code> 13 </ic-code>
<ic-label> magic, supernaturalism, occultism </ic-label>
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAB </ovm-code>
<ovm-label> cult objects and other holy objects </ovm-label>
</ovm-mapping>
</mapping>
<mapping>
<ic-code> 13C3 </ic-code>
<ic-label> magic objects, apotropaia </ic-label>
<ovm-mapping>
<ovm-code> OVM.AAC.AAN.AAE </ovm-code>
<ovm-label> magical protection and defence </ovm-label>
</ovm-mapping>
</mapping>
...
</mappings
In contrast to the geographic mapping described above a mapping between two
nodes often does not mean that complete sub-trees would map. For ECHO it would
be too much to do a complete analysis. This has to be left over to other projects.
OVM
category
thesaurus
IconClass
category
thesaurus
58
As indicated above there will be much debate about particular mappings. Therefore
it is even more true that individuals or groups should be able to influence inferencing
by being able to modify the mappings easily. This requires open definitions as they
are envisaged for example in ISOTC37/SC4 based on ISO 11179 and ISO 12620
and suitable tools, but in the area of cultural heritage we are far away from such a
situation.
3.8 OVM-to-IconClass Mapping
This mapping file is complementary to the one-directional comparison described
above. For the same reasons also this comparison could only be done manually.
<mappings>
<mapping>
<ovm-code> OVM.AAC.AAA.AAA.AAA </ovm-code>
<ic-label> hunting without tools </ic-label>
<ovm-mapping>
<ovm-code> 43C111 </ovm-code>
<ovm-label> game, hunted animals, hunt, bird hunting </ovm-label>
</ovm-mapping>
</mapping>
<mapping>
<ovm-code> OVM.AAC.AAA.AAA.AAB </ovm-code>
<ic-label> hunting with lures </ic-label>
<ovm-mapping>
<ovm-code> 43C132 </ovm-code>
<ovm-label> duck decoy </ovm-label>
</ovm-mapping>
<ovm-mapping>
<ovm-code> 43C1(+43)</ovm-code>
<ovm-label> lures (hunting)</ovm-label>
</ovm-mapping>
</mapping>
<mapping>
<ovm-code> OVM.AAC.AAA.AAA.AAC </ovm-code>
<ic-label> hunting with traps and snares </ic-label>
<ovm-mapping>
<ovm-code> 43C131</ovm-code>
<ovm-label> finch trap, finchery </ovm-label>
</ovm-mapping>
</mapping>
...
</mappings>
For some comments see above.
3.7 MPI Content List
To achieve content mappings were possible it is important to try to map all content
describing elements from all metadata sets with the thesauri used by RMV and
Fotothek and to find of course links between them. We extracted the list of all values
we found so far and are currently comparing the entries. This all can only be done
manually.
<mappings>
<mapping>
<mpi-label> Speech </mpi-label>
<ic-code> 31B6235 </ic-code>
<ic-label> speaking </ic-label>
</mapping>
59
<mapping>
<mpi-label> writing </mpi-label>
<ic-code>49L11</ic-code>
<ic-label> handwriting, writing as activity </ic-label>
<ovm-code> OVM.AAC.AAK.AAB </ovm-code>
<ovm-label> script </ovm-label>
</mapping>
<mapping>
<mpi-label> Speech, some gesture </mpi-label>
<ic-code>31B6235</ic-code>
<ic-label> speaking </ic-label>
<ic-code>31A25</ic-code>
<ic-label> postures and gestures of arms and hands </ic-label>
</mapping>
4. ECHO Knowledge Repositories
In chapter 3 we made some comments about the need for flexible knowledge
representation infrastructures for the area of cultural heritage. This mainly is due to
the fact that people will not agree about definitions - so it should be possible to add
new definitions. Even more problematic are the mappings, since only in a few cases
one can speak about a perfect match.
In the case of the thesaurus mappings we yet did not use relation-types. It is beyond
the scope of the ECHO project to sort out how the inherent semantics can be
modeled more precisely to be able to exploit the mappings in a more fine-grained
way. Currently, all mappings between the thesaurus nodes are of the type “mapsTo”
which implement a fuzzy mapping indicating some form of overlap without being
more precise.
To come to a more open and flexible knowledge representation infrastructure we will
set up an ISO TC37/SC4 compliant repository and start defining the DORA
categories with the help of this framework. For the mapping files appropriate open
repositories will be offered at the MPI web-address including all schemas16. RDF
seems to be a primary candidate for the representation in teh Semantic Web era.
Currently, however, XML is seen as being sufficient. This could allow everyone to
modify aspects of the mapping and use it in their machinery.
We see this start of an open knowledge representation infrastructure as one of the
outcomes of ECHO. The current DORA machinery will not make use of this open
infrastructure, since it would cost too much effort to rewrite all programs and scripts.
5. Exploitation
Within ECHO we have created a practical ontology covering a number of knowledge
components. From careful inspection of certain representations such as the thesauri
we could identify many useful mappings that can be exploited by the DORA
machinery. However, we yet cannot say enough about the usage of the various
metadata categories by those people who generate the metadata descriptions. From
16
Before doing this at the end of the ECHO project we have to check the IPR situation.
60
experience we know that there is some semantic spreading, yet we cannot make
any quantifying statements.
When DORA uses the full set of components described here17, we have to start
investigations how effective the mappings are in exploiting possible relations
between the different domains and sub-domains. Here we are at the beginning.
Partly this has also to do with the fact that only few repositories have a large size
(Fotothek, RMV, Languages).
17
The machinery is constantly extended with the goal to be ready end of April 2004.
61
C. WP2 Note on the DORA Search Engine
Peter Wittenburg
9.5.2004
In two reports we have described the DORA18 concept and the underlying mapping
scheme (WP2-TR16-2004) and its ontology components (WP2-TR17-2004). In this
document we want to describe the search engine and summarize its evaluation19.
While the DORA document describes the intentions and possibilities, this document
describes what was implemented. It is not a technical documentation, but describes
to a certain detail which implementation decisions were taken and which problems
were encountered. The search engine is based on the mappings as described in the
DORA note and in the Ontology note, i.e., it implements the mappings and semantic
relations in specific ways to achieve high performance.
The evaluation part has to consider two aspects: (1) The formal correctness of the
algorithms have to be checked and (2) the usefulness and appropriateness of the
semantics included in DORA has to be evaluated. Finally, answers to the following
two questions have to be given:
•
•
•
Are the chosen semantic relation useful?
Does metadata interdisciplinary help to answer questions?
What kind of infrastructure is necessary to overcome current limitations?
It should be noted here that the included number of records is about 95.000 records
and that the distribution is uneven. It is obvious that searching only makes sense in
large collections such as delivered from Fotothek (75715 records) and languages
(17403 records). The relatively small number of records provided by the other
repositories at this moment (20 to 1100) limits the strength of the evaluation. Any
data that was offered by the data providers was integrated20.
1. Search Engine
In this chapter we want to describe the actual DORA interface, the harvesting
principles, the data correction steps to be taken, the nature of the index creation
process and the searching process. It should be mentioned that the DORA engine is
implemented largely with Java21.
1.1 DORA Interface
The DORA interface was implemented as described in the original DORA document.
However, during the ECHO project it became apparent that some of the goals were
too challenging to be met within the short period of time. Everyone interested can
make use of the DORA engine, it is available under the following URL:
18
Digital Open Resource Area: see WP2-TR16-2003; web-site to come
The evaluation will be updated in May 2004
20
In the case of the RMV repository it is being checked why not more than the current 20 records can be
harvested.
21
A technical documentation will go into more detail
19
62
http://corpus1.mpi.nl/ds/dora/
The user can select the disciplines and within the disciplines the data providers to
be included in the search. The disciplines are indicated by images and the data
providers by menu lists. The interface offers two search options: (1) In simple search
the user can specify words that are searched for in all metadata fields provided
including full-text fields that contain prose-text. (2) In complex search the user can
select a view that is derived from the vocabulary used by the different data
providers. All details of these views are explained in the DORA note.
63
Originally, it was intended to include browsing, geographical browsing and
annotations in the search. These features were not implemented. Languages is the
only domain where browsing is made available so here it is makes sense to go to
the language portal immediately. The geographical browsing turned out to be too
difficult to be implemented in the ECHO period. Due to the large scale difference
(continents to maps of ancient Rome) we would have needed scalable maps that
allow to step down to details of Rome and it was seen as too much work to provide
the exact coordinates of all locations involved in the DORA domain. Metadata
descriptions do not yet include formal geographical coordinates such that points
could be created automatically.
The option to search on annotations is provided and it would not be too difficult to
add annotations to the index, however, it is not as effective. Also here some plans
were too ambitious to be realized in the short ECHO period. The idea in history of
science was to relate web-sites with each other by entering typed relations. These
annotations would be very excellent resources to be integrated in searches. Yet no
data could be created.
It should be mentioned that the interface is configuration file driven, i.e., it can be
easily adapted to other configurations that would imply other
•
•
•
disciplines
data providers within them
views
Every data source in DORA gets an ID which is used as the key to combine different
knowledge.
1.2 Harvesting
The way data providers deliver data within ECHO is different as the table indicates.
NECEP
online
XML
RMV
online
OAI
Languages
online
XML/OAI
Lineamenta
off-line
email
CIPRO
off-line
email
Fotothek
off-line
email
IMSS
online
OAI
Berlin
not yet
up
Philosophy
online
XML
Five collections were online and could be harvested according to a various
schemes. Three of the interfaces are offering an OAI MHP compliant interface. In
the case of languages the XML variant was preferred since it includes all metadata
fields. The three data sources extracted files at certain moments and provided them
by sending emails. In the latter case a harvesting concept was not applicable.
For those data sources that could be harvested a process file was created. It can be
modified in a simple way with the help of a web-interface. The following parameters
can be defined via this interface to tune the harvesting engine:
•
•
•
•
•
data provider ID
frequency of harvesting
day time to execute the harvesting (hour/minute)
day to execute the harvesting
import prefix
64
•
•
•
classpath to the data processing programs
the label of the data provider
root URL as harvesting address
In addition the file contains parameters such as location of logging information, date
and time of last harvesting etc.
The classpath reference is of great importance since it refers to executable code
that contains the knowledge about how to grab the data from the specified URL
(OAI/XML) and how to preprocess the data delivered from the source.
A log file is created that contains protocol information describing the harvesting
process. In addition to the information mentioned above it says how many records
were received per source, which type of errors were encountered. This file is also
used to document other steps and to protocol the query handling.
1.3 Data Pre-Processing
The data delivered had to be corrected and modified in different ways. Here we can
only give a few examples. The purpose of this chapter is not to complain, but to
show the problems one is faced with when building an interoperable metadata
domain at the various levels. Initiatives such as OAI have a great value, although
the metadata harvesting protocol is very simple. Its wide acceptance makes clear to
every data provider that it is the task of the data provider to provide correct data and
not that one of the service provider. The experience not only in ECHO shows that
we are still far away from that goal.
Much effort was due to changes in the data delivered over time. The language
domain changed the IMDI version such that new X-paths were necessary and new
mappings had to be established. However, this step was an explicit one supported
by proper schemas. In many cases changes were done without notice or without
providing a schema. Path corrections could only be carried out after visual
inspection.
OAI MHP Type of Harvesting (RMV, IMSS)
In the case of OAI harvesting the type of preprocessing was comparatively simple.
This has to do with the fact that a validation check is carried out when registering as
OAI data provider. A schema has to be provided and the data delivered is validated
against this schema, i.e., at the encoding and syntax level correct data can be
assumed. Still at the content encoding level some pre-processing had to be carried
out, since this is beyond schemas. Due to the limited number of fields in Dublin Core
different types RMV chose to package different types of information into one Dublin
Core field. During preprocessing this had to be separated again. Also some of the
encodings had to be interpreted and modified to separate formal encodings and
explanatory (and therefore searchable) strings. In principle, however, the choice of
OAI to put all validation errors at the shoulders of the data provider seems to be the
best one can do. It requires that the data providers who know their data very well
and have the responsibility to clean up all encoding and syntax problems. In general
the broad semantic definitions of fields in Dublin Core such as DC:Coverage or
DC:Subject make it difficult at the semantic level to create suitable mappings. In
some cases it is too early to make statements about the usage of such fields.
65
XML Type of Harvesting (NECEP, Languages, Philosophy)
In the case of harvesting online available XML data in two cases a schema was
available (NECEP, Languages) and validation was carried out by the data provider,
so proper metadata was delivered. In the case of philosophy IMDI type of metadata
descriptions were created manually from the given texts, therefore also proper
schema-based metadata was available. In fact the philosophy data exists from
textual descriptions that were interpreted as prose descriptions, i.e., they are not
part of the complex search but integrated into the index for simple search.
In the language case a major schema change was done during the DORA work,
therefore several utility files containing Xpaths etc had to be adapted. Some
repositories such as those created by Lund University within ECHO are still using
the old IMDI version, i.e., it had to be noticed which version is used for different
parts in the language domain. Therefore, a proper harvesting scheme would have to
check regularly the version of the underlying schema to make sure that the settings
are still ok. The IMDI import module has the appropriate knowledge and can adapt
the import schema, however, the Xpath specifications have to be updated.
Static Harvesting (other providers)
In the case of the other data providers in ECHO static files were exchanged – in
general by email. As far as we know XML data was generated by extracting data
from relational database repositories of different types. Here many problems were
encountered. Again it should be mentioned that our colleagues did their best to
provide useful metadata – it’s just a picture of the state of technology.
•
•
•
•
•
•
lack of proper XML headers;
no UTF-8 character encoding although the XML header claims it22;
lack of an XML schema prohibiting any validation;
invalid XML constructions;
existence of several XML document headers in one file;
changes of the underlying schema
In the case of the Fotothek it was known that the records are highly nested, so a
normalized structure had to be created. It was not always clear to the DORA
developers which of the fields had to be replicated.
It became also apparent that the encodings found in the metadata records did not fit
with the encodings found in the thesauri for example. Some pre-processing had to
be done here as well.
Normalized validated DORA Repositories
Before actually doing any further processing normalized and validated (as far as
possible) XML files were created for all repositories. These are part of the DORA
ontology, have a documented structure such that the Xpath definitions contained in
22
These kind of problems are very serious ones, since during parsing no errors are created. In general errors can
only be indicated if searches don’t lead to appropriate results. The string “Milano” was not extended due to the
geographic thesaurus as subpart of “Italy” and “Europe” since it contained non-UTF-8 character encodings. We
assume that some of these errors are still hidden in the index.
66
the various other resources are correct. In general, this pre-processing step was
necessary to come to useful repositories, but it took too much time.
When creating these normalized XML files also the punctuation characters were
removed from the data to allow proper and easy matching. For presentation
purposes the original string is preserved as well.
1.4 Index Creation
Since DORA contains now about 95.000 records and since it can be expected that
these numbers will increase rapidly, it was decided to focus on fast indexing
mechanisms and to do as much as semantic processing off-line, i.e., not during
search. Exploiting the different knowledge components in real time would lead to
unacceptable delays. It was decided to use a binary tree where every word found
somewhere in the metadata descriptions (including the prose texts) is included as a
sequence of nodes. With proper encoding techniques such a tree would guarantee
almost equal access times for all queries. It was checked whether an API provided
by some of the already existing search engines could be used. Since the search
algorithm itself was not seen as the component that would take much time this
option was not chosen, i.e., based on existing experience and knowledge a treetraversing algorithm was programmed.
Before creating the index tree the semantic extension had to take place. To
accomplish this first the codes found in the Fotothek and RMV metadata
descriptions were replaced by the strings and separated respectively. At the same
moment the mapping between the three content thesauri was used to add the
appropriate strings (iconclass2ovm-mapping-v3.xml, ovm2iconclass-mappingv3.xml, IMDI2iconclass-and-ovm-v1.xml). Due to the semantic vagueness of the
entries found and of the relations between the thesauri it was decided to not extend
to all super-classes in the thesauri. Tests have shown that this would result in an
semantic explosion and a decrease in precision23. The following example may
illustrate the operation.
The following relation is taken from the iconclass2ovm-mapping file. A specific
Iconclass code has relations to two OVM codes.
31D
human life and its ages
OVM.AAC.AAM
life cycle
OVM.AAC.AAM.AAA
pregnancy, birth and first year
Iconclass code that maps to OVM classes
corresponding Iconclass string
OVM code
appropriate OVM string
OVM code
appropriate OVM string
When in a record of the Fotothek repository the entry “31D” is found, it will first be
replaced by the corresponding string. Then the two semantically overlapping strings
of the OVM thesaurus are added. The resulting entry would be transformed from
“31D” to
“human life and its ages; life cycle; pregnancy, birth and first year”
23
Here the term “precision” is used known from the field of information extraction. It indicates how many hits
were obtained that are inappropriate. A decrease in precision means that too many “wrong” hits were found.
67
In doing so the user would find this entry also if the search string “life cycle” was
entered.
For all geographic information a full extension was made. Two thesauri were used:
ovm-geo-thesaurus-v3.xml; mpi-geo-thesaurus-v4.xml. The first is being used
for the OVM collection, the second was assembled by looking through all
geographically relevant fields including the names of museums, names of languages
spoken in that area, etc in the other repositories (for more details we refer to the
ontology document). Where possible also other names than the English were
added24. So if Milano was found, also Milan and Mailand were added.
The mpi-geo-thesaurus-v4 thesaurus also contains mappings to the appropriate
categories in the OVM thesaurus. The following example is taken from the mpi-geothesaurus-v4 thesaurus.
West Africa
OVM.AAA.AAA.AAE
Benin
OVM.AAA.AAA.AAE.AAA.AAA
Burkina Faso
OVM.AAA.AAA.AAE.AAB.AAA
<lang>Dogon
It says that Benin and Burkina Faso can be found in West Africa and that the
language Dogon is spoken in the area of Burkina Faso. During index creation
therefore two three types of information were added to an entry such as “Milano”. It
would result in the entry
“Milano, Milan, Mailand, Italy, Italien, Italia, Europe, Europa”
This would give the corresponding record as a hit, if for example the string “Italien”
would be used to specify the location in a query. In this case hierarchy extension
makes sense, since the geographic concepts are exactly defined.
Since only one index is used both for simple and complex search, special care had
to be taken how the extension can be done for prose text. For keyword type of
metadata elements it was assumed that the vocabulary is used properly, i.e. we
expect to find the complete string for an institution such as “Sterling and Francine
Clark Art Institute” (an institution in Williamstown/ Massachusetts/USA). This allows
us to match the complete string and therefore reduce the chance of fault hits.
However, in prose text we may find various variants of such a string such as the “the
Art Institute from Sterling and Francine Clark”, nevertheless the search engine
should find the entry. We could only implement policies that do not rely on advanced
Natural Language Processing. Therefore, during the extension it was allowed to
break the found string down and to match for example “Sterling”. Such a policy
would increase the risk of false hits, but in case of more information in the query
such as “Francine Clark” those records that come from the mentioned institution
would get a high rating and appear at the top.
The result of these processes is a large index file that includes all necessary types
of information for each node in the tree such as Document ID, Repository ID and
24
This could only be done in a limited and unsystematic way to help using the DORA engine.
68
Xpath Information. So when a hit was found it can for example immediately be
extracted where it comes from.
1.5 Searching
Searching is simply done by traversing the binary tree for every entry found in the
query. This results in a number of hits which are filtered according to the selections
made in the interface. When looking for the string “horse” also the “hits” for “horses”
are used which is a morphological variant. Yet no lexical processing is used in the
search algorithm.
The filtering includes that for domains, for sub-domains and for the field names for
complex search. The latter includes all semantic mapping relations between the
metadata categories as explained in the DORA note. In doing so the task of
semantic mapping is reduced to a filtering step making mapping very fast.
A simple ranking mechanism is applied in the search algorithm. When two or more
separate items as for example in “Sterling and Francine Clark Art Institute” (5
different items) all result in hits, then the hit receives a very high ranking. Further,
the number of occurrences of a certain string in a metadata record is used to
increase the ranking. Therefore we can speak about three ranking levels: (1)
Highest ranking for the co-occurrence of multiple words appearing in the query. (2)
Moderate ranking when a word occurs several times in a record. (3) Singular
occurrence of one word of the query string.
69
With respect to the hits all information that is provided by the data providers is used
to give as quick feedback as possible. In the above figures a few examples are
given. The first example is the result of entering “horses” in simple search. It results
in 8 hits from three different domains. In the case of the IMSS hits a back link is
provided to the web-page with the following object: “PAOLO SANTINI (after
TACCOLA) - Double-grindstone mill powered by two horses”., i.e., when clicking on
the back link the shown page appears.
In the case of languages when querying for example “wittenburg”, a resource is
shown with gesture data. When clicking on the back link one first gets the metadata
entry, but can then request the annotations with the appropriate video fragment.
Two options are available: (1) The annotations created with ELAN can be viewed
with the help of HTML where clicking on an annotation will active the appropriate
video fragment. (2) ELAN allows to generate a SMIL25 object which is addressable
via the metadata. When clicking streaming video is shown with subtitles. ELAN
allows to select the tiers to be seen and the time fragment that is of interest.
In the third example the word “rome” is entered as query, delivering many hits for
example from the CIPRO repository. Here two options are given. When clicking on
the thumbnail a larger image of the map is shown. When clicking on the back link a
page is offered with showing the appropriate map within the DIGILIB image
processing framework. The presentation of the hits and the back link possibilities
can certainly be improved, but they were not in the center of the ECHO work. Also
some repositories include many resources that are not open.
2. Evaluation
This evaluation is split in three parts. In the first we will make some comments about
the formal correctness which we distinguish from the usefulness of the chosen
25
SMIL is a W3C supported standard for media presentations and will be supported by an increasing number of
browsers.
70
semantic mappings and operations which we will discuss with the help of examples.
While in the case of the formal correctness one can speak about “errors”, the
semantic mappings are a matter of subjective evaluation. The third part will make
statements about the ranking.
2.1 Formal
The formal correctness include all aspects such as
•
•
•
•
Are all specifications made in the ontology correctly implemented?
Are the final metadata files (created by conversion) correct?
Are the extension mechanisms that create the final index file correct?
Are the extensions such that we don’t get a semantic explosion?
The latter has also to do with semantic evaluation, so it could also appear under 2.2.
During the last weeks much testing was done to see whether the engine and the
underlying mapping files are correct. We distinguish two types of mappings: (1)
Those mappings that are specified between the different metadata elements. (2)
Those mappings that are established between the thesauri.
The mapping scheme between the metadata elements was provided and discussed
very early with the data providing teams. The first version of the DORA document
was distributed in late 2003, so that all teams could respond. The corrections we
received were integrated. It was checked in detail during the tests whether the
mappings are effective while searching. Here the method was to investigate specific
examples that were obvious from studying the metadata sets. As far as can be seen
from these investigations the specified mappings are used correctly.
The check of the correctness of the implementation of the thesaurus mappings and
extensions was especially tested for the geographical elements. Here we discovered
a number of errors which mainly had to do with incorrect character encodings in the
metadata files. Although UTF-8 was mentioned in the header we found out that this
specification was not correct in some cases. Also in some cases additional
characters were introduced in the strings. Only by these operational checks we
could find out these errors. For the obvious cases corrections were carried out,
although we cannot claim that these kinds of problems are completely removed.
Another problem we encountered was that the thesaurus extension leads to an
explosion of hits in the case of the content description. In the case of geographical
terms we have a well-defined domain that is organized hierarchically. In the case of
content descriptions we don’t have such a well-structured domain. Both – the
application of semantic mappings between nodes of the content thesauri and the
hierarchical extension – leads to cycles and an explosion amounting in too many
non useful hits. Therefore, we concluded that for the content description within
ECHO we will only exploit the mapping specifications and not use the hierarchy
information. A more detailed semantic analysis would have to be carried out to come
to refinements. This was beyond the scope of the ECHO project.
2.2 Examples and Semantics
First, we will give a number of examples and then give a first evaluation.
71
Example 1
Simple Search “weapons”
87 matches are found: Fotothek: 84, RMV: 1, IMSS: 2
Complex Search “weapons”
Fotothek - Iconography: 84, RMV - Content Description: 1 , IMSS - title: 2
Both search types lead to the same result. In the case of complex search the
mapping between the fields becomes effective leading to acceptable results.
Example 2
Simple Search “dogon”
1 match was found: NECEP: 1
Complex Search “dogon”
View NECEP - society name: 1 in NECEP
View IMSS - Ianguage: 1 in NECEP
View DC - language: 1 in NECEP
View Language - language: 1 in NECEP
Complex Search “mali”
View Language - country: 1 in NECEP
This example demonstrates the effect of mapping and geographical thesaurus. The
language element is mapped to the society name element in NECEP although this is
semantically not fully correct. Entering “mali” in the country specification yields a hit
since “mali” is seen as a superclass to “dogon”. Here a relation type such as
“has_language” would be semantically more appropriate.
Example 3
Simple Search “inuit”
2 matches are found: Language: 1, NECEP: 1
Complex Search “inuit”
View Language - *: 0 in Language (could not be found in the Language
domain)
View Language – language: 1 in NECEP
Complex Search “greenland”
View Language – language: 1 in NECEP
The results are similar compared to example 2. It indicates that the element
including “inuit” in the language domain is not an element that is used for mapping. It
was used as an optional field by one specific researcher.
Example 4
Simple Search “agriculture”
75 matches are found: Language: 73, Fotothek: 2
Complex Search “agriculture”
View Fotothek - iconography: 2 in Fotothek
View RMV – content: 2 in Fotothek
View IMDI – content: 2 in Fotothek
72
The results can be misleading. The 73 hits for language result from matching with
recording place (“southern agriculture kindergarten”) or affiliation of an actor
(“ministry of agriculture”). In the case of Fotothek the hits make sense since it is
about “harvesting”. The mapping in complex search works properly as indicated. Of
course, in complex search the misleading hits from the language domain are not
found.
Example 5
Simple Search “clothing”
22 matches: Language: 8, RMV: 8, Fotothek: 6
Complex Search “clothing”
View RMV – content: 8 in RMV, 6 in Fotothek
View Fotothek – iconography: 8 in RMV, 6 in Fotothek
View Language – content: 8 in RMV, 6 in Fotothek
Again the rich annotations that are inserted in various free-text fields in the language
domain lead to not useful hits. They are about chats at the bakery shop and the
clothes people are wearing – so it’s not about clothing as an object which may be
intended by the person specifying the search. The results for complex search from
different domains shows the correctness of the mappings. The language hits are
excluded, but the others are found.
Example 6
Simple Search “horses”
7 matches: Fotothek: 2, Language: 2, IMSS: 3
Complex Search “horses”
View Fotothek – object title: 3 in IMSS
View Fotothek – iconography: 2 in Fotothek
View Lineamenta – title: 3 in IMSS
View Lineamenta – keywords: 2 in Fotothek
View IMSS – title: 3 in IMSS
View IMSS –subject: 2 in Fotothek
View Language – title: 3 in IMSS
View Language – content: 2 in Fotothek
This example clearly indicates the strength of simple search and the weakness of
complex search. The pattern of complex search is like a narrow path in the complex
semantic space. If one looks at title one finds the IMSS hits, if one looks at content
one finds the Fotothek hits. Both, however, are leading to useful hits where “horses”
have an important role. The reason partly is that metadata in many cases is very
sparsely encoded. In the case of IMSS the term horses is only mentioned in the title,
but the content element is yet not used. In the language case thesaurus information
is used to infer from the title content “spatial layout task, farm scenarios” to “horses”.
Further tests and examples will follow.
Yet, there is no clear statement whether simple or complex search are better.
Simple search is good when one wants to be sure to get a large number of hits
where the probability is very high that the documents looking for are included – even
at the price of a large number of hits. Complex search is more selective and its
73
matching operations are much more strict. In general complex search is excellent for
those metadata elements that describe a more precise domain such as date,
geographic location and authors. Content descriptions are done in very different
ways and according to different categorization principles (thesauri, keywords). Any
professional search on these elements requires a high degree of knowledge about
the underlying category system and its semantics. If one wants to exploit the
advantages a thesaurus such as IconClass can offer, one has to know its semantic
construction principles.
One big advantage of simple search is that it uses all fields even if they contain
prose text. However, it also increases the number of appropriate hits as was shown
in the examples.
2.3 Ranking
Ranking is a possibility to satisfy the user in case of low precision. It is a general rule
to offer more hits even if non-appropriate documents are included, since there is
always a penalty between “recall” and “precision”. If the “recall” (ratio of appropriate
documents found to total number of appropriate documents) shall be increased
normally the precision (ratio of appropriate documents to in-appropriate) decreases.
But the primary goal is to find all appropriate documents and offer them. A
compromise then is to offer all appropriate documents first in case of clear evidence.
The implemented ranking is based on frequency of occurrence and not on semantic
criteria. It makes sense to weight multiple occurrence of different terms higher than
multiple occurrence of one term. The fact that more terms found in the query input
are matching raises the probability that the found document is a useful hit. The
results found are in general satisfying.
An implementation of a ranking based on semantic criteria requires much more
experience and insight to the usage of all concepts. Since many metadata sets were
offered at a very late moment within the project there was no chance to include
semantics in rating. Including semantics also means to include a bias. It is obvious
that people disagree on semantic relations and want to be able to tune the
semantically related operations according to their wishes. Therefore, we refrained
from making use of the “mapping quality” parameter which can be added to the
mapping relations between the different metadata elements. It would require much
more time to come with useful defaults.
At this stage of the DORA search engine ranking based on formal criteria is much
more appropriate than including semantic criteria.
3. Conclusions
The final conclusions will be drawn when all evaluations have been done in June.
Here some preliminary conclusions are made.
Creating an interoperable and interdisciplinary search space is a difficult task. So
DORA is one of the first attempts to do this in a flexible and unbiased way without a
specific goal in mind. It is not yet clear whether this approach is useful. A project
approach – even if it includes a few disciplines – may have specific objectives in
74
mind that will require a careful analysis of the included semantics and it may include
strong biases.
DORA was intended to make it easy to integrate other domains into the search
space. Integrating another discipline requires activities at the harvesting and data
preprocessing level which will not be commented here. It was already described that
most of the repositories are yet not so far to offer validated, correct and stable
output. The OAI MHP protocol is important, but many repositories are not ready.
Even the concept of metadata was new for some and a fair debate showed that
some question the usefulness of keyword type of metadata. Here we can see a
difference between institutions that hold large collections of multimedia objects and
those that are more text oriented.
Discipline integration also requires various operations to integrate the semantics:
•
•
•
The mappings to other metadata elements have to be added to support
complex search.
In the case of geographic descriptions one has to create a discipline specific
list of terms and relate them to nodes in a geographic thesaurus.
In the case of content descriptions one also has to create relations to
concepts used in other domains.
Currently, the effort is very high, since there is no structural support and there are no
existing knowledge documents one can refer to in the area of the humanities. What
is needed to support such work and also allow individuals or groups to tune the
semantics to their needs is as follows:
•
•
•
•
Open Data Category Registries that contain ISO compliant concept
definitions occurring in a discipline. Compliance to standards such as ISO
11179 would guarantee a certain degree of homogeneity and increase the reusability. The definitions should be included in XML files that are associated
with a schema. These definitions should contain only those relations that are
part of the proper definitions of a concept, i.e., if for example the sub-class
relation is important to define a concept than a relation to another concept
could be included. However, it is wise to reduce this to a minimum, since
relations often are a matter of disagreement even within domain.
This also is valid for the thesauri. As far as is known to us, the big thesauri
have their own definition style, come with a particular access interface and
are not open available as an XML file26.
For the mappings we also need frameworks to easily create practical
ontologies. These should be described in RDF and refer to concepts defined
in open registries. It must be possible for users to easily create their own
versions, i.e., to adapt existing relations or to add new ones.
All these components must be machine-readable and inference engines must
be available that can operate on them.
26
To make IconClass useful in the DORA framework the database format used on the distributed CDROM had
to be decoded with the help of scripts and some manual intervention to come to an appropriate XML structured
file.
75
•
•
Registration mechanisms have to be designed that allow to register
knowledge components and to search for them.
The RDF-S and OWL definitions are an excellent start to formalize relation
types, however, in practical work we are often faced with fuzzy or unclear
relations that cannot be described by RDF-S/OWL types.
Part of the work has been started in the area of Language Resources (ISO
TC37/SC4). This can be seen as an example to start such work in other disciplines
of the humanities. It will pave the way of the humanities towards the Semantic Web.
DORA is an attempt to tackle some of the problems based on open and wellstructured ontology components, yet, most of them are not based on established
standards.
A key point for success of DORA like approaches with complex search based on
selected metadata categories will be the flexibility for users and groups to tune the
semantics. The above mentioned steps will help doing this, but smart and userfriendly tools have to be available.
From the experience it is obvious that the choice to not offer Dublin Core as the
Gold Semantic Standard was appropriate. The success of selective search will
depend on the knowledge about the vocabularies and the quality of the mappings.
Dublin Core presents a rather reduced vocabulary with loosely defined concepts. It
is not obvious how different disciplines will map their concepts on the Dublin Core
ones and in general this mapping is not open. So the concept of a GOLD standard
may be useful for cases like the domain of book descriptions where the concepts
such as title, author, year of appearance and publisher developed for many years
and are used by all libraries. For purposes such as DORA which want to go beyond
these formalized elements, Dublin Core cannot be recommended. It may play a role
for occasional users, but it can be questioned whether DC search is preferable
compared to simple search.
An important aspect that restricts the quality of this evaluation is the lack of detailed
metadata descriptions in many cases and the comparatively small number of objects
in some of the repositories. Only the Fotothek and Language repositories have a
large number of records. For repositories that offer about 100 records or less
browsing is sufficient and then superior to searching. However, it is obvious that this
will change in all disciplines since the number of digital objects stored increases
extremely fast.
The DORA technology has to be seen as one of the possible initiatives to indicate
how difficult semantic integration is and how much has to be done in future. We
need more of such attempts to build the infrastructures and tools to cope with the
challenges of the Semantic Web and to prepare the disciplines of the humanities for
these challenges.
76
D. Availability of the Code and the Knowledge
Components
Since we received suggestions for optimizations from the various partners until the date of writing this
report, we will finish the modification work in May 2004. After that date we will generate two ZIPpackages:
•
•
one containing all relevant code for the DORA Search engine
one containing all relevant knowledge components
We intend to have this done in mid June and make the two parts available at the WP2 web-site.
The first package will not include all the scripts that were necessary to pre-process the various data
sets. We will provide the code of programs that are still in operation.
With respect to the latter we have to check what the terms are to put our XML-version of IconClass
on the web.
77