Thesaurus modelling, data acquisition and design

BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
A Biodiversity Collection
Access Service for Europe
Workpackage 4: Thesaurus modelling, data acquisition and design
Deliverable D4:
Thesaurus criteria,
candidate thesauri and
catalogues
Charles Copp
Natural History Museum (London)
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 1 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
The Biocase Thesaurus Team
Charles Copp
Neil Caithness
Richard White
John Robinson
Natural History Museum (Clevedon, UK)
Natural History Museum (London, UK)
Southampton University (Southampton, UK)
Southampton University (Southampton, UK)
CC
NC
RW
JR
Contents
1.
INTRODUCTION ......................................................................................................................... 3
2.
THE PURPOSE OF THE BIOCASE THESAURUS ................................................................. 3
3.
THE BIOCASE THESAURUS DATABASE ............................................................................. 4
3.1.
4.
THE BIOCASE THESAURUS MODEL ........................................................................................ 4
SOURCES OF TERMS IN THE BIOCASE THESAURUS...................................................... 5
4.1.
4.2.
SOURCES .................................................................................................................................. 5
DOMAIN RESPONSIBILITIES...................................................................................................... 6
5. CRITERIA FOR THE SELECTION OF TERM LISTS, CLASSIFICATIONS AND
THESAURI ............................................................................................................................................. 7
6.
PROGRESS WITH CANDIDATE LISTS, THESAURI AND CATALOGUES ................... 11
6.1.
6.2.
7.
IDENTIFICATION AND ACQUISITION OF CANDIDATE TERM LISTS ........................................... 11
DATA QUALITY ISSUES RELATED TO THE THESAURUS ........................................................... 11
ADDING AND MANAGING TERM LISTS IN THE THESAURUS .................................... 13
7.1.
7.2.
7.3.
7.4.
DATA SUPPLY AND UPDATE ................................................................................................... 13
THESAURUS CONTENT MANAGEMENT................................................................................... 14
THESAURUS DATABASE MANAGEMENT ................................................................................ 15
THESAURUS DISTRIBUTION AND USE..................................................................................... 15
8. ANNEX 1: TERM LISTS AND NUMBERS OF TERMS IMPORTED INTO THE
PROTOTYPE THESAURUS.............................................................................................................. 16
9.
ANNEX 2: TERM RELATIONS IN THE THESAURUS ...................................................... 20
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 2 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
Deliverable D4: Thesaurus criteria, candidate
thesauri and catalogues
Charles Copp
Natural History Museum (London)
1.
Introduction
This report covers the work of the WP4 (Thesaurus) Team to establish the means by which
we can establish, populate and manage an efficient thesaurus of relevant and related terms for
the BioCASE Project. The BioCASE Thesaurus has been designed as a single relational
database which can accommodate multiple hierarchical term lists relating to different subject
areas (e.g. taxonomy, gazetteers and habitats).
An initial trawl of potential sources has indicated a vast and overlapping resource of both
complimentary and competing thesauri, term lists and classifications that could be relevant to
BioCASE. The WP4 Team have developed a methodology whereby potential sources are
documented and scored against a number of criteria including uniqueness, completeness,
accuracy, format and availability and candidate lists marked for acquisition. Some important
sources covering taxonomy, gazetteers and biotopes have already been identified and steps
taken to obtain or negotiate use of copies. The overall size of the task means that it is likely to
continue through the length of the BioCASE Project.
Much of the early work of the project has been concerned with establishing a sound model for
the management of these term lists and work with the prototype has included importing more
the 250,000 terms from 169 source lists to explore both data quality issues and aid the
development of thesaurus import and editing tools.
2.
The purpose of the BioCASE Thesaurus
The purpose of the BioCASE Thesaurus is to provide controlled and classified terminology to
enable reliable and accurate data retrieval from partner databases. The entries in the thesaurus
will be derived from other classifications and term lists or derived from the indexing of
partner databases. It is not the objective of the BioCASE Thesaurus to become a terminology
standard and there is no claim that its content will be either comprehensive or authoritative.
There is no intention that the Thesaurus attempts to replace or rival existing or potential
terminology standards.
The BioCASE Thesaurus will be an enabling technology for providing maximum retrieval
from a disparate and varied set of data sources. Where possible it will allow for inevitable
variation in term form and the existence of synonyms and multi-lingual alternatives. The
thesaurus will be structured, where appropriate, into related hierarchies that allow for broader,
narrower and related term searches, again with the purpose of maximising the likelihood of
positive results from a user query.
A further function of the BioCASE Thesaurus will be to enable different user community
views of the same data. For instance taxon specialists will expect to access data using formal
taxonomic names and may be very specific in their requirements whereas general users and
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 3 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
non-specialists may wish to use common names (probably in their own language) and may
prefer broader group terms, some of which will be informal (e.g. sea birds). The Thesaurus
will, therefore need to provide a network of related terms and concepts, ideally with weighted
associations, that can be explored by users starting from terminology that they understand and
which can lead them to broader, narrower and equivalent terms or different terms that might
produce fruitful results1.
3.
3.1.
The BioCASE Thesaurus Database
The BioCASE Thesaurus Model
The structure of the BioCASE Thesaurus Database is detailed in a separate paper2. The
principal feature of the BioCASE Thesaurus Database model is that it is designed as a
mechanism for storing many term lists and versions of lists together with the means for
translating or relating from one to another. The model we have implemented is optimised for
the collection and management of multiple term lists and hierarchical classifications relating
to any number of knowledge domains. This means that the same structure can be used to
manage and relate taxon classifications, biotope and habitat classifications, gazetteers,
stratigraphic hierarchies and museological terms.
The key principle that we are working to is that no individual list or hierarchy is regarded as
definitive but that it is important to know the source of terms and their scope within the
source. The partner databases that will be indexed for BioCASE may include specimen data
(e.g. place names, taxon names, habitat terms) that derive from a wide range of lists and many
older terms that would not be regarded as current. The object of the thesaurus is to capture
these terms and where possible relate them to a known list. The BioCASE Thesaurus
managers will attempt to build links between terms and hierarchies of terms that enables users
to both change the scope of their searches (broader or narrower, near terms and related terms)
and to automatically account for equivalent terms in searches (e.g. common names and
binomials, multi-lingual versions, syntactic and spelling variations). The system will need to
be closely managed to ensure consistency and ideally be able to learn through use and user
input.
The structure of the BioCASE Thesaurus Database is complex with many tables. This
complexity allows us to document and manage any kind of list or hierarchical classification
from any of the domains3 with which we are concerned. The structural complexity makes
manual maintenance of the database difficult and we will therefore be relying heavily on the
use of specially developed software tools4 to handle additions, deletions and edits of list
items. The Thesaurus team are currently exploring the best way to make the thesaurus
available to the other work packages for whom the complex data model and extended
metadata relating to lists will not be appropriate. We will therefore be presenting the core
thesaurus information to the other work packages in a simplified and de-normalised format.
1
See Annex 2 for a list of relationships modelled in the Thesaurus
The BioCASE Thesaurus, Logical and Physical Data Models, Charles Copp, BioCASE Report, June 2002
3
Subject areas e.g. species names, place names, habitat types, modes of specimen preservation
4
Being written by John Robinson at Southampton University, School of Biological Sciences
2
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 4 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
Figure 1: Prototype Thesaurus Viewer showing the ability to explore hierarchical trees of
terms and to list equivalent and related terms from other lists
4.
4.1.
Sources of terms in the BioCASE Thesaurus
Sources
The BioCASE Thesaurus Database has been designed to hold and relate many term lists and
hierarchical classifications. We wish to ensure that the indexing package has a high likelihood
of finding terms it encounters in partner databases. This means that the Thesaurus Team have
to look for the most inclusive and widely used term lists available and also allow for the
addition of more localised and specialised lists as these are identified.
The BioCASE Thesaurus will borrow from or link to terminology standards where they exist
and where they are relevant to data retrieval within the BioCASE project. The addition of
terms to the thesaurus will not be a guide to their validity, only their utility. The BioCASE
Thesaurus will carry no guarantee that its included term lists are comprehensive although it
will draw wherever possible from the most accurate and comprehensive sources available.
In addition to copies or links to published lists, classifications and thesauri the BioCASE
thesaurus will include terms derived from indexing partner databases and terms supplied with
collections metadata. This approach is needed because partner databases may be in a variety
of languages and include many free terms or be derived from in-house term lists. There are
obvious dangers in allowing a thesaurus to grow in this way because simple lists of terms put
together without rules would soon become unusable. This implies that the thesaurus will need
to be managed and work within a set of rules. What these rules are and how they will be
applied will be defined as the work progresses.
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 5 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
There are 4 principal sources of terms that will be incorporated into the BioCASE Thesaurus:
4.2.
1.
Recognised national and international standards (e.g. TDWG
Geographic Codes). These should be consistent and reliable but may
include both full terms and codes (e.g. ISO 639 Languages), either of
which may have been used in databases. Standard lists are generally
static or addition of new terms is strictly controlled.
2.
Existing lists of terms that are maintained by a recognised
organisation (e.g. Botanical Society of the British Isles Plant Names).
May be treated as emerging standards and generally reliable but several
versions of the list may be in use. Some lists are represented by large
and complex databases (e.g. the Getty Placenames Thesaurus and the
Species 2000 Project). This category includes both static lists (e.g. UK
Phase I habitats) and developing lists (e.g. EUNIS habitats). Some lists
contain essentially the same terms (e.g. CORINE and EUNIS) but there
may be orthographic differences, often unintended.
3.
Existing informal lists of terms that are not fully controlled but may
have been widely used (e.g. the Stratigraphy lists incorporated in the
prototype of the thesaurus database). These lists may be derived from
various published sources but may include a number of problems
including misspellings, duplication, and inconsistent updates without
version control. There can be multiple lists relating to the same topic but
managed by different people and organisations. Some lists that have
grown informally may include duplicated terms including full duplicates
and orthographic duplicates.
4.
Terms derived from indexing partner databases. These terms may be
derived from controlled lists or could be free terms. Typical problems
include; varying spelling, plurals, gender, and use of abbreviations.
Across the BioCASE area free-terms and text descriptions will include
many spelling, abbreviation and language variants as well as the
inevitable typographic errors. The growth of new terms from this source
could be exponential and the indexing system will therefore need to deal
with term reduction using stop lists, word stemming and other
techniques. Even with term and ‘noise’ reduction there will be a
significant resource implication for relating new terms to existing ones.
Domain Responsibilities
The tasks of identifying and obtaining checklists, term lists, classification and catalogues has
been divided amongst the thesaurus team. The responsibilities are:
Richard White (Southampton University)
Neil Caithness (Natural History Museum
London)
Charles Copp (Natural History Museum
London)
Taxon lists and classifications
Gazetteers and Administrative Area names
Geological, Ecological, Museological and
other terms.
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 6 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
5.
Criteria for the Selection of Term Lists, Classifications and
Thesauri
The BioCASE Thesaurus database has been designed to enable us to manage and index term
data derived from a wide range of sources and formats and to provide a common means of
relating terms from different sources. However, even a casual search on the ‘web’ or in
museological literature demonstrates a potentially vast source of classifications, thesauri,
gazetteers and term lists. We therefore needed to establish the criteria by which we can judge
the suitability of any particular source of terms for our purposes.
We have identified a number of criteria that will help us select and prioritise appropriate
sources of terms for the BioCASE Project. Some of the criteria are free text descriptions but
where possible we will be applying fixed values that can be scored to help in the selection
process.
It has been decided to include the selection and acquisition information relating to term lists
as metadata in the BioCASE Thesaurus database. The intention is to maintain the BioCASE
Thesaurus Database on a MySQL database at Southampton University and for members of
the Thesaurus Team to access the database over the web to record metadata about term lists
and their sources as they find them. The metadata will record further information about the
lists we decide to acquire such as cost, restraints on use and update agreements.
The criteria and metadata that we are recording include:
1.
List Type
List Type refers to a controlled list of terms describing the ‘domain’ of any given
term list and gives a convenient way of sorting lists within the Thesaurus Database.
List type is hierarchical so that sub-domains can be grouped. Top level types such as
Taxon list, Biotope list, Gazetteer and Geology can have subgroups, for instance,
Minerals and Stratigraphy fall within Geology.
2.
List Topic
Within a list type domain a list can have a specific topic e.g. a regional taxon list
may cover only the Leguminosae. In terms of selection for inclusion in the
BioCASE Thesaurus the topic may be judged on:
•
Does the subject cover only items relevant to BioCASE
•
Also extends to items outside of the current BioCASE remit
•
Outside but related (e.g. mineral names) and may be useful in the
future
•
Completely outside of current BioCASE scope
3.
Theme Coverage
Within the topic covered lists may
•
Cover the whole theme
•
Cover a subset of the theme
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 7 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
4.
Geographic Coverage
Some lists are global in their coverage but most are restricted to some named
geographic area.
•
5.
Is the term list applicable to a specific geographic area – if so where
Language
The BioCASE Project is initially concentrating on English as a common language
for indexing and retrieval but multi-lingual terms will be encountered increasingly
as the project progresses and especially when indexing of ‘unit’ data within partner
databases takes place. Lists may be available in many languages and some include
multi-lingual synonyms, place names are likely to be the earliest sources of nonEnglish terms.
•
What language?
•
Does the list provide an essential source of terms e.g. place names in
countries where there has been much recent change.
6.
Standard
This is a flag to record whether this list is an international or national standard.
7.
Uniqueness
If a term list or thesaurus is unique in its content and relevant to BioCASE then it
will become a priority for acquisition. For some domains, however, there seems to
be an almost inexhaustible supply of alternative lists that may entirely or partially
duplicate each other and we may then use other criteria in selection for the
BioCASE Thesaurus. Possible values for lists are:
•
Only source of these terms – must have
•
Includes some unique terms that are important to include
•
Most terms covered elsewhere but could be of value
•
Fully duplicated by another more readily available or reliable source
8.
Completeness
It is useful to know how complete a list is within its given domain
9.
•
Complete
•
Incomplete
Accuracy
The BioCASE Thesaurus does not set out to become an authoritative standard but it
is important to know the origin and quality of the term lists it incorporates e.g. for
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 8 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
choosing a ‘preferred’ term as a listing heading when faced with synonyms and
orthographic variants. Possible values for lists are:
•
International or national standard
•
Classification or thesaurus assembled or maintained by acknowledged
‘expert’, respected Society or consortium
•
Informal list but assumed to be accurate
•
List known to include inaccuracies but widely used
•
List considered unreliable or inaccurate
10.
Version Detail and Date of List Version
Term lists, thesauri and classifications are often released or replaced in different
versions. It is important for us to know which version we have and how it relates to
other versions. For some lists we may simply need the most recent version, for
others we may need all versions.
11.
Maintenance
Very few of the larger term lists are either static or complete and therefore it is
necessary to know how these lists are updated and maintained in order that if they
are incorporated into the BioCASE Thesaurus our copy can be kept up-to-date.
•
Static international or national standard
•
Static informal list
•
Maintained international or national standard
•
Maintained formal list not adopted as standard
•
Maintained informal list
12.
Updates
Where lists have a controlling authority and are subject to change we will need to
record how and when we can receive updates.
•
Complete copy of static list – no updates needed
•
Copy of a maintained list with arrangement for update
•
Remote access to an on-line maintained thesaurus or dictionary
13.
Current Storage Format
Current storage format refers to the ‘native’ format of the list, which will affect the
ease with which it can be manipulated or incorporated into the BioCASE Thesaurus.
Some likely types include:
•
Manuscript list
•
Published list (paper format)
•
List in text format
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 9 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
•
Spreadsheet-style format (single table)
•
Thesaurus structured text list
•
Relational database
•
Proprietary electronic format
14.
List structure suitability
This field provides us with an indication to whether the list can be readily
incorporated or will need significant manipulation and restructuring to suit our
purposes.
•
Directly usable by BioCASE
•
Needs simple re-structuring
•
Needs significant restructuring
•
Not known
15.
Availability
The availability of a term list for use or incorporation into our own thesaurus is a
key criterion. Some lists are available to subscribers only and may only be
accessible on-line, not as copies. Other lists may have use constraints placed upon
them. In either situation we will need to judge whether such lists are critical to
BioCASE or can be replaced with more freely available alternatives.
•
Freely available to copy and use without constraint
•
Freely available to copy and use within copyright or negotiated
constraints
•
Freely available to access on-line but not copy
•
Copy available for a one-off cost
•
Copy and updates available through subscription
•
On-line access by subscription
•
Not known
16.
Cost
If the list is not freely available, what are the costs involved in obtaining and
maintaining a copy.
17.
Source of copy
This field allows us to record the means and format in which we can obtain a copy
of the list
•
Available by download from web or by ftp
•
Available on web for searching but not download
•
Available in electronic format on disk or CD
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 10 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
•
Paper format only
•
Subject to negotiation
In addition to the above criteria we are recording the following metadata for named list
versions:
1.
Acquisition Priority based on an assessment of the above criteria.
Scored on a scale of 1(must have) to 5 (don’t need)
6.
6.1.
2.
Actual Cost of Acquisition
3.
Flag for lists actually acquired
4.
Update arrangements
5.
Source and agreement metadata
6.
Contact name and organisation data
7.
URL for related website
Progress with Candidate Lists, Thesauri and Catalogues
Identification and Acquisition of Candidate Term Lists
The team has met several times to discuss the task of identifying and evaluating candidate
thesauri and have established through a preliminary trawling of lists and websites that this
task will need to run through the whole project. The range and coverage of lists and thesauri
covering natural science topics (including taxonomy and gazetteers) is vast with much
overlap.
The late start to the project and early difficulties in getting staff in place has meant that
progress in selecting term lists according to the established criteria has been slower than
expected although major taxonomic and gazetteer sources have been identified and steps
taken to obtain copies. These include the BIOSIS and Species 2000 taxonomic classifications
(covers worldwide taxa) and the US National Imagery and Mapping Agency Gazetteer (c. 3
million place names). Importing of these lists has been placed on a lower priority whilst work
has concentrated on the development of the Thesaurus data model and prototype database.
C. Copp has, however, brought together a number of specifically European lists covering a
range of domains to test the developing prototype database and the various thesaurushandling tools. The prototype database currently holds 169 different lists and classifications
relating to 265,323 terms (use of the same terms in different lists brings the listable entries to
305,376). These lists have been imported from various sources including dictionaries
maintained by the UK National Biodiversity Network. The lists have been imported with
minimal data cleaning which has highlighted a number of data quality issues, which the
project will need to address.
6.2.
Data Quality Issues related to the Thesaurus
A prototype version of the BioCASE Thesaurus, populated with circa 30,000 terms
representing a number of earth science and biotope lists and classifications, was circulated for
testing and comment early in the project. The Paris team quickly identified a number of
issues deriving from inconsistencies, duplications and term formats that will create potential
problems for the indexing work package. Although the lists of terms delivered with the
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 11 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
prototype thesaurus were for test purposes only, they did give a good indication of what
happens when you bring lists together from different sources. The test also gave an indication
of what will happen when we start trying to index partner databases and attempt to link terms
to existing thesaurus term lists.
The underlying problem is that we cannot control what is in existing databases. If we are
building a database (e.g. the collections metadatabase) from new then we have the
opportunity to enforce fairly strict terminology control (at least for higher level terms) but for
existing unit-level databases this will not be possible. However, in mapping unit databases to
a common schema (as views) we can, at least, identify term groups (e.g. geographic,
taxonomic, biotope etc.) so that we do not confuse overlapping concepts (e.g. Essex Emerald
in an Identification field is a taxon and does not refer to a collection site in Essex).
Terms derived from indexing unit data might have any of the following characteristics:
•
Source term list may or may not be identifiable
•
May be in international formal language (e.g. taxonomic) or in any
national language
•
If a term is not already in the thesaurus, it might not be readily
assignable to higher hierarchical level. E.g. new taxon or geographic
terms.
•
May be concatenated with other terms
•
May be misspelt or wrongly capitalised
•
May be pluralised or, in some languages, in a different gender
•
May be abbreviated or entered as a code (e.g. habitat code) or
symbolic notation (e.g. chemical composition)
•
May include other text (e.g. and, the, of)
•
May include qualifying words (e.g. outside, near, cf.)
•
May include punctuation or other symbols (e.g. = , > ? [ ] )
•
Terms may be made up of several words (Lower Jurassic, Lesser
Spotted Woodpecker)
This raises a number of problems for both the indexing and thesaurus teams:
•
Simple atomised indexing (i.e. each separate word) will miss critical
links between terms. It might be necessary to parse terms for word
position, connecting words and punctuation. (e.g. to resolve ‘Lower
Rhaetian, Upper Triassic’)
•
Qualifiers may be important. In UK, square brackets [ ] are commonly
used to denote inferred information, ? and cf. are important in
identifications. Qualifiers might need to be stripped for indexing
purposes but some may form part of a name (e.g. aff. or var.)
•
The Index could fill up with spurious variants of terms – there will
need to be a means whereby common variants such as pleurals can be
recognised.
•
It might not be possible to link an indexed term to one in the thesaurus
without further evidence (e.g. if a locality is given as ‘Germany’ the
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 12 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
date is also important, there are also numerous instances of the same
taxonomic binomial referring to a plant, an animal and a fossil)
•
New terms derived from indexing may not be accessible under broader
term/narrower term searches until a link is added.
What this means for the thesaurus:
We have to be prepared to work with and develop a strategy to deal with inconsistent terms
derived from multiple term sources, typographic errors, inter-changeability of terms,
abbreviations and codes. We might be able to influence terms used in new metadata but there
is no way that we can influence the form of existing unit data so our products must be
designed to deal with inconsistencies and uncertainties.
7.
Adding and Managing Term Lists in the Thesaurus
This section provides an outline of the likely processes and 'job roles' involved in establishing
and maintaining the BioCASE Thesaurus. The envisaged pattern of information flow is
further summarised in Figure 2 (below). The processes identified are original data supply,
thesaurus update, thesaurus content management, thesaurus database management,
distribution and thesaurus use.
7.1.
Data supply and update
Figure 2 illustrates the diversity of sources that the BioCASE Thesaurus will derive its term
lists from. It is the role of the Thesaurus Team to identify and acquire sufficient term lists to
provide a sound basis for indexing partner databases and to provide a framework into which
new terms can be fitted. It is not the role of the Thesaurus Team to validate or alter imported
lists although the physical structure will often be modified to suit the BioCASE Thesaurus
Model. Lists might be scanned for consistency but changes should only be initiated by the list
or classification owner. The Thesaurus Team will have to negotiate use of lists and
arrangements for update with the list owners and all terms and data imported into the
BioCASE Thesaurus should have enough associated metadata to indicate their origins and
any constraints attached to use. For practical reasons, the task of acquiring and importing term
lists and classifications for the BioCASE Thesaurus has been split into three areas, taxonomy,
gazetteers and the rest (habitats, museology, earth science etc.)
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 13 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
Partner
Databases
On-line
Thesauri
Submit query with alternative terms where required
Indexing
software
Check for term if not
in BioCase thesaurus
Derive equivalent
and related terms
Check
for term
Static
Lists
Maintained
Lists
Published
Standards
Sources
Data access
software
Submit
search terms
Add Terms
Copy terms
BioCase
Thesaurus
Search
Thesaurus
Supply
search terms
Copy Thesaurus
simplified structure?
Copy terms &
updates
Copy terms
Other potential products derived
from or using BioCASE Thesaurus
Management
Users
Applications
Figure 2: BioCASE Thesaurus in relation to term sources and user queries
7.2.
User
Interface
Thesaurus Content Management
Each of the three Thesaurus areas mentioned has an individual responsible for content and
acquisition. The workpackage leader co-ordinates their work for reporting purposes. The
content and performance of the thesaurus will be monitored throughout the project. To aid
this, a version of the thesaurus will be available on-line to all partners and a report on content
will be delivered at each BioCASE Technical Committee meeting. As the project progresses,
the process whereby we apply criteria, possibly with respective weightings, will mature and
will also be modified as we gain more experience in negotiating term list use with owners.
Where there are costs involved with the acquisition of a list or gaining access to a thesaurus or
classification this will need to be balanced against the cost of not having access and also the
likelihood of longer term maintenance of the BioCASE Thesaurus.
During the course of the project it will be necessary to formulate a forward plan for how the
content of the thesaurus will be sustainably maintained and updated in the future. In particular
we will need to define who will be responsible for on-going agreements with term list and
thesaurus suppliers to provide corrections and updates and how any financial implications
will be met.
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 14 of 20
Use
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
7.3.
Thesaurus Database Management
The BioCASE Thesaurus has been built and is maintained in a MySQL Database based in
Southampton University School of Biological Sciences. This database will be accessible to
the team and other technical workpackage members through a Thesaurus website
(http://biodiversity.soton.ac.uk/biocase/ ). The Southampton team has responsibility for
maintaining the physical database and providing software tools to both edit and access it.
The structure of the master database will be reviewed now it is substantially populated with
lists from each of the identified topic areas. It may also be necessary to migrate the thesaurus
to PostgresSQL, which is being used by Berlin and Paris in their development work although
this is not a priority at present.
7.4.
Thesaurus Distribution and Use
It is also the role of the Thesaurus Team to provide a version of the thesaurus to the other
technical workpackages in a format that lends itself to their needs. The current format is quite
complex with many relational tables that suit the flexibility we need for collating lists of
different types and versions from many sources but is not the ideal delivery format for other
users. We will therefore, work with partners to define the best format for their purposes and
export terms from the master database into a simpler format e.g. one optimised for servicing
queries (see Figure 2). Once again, this arrangement will need a strategy for maintenance
when the BioCASE project comes to an end.
Figure 2 envisages that the thesaurus or an optimised copy of the thesaurus will be used in
several ways. The indexing package will check terms against it and identify new terms, which
may then be checked against other on-line thesauri such as the Alexandria Gazetteer. New
terms will then be added to the master database. New terms will have to be flagged so that the
thesaurus team members, responsible for thesaurus content, can check them for links against
existing terms.
Users of the search portal will have access to the thesaurus in some way to provide not only
valid search terms but also broader, narrower and related terms to modify their searches. This
latter function could also be performed automatically by the search software e.g. in looking
for equivalent terms such as multi-lingual versions or synonyms.
As the Thesaurus grows it will potentially become very valuable for purposes other than the
indexing and searching of partner databases and this is perhaps an area that could be
investigated towards the end of the project.
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 15 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
8.
Annex 1: Term Lists and Numbers of Terms Imported into the
Prototype Thesaurus
The prototype thesaurus currently holds 169 term lists relating to 265,323 terms covering a
selection of British and European terrestrial and marine taxa, placenames, biotopes, minerals
and stratigraphic terms.
List Name
Number of
terms
French Place Names - US National Imagery and Mapping Agency
Recorder 3.3 (1998) - British terrestrial taxa
UK Place Names - US National Imagery and Mapping Agency
Ulster Museum and Marine Conservation Society Marine Species Directory
English Civil Parishes
British Lithostratigraphic Names
General List of Mineral Names
A review of the scarce and threatened flies of Great Britain - Part 1 (Falk, S.J.)
British Butterflies and Moths (Bradley, J.D. and Fletcher, D.S., 1979)
CORINE Biotopes Project Habitat Classification
Botanical Society for the British Isles checklist (Kent, 1992)
A review of the scarce and threatened beetles of Great Britain Part 1 (Hyman, P.S. revised and updated by M.S.
Parsons.)
EUNIS Biotopes Classification
British Red Data Book of Insects
A review of the scarce and threatened beetles of Great Britain Part 2 (Hyman, P.S. revised and updated by M.S.
Parsons.)
British Biodiversity Action Plan Long list of taxa 1995
British Biodiversity Action Plan Priority Species List 1998
British Ornithologists Union British Checklist
A provisional Review of the status of British Microlepidoptera (Parsons, M.S. 1984.)
BRC 0820 - Bryopsida
British National Vegetation Classification
Berne Convention (Appendix II) Taxa
Berne Convention (Appendix I) Taxa
Habitats and Species directive (Annex II) Taxa
British Arachnological Society checklist
BRC Recording Card RA65 - BRC Araneae: Spiders
British Spiders (Locket, Millidge & Merrett vol III, 1974)
British Freshwater checklist (source unknown)
BRC Recording Card RA8 - Butterflies & Moths
A National Review of British Macrolepidoptera (Hadley, M.)
A review of the scarce and threatened bees, wasps and ants of Great Britain (Falk, S.J.)
BRC Recording Card RA57 - Terrestrial Heteroptera
A review of the scarce and threatened Hemiptera of Great Britain (Kirby, P.)
Phase 1 Habitat Classification
English Placename in National Monuments Record
Habitats and Species directive (Annex IV) Taxa
British Trust for Ornithology five letter coding scheme
BRC Recording Card RA66 - Diptera: Empids
British Marine Nature Conservation Review Habitats
BRC Recording Card RA37 - Homoptera: Auchenorhyncha
BRC 6453 - Carabidae
BRC Recording Card RA29 - Coleoptera: Carabidae
British Red Data Book Vascular Plants
BRC 0810 - Hepaticopsida
BRC Recording Card RA11 - Diptera: Craneflies
BRC Recording Card RA64 - Diptera: Fungus Gnats
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 16 of 20
97589
80275
31261
15527
10421
9534
4139
3103
2716
2610
2512
2407
2378
1800
1327
1252
1203
1120
1017
923
911
709
658
648
632
623
613
606
590
508
502
489
484
481
446
399
391
372
367
359
355
342
327
326
324
323
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
BRC Recording Card RA43 - Hymenoptera: Aculeata 1 - Ants & Wasps (excluding Dryinidae)
British Trust for Ornithology two letter coding scheme
English Districts 1974
British Biostratigraphy (selected)
General Chronostratigraphy List
A review of the Nationally Notable Spiders of Great Britain (Merrett, P.)
BRC Recording Card RA41 - Coleoptera: Bruchidae & Chrysomelidae
Berne Convention (Appendix III) Taxa
BRC Recording Card RA67 - Diptera: Dolichopodidae
BRC Recording Card RA44 - Hymenoptera: Aculeata 2 - Bees
Scarce plants in Britain
British Red Data Book of Bryophytes
Rare marine benthic flora and fauna in Great Britain: the development of criteria for assessment
BRC Recording Card RA36 - Aquatic Coleoptera (obsolete)
BRC Recording Card RA33 - Diptera: Syrphidae
Shimwell Urban Habitat Classification
Berne Convention (Appendix I (continuation))
Bonn Convention (Appendix II) Taxa
BRC Recording Card RA39 - Trichoptera
Wildlife (Northern Ireland) Order (1985)
BRC Recording Card RA18 - (obsolete)
Wildlife and Countryside Act (Schedule 8) Taxa
Birds directive (Annex I) Taxa
International obligations for the protection of British species other than birds (Palmer 1996)
British Red Data Book Lichens
A review of the scarce and threatened Ethmiidae, Gelechiidae and Stathmopodidae moths of Great Britain (Parsons,
M.S.)
Habitats and Species directive (Annex II) - species for Macronesia
BRC Recording Card RA34 - Diptera: Larger Brachycera
British Red Data Book of Birds
British Red Data Book of Invertebrates
Habitats and Species directive (Annex V) Taxa
Bonn Convention (Appendix I) Taxa
Birks and Ratcliffe Upland Survey Biotopes
Red Data Book of European Bryophytes
Wildlife and Countryside Act (Schedule 5) Taxa
Watsonian Vice Counties of Great Britain
A review of the scarce and threatened pyralid moths of Great Britain (Parsons, M.S.)
Wildlife and Countryside Act (Schedule 4) Taxa
Threatened Rhoplocers (Heath)
A review of the Trichoptera of Great Britain (Wallace, I.D.)
English Nature Natural Areas
BRC Recording Card RA9E - Lepidoptera: Butterflies - English names (obsolete)
Peterken Woodland Stand Types
A National Review of non-marine Molluscs (Foster, A.P.)
BRC Recording Card RA48 - Lepidoptera: Oecophoridae
Birds directive (Annex II) Taxa
Wildlife and Countryside Act (Schedule 1) Taxa
CITES UK Species only
BRC Recording Card RA50 - Coleoptera: Elateroidea
BRC Recording Card RA52 - Lepidoptera: Butterflies
Habitats of Community Interest
IUCN Red List of Threatened Animals (1996) for species occurring in the UK
Wildlife and Countryside Act (Schedule 9) Taxa
BRC Recording Card RA32 - Neuroptera & Mecoptera (obsolete)
British Trust for Ornithology habitats list
Biodiversity Action Plan Broad Habitats List
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 17 of 20
319
310
302
290
287
281
267
267
266
260
253
251
247
245
237
237
215
215
202
190
189
188
181
180
177
167
166
151
147
144
143
140
121
118
116
115
114
96
96
94
92
90
89
84
83
82
82
81
80
78
78
76
70
69
68
64
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
Protection of Dragonflies (Van Toll)
BRC Recording Card RA54 - Aquatic Heteroptera
Seabird 2000 Habitats
BRC Recording Card RA45 - Coleoptera: Cerambycidae
Seabird 2000 Checklist
BRC Recording Card RA46 - Odonata (obsolete)
BRC 6411 - Odonata
Scottish Districts 1974-96
Vegetation communities of British Isles
BRC Recording Card RA59 - Diplopoda: Millipedes
BRC Recording Card RA4/B - Orthoptera/Dermaptera/Dictyoptera
BRC Recording Card RA56 - Ephemeroptera - Mayflies
Conservation (Natural Habitats, &c.) Regulations 1994 (Statutory Instrument No. 2716)
BRC Recording Card RA58 - Centipedes
Biodiversity Action Plan Priority Habitats
BRC Recording Card RA4 - (obsolete)
BRC Recording Card RA47 - Coleoptera: Coccinellidae
BRC Recording Card RA28 - (obsolete)
Guidelines for selection of SSSI's
Irish Vice-counties
English Counties 1974
BRC Recording Card RA51 - Non-Marine Isopoda
English Shire Counties (Pre 1974)
Welsh Districts 1974-96
Birds directive (Annex III) Taxa
English Metropolitan Districts
BRC Recording Card RA53 - Diptera: Culicidae
Scottish Counties? - 1974
Wildlife and Countryside Act (Schedule 3) Taxa
London Boroughs 1974 Wildlife and Countryside Act (Schedule 2) Taxa
British Sea Areas
Scottish Unitary Councils 1996A National Review of Orthoptera (Hadley, M.)
Northern Irish Districts 1974BRC Recording Card RA55 - Pseudoscorpiones: False Scorpions
BRC Recording Card RA10 - Hymenoptera: Bumblebees (obsolete)
A review of the scarce and threatened Emphemeroptera and Plecoptera of Great Britain (Bratton, J.H.)
BRC Recording Card RA69 - Diptera: Conopidae
English Nature Maritime Areas
Welsh Unitary Councils 1996BRC Recording Card RA27 - Opiliones
English Nature Local Team Areas
British Red Data Book Stoneworts
A review of the scarcer Neuroptera of Great Britain (Kirby, P.)
IUCN Red List of Threatened Plants (1997) for species occurring in the UK
CCW Region/Area
SNH Region/Areas
NCC Region
Isle of Man
Wildlife and Countryside Act (Schedule 6) Taxa
English Unitary Authorities 1996 Welsh Counties 1888-1974
English National Parks
Botanical Classification of habitats
BRC Recording Card RA40 - Coleoptera: Scolytidae (obsolete)
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 18 of 20
63
62
61
59
58
55
54
53
52
52
49
47
46
46
45
44
43
42
41
40
39
38
38
37
36
36
35
33
33
32
32
30
29
28
26
25
25
25
25
24
22
22
21
21
20
19
17
17
15
15
14
14
13
11
10
10
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
Scottish Regions 1974-96
Country Names
Welsh Counties 1974-96
Metropolitan Counties 1974-1985
Channel Islands
BRC 6575 - Crayfish
Scottish Island Councils
London County Council
Corporation of London
Isle of Man Boroughs
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
9
8
8
7
7
5
3
1
1
1
Page 19 of 20
BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017
9.
Annex 2: Term Relations in the Thesaurus
Terms may be related in various ways within the BioCASE Thesaurus. These include:
1.
Is congruent with or equals
Typical relationships of this type occur in List_Item. In this case the terms
may also be qualified by the thesaurus concepts UF (Use for) and U (use) or
P (Preferred). Relationships in the Item_In_List table
• Is preferred term (for this list – there could be multiple
preferred terms e.g. preferred scientific name, preferred common
name etc.)
•
Is a current alternative term (possibly a language version)
•
Is a synonym (may follow taxonomic rules)
2.
Contains / Is contained by [typical relationship in
Term_Version_Relations]. A version of a term may represent the
merging of other terms e.g. taxonomic merging or the addition of
East Germany to Germany. Also occurs in List_Item self referential
relationship for hierarchical relations e.g. Bellis is in the
Compositae
3.
Overlaps (e.g. 1974 County of Avon overlaps pre-1974 County of
Gloucestershire)
4.
Touching but not overlapping (e.g. partial concurrent boundary) –
useful for geographic relationships
5.
Is adjacent to but separated from As above but no shared boundary
– e.g. for recording the geographic proximity of two place names
6.
Part of a non-touching set e.g. a nature reserve made up of several
distinct land parcels, Islands in an island group.
7.
Pre-dates/ Post-dates May be used in Term_Version_Relation or
can be inferred from term introduced date.
8.
Is parent of / Is child of In hierarchies or trees can be represented
by self referential pointer (as in List_Item) but relationship may be
multiple and non-hierarchical (e.g. may need several parents for a
complex hybrid in term_version_relation)
9.
Association – This is related term from a classical thesaurus linking
different kinds of lists to allow branching associations e.g. Alps is a
gazetteer term Alp is a Geomorphological term. Can also be used to
link terms between different term lists relating to the same things
e.g. habitat equivalents (Alpine boreal in CORINE = Alpine Boreal
in EUNIS)
Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues
Page 20 of 20
BioCASE Thesaurus - BioCASE Thesaurus
BioCASE Thesaurus
Welcome to the BioCASE Thesaurus Team home page!
●
●
●
Thesaurus design, implementation and documentation
Prototype Thesaurus Editor (Java)
BioCASE Project home page
Related project pages at Southampton
In many cases these pages represent local activities in Southampton on larger collaborative projects, and
provide links to the corresponding parent organisations.
●
●
●
●
ERMS (European Register of Marine Species)
LITCHI (integrity and consistency of checklist databases)
Species 2000
Spice Project (system architecture for Species 2000)
BioCASE Thesaurus Team web site designed and maintained by Richard White, last edited on 12 June 2002.
Copyright © 2002 by BioCASE project. All rights reserved. Server hosted by School of Biological Sciences, Southampton
University.
http://biodiversity.soton.ac.uk/biocase/ [11.07.2002 14:06:16]