BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 A Biodiversity Collection Access Service for Europe Workpackage 4: Thesaurus modelling, data acquisition and design Deliverable D4: Thesaurus criteria, candidate thesauri and catalogues Charles Copp Natural History Museum (London) Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 1 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 The Biocase Thesaurus Team Charles Copp Neil Caithness Richard White John Robinson Natural History Museum (Clevedon, UK) Natural History Museum (London, UK) Southampton University (Southampton, UK) Southampton University (Southampton, UK) CC NC RW JR Contents 1. INTRODUCTION ......................................................................................................................... 3 2. THE PURPOSE OF THE BIOCASE THESAURUS ................................................................. 3 3. THE BIOCASE THESAURUS DATABASE ............................................................................. 4 3.1. 4. THE BIOCASE THESAURUS MODEL ........................................................................................ 4 SOURCES OF TERMS IN THE BIOCASE THESAURUS...................................................... 5 4.1. 4.2. SOURCES .................................................................................................................................. 5 DOMAIN RESPONSIBILITIES...................................................................................................... 6 5. CRITERIA FOR THE SELECTION OF TERM LISTS, CLASSIFICATIONS AND THESAURI ............................................................................................................................................. 7 6. PROGRESS WITH CANDIDATE LISTS, THESAURI AND CATALOGUES ................... 11 6.1. 6.2. 7. IDENTIFICATION AND ACQUISITION OF CANDIDATE TERM LISTS ........................................... 11 DATA QUALITY ISSUES RELATED TO THE THESAURUS ........................................................... 11 ADDING AND MANAGING TERM LISTS IN THE THESAURUS .................................... 13 7.1. 7.2. 7.3. 7.4. DATA SUPPLY AND UPDATE ................................................................................................... 13 THESAURUS CONTENT MANAGEMENT................................................................................... 14 THESAURUS DATABASE MANAGEMENT ................................................................................ 15 THESAURUS DISTRIBUTION AND USE..................................................................................... 15 8. ANNEX 1: TERM LISTS AND NUMBERS OF TERMS IMPORTED INTO THE PROTOTYPE THESAURUS.............................................................................................................. 16 9. ANNEX 2: TERM RELATIONS IN THE THESAURUS ...................................................... 20 Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 2 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 Deliverable D4: Thesaurus criteria, candidate thesauri and catalogues Charles Copp Natural History Museum (London) 1. Introduction This report covers the work of the WP4 (Thesaurus) Team to establish the means by which we can establish, populate and manage an efficient thesaurus of relevant and related terms for the BioCASE Project. The BioCASE Thesaurus has been designed as a single relational database which can accommodate multiple hierarchical term lists relating to different subject areas (e.g. taxonomy, gazetteers and habitats). An initial trawl of potential sources has indicated a vast and overlapping resource of both complimentary and competing thesauri, term lists and classifications that could be relevant to BioCASE. The WP4 Team have developed a methodology whereby potential sources are documented and scored against a number of criteria including uniqueness, completeness, accuracy, format and availability and candidate lists marked for acquisition. Some important sources covering taxonomy, gazetteers and biotopes have already been identified and steps taken to obtain or negotiate use of copies. The overall size of the task means that it is likely to continue through the length of the BioCASE Project. Much of the early work of the project has been concerned with establishing a sound model for the management of these term lists and work with the prototype has included importing more the 250,000 terms from 169 source lists to explore both data quality issues and aid the development of thesaurus import and editing tools. 2. The purpose of the BioCASE Thesaurus The purpose of the BioCASE Thesaurus is to provide controlled and classified terminology to enable reliable and accurate data retrieval from partner databases. The entries in the thesaurus will be derived from other classifications and term lists or derived from the indexing of partner databases. It is not the objective of the BioCASE Thesaurus to become a terminology standard and there is no claim that its content will be either comprehensive or authoritative. There is no intention that the Thesaurus attempts to replace or rival existing or potential terminology standards. The BioCASE Thesaurus will be an enabling technology for providing maximum retrieval from a disparate and varied set of data sources. Where possible it will allow for inevitable variation in term form and the existence of synonyms and multi-lingual alternatives. The thesaurus will be structured, where appropriate, into related hierarchies that allow for broader, narrower and related term searches, again with the purpose of maximising the likelihood of positive results from a user query. A further function of the BioCASE Thesaurus will be to enable different user community views of the same data. For instance taxon specialists will expect to access data using formal taxonomic names and may be very specific in their requirements whereas general users and Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 3 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 non-specialists may wish to use common names (probably in their own language) and may prefer broader group terms, some of which will be informal (e.g. sea birds). The Thesaurus will, therefore need to provide a network of related terms and concepts, ideally with weighted associations, that can be explored by users starting from terminology that they understand and which can lead them to broader, narrower and equivalent terms or different terms that might produce fruitful results1. 3. 3.1. The BioCASE Thesaurus Database The BioCASE Thesaurus Model The structure of the BioCASE Thesaurus Database is detailed in a separate paper2. The principal feature of the BioCASE Thesaurus Database model is that it is designed as a mechanism for storing many term lists and versions of lists together with the means for translating or relating from one to another. The model we have implemented is optimised for the collection and management of multiple term lists and hierarchical classifications relating to any number of knowledge domains. This means that the same structure can be used to manage and relate taxon classifications, biotope and habitat classifications, gazetteers, stratigraphic hierarchies and museological terms. The key principle that we are working to is that no individual list or hierarchy is regarded as definitive but that it is important to know the source of terms and their scope within the source. The partner databases that will be indexed for BioCASE may include specimen data (e.g. place names, taxon names, habitat terms) that derive from a wide range of lists and many older terms that would not be regarded as current. The object of the thesaurus is to capture these terms and where possible relate them to a known list. The BioCASE Thesaurus managers will attempt to build links between terms and hierarchies of terms that enables users to both change the scope of their searches (broader or narrower, near terms and related terms) and to automatically account for equivalent terms in searches (e.g. common names and binomials, multi-lingual versions, syntactic and spelling variations). The system will need to be closely managed to ensure consistency and ideally be able to learn through use and user input. The structure of the BioCASE Thesaurus Database is complex with many tables. This complexity allows us to document and manage any kind of list or hierarchical classification from any of the domains3 with which we are concerned. The structural complexity makes manual maintenance of the database difficult and we will therefore be relying heavily on the use of specially developed software tools4 to handle additions, deletions and edits of list items. The Thesaurus team are currently exploring the best way to make the thesaurus available to the other work packages for whom the complex data model and extended metadata relating to lists will not be appropriate. We will therefore be presenting the core thesaurus information to the other work packages in a simplified and de-normalised format. 1 See Annex 2 for a list of relationships modelled in the Thesaurus The BioCASE Thesaurus, Logical and Physical Data Models, Charles Copp, BioCASE Report, June 2002 3 Subject areas e.g. species names, place names, habitat types, modes of specimen preservation 4 Being written by John Robinson at Southampton University, School of Biological Sciences 2 Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 4 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 Figure 1: Prototype Thesaurus Viewer showing the ability to explore hierarchical trees of terms and to list equivalent and related terms from other lists 4. 4.1. Sources of terms in the BioCASE Thesaurus Sources The BioCASE Thesaurus Database has been designed to hold and relate many term lists and hierarchical classifications. We wish to ensure that the indexing package has a high likelihood of finding terms it encounters in partner databases. This means that the Thesaurus Team have to look for the most inclusive and widely used term lists available and also allow for the addition of more localised and specialised lists as these are identified. The BioCASE Thesaurus will borrow from or link to terminology standards where they exist and where they are relevant to data retrieval within the BioCASE project. The addition of terms to the thesaurus will not be a guide to their validity, only their utility. The BioCASE Thesaurus will carry no guarantee that its included term lists are comprehensive although it will draw wherever possible from the most accurate and comprehensive sources available. In addition to copies or links to published lists, classifications and thesauri the BioCASE thesaurus will include terms derived from indexing partner databases and terms supplied with collections metadata. This approach is needed because partner databases may be in a variety of languages and include many free terms or be derived from in-house term lists. There are obvious dangers in allowing a thesaurus to grow in this way because simple lists of terms put together without rules would soon become unusable. This implies that the thesaurus will need to be managed and work within a set of rules. What these rules are and how they will be applied will be defined as the work progresses. Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 5 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 There are 4 principal sources of terms that will be incorporated into the BioCASE Thesaurus: 4.2. 1. Recognised national and international standards (e.g. TDWG Geographic Codes). These should be consistent and reliable but may include both full terms and codes (e.g. ISO 639 Languages), either of which may have been used in databases. Standard lists are generally static or addition of new terms is strictly controlled. 2. Existing lists of terms that are maintained by a recognised organisation (e.g. Botanical Society of the British Isles Plant Names). May be treated as emerging standards and generally reliable but several versions of the list may be in use. Some lists are represented by large and complex databases (e.g. the Getty Placenames Thesaurus and the Species 2000 Project). This category includes both static lists (e.g. UK Phase I habitats) and developing lists (e.g. EUNIS habitats). Some lists contain essentially the same terms (e.g. CORINE and EUNIS) but there may be orthographic differences, often unintended. 3. Existing informal lists of terms that are not fully controlled but may have been widely used (e.g. the Stratigraphy lists incorporated in the prototype of the thesaurus database). These lists may be derived from various published sources but may include a number of problems including misspellings, duplication, and inconsistent updates without version control. There can be multiple lists relating to the same topic but managed by different people and organisations. Some lists that have grown informally may include duplicated terms including full duplicates and orthographic duplicates. 4. Terms derived from indexing partner databases. These terms may be derived from controlled lists or could be free terms. Typical problems include; varying spelling, plurals, gender, and use of abbreviations. Across the BioCASE area free-terms and text descriptions will include many spelling, abbreviation and language variants as well as the inevitable typographic errors. The growth of new terms from this source could be exponential and the indexing system will therefore need to deal with term reduction using stop lists, word stemming and other techniques. Even with term and ‘noise’ reduction there will be a significant resource implication for relating new terms to existing ones. Domain Responsibilities The tasks of identifying and obtaining checklists, term lists, classification and catalogues has been divided amongst the thesaurus team. The responsibilities are: Richard White (Southampton University) Neil Caithness (Natural History Museum London) Charles Copp (Natural History Museum London) Taxon lists and classifications Gazetteers and Administrative Area names Geological, Ecological, Museological and other terms. Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 6 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 5. Criteria for the Selection of Term Lists, Classifications and Thesauri The BioCASE Thesaurus database has been designed to enable us to manage and index term data derived from a wide range of sources and formats and to provide a common means of relating terms from different sources. However, even a casual search on the ‘web’ or in museological literature demonstrates a potentially vast source of classifications, thesauri, gazetteers and term lists. We therefore needed to establish the criteria by which we can judge the suitability of any particular source of terms for our purposes. We have identified a number of criteria that will help us select and prioritise appropriate sources of terms for the BioCASE Project. Some of the criteria are free text descriptions but where possible we will be applying fixed values that can be scored to help in the selection process. It has been decided to include the selection and acquisition information relating to term lists as metadata in the BioCASE Thesaurus database. The intention is to maintain the BioCASE Thesaurus Database on a MySQL database at Southampton University and for members of the Thesaurus Team to access the database over the web to record metadata about term lists and their sources as they find them. The metadata will record further information about the lists we decide to acquire such as cost, restraints on use and update agreements. The criteria and metadata that we are recording include: 1. List Type List Type refers to a controlled list of terms describing the ‘domain’ of any given term list and gives a convenient way of sorting lists within the Thesaurus Database. List type is hierarchical so that sub-domains can be grouped. Top level types such as Taxon list, Biotope list, Gazetteer and Geology can have subgroups, for instance, Minerals and Stratigraphy fall within Geology. 2. List Topic Within a list type domain a list can have a specific topic e.g. a regional taxon list may cover only the Leguminosae. In terms of selection for inclusion in the BioCASE Thesaurus the topic may be judged on: • Does the subject cover only items relevant to BioCASE • Also extends to items outside of the current BioCASE remit • Outside but related (e.g. mineral names) and may be useful in the future • Completely outside of current BioCASE scope 3. Theme Coverage Within the topic covered lists may • Cover the whole theme • Cover a subset of the theme Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 7 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 4. Geographic Coverage Some lists are global in their coverage but most are restricted to some named geographic area. • 5. Is the term list applicable to a specific geographic area – if so where Language The BioCASE Project is initially concentrating on English as a common language for indexing and retrieval but multi-lingual terms will be encountered increasingly as the project progresses and especially when indexing of ‘unit’ data within partner databases takes place. Lists may be available in many languages and some include multi-lingual synonyms, place names are likely to be the earliest sources of nonEnglish terms. • What language? • Does the list provide an essential source of terms e.g. place names in countries where there has been much recent change. 6. Standard This is a flag to record whether this list is an international or national standard. 7. Uniqueness If a term list or thesaurus is unique in its content and relevant to BioCASE then it will become a priority for acquisition. For some domains, however, there seems to be an almost inexhaustible supply of alternative lists that may entirely or partially duplicate each other and we may then use other criteria in selection for the BioCASE Thesaurus. Possible values for lists are: • Only source of these terms – must have • Includes some unique terms that are important to include • Most terms covered elsewhere but could be of value • Fully duplicated by another more readily available or reliable source 8. Completeness It is useful to know how complete a list is within its given domain 9. • Complete • Incomplete Accuracy The BioCASE Thesaurus does not set out to become an authoritative standard but it is important to know the origin and quality of the term lists it incorporates e.g. for Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 8 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 choosing a ‘preferred’ term as a listing heading when faced with synonyms and orthographic variants. Possible values for lists are: • International or national standard • Classification or thesaurus assembled or maintained by acknowledged ‘expert’, respected Society or consortium • Informal list but assumed to be accurate • List known to include inaccuracies but widely used • List considered unreliable or inaccurate 10. Version Detail and Date of List Version Term lists, thesauri and classifications are often released or replaced in different versions. It is important for us to know which version we have and how it relates to other versions. For some lists we may simply need the most recent version, for others we may need all versions. 11. Maintenance Very few of the larger term lists are either static or complete and therefore it is necessary to know how these lists are updated and maintained in order that if they are incorporated into the BioCASE Thesaurus our copy can be kept up-to-date. • Static international or national standard • Static informal list • Maintained international or national standard • Maintained formal list not adopted as standard • Maintained informal list 12. Updates Where lists have a controlling authority and are subject to change we will need to record how and when we can receive updates. • Complete copy of static list – no updates needed • Copy of a maintained list with arrangement for update • Remote access to an on-line maintained thesaurus or dictionary 13. Current Storage Format Current storage format refers to the ‘native’ format of the list, which will affect the ease with which it can be manipulated or incorporated into the BioCASE Thesaurus. Some likely types include: • Manuscript list • Published list (paper format) • List in text format Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 9 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 • Spreadsheet-style format (single table) • Thesaurus structured text list • Relational database • Proprietary electronic format 14. List structure suitability This field provides us with an indication to whether the list can be readily incorporated or will need significant manipulation and restructuring to suit our purposes. • Directly usable by BioCASE • Needs simple re-structuring • Needs significant restructuring • Not known 15. Availability The availability of a term list for use or incorporation into our own thesaurus is a key criterion. Some lists are available to subscribers only and may only be accessible on-line, not as copies. Other lists may have use constraints placed upon them. In either situation we will need to judge whether such lists are critical to BioCASE or can be replaced with more freely available alternatives. • Freely available to copy and use without constraint • Freely available to copy and use within copyright or negotiated constraints • Freely available to access on-line but not copy • Copy available for a one-off cost • Copy and updates available through subscription • On-line access by subscription • Not known 16. Cost If the list is not freely available, what are the costs involved in obtaining and maintaining a copy. 17. Source of copy This field allows us to record the means and format in which we can obtain a copy of the list • Available by download from web or by ftp • Available on web for searching but not download • Available in electronic format on disk or CD Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 10 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 • Paper format only • Subject to negotiation In addition to the above criteria we are recording the following metadata for named list versions: 1. Acquisition Priority based on an assessment of the above criteria. Scored on a scale of 1(must have) to 5 (don’t need) 6. 6.1. 2. Actual Cost of Acquisition 3. Flag for lists actually acquired 4. Update arrangements 5. Source and agreement metadata 6. Contact name and organisation data 7. URL for related website Progress with Candidate Lists, Thesauri and Catalogues Identification and Acquisition of Candidate Term Lists The team has met several times to discuss the task of identifying and evaluating candidate thesauri and have established through a preliminary trawling of lists and websites that this task will need to run through the whole project. The range and coverage of lists and thesauri covering natural science topics (including taxonomy and gazetteers) is vast with much overlap. The late start to the project and early difficulties in getting staff in place has meant that progress in selecting term lists according to the established criteria has been slower than expected although major taxonomic and gazetteer sources have been identified and steps taken to obtain copies. These include the BIOSIS and Species 2000 taxonomic classifications (covers worldwide taxa) and the US National Imagery and Mapping Agency Gazetteer (c. 3 million place names). Importing of these lists has been placed on a lower priority whilst work has concentrated on the development of the Thesaurus data model and prototype database. C. Copp has, however, brought together a number of specifically European lists covering a range of domains to test the developing prototype database and the various thesaurushandling tools. The prototype database currently holds 169 different lists and classifications relating to 265,323 terms (use of the same terms in different lists brings the listable entries to 305,376). These lists have been imported from various sources including dictionaries maintained by the UK National Biodiversity Network. The lists have been imported with minimal data cleaning which has highlighted a number of data quality issues, which the project will need to address. 6.2. Data Quality Issues related to the Thesaurus A prototype version of the BioCASE Thesaurus, populated with circa 30,000 terms representing a number of earth science and biotope lists and classifications, was circulated for testing and comment early in the project. The Paris team quickly identified a number of issues deriving from inconsistencies, duplications and term formats that will create potential problems for the indexing work package. Although the lists of terms delivered with the Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 11 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 prototype thesaurus were for test purposes only, they did give a good indication of what happens when you bring lists together from different sources. The test also gave an indication of what will happen when we start trying to index partner databases and attempt to link terms to existing thesaurus term lists. The underlying problem is that we cannot control what is in existing databases. If we are building a database (e.g. the collections metadatabase) from new then we have the opportunity to enforce fairly strict terminology control (at least for higher level terms) but for existing unit-level databases this will not be possible. However, in mapping unit databases to a common schema (as views) we can, at least, identify term groups (e.g. geographic, taxonomic, biotope etc.) so that we do not confuse overlapping concepts (e.g. Essex Emerald in an Identification field is a taxon and does not refer to a collection site in Essex). Terms derived from indexing unit data might have any of the following characteristics: • Source term list may or may not be identifiable • May be in international formal language (e.g. taxonomic) or in any national language • If a term is not already in the thesaurus, it might not be readily assignable to higher hierarchical level. E.g. new taxon or geographic terms. • May be concatenated with other terms • May be misspelt or wrongly capitalised • May be pluralised or, in some languages, in a different gender • May be abbreviated or entered as a code (e.g. habitat code) or symbolic notation (e.g. chemical composition) • May include other text (e.g. and, the, of) • May include qualifying words (e.g. outside, near, cf.) • May include punctuation or other symbols (e.g. = , > ? [ ] ) • Terms may be made up of several words (Lower Jurassic, Lesser Spotted Woodpecker) This raises a number of problems for both the indexing and thesaurus teams: • Simple atomised indexing (i.e. each separate word) will miss critical links between terms. It might be necessary to parse terms for word position, connecting words and punctuation. (e.g. to resolve ‘Lower Rhaetian, Upper Triassic’) • Qualifiers may be important. In UK, square brackets [ ] are commonly used to denote inferred information, ? and cf. are important in identifications. Qualifiers might need to be stripped for indexing purposes but some may form part of a name (e.g. aff. or var.) • The Index could fill up with spurious variants of terms – there will need to be a means whereby common variants such as pleurals can be recognised. • It might not be possible to link an indexed term to one in the thesaurus without further evidence (e.g. if a locality is given as ‘Germany’ the Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 12 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 date is also important, there are also numerous instances of the same taxonomic binomial referring to a plant, an animal and a fossil) • New terms derived from indexing may not be accessible under broader term/narrower term searches until a link is added. What this means for the thesaurus: We have to be prepared to work with and develop a strategy to deal with inconsistent terms derived from multiple term sources, typographic errors, inter-changeability of terms, abbreviations and codes. We might be able to influence terms used in new metadata but there is no way that we can influence the form of existing unit data so our products must be designed to deal with inconsistencies and uncertainties. 7. Adding and Managing Term Lists in the Thesaurus This section provides an outline of the likely processes and 'job roles' involved in establishing and maintaining the BioCASE Thesaurus. The envisaged pattern of information flow is further summarised in Figure 2 (below). The processes identified are original data supply, thesaurus update, thesaurus content management, thesaurus database management, distribution and thesaurus use. 7.1. Data supply and update Figure 2 illustrates the diversity of sources that the BioCASE Thesaurus will derive its term lists from. It is the role of the Thesaurus Team to identify and acquire sufficient term lists to provide a sound basis for indexing partner databases and to provide a framework into which new terms can be fitted. It is not the role of the Thesaurus Team to validate or alter imported lists although the physical structure will often be modified to suit the BioCASE Thesaurus Model. Lists might be scanned for consistency but changes should only be initiated by the list or classification owner. The Thesaurus Team will have to negotiate use of lists and arrangements for update with the list owners and all terms and data imported into the BioCASE Thesaurus should have enough associated metadata to indicate their origins and any constraints attached to use. For practical reasons, the task of acquiring and importing term lists and classifications for the BioCASE Thesaurus has been split into three areas, taxonomy, gazetteers and the rest (habitats, museology, earth science etc.) Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 13 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 Partner Databases On-line Thesauri Submit query with alternative terms where required Indexing software Check for term if not in BioCase thesaurus Derive equivalent and related terms Check for term Static Lists Maintained Lists Published Standards Sources Data access software Submit search terms Add Terms Copy terms BioCase Thesaurus Search Thesaurus Supply search terms Copy Thesaurus simplified structure? Copy terms & updates Copy terms Other potential products derived from or using BioCASE Thesaurus Management Users Applications Figure 2: BioCASE Thesaurus in relation to term sources and user queries 7.2. User Interface Thesaurus Content Management Each of the three Thesaurus areas mentioned has an individual responsible for content and acquisition. The workpackage leader co-ordinates their work for reporting purposes. The content and performance of the thesaurus will be monitored throughout the project. To aid this, a version of the thesaurus will be available on-line to all partners and a report on content will be delivered at each BioCASE Technical Committee meeting. As the project progresses, the process whereby we apply criteria, possibly with respective weightings, will mature and will also be modified as we gain more experience in negotiating term list use with owners. Where there are costs involved with the acquisition of a list or gaining access to a thesaurus or classification this will need to be balanced against the cost of not having access and also the likelihood of longer term maintenance of the BioCASE Thesaurus. During the course of the project it will be necessary to formulate a forward plan for how the content of the thesaurus will be sustainably maintained and updated in the future. In particular we will need to define who will be responsible for on-going agreements with term list and thesaurus suppliers to provide corrections and updates and how any financial implications will be met. Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 14 of 20 Use BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 7.3. Thesaurus Database Management The BioCASE Thesaurus has been built and is maintained in a MySQL Database based in Southampton University School of Biological Sciences. This database will be accessible to the team and other technical workpackage members through a Thesaurus website (http://biodiversity.soton.ac.uk/biocase/ ). The Southampton team has responsibility for maintaining the physical database and providing software tools to both edit and access it. The structure of the master database will be reviewed now it is substantially populated with lists from each of the identified topic areas. It may also be necessary to migrate the thesaurus to PostgresSQL, which is being used by Berlin and Paris in their development work although this is not a priority at present. 7.4. Thesaurus Distribution and Use It is also the role of the Thesaurus Team to provide a version of the thesaurus to the other technical workpackages in a format that lends itself to their needs. The current format is quite complex with many relational tables that suit the flexibility we need for collating lists of different types and versions from many sources but is not the ideal delivery format for other users. We will therefore, work with partners to define the best format for their purposes and export terms from the master database into a simpler format e.g. one optimised for servicing queries (see Figure 2). Once again, this arrangement will need a strategy for maintenance when the BioCASE project comes to an end. Figure 2 envisages that the thesaurus or an optimised copy of the thesaurus will be used in several ways. The indexing package will check terms against it and identify new terms, which may then be checked against other on-line thesauri such as the Alexandria Gazetteer. New terms will then be added to the master database. New terms will have to be flagged so that the thesaurus team members, responsible for thesaurus content, can check them for links against existing terms. Users of the search portal will have access to the thesaurus in some way to provide not only valid search terms but also broader, narrower and related terms to modify their searches. This latter function could also be performed automatically by the search software e.g. in looking for equivalent terms such as multi-lingual versions or synonyms. As the Thesaurus grows it will potentially become very valuable for purposes other than the indexing and searching of partner databases and this is perhaps an area that could be investigated towards the end of the project. Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 15 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 8. Annex 1: Term Lists and Numbers of Terms Imported into the Prototype Thesaurus The prototype thesaurus currently holds 169 term lists relating to 265,323 terms covering a selection of British and European terrestrial and marine taxa, placenames, biotopes, minerals and stratigraphic terms. List Name Number of terms French Place Names - US National Imagery and Mapping Agency Recorder 3.3 (1998) - British terrestrial taxa UK Place Names - US National Imagery and Mapping Agency Ulster Museum and Marine Conservation Society Marine Species Directory English Civil Parishes British Lithostratigraphic Names General List of Mineral Names A review of the scarce and threatened flies of Great Britain - Part 1 (Falk, S.J.) British Butterflies and Moths (Bradley, J.D. and Fletcher, D.S., 1979) CORINE Biotopes Project Habitat Classification Botanical Society for the British Isles checklist (Kent, 1992) A review of the scarce and threatened beetles of Great Britain Part 1 (Hyman, P.S. revised and updated by M.S. Parsons.) EUNIS Biotopes Classification British Red Data Book of Insects A review of the scarce and threatened beetles of Great Britain Part 2 (Hyman, P.S. revised and updated by M.S. Parsons.) British Biodiversity Action Plan Long list of taxa 1995 British Biodiversity Action Plan Priority Species List 1998 British Ornithologists Union British Checklist A provisional Review of the status of British Microlepidoptera (Parsons, M.S. 1984.) BRC 0820 - Bryopsida British National Vegetation Classification Berne Convention (Appendix II) Taxa Berne Convention (Appendix I) Taxa Habitats and Species directive (Annex II) Taxa British Arachnological Society checklist BRC Recording Card RA65 - BRC Araneae: Spiders British Spiders (Locket, Millidge & Merrett vol III, 1974) British Freshwater checklist (source unknown) BRC Recording Card RA8 - Butterflies & Moths A National Review of British Macrolepidoptera (Hadley, M.) A review of the scarce and threatened bees, wasps and ants of Great Britain (Falk, S.J.) BRC Recording Card RA57 - Terrestrial Heteroptera A review of the scarce and threatened Hemiptera of Great Britain (Kirby, P.) Phase 1 Habitat Classification English Placename in National Monuments Record Habitats and Species directive (Annex IV) Taxa British Trust for Ornithology five letter coding scheme BRC Recording Card RA66 - Diptera: Empids British Marine Nature Conservation Review Habitats BRC Recording Card RA37 - Homoptera: Auchenorhyncha BRC 6453 - Carabidae BRC Recording Card RA29 - Coleoptera: Carabidae British Red Data Book Vascular Plants BRC 0810 - Hepaticopsida BRC Recording Card RA11 - Diptera: Craneflies BRC Recording Card RA64 - Diptera: Fungus Gnats Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 16 of 20 97589 80275 31261 15527 10421 9534 4139 3103 2716 2610 2512 2407 2378 1800 1327 1252 1203 1120 1017 923 911 709 658 648 632 623 613 606 590 508 502 489 484 481 446 399 391 372 367 359 355 342 327 326 324 323 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 BRC Recording Card RA43 - Hymenoptera: Aculeata 1 - Ants & Wasps (excluding Dryinidae) British Trust for Ornithology two letter coding scheme English Districts 1974 British Biostratigraphy (selected) General Chronostratigraphy List A review of the Nationally Notable Spiders of Great Britain (Merrett, P.) BRC Recording Card RA41 - Coleoptera: Bruchidae & Chrysomelidae Berne Convention (Appendix III) Taxa BRC Recording Card RA67 - Diptera: Dolichopodidae BRC Recording Card RA44 - Hymenoptera: Aculeata 2 - Bees Scarce plants in Britain British Red Data Book of Bryophytes Rare marine benthic flora and fauna in Great Britain: the development of criteria for assessment BRC Recording Card RA36 - Aquatic Coleoptera (obsolete) BRC Recording Card RA33 - Diptera: Syrphidae Shimwell Urban Habitat Classification Berne Convention (Appendix I (continuation)) Bonn Convention (Appendix II) Taxa BRC Recording Card RA39 - Trichoptera Wildlife (Northern Ireland) Order (1985) BRC Recording Card RA18 - (obsolete) Wildlife and Countryside Act (Schedule 8) Taxa Birds directive (Annex I) Taxa International obligations for the protection of British species other than birds (Palmer 1996) British Red Data Book Lichens A review of the scarce and threatened Ethmiidae, Gelechiidae and Stathmopodidae moths of Great Britain (Parsons, M.S.) Habitats and Species directive (Annex II) - species for Macronesia BRC Recording Card RA34 - Diptera: Larger Brachycera British Red Data Book of Birds British Red Data Book of Invertebrates Habitats and Species directive (Annex V) Taxa Bonn Convention (Appendix I) Taxa Birks and Ratcliffe Upland Survey Biotopes Red Data Book of European Bryophytes Wildlife and Countryside Act (Schedule 5) Taxa Watsonian Vice Counties of Great Britain A review of the scarce and threatened pyralid moths of Great Britain (Parsons, M.S.) Wildlife and Countryside Act (Schedule 4) Taxa Threatened Rhoplocers (Heath) A review of the Trichoptera of Great Britain (Wallace, I.D.) English Nature Natural Areas BRC Recording Card RA9E - Lepidoptera: Butterflies - English names (obsolete) Peterken Woodland Stand Types A National Review of non-marine Molluscs (Foster, A.P.) BRC Recording Card RA48 - Lepidoptera: Oecophoridae Birds directive (Annex II) Taxa Wildlife and Countryside Act (Schedule 1) Taxa CITES UK Species only BRC Recording Card RA50 - Coleoptera: Elateroidea BRC Recording Card RA52 - Lepidoptera: Butterflies Habitats of Community Interest IUCN Red List of Threatened Animals (1996) for species occurring in the UK Wildlife and Countryside Act (Schedule 9) Taxa BRC Recording Card RA32 - Neuroptera & Mecoptera (obsolete) British Trust for Ornithology habitats list Biodiversity Action Plan Broad Habitats List Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 17 of 20 319 310 302 290 287 281 267 267 266 260 253 251 247 245 237 237 215 215 202 190 189 188 181 180 177 167 166 151 147 144 143 140 121 118 116 115 114 96 96 94 92 90 89 84 83 82 82 81 80 78 78 76 70 69 68 64 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 Protection of Dragonflies (Van Toll) BRC Recording Card RA54 - Aquatic Heteroptera Seabird 2000 Habitats BRC Recording Card RA45 - Coleoptera: Cerambycidae Seabird 2000 Checklist BRC Recording Card RA46 - Odonata (obsolete) BRC 6411 - Odonata Scottish Districts 1974-96 Vegetation communities of British Isles BRC Recording Card RA59 - Diplopoda: Millipedes BRC Recording Card RA4/B - Orthoptera/Dermaptera/Dictyoptera BRC Recording Card RA56 - Ephemeroptera - Mayflies Conservation (Natural Habitats, &c.) Regulations 1994 (Statutory Instrument No. 2716) BRC Recording Card RA58 - Centipedes Biodiversity Action Plan Priority Habitats BRC Recording Card RA4 - (obsolete) BRC Recording Card RA47 - Coleoptera: Coccinellidae BRC Recording Card RA28 - (obsolete) Guidelines for selection of SSSI's Irish Vice-counties English Counties 1974 BRC Recording Card RA51 - Non-Marine Isopoda English Shire Counties (Pre 1974) Welsh Districts 1974-96 Birds directive (Annex III) Taxa English Metropolitan Districts BRC Recording Card RA53 - Diptera: Culicidae Scottish Counties? - 1974 Wildlife and Countryside Act (Schedule 3) Taxa London Boroughs 1974 Wildlife and Countryside Act (Schedule 2) Taxa British Sea Areas Scottish Unitary Councils 1996A National Review of Orthoptera (Hadley, M.) Northern Irish Districts 1974BRC Recording Card RA55 - Pseudoscorpiones: False Scorpions BRC Recording Card RA10 - Hymenoptera: Bumblebees (obsolete) A review of the scarce and threatened Emphemeroptera and Plecoptera of Great Britain (Bratton, J.H.) BRC Recording Card RA69 - Diptera: Conopidae English Nature Maritime Areas Welsh Unitary Councils 1996BRC Recording Card RA27 - Opiliones English Nature Local Team Areas British Red Data Book Stoneworts A review of the scarcer Neuroptera of Great Britain (Kirby, P.) IUCN Red List of Threatened Plants (1997) for species occurring in the UK CCW Region/Area SNH Region/Areas NCC Region Isle of Man Wildlife and Countryside Act (Schedule 6) Taxa English Unitary Authorities 1996 Welsh Counties 1888-1974 English National Parks Botanical Classification of habitats BRC Recording Card RA40 - Coleoptera: Scolytidae (obsolete) Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 18 of 20 63 62 61 59 58 55 54 53 52 52 49 47 46 46 45 44 43 42 41 40 39 38 38 37 36 36 35 33 33 32 32 30 29 28 26 25 25 25 25 24 22 22 21 21 20 19 17 17 15 15 14 14 13 11 10 10 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 Scottish Regions 1974-96 Country Names Welsh Counties 1974-96 Metropolitan Counties 1974-1985 Channel Islands BRC 6575 - Crayfish Scottish Island Councils London County Council Corporation of London Isle of Man Boroughs Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues 9 8 8 7 7 5 3 1 1 1 Page 19 of 20 BioCASE - A Biological Collection Access Service for Europe – CVR-CT-2001-40017 9. Annex 2: Term Relations in the Thesaurus Terms may be related in various ways within the BioCASE Thesaurus. These include: 1. Is congruent with or equals Typical relationships of this type occur in List_Item. In this case the terms may also be qualified by the thesaurus concepts UF (Use for) and U (use) or P (Preferred). Relationships in the Item_In_List table • Is preferred term (for this list – there could be multiple preferred terms e.g. preferred scientific name, preferred common name etc.) • Is a current alternative term (possibly a language version) • Is a synonym (may follow taxonomic rules) 2. Contains / Is contained by [typical relationship in Term_Version_Relations]. A version of a term may represent the merging of other terms e.g. taxonomic merging or the addition of East Germany to Germany. Also occurs in List_Item self referential relationship for hierarchical relations e.g. Bellis is in the Compositae 3. Overlaps (e.g. 1974 County of Avon overlaps pre-1974 County of Gloucestershire) 4. Touching but not overlapping (e.g. partial concurrent boundary) – useful for geographic relationships 5. Is adjacent to but separated from As above but no shared boundary – e.g. for recording the geographic proximity of two place names 6. Part of a non-touching set e.g. a nature reserve made up of several distinct land parcels, Islands in an island group. 7. Pre-dates/ Post-dates May be used in Term_Version_Relation or can be inferred from term introduced date. 8. Is parent of / Is child of In hierarchies or trees can be represented by self referential pointer (as in List_Item) but relationship may be multiple and non-hierarchical (e.g. may need several parents for a complex hybrid in term_version_relation) 9. Association – This is related term from a classical thesaurus linking different kinds of lists to allow branching associations e.g. Alps is a gazetteer term Alp is a Geomorphological term. Can also be used to link terms between different term lists relating to the same things e.g. habitat equivalents (Alpine boreal in CORINE = Alpine Boreal in EUNIS) Deliverable D4 Thesaurus criteria, candidate thesauri and catalogues Page 20 of 20 BioCASE Thesaurus - BioCASE Thesaurus BioCASE Thesaurus Welcome to the BioCASE Thesaurus Team home page! ● ● ● Thesaurus design, implementation and documentation Prototype Thesaurus Editor (Java) BioCASE Project home page Related project pages at Southampton In many cases these pages represent local activities in Southampton on larger collaborative projects, and provide links to the corresponding parent organisations. ● ● ● ● ERMS (European Register of Marine Species) LITCHI (integrity and consistency of checklist databases) Species 2000 Spice Project (system architecture for Species 2000) BioCASE Thesaurus Team web site designed and maintained by Richard White, last edited on 12 June 2002. Copyright © 2002 by BioCASE project. All rights reserved. Server hosted by School of Biological Sciences, Southampton University. http://biodiversity.soton.ac.uk/biocase/ [11.07.2002 14:06:16]
© Copyright 2026 Paperzz