PART 1 : What is a thesaurus ? Concept and samples

PART 1 : What is a thesaurus ?
Concept and samples
Christine Laaboudi-Spoiden
Publications Office of the European Communities
EUR-LEX Unit – Documentary section
Cape Town, June 2006
EUR-Lex – Searching information
 EUR-LEX http://eur-lex.europa.eu/en/index.htm
direct free access to European Union law
• the treaties, legislation, case-law and legislative
proposals
–
–
–
–
–
–
–
Official Journal of the European Union
Official Journal L – Legislation
Official Journal C – Information and notices
Official Journal – Special editions
European Court Reports
Documents of the institutions
Consolidated texts
Cape Town, June 2006
EUR-Lex : Searching information
COMPUTER CRIME
• Title and text: computer crime  40 Hits
COMPUTER RELATED CRIME
• Title and text: computer crime 58 Hits
CYBERCRIME
• Title and text: cybercrime  55 Hits
CYBER CRIME
• Title and text: cyber crime  48 Hits
COMPUTER CRIME, CYBERCRIME, CYBER
CRIME (Boolean - OR)
• Title and text: computer crime, cybercrime  129
Hits
 USE OF SYNONYMS OR EQUIVALENT TERMS
Cape Town, June 2006
EUR-Lex sample –Bibliographic Notice
EUROVOC
DESCRIPTORS
TERMES
D’INDEXATION
ou DESCRIPTEURS
INDEXING TERMS
PREFERRED TERMS
CLASSIFICATION SCHEME
SUBJECT HEADINGS
Cape Town, June 2006
Indexing process
 Indexing = Identify the concept
Represented in a document
EUROVOC descriptor:
information society, computer crime, personal data,
electronic mail, confidentiality
For information retrieval (information request)
Title and text: computer crime, cybercrime  129 Hits
Content Indexing = only 1 process !
Searching = start again if the results are not
relevant to the question.
Cape Town, June 2006
Search Results
Relevant / Relevancy = relationship between a
document and a request.
– The document is relevant to the topic
– It replies to the user’s request
Pertinence = relationship between a document and
an information need.
• Relevant and useful for a user
• Relevant but the user doesn’t find it useful
(language, level of comprehensibility, type)
Irrelevant results = NOISE
Non-retrieved results = SILENCE
Cape Town, June 2006
Causes of searching failures
 Two words don’t mean exactly the same thing
 Enormous range of choices of words and expressions
 No true synonyms, although words are often close in
meaning
 Words are not clearly understood
 Inconsistent use of words
 Users are unlikely to choose all the relevant terms
 The user might choose the terms used by the indexer
with a different understanding of meaning.
Cape Town, June 2006
Need of a controlled vocabulary
 A controlled vocabulary = A consistent set of
words/expressions, along with rules of usage, to
be followed when indexing / searching
 Nature of indexing language
A list of terms acceptable to users
Mechanisms for structuring and using those terms
Minimize the ambiguity of isolated vocabulary that
may be out of context 
Cape Town, June 2006
Out of context information
 What means SENSITIVE AREA ?
 urban
 military
 environmental
 sensitive epidermis …
 A sensitive area protected by special measures to
preserve a highly vulnerable habitat (Eurovoc
thesaurus)
Cape Town, June 2006
Types of Vocabulary – Authority List
 Simple list or index enumerating the terms
available for indexing a collection of documents
Author names, organization names, Countries,
E.g.
• Library of Congress Authorities
• ISO Country Codes
Cape Town, June 2006
Vocabulary control – Classification Scheme
Heading / Caption
Notation
Upper level
Class
Sub-classes
Lower level
EUR-Lex directory codes
Cape Town, June 2006
Vocabulary control – Classification Scheme
 Systematic arrangement of entities/concepts into
classes (group or categories)
group of concepts whose members share a common
feature
vertical arrangement – level of specificity
Words may appear in several classes
Cape Town, June 2006
Vocabulary control – Classification Scheme
 Classes are identified by
a heading/caption
a notation (alphabetical and/or numerical code)
• Key for arranging items in physical libraries
Expressiveness (reflects the structure of the
scheme)
11.60.30.20 External relations / Commercial policy /
Trade arrangements / Common import arrangements
Cape Town, June 2006
EUR – Lex Directory Codes
Numerical classification of the “Directory of
Community legislation in force” and is used to
index legislation and preparatory acts.
http://eur-lex.europa.eu/RECH_repertoire.do
20 principal chapters, each covering a specific area
of European Union activity.
Each descriptor is composed of eight digits
• (principal chapter heading and up to three
subsequent subdivisions, each represented by two
digits)
Cape Town, June 2006
EUR-Lex – Subject Headings
One to maximum 5 descriptors based on the
subject-matter list of terms
The alphabetically structured list of over 200
keywords is based on the subdivisions of the
treaties and the areas of activity of the institutions.
The descriptors are less specific than those of the
Directory code but provide a general overview of
the content of the document.
Cape Town, June 2006
Thesaurus - Definition
ISO 2788 (1984)
A structured list of expressions intended to
represent in unambiguous way the conceptual
content of the document in a documentary system
and of the queries addressed to the system.
= NOUN, NOUN PHRASE
= INDEXING PROCESS
= ONE SINGLE INTERPRETATION
Cape Town, June 2006
Thesaurus - Definition
 BSI 8723 (2006)
= MUTUALLY EXCLUSIVE RELATIONSHIPS
A controlled vocabulary in which concepts are
represented by descriptors, formally organized so
that paradigmatic relationships between the
concepts are made explicit,
and the descriptors are accompanied by lead-in
entries for synonyms and quasi-synonyms.
= EQUIVALENCE
The purpose of a thesaurus is
• to guide both the indexer and the searcher to select
the same descriptor or combination of descriptors to
represent a given subject.
= INDEXING PROCESS
Cape Town, June 2006
Eurovoc - Scope
 Eurovoc
A multilingual thesaurus (hierarchical list of terms)
Multidisciplinary vocabulary
• Community and national point of view
• Parliamentary activities
 Definition of concepts
Samples from Eurovoc
Cape Town, June 2006
Eurovoc - Coverage
21 FIELDS = HEADINGS
04 POLITICS
08 INTERNATIONAL RELATIONS
10 EUROPEAN COMMUNITIES
12 LAW
16 ECONOMICS
20 TRADE
24 FINANCE
28 SOCIAL QUESTIONS
32 EDUCATION AND COMMUNICATIONS
36 SCIENCE
40 BUSINESS AND COMPETITION
44 EMPLOYMENT AND WORKING CONDITIONS
48 TRANSPORT
52 ENVIRONMENT
56 AGRICULTURE, FORESTRY AND FISHERIES
60 AGRI- FOODSTUFFS
64 PRODUCTION, TECHNOLOGY AND RESEARCH
66 ENERGY
68 INDUSTRY
72 GEOGRAPHY
76 INTERNATIONAL ORGANISATIONS
Cape Town, June 2006
 0806 international affairs
 0811 cooperation policy
 0816 international balance
 0821 defence
127 MICROTHESAURUS
= CLASSES
Eurovoc - Equivalence
NON-DESCRIPTOR
USE DESCRIPTOR
Cape Town, June 2006
Eurovoc – Contextual information
DESCRIPTOR
MT - MICROTHESAURUS (MAIN CLASS)
UF (USED FOR) - NON-DESCRIPTOR
This descriptor is USED FOR a non-descriptor
BT - BROADER TERM / GENERIC TERM
NT - NARROWER TERM / SPECIFIC TERM
RT – RELATED TERM
Cape Town, June 2006
Eurovoc – Relationships
TOP TERM = higher in the hierarchy
Equivalence relationship
(USE, UF)
SCOPE NOTE (SN) =
Usage or definition note
NT1
NT3
Hierarchical relationship
(MT, BT, NT)
Associative relationship
(RT)
Cape Town, June 2006
Vocabulary Control – Thesaurus
The scope of a descriptor is limited to a single
meaning (unambiguous)
• Nouns or Noun phrases
• Pre-coordination of concepts
The context is provided by :
• The hierarchical relationships (MT, BT, NT)
• The scope note (SN)
– (state the chosen meaning or indicate other meanings
excluded for indexing purposes)
A concept is represented by two or more synonyms
• One term selected as a descriptor (indexing term)
• Equivalents = non-descriptors
– (lead-in entries or references to the descriptor – USE, UF)
Cape Town, June 2006
Vocabulary control - Targets
 Represents the general conceptual structure of a subject
area and presents a guide to the user of an index
 Reflects closely the literature vocabulary and the user’s
own technical usage
 Employs pre-coordinated phrases to reduce false drops
to minimum
• Venetian Blind
 Controls synonyms and near-synonyms in order to
increase the consistency
 Only one term from a list of similar terms will be used in
indexing
 Horizontal and vertical relationships among terms
(cross-references)
Cape Town, June 2006
Classification & Thesaurus - Difference
 Classification
Single preferred location (physical libraries)
• Directory code:
03.60.55.00 Agriculture / Products subject to market
organisation / Wine
• Post-coordination of concepts
 Eurovoc
Admits relationships as hierarchical
wine
MT 6021 beverages and sugar
BT1 alcoholic beverage
BT2 beverage
NT1
NT1
NT1
NT1
bottled wine
champagne
flavoured wine
fortified wine
Cape Town, June 2006
Indexing systems - Types
Derived-term system
Assigned-term system
 All descriptors are taken from
the text itself
 Subject heading list,
thesaurus, classification,
taxonomy
 Intellectual effort
 Natural language or free-text
indexing
 The Indexer determines the
scope of the document and
assigns descriptors from a
controlled vocabulary
 Descriptors identify the
concepts expressed by the
documents
 Automatic indexing
 Greater time and efforts
 Cost is important
Cape Town, June 2006
PART 2 : EUROVOC THESAURUS
Christine Laaboudi-Spoiden
Publications Office of the European Communities
EUR-LEX Unit – Documentary section
Cape Town, June 2006
Eurovoc 4.2 - Languages
http://europa.eu/eurovoc/:
 Official EU Languages
 Acceeding countries
 BG - Bulgarian, RO – Romanian
 Candidate country
 HR – Croatian
Local sites
 Other languages
 Albanese, Ukranian, Russian,
Georgian, Serbian
 Regional languages : basque, catalan
Cape Town, June 2006
ES
LT
CS
HU
DA
NL
DE
PL
EL
PT
ET(*)
SI
EN
SK
FR
FI
IT
SV
LV
Eurovoc 4.2 in figures
Eurovoc 4.1
Eurovoc 4.2
DOMAINS
21
21
MICROTHESAURI
127
127
DESCRIPTORS
6501
6645
GENERIC RELATIONSHIPS
6510
6669
ASSOCIATIVE
RELATIONSHIPS
3542
3636
Cape Town, June 2006
Eurovoc – fields most frequently used
76
68
40
36
66
56
48
72
44
Fields 32
24
20
52
16
08
28
10
12
04
1
1
1
1
2
2
2
76
68
40
36
66
56
48
72
44
32
24
20
52
16
08
28
10
12
04
3
4
4
4
4
5
7
9
10
- INTERNATIONAL ORGANISATIONS
– INDUSTRY
– BUSINESS AND COMPETITION
– SCIENCE
– ENERGY
– AGRICULTURE, FORESTRY AND FISHERIES
– TRANSPORT
– GEOGRAPHY
– EMPLOYMENT AND WORKING CONDITIONS
– EDUCATION AND COMMUNICATIONS
– FINANCE
– TRADE
– ENVIRONMENT
– ECONOMICS
– INTERNATIONAL RELATIONS
– SOCIAL QUESTIONS
– EUROPEAN COMMUNITIES
– LAW
– POLITICS
11
17
18
0
5
10
Number of users
Cape Town, June 2006
15
20
Eurovoc – Polyhierarchical relationship
 Main rule : Descriptors belong to one category (1
BT, 1 MT)
 Exception : Descriptors from Domains 72 & 76
Field 72 : Geography
Field 76 : International Organizations
Cape Town, June 2006
Eurovoc - Advantages
 Multilingualism
 Indexation in the documentalist’s language
 Search in the user’s language
 Update
18 months
 Cooperation
 National parliaments
 Candidate descriptors
 Normalisation
 ISO 2788 & 5964
Cape Town, June 2006
Eurovoc - Limits
Generic vocabulary, not specific
Don’t cover national specificities
Cape Town, June 2006
Eurovoc - Display
 Formats
Printed – paper version
Web site http://europa.eu/eurovoc/
XML Files (provided to licensees)
PDF Files to download
 Types of display
Alphabetical
Thematic
• Alphabetical listing by field/domain
Cape Town, June 2006
Eurovoc – Thematic display
Languages
Field/Domain
Microthesauri
NAVIGATING
Cape Town, June 2006
Eurovoc – Thematic display
Microthesauri
Top Term / Broader Term
Related Terms
Alphabetical index of
descriptors/non-descriptors
of the current field
Specific Terms
NT1 – NT2
Cape Town, June 2006
Eurovoc – Terminology of the field
Alphabetical index of
descriptors and non-descriptors
Cape Town, June 2006
Eurovoc – Searching for concept
Cape Town, June 2006
Eurovoc – Alphabetical display
Cape Town, June 2006
Eurovoc – Alphabetical display
PT
FR
Cape Town, June 2006
Eurovoc – Translations
A descriptor =
an equivalent concept in every language
Cape Town, June 2006
Eurovoc - History
 1982 :
• comparative study of the existing documentary languages
at the European Commission and the European Parliament
 1984 : first edition
• seven languages (DA, DE, EN, FR, EL, IT, NL)
 1987 : 2nd edition
• + ES, PT
 1995 : 3rd edition - 1999 : 3.1 edition
• + SE, FI
 2002 : 4.0 edition - 2004 : 4.1 edition
 2005 : 4.2 edition
• 17 languages
 2006 : 4.3 edition
• 21 langues
Cape Town, June 2006
Eurovoc - Users
National parliaments
European institutions (European Parliament,
Publications Office, Court of Justice)
Private users = Eurovoc License holders (licence
Eurovoc)
Cape Town, June 2006
Eurovoc – Users
16
16
14
NationalP arliament
12
NationalA dministration
10
EU Institutions
8
Consultants
6
5
6
4
Universities
3
2
4
2
2
0
Private User
Research Institutes
Total
Cape Town, June 2006
Eurovoc – Users
Transla tors
20%
1%
6%
Informatics
3%
Termino logues
Lingui sts
Libraria ns
Docum entalis ts
14%
56%
Res earchers
Other
Cape Town, June 2006
Eurovoc - Licenses (1)
50
45
40
35
30
25
20
15
10
5
0
44
25
Licence s
15
2003
2004
Number of Licences
Cape Town, June 2006
2005
Eurovoc – Licenses (2)
35
33
30
25
18
20
2004
2005
15
2006
10
5
0
1
4
Acade mic
4
2 3
Commercial
4
Transla tion
Cape Town, June 2006
3
Inde xing
PART 3 : EUROVOC MAINTENANCE
Christine Laaboudi-Spoiden
Publications Office of the European Communities
EUR-LEX Unit – Documentary section
Cape Town, June 2006
Eurovoc - Maintenance
 2 interinstitutional committees
Maintenance committee
• Commission, Council, Parliament, Court of Justice,
Court of Auditors
Steering committee
• Commission, Council, Parliament, Court of Justice,
Court of Auditors
 Eurovoc Maintenance Team
Publications Office
Cape Town, June 2006
Eurovoc - Steering committee
Supervises the Eurovoc project
• Objectives, priorities, overall timetable
• Resources and budget
Officially adopts each new version
Chair by a representative of the European
Parliament
Cape Town, June 2006
Eurovoc – The maintenance committee
Examines and votes on the proposals for updating
the thesaurus
Decides on the amendments to be made
Chair by the Publications Office
Meets twice a year
Cape Town, June 2006
Eurovoc – The maintenance team
Location: Publications Office
Collects and examines the proposals made by all
users
Coordinate the work of the Maintenance Committee
Responsible for IT developments, translation
monitoring, web site
Works through a maintenance interface
Cape Town, June 2006
Eurovoc – Maintenance process
 The European Parliament
– Collects, examines and filters the proposals from the national
parliaments
 The Maintenance Team
– Collects the proposals made by all users (E.P, licensees,
OPOCE)
– Manage the proposals through the maintenance system
 The Maintenance Committee
– Votes on the various proposals
– Decides on the final amendments
 The Maintenance Team
– New descriptors and amendments are sent to the E.C
translation
 The Maintenance Committee
– Review the multilingual draft version
 The Steering Committee
– Officially adopts the new version
Cape Town, June 2006
EUROVOC – The maintenance interface
 https://webgate.cec.eu.int/eurovoc/maint
 Users
EU Institutions : Members of the maintenance
committee, Translators
National parliaments
 Features
Propose Candidate descriptors, amendments
Translation module
A dedicated layer for each user
Cape Town, June 2006
EUROVOC – Maintenance
CANDIDATE DESCRIPTOR
 How to propose new concepts / amendments
Eurovoc maintenance form (web site)
Email to [email protected]
Cape Town, June 2006
EUROVOC – Maintenance
 Criteria’s of acceptance / non acceptance of
candidates descriptors
 Acceptance :
 Creation necessary :
• European Food Safety Authority (new european
organism)
• Greater Poland province in Regions of Poland in
MT7211 (new regions to incorporate)
 New concept interesting and useful
• Access to healthcare
• selfregulation
Cape Town, June 2006
EUROVOC – Maintenance
 Criteria’s of acceptance / non acceptance of
candidates descriptors
 Non acceptance :
 Descriptor already existing under another form
• Second home  secondary home
• Community Customs Code exists as a nondescriptor of « Customs regulations »
 Concept which can be obtained in combining two
or three descriptors already created (
• European Refugee Fund  EC fund + aid to
regufees
Cape Town, June 2006
EUROVOC – Maintenance
 Criteria’s of acceptance / non acceptance of
candidates descriptors
 Non acceptance :
 Term too specific (not enough used)
• Arctic agriculture
 Term too national (not useful for the other users)
• Popular school (in SV)
 Term too vague
• Right to peace
• Small states
Cape Town, June 2006
PART 4 : INDEXING AND SEARCHING
WITH EUROVOC & the EP Library
Christine Laaboudi-Spoiden
Publications Office of the European Communities
EUR-LEX Unit – Documentary section
Isabelle Gautier – European Parliament - Library
Cape Town, June 2006
INDEXING AND SEARCHING
WITH EUROVOC
1. Content analysis and subject determination :
 Example from Eur-Lex database (Directive 50/2006)
 Example from Eur-Lex database (Règlement 802/2006)
Cape Town, June 2006
Cape Town, June 2006
Cape Town, June 2006
INDEXING AND SEARCHING
WITH EUROVOC
1. Term selection in Eurovoc
• Check the relationships (hierarchy and semantical
environment of a descriptor)
• Definition of horizontal or vertical specificity
• Translation of concepts into indexing terms : cases of
generic terms, compounds terms, lack of precision, proper
names.
3. Depth of indexing :
• Exhaustivity and selectivity
4. Making choice : indexing policy
Cape Town, June 2006
Cape Town, June 2006
Cape Town, June 2006
Cape Town, June 2006
Cape Town, June 2006
EUROVOC at EP LIBRARY
 1999 : change of our data processing system of our
catalogue ; involves a new indexing policy to manage
for the library.
 new catalogue => needs to develop a new consistency for
indexing ;
 to obtain this consistency, organization of a training for all
indexers ;
 creation of a Working Group in charge of the Indexing
Coordination among the library.
Cape Town, June 2006
EUROVOC at EP LIBRARY
The Indexing Coordination Group




Working Group formed by indexers Information
Specialists (nationalities and languages
differents) in charge of :
Writing an internal guide to use the practical
rules for indexing, this for the departement ;
Creating some updated lists (descriptors studied
and descriptors created for the Library) and
templates (to propose a creation or a
modification) useful for the colleagues
organizing regularly some meetings on the
indexing policy and its implementation;
training the new colleagues.
Cape Town, June 2006
EUROVOC at EP LIBRARY
The Indexing Guide
Target : to obtain a better consistency of the
indexing operation in the catalogue and a good
knowledge of the new data processing system.
Contents three parts :
 definition and basic rules for indexing ;
 the indexing policy in the library ;
 practical application in our catalogue.
Completed by some advised-sheets for indexing if it
appears necessary.
Cape Town, June 2006
EUROVOC at EP LIBRARY
Indexing Meetings
 Target : the group studies the proposals of new
descriptors or modifications sent by the
colleagues ;
 To answer to specific questions asked by the
colleagues ; to write if necessary some advisedsheets ;
 questions are analysed by the group in some
meetings and presented in meetings at the
department level;
 Advise and help role.
Cape Town, June 2006
EUROVOC at EP LIBRARY
Examples of proposals received by the Group
Candidate-descriptor created (library level) :
 Community law-international law
MT 1231 international law - BT international law
SN influence du droit communautaire sur le droit
international et vice-versa
Candidate-descriptor rejected :
 environmental damage principle
Advise to index with : environment impact + risk
prevention
Modification of a descriptor :
 polluter pays principle
Proposal to change the English term (in place of polluter
pays policy).
Cape Town, June 2006
EUROVOC at THE EP LIBRARY
Training
 Training Organisation for new colleagues :
 Internal with a presentation of : the thesaurus,
the indexing guide, the indexing policy of the
department, indexing in our catalogue and little
practical exercises ;
 internal but an external trainer to review or to
train - if necessary – to index a group of people
 external : as needs requested by indexers and if
training available in the different countries.
Cape Town, June 2006
EUROVOC at EP LIBRARY
European Parliament’s role as member of Maintenance
Committee :
 Represents both the EP and the national parliaments at the
Maintenance Committee ;
 Receives as representative the proposals of the national
parliaments users of the thesaurus ;
 Filters the proposals (criteria's rejection : concept too
national or too specific or too vague) ;
 Forwards the proposals of the department and of the
national parliaments to the Committee ;
 organises regularly seminars with national Parliaments.
Cape Town, June 2006
IN CONCLUSION : USEFUL LINKS
 EUROVOC : http://eurovoc.europa.eu
 Eur-Lex : http://eur-lex.europa.eu
 Parlement européen : http://www.europarl.europa.eu
Cape Town, June 2006