6.2.2. Controlled vocabulary

Indexing languages
6.2.2. Controlled vocabulary
Overview
Anyone who has struggled to find the exact search term to retrieve information about a certain subject
can benefit from controlled vocabulary. Controlled vocabulary allows just one term, spelled one way, to
represent a given concept. It is an indexing solution to problems stemming from the ambiguity of natural
language that tend to result in imprecise and incomplete retrieval.
A controlled vocabulary is a set of authorized (standardized) terms. Most controlled vocabularies
represent subjects and are listed in subject authority files called thesauri or subject headings lists.
An indexer or cataloger chooses controlled-vocabulary terms from a particular authority file and assigns
them to a controlled-vocabulary field in the metadata record. A searcher should also consult the authority
file to find terms for searching the controlled-vocabulary field.
This module explains how a controlled vocabulary works. It describes three kinds of indexing problems,
then shows how controlled vocabulary provides solutions. The problems are:
1. Naming single concepts: What is the best term for a given concept?
2. Showing relationships among single concepts: What concept is related to a given concept?
3. Showing relationships among multiple concepts: What if the subject of a document contains
two concepts?
1. Naming Single Concepts
Problems
What is the best term for a given concept? How does one choose among variant word forms for a
concept?
INFO 5200 / Controlled vocabulary / p. 2
Solutions
The creator of the controlled vocabulary addresses these problems based on understandings of the users
and the collection. The "best" term for a concept is the most accurate, common and current word at the
time the controlled vocabulary is created. Typical approaches are to:
Focus on concrete nouns
fish, fishing
Include multiword terms
aircraft carrier
Include proper nouns
American
Exclude commercial names
IBM
Preferred word forms are shown by example:
theater [not theatre]
theater [general; the profession]
theaters [specific; buildings]
performing arts [one term]
Spelling
Singular
Plural
Multiword
Some terms are more ambiguous than others and need further clarification. Most of these are
homographs: terms that have the same spellings but different meanings. In a controlled vocabulary,
these are often distinguished by parenthetical qualifiers:
letter (correspondence)
port (opening)
vs.
vs.
letter (alphabet)
port (wine)
Multiword terms, with more than one word representing a single concept, are also called compound
terms. In some controlled vocabularies, terms and their parenthetical qualifiers are treated as compound
terms: all the words must be kept together in indexing and searching.
2. Showing relationships among single concepts
Problems
What concept is related to a given concept? How is it related?
Suppose you have these terms:
motor vehicles, automobiles, cars, sports cars, trucks
Clearly some concepts are broader than (encompass) others and some terms are actually synonyms.
INFO 5200 / Controlled vocabulary / p. 3
Solutions
Again, the creator of the controlled vocabulary addresses these problems, based on understandings of
the users and the collection.
Relationships based on word meanings are called semantic relationships. Three kinds of semantic
relationships are equivalent, hierarchical, and associative. Each raises its own questions:
Equivalent
(synonymous or nearly synonymous)
How to show preferred terms?
Hierarchical
(genus-species or broad-narrow)
How to show levels of meaning?
Associative
(related but not synonymous
or hierarchical)
How to link related terms?
The solutions are cross references that show the relationships. For example, in an authority file on
transportation, all three of these relationships pertain to the term automobiles:
Equivalent
USE FOR cars
Hierarchical
BROADER TERM motor vehicles
NARROWER TERM sports cars
Associative
RELATED TERM trucks
Each term in the authority file is listed separately. For each relationship, there must be a pair of cross
references, called mandatory reciprocals:
USE FOR
BROADER TERM
RELATED TERM
and
and
and
USE
NARROWER TERM
RELATED TERM
Cross references are commonly abbreviated UF, USE, BT, NT, and RT. All terms in the authority file are
listed alphabetically. Here is the display for automobiles:
automobiles
UF cars
BT motor vehicles
NT sports cars
RT trucks
cars
USE automobiles
motor vehicles
NT automobiles
sports cars
BT automobiles
trucks
RT automobiles
INFO 5200 / Controlled vocabulary / p. 4
This example shows all reciprocals for automobiles. In the equivalent relationship, automobiles is the
preferred term (or authorized term, or descriptor) and cars is the lead-in term (or nonpreferred term).
The lead-in term is not used to represent or search for a subject: it is the term that people may look for
first in the authority file and is included to lead them to the preferred term.
This is how you read a thesaurus entry:
Given:
automobiles
UF cars
BT motor vehicles
NT sports cars
RT trucks
“You can search using the term automobiles and find something. Search using automobiles instead of
searching using cars. Also, you can search and find something using the broader term motor vehicles, or
by the narrower term sports cars, or by the related term trucks.”
We know we will find something using automobiles because it is bolded (bolded means there are
guaranteed to be records found with this term)
We also know we will find something using motor vehicles, or sports cars, or trucks because all three of
those terms are bound into either a hierarchical relationship or an associative relationship, and only
authorized terms can be so bound.
Project Alert! You must show at least one example of each kind of semantic relationship in your sample
thesaurus. Do not force a relationship on every term. You must have at least 15 authorized terms in the
thesaurus. Note: for the field on which you executed your thesaurus:
All authorized terms in the thesaurus must be found in at least one of your InMagic records
No unauthorized terms should be found in any InMagic records
All terms in the records must be in the thesaurus as authorized terms
The arrangement of a controlled vocabulary using cross references to show relationships is known as its
syndetic structure. See the assigned reading "Thesaurus construction and format" (2001) and the
thesaurus tutorial module.
3. Showing relationships among multiple concepts
Problems
What if the subject of a document contains two concepts? What if it contains more than two concepts?
This problem is even more complicated when there are not only multiple concepts in one document . . .
Drama in the lives of teachers
INFO 5200 / Controlled vocabulary / p. 5
. . . but also multiple documents with similar multiple concepts!
Methods for teaching drama
Drama as a teaching method
A subject that includes more than one concept is known as a composite subject; it may also be called a
complex or compound subject.
Solutions
Use precoordinate or postcoordinate indexing to link the concepts. These are rather mysterious terms for
what are really simple concepts.
Precoordinate indexing is combining several terms in some logical order, as in library catalog subject
headings. "Pre" means the terms are combined prior to searching, at the time of indexing.
•
Precoordination is the combination of indexing terms at the time of indexing.
•
Combined terms represent composite or complex subjects.
•
Typical combinations are controlled-vocabulary subject headings used in subject cataloging.
•
Searching usually does not require the entry of all terms in the subject heading.
INFO 5200 / Controlled vocabulary / p. 6
Some examples, with alternatives:
Drama in the lives of teachers
Education--Teachers
Education--Teaching--Psychological aspects
Methods for teaching drama
Education--Drama--Teaching methods
Drama--Teaching methods
Drama as a teaching method
Education--Teaching methods--drama
Postcoordinate indexing is combining single terms using boolean operators (AND, OR, NOT).
"Post" means the terms are combined after indexing, at the time of searching.
•
Postcoordination is the combination of indexing terms at the time of searching.
•
Terms represent single, simple concepts.
•
Typical combinations are controlled-vocabulary descriptors used in indexing.
•
Searcher uses boolean operators and other techniques to combine terms.
Some examples, with alternatives:
Drama in the lives of teachers
drama AND lives AND teachers
(teachers OR teaching) AND psychology
teachers AND psychology NOT methods
Methods for teaching drama
drama AND teaching AND methods
drama AND (teaching OR methods)
(drama AND education) AND methods
Drama as a teaching method
drama AND teaching AND methods
drama AND (teaching OR methods)
As you study the examples above, you may wonder whether the order of the terms matters. In
precoordinate indexing like the subject headings shown, the order of terms, or syntax, does matter: this is
known as a syntactic relationship. In postcoordinate indexing, like the boolean combinations shown,
syntax may or may not matter, depending on the database. For more information, see the module on
indexing, searching, and retrieval.
In the examples above, you may also notice that none of the alternatives for either precoordinate and
postcoordinate indexing fully conveys the meanings of the titles. Unfortunately, some meaning is almost
always lost in a representation.
INFO 5200 / Controlled vocabulary / p. 7
Summary
Indexing problems stem from the ambiguity of natural language. In controlled vocabulary approaches,
most of the burden of solving these problems falls on the indexers who create and use subject authority
files. Searchers must also assume some of the burden, however, in knowing how and when to consult
subject authority files and how to use boolean operators to search multiple concepts.
This module contains many key concepts and terms. It is especially important to distinguish among
concepts in these sets of terms:
•
semantic, syndetic, syntactic
•
equivalent, hierarchical, associative
You may also want to compare the solutions in this module with those in the module on natural language.
Cites & sites
Thesaurus construction and format. (2001). In Thesaurus of ERIC Descriptors. (14th ed.). Phoenix, AZ:
Oryx Press. [ xxvii-xxxi]
All INFO 5200/4200 course materials are copyrighted and may not be copied, revised, or distributed in any form or venue, beyond their use
by students for purposes of fulfilling course requirements, without prior permission of the authors or the University of North Texas.