From thesaurus to ontology: the development of the

From thesaurus to ontology: the development
of the Kaunokki Finnish fiction thesaurus
Jarmo Saarti and Kaisa Hypén
Jarmo Saarti and Kaisa Hypén look briefly at pre-computer age fiction retrieval methods. They discuss the challenges works of fiction present in the information management and retrieval context and the need to develop fiction
indexing tools suitable for web-based services, and describe how the creation of Kaunokki – the Finnish fiction
thesaurus – in 1996, and its development into the Kirjasampo-SAHA web service for readers and librarians, are
helping to meet these challenges.
Background
Finding what you don’t know is there: free text searching, the
index and the social networker
Works of fiction have traditionally been ordered on a shelfarrangement basis in such a way as to allow browsing by
potential readers as they walk between open stacks (see
Saarti, 1997). We all know how much we resent it when
books are out of sight in basement stores and have to be
called up by known title. These shelf arrangements were
based mainly on genres, and are known as genre classifications. The usual library tools were used for determining the
arrangements: classification, indexing and abstracts (see
Saarti, 1999b, 2002b). But the reality, shown up by developments in the information technology field, including in
particular digital distribution methods and retrospective
digitization of material, is that in the years of paper catalogues much material (above all fiction) went unindexed and
uncatalogued in any meaningful way. Unless steps are taken
to remedy the situation this could continue in the Internet
age since, as digitization demonstrates, texts and other types
of documents that have not been analysed and classified and/
or indexed in full text databases are simply ignored. If the
system cannot see them, it will not find them.
Not only did the advent of highly sophisticated computer
technology show up the gaps of the past, it also provided
the means to move towards filling them. For natural science
materials, the task is fairly easy. The fact that texts are usually
topic-based makes them relatively unambiguous and thus
ready for automatic text-search ‘indexing’. Fiction, being
concept-based, is another matter, the greatest challenge in
the representation of fictional content being that interpretation of the content is always subjective. Readers make their
own judgement. (See Bell, 1992; and for a discussion of the
problem in relation to the indexing of poetry, Johnstone,
2010.) The challenge for those seeking to develop a fiction
retrieval system is to identify which aspects of a fictional
work are the least subjective, and which are the aspects most
readers will be able to agree on (see Saarti, 2002a; and for
a discussion of the problem in relation to the indexing of
poetry, Johnstone, 2010).
In a sense this process began during the period of card
catalogues, with the first experiments in developing subject
50
headings for fiction and the creation of systematic thesauri.
The concept of faceted classification goes back to 1933,1
and there are now many lists describing the most objective aspects (or ‘facets’) in fiction content in the library
and information science context (see e.g. Beghtol, 1994),
the facets most often mentioned being genre, time and
space. Other ‘objective’ elements include the style of the
work, its language, and factual objects mentioned in the
text including, for example, characters in a novel. However,
‘objective’ though they may be, this does not mean that they
are necessarily easy to define objectively. It all comes back to
the subjectivity of fiction interpretation. We discuss the role
of facets and ways of dealing with subjectivity in the context
of the Kaunokki thesaurus below.
The most radical change introduced by computer technology into the world of information management and
retrieval has been the possibility of arranging masses of text
in searchable databases. The arrival of interactive user interfaces and social networking has taken things a step further,
revolutionizing the publication and dissemination of information and making it possible to incorporate user behaviour
into the content representation of fiction.
One exciting aspect of the analysis of fiction texts and the
fiction information dissemination process is the challenge
it offers to traditional models, rigorously testing and, it is
hoped, expanding the theoretical tools and concepts used in
our field of research. (See e.g. Beghtol, 1994, 1997; Green,
1997.) In this article we explore developments in the field
of fiction knowledge management, focusing especially on
the problems of fiction indexing and on the development of
the Kaunokki – the Finnish fiction thesaurus from printed
thesaurus to online ontology.
Relevance as interface between work and
user
In library and information science (LIS), the relationship
between a potential reader and a work is defined by the
concept of relevance. This is a somewhat fuzzy notion, but
roughly speaking, ‘relevance’ happens when the process
of information seeking and searching leads to a positive
encounter between the searcher and the work. In other
words, the searchers find something corresponding to what
The Indexer Vol. 28 No. 2 June 2010
Saarti and Hypén: From thesaurus to ontology
they are looking for. The fuzziness derives from the fact that
information needs of an individual and the information in the
documents under scrutiny are both infinitely variable, and
so, therefore, is the scope for a ‘positive encounter’. Positive
encounters are many-faceted and multi-dimensional. The
first document turned up as the result of an informationseeking and searching process may not meet the needs of
the seeker, but it can be used to redefine or refine the query,
with the prospect of getting nearer to the target.
Tefko Saracevic first addressed the concept of relevance
in 1975. Returning to the subject in 1996, she suggested that
‘relevance’ manifests itself in five ways: system or algorithmic
relevance (relationship between a query and information objects); subject relevance (relationship between the
subject or topic expressed in a query); cognitive relevance
or pertinence (relationship between the state of knowledge
and cognitive information needs of a user, and the texts
retrieved); situational relevance or utility (relationship
between the situation, task, or problem identified by the
user, and the texts retrieved); and motivational or affective
relevance (relationship between the user’s goals and motivations, and the texts retrieved). These all pose particular
challenges for the knowledge management of works of
fiction, whether we are talking about traditional library tools
or modern IT methods.
Topicality in fiction: the problems of term
selection
Distinguishing fact and fiction
As already mentioned, works of fiction present particular
problems when it comes to ‘topicality’ – in other words,
term selection. For example, novels drawing on the life of a
real person or a historical novel representing events of the
past are based explicitly on reality, but the fact that a novel
based on a real person has been published as fiction implies
that the author has deliberately moved the real person into a
fictional world. So the rules are changed: the novel is a recreation of a life that did occur but which is now reinterpreted
and rewritten. This sets one minimum requirement for the
chosen system, particularly one that contains metadata
about both factual and fictional works: it must be able to
recognize the difference between them! And, as noted in
studies on fiction knowledge management, the systems must
be able to handle multi-faceted content descriptions and be
open to multi-faceted information searching.
User motivation
There are essentially two sorts of motivation in the fictionsearching context. Either the user wants to track down works
on a particular topic, or they are simply looking for enjoyment in the shape of a good novel, just as with a good movie
or good music.
Those looking for fiction of topical or subject relevance
may be in search of something very precise, or may be interested more broadly in ‘aboutness’. So if a fiction information
retrieval (IR) system is to be effective, it must be able to
respond to searches about how truthfully or realistically a
The Indexer Vol. 28 No. 2 June 2010
novel handles a historical fact, for example, or addresses
aspects of real life such as illnesses or building one’s identity.
People searching fiction by topic are often motivated by
factors other than just the pleasure of reading – for example,
fictional works may be used as material for historical or
sociological studies, travel guides, even as language learning
resources – but their search is essentially objective.
‘Good fiction’: a subjective concept
When it comes to searching for ‘good’ fiction we are in more
subjective territory, and it is very difficult to address the
requirement using the traditional classification and content
representation tools. Searchers are looking for something
new based on their previous experiences (‘I want to read
something like book X or similar to author A . . .’) but at
the same time they want to experiment (‘. . . but I want to
read something new’). It could indeed be that the greatest
pleasure is to stumble upon a totally new world of fiction,
with new types of material or authors or genres that do not
correspond to anything read previously.
In fact, the concept of ‘a good work’ in this context is
totally dependent on the individual’s own, ever-changing
point of view. A novel that someone considers ‘good’ today
might not resonate with them tomorrow, but could become
meaningful again at some time in the future. This is an
experience we have all had when we re-read something we
enjoyed when we were young and now find disappoints, or
vice versa.
For somebody looking for a ‘good’ work of fiction, ‘good’
only really has meaning by reference to the needs of that
particular person at that particular moment in time: and the
needs may not even be clear until the search is successful,
or until they have been redefined and refined to the point
of total clarity in response to unsuccessful searches. So
when searching fiction, the opportunity to browse is allimportant.
The sociohistorical context
However, although only individuals can judge what is ‘good’
for them, fiction is always defined within a sociohistorical
context. The interpretation and reception of works of fiction
is invariably a social and historical construction, an aspect
reflected in the use of categorizations (for example, ‘the
most important movies in the world’, ‘Finnish national
writers’, ‘a hundred horror movies you have to see’, ‘Death
metal – the most important tracks of last year’). To be effective an information retrieval system must be able to build in
this sociohistorical context, and to make use of the language
and terminology of a given group for search and retrieval
purposes.
Actional relevance
One further ‘relevance’ category can be added to Saracevic’s
list especially in relation to fiction: actional or interpretative
relevance. Reception of fiction is based on interpretation,
and these days interpretation is actively disseminated in
Internet forums, in book club discussions, and in reviews.
Actional relevance also occurs when new works of fiction
51
Saarti and Hypén: From thesaurus to ontology
are created, based on those that have been read, or perhaps
through role playing activity. Fan-fiction portals and social
networking are excellent examples of actional relevance.
Folksonomies
The reader as interpreter of fiction and creator of fictional
environments
Indexing carried out in libraries and similar institutions tends
to follow one or other of two main approaches: intellectual
or assigned indexing done by humans, or derived indexing
using different types of algorithms in conjunction with fulltext databases. In the first approach different types of word
lists, thesauri and so on are used to guide and control the
indexer. In the latter, the text in the databases is indexed by
computer using statistical approaches based on language.
The first approach, intellectual or assigned indexing,
has been transformed with the evolution of Web 2.0 technologies. Users of the material can now be involved in the
indexing process, identifying their needs and preferences by
a process of ‘tagging’. Word lists or thesauri thus created
are known as folksonomies, defined (Wikipedia, 2009;
Noruzi, 2007) as taxonomies created within a network-based
community for the indexing (tagging) of Internet or other
documents, the key word here perhaps being ‘community’.
Folksonomies started to develop, as was the case with
printed documents, when the amount of material available
reached the point that the target community needed tools to
manage it. Web technologies also offer other possibilities for
developing thesauri, such as linking between thesauri and
dictionaries in order to define and tighten or broaden the
terminology used initially.
From anarchy to structure and control
The building of folksonomies started in an anarchic fashion,
but it was soon realized that they needed structure and
vocabulary control. Spiteri (2007) saw folksonomies as a tool
for broadening library services, but stressed the importance
of structural and term control if library catalogues were to
be open to indexing (or rather, tagging) by the users. Noruzi
(2007) identified thesaurus-like structural requirements
as including term control (form and meaning), hierarchies
between terms used, spelling and vocabulary structures
(indexes, guides and so on).
The structures and rules for ‘open’ indexing need to be
produced professionally, a task we have already embarked
on in Finland using various approaches. These include
‘ontologized’ net thesauri (VESA and Kaunokki), the use
of statistical algorithms to regulate free indexing by users
(see Saarti, 2002b), and Internet bookstore voting methods
in which users are invited to evaluate the tagging done by
other users. Another possibility is ‘tagging’ moderation by
professionals (Hidderley and Rafferty,1997).
Folkosonomies and controlled vocabularies compared
The basic difference between these two systems (see Table 1)
is that a folksonomy is built within an interactive subcultural
Internet community and is continuously tested and modified.
Building a controlled vocabulary is a much more elaborate
process requiring a great deal of sophisticated professional input, and therefore adopting new terms takes time
(McCutcheon, 2009). Folksonomy responds to the cognitive
relevance for users, and provides tools to put them in the
driving seat, whereas a controlled vocabulary is a tool for
controlling and managing the topical or subject relevance.
There is clearly a need for further studies on how these tools
can be united, working towards a controlled way of creating
open and flexible vocabularies and discovering how to make
use of all available tools in combination. In other words, as
McCutcheon puts it, ‘the one with the most tools wins’.
Kaunokki – the Finnish fiction thesaurus
First steps
The origins of the Finnish thesaurus for fiction can be traced
to experiments on indexing fiction carried out by a group of
Finnish librarians and booksellers using the Finnish general
thesaurus (YSA). They soon discovered that the YSA lacked
the terms for indexing fiction. It was decided, therefore,
to develop centralized indexing services for fiction. But
indexing fiction tends to be seen as laborious in itself, and
suffers (at least in Finland) from a lack of tradition and
guidelines, including subject heading lists and thesauri.
In the autumn of 1993, the University of Helsinki Library
(acting also as the Finnish National Library) and the Finnish
Library Service embarked on the construction of a subject
heading list for fiction under the supervision of a general
editor and editorial board. The initial draft of the Kaunokki
Table 1 Folksonomies and controlled vocabularies
Folksonomy
Controlled vocabulary
Answers in an open manner to the aboutness and topicality of
fiction
Answers in a controlled manner to the aboutness and topicality
of fiction
Used within communities to meet their particular interpretive
needs
Helps bibliographic control by library professionals
Flexibility of definition
Strict definition
Transition from ‘controlled’ indexing to describing one’s own
experiences is straightforward
Transition from ‘controlled’ indexing to describing one’s own
experiences is practically impossible
52
The Indexer Vol. 28 No. 2 June 2010
Saarti and Hypén: From thesaurus to ontology
was tested in Finnish public libraries, and the first edition
published in 1996.
Structuring the Kaunokki: a faceted thesaurus
The first problem was to decide on how to structure the
Kaunokki: subject heading list or thesaurus? The editorial
board opted for thesaurus format so that the new list would
match up with other thesauri published by the University
of Helsinki Library. The organization of the thesaurus was
to be compatible with the facets and topics identified as
relevant in various studies on the classification and indexing
of fiction. An alphabetical ‘index’ was added of all the terms
used in the thesaurus.
The following facets were used:
•
•
•
•
•
•
fictional genres and their explanations
events, motives and themes
characters
settings
times
‘other’ (mostly technical and typographical aspects).
Four of these facets – events, characters, spaces (or settings)
and times – are mentioned in almost all studies as the main
categories for fiction indexing: ‘Characters, Events, Spaces
and Times may be taken as fundamental data categories for
fiction’ (Beghtol, 1994: 157).
Beghtol’s list is very similar to Ranganathan’s PMEST
schema (personality, matter, energy, space, time), as is Shatford’s ‘who, what, when, where’ approach to indexing pictures
(Shatford, 1986). Shatford decided to combine personality
and matter facets as ‘characters’, and ascribes activities by
characters to the energy facet. In Kaunokki, the solution was
to treat terms describing the genre of the fictional work as
corresponding to Ranganathan’s personality facet, since the
genre in fact describes the personality of the work and determines many of the events, spaces and times used in the novel.
Events and motives in Kaunokki were treated as Ranganathan’s matter facet, while Kaunokki characters corresponded
to Ranganathan’s energy facet. Using the facets in association
with Ranganathan’s basic class structure2 made it possible to
separate out different types of media, such as fiction, comics,
movies and musical works.
The miscellaneous ‘other’ group included mainly terms
that describe aspects which are outside the factual text of
the work but are regularly inquired about in libraries. These
aspects fall into Pejtersen’s accessibility category, such as
readability, language (Pejtersen and Austin, 1983: 234).
Horses for courses
It was realized from an early stage that the context in
which the thesaurus was to be used was key to choosing the
right terms and the right depth. For example, the decision
to use the Kaunokki in public libraries meant that many
terms which were important for literature studies were
not appropriate for the Kaunokki’s less specialized users.
This gap will be made good when the planned Thesaurus
for literary research is published, and some features of the
Thesaurus for literary research may also be useful to public
The Indexer Vol. 28 No. 2 June 2010
library users. Hans Jørn Nielsen argues that traditional
fiction indexing, which is mainly based on the factual
aspects, should be extended to include thematic factors, as
well as aspects related to the narrative structures. (It must
be remembered that in modern and postmodern fiction the
main point is not what is told, but how it is told.) Nielsen
also emphasizes the importance of including in the index
cultural and historical factors that have affected the work.
Some of these have been included in the Kaunokki (for
example schools of art and cultural periods) (Nielsen,
1997).
A test run
From the beginning, a database was used to collect the terms
and manage the Kaunokki, but at that time database systems
had not evolved sufficiently allow for an electronic version
of the first edition. It was decided therefore to produce the
first edition in hard copy only. To emphasize its pioneering
spirit, it was described as a ‘test’ edition. This first edition
of Kaunokki was published in 1996 by BTJ Kirjastopalvelu,
a company that provides services for Finnish libraries. BTJ
Kirjastopalvelu then decided to index all published Finnish
fiction using Kaunokki. As it was already providing library
cataloguing data for the majority of Finnish public libraries,
its decision ensured that Kaunokki had an immediate impact
on the whole Finnish library field.
Fiction indexing: why and how?
BTJ Kirjastopalvelu’s decision also gave rise to considerable
debate about the why and how of fiction indexing. Should
works of fiction be indexed at all? That hurdle safely negotiated, the debate turned to the practical aspects of indexing:
how many terms should be used per work, do all the aspects
of the work of fiction need to be indexed, and importantly,
how exhaustive and how specific should the indexing of
works of fiction be? Perhaps the most important outcome
of this discussion was that it forced libraries to develop
their own policies on fiction indexing, taking this debate
into account. BTJ Kirjastopalvelu had to revise and refine
its fiction indexing policy in response to points made by the
libraries.
A Swedish version – Bella
Once the first version of Kaunokki was published, the next
step was to produce a Swedish version (Bella) to meet
the needs of Finland’s Swedish-speaking population. The
Finnish version was first translated into Swedish, or rather
into the Swedish that is used in Finland. The result was
then carefully edited to ensure that the terms and the
meanings attributed to them worked in ‘Finnish’ Swedish.
Although Finnish and Swedish are both official languages
of Finland, some cultural meanings are not identical in the
two languages, so Bella has some terms that are unique to it.
Bella was published in 1997, just a year after Kaunokki.
Expanding the scope
Once Kaunokki/Bella was firmly established as a tool for
fiction indexing in Finland, it was decided to prepare a
53
Saarti and Hypén: From thesaurus to ontology
second edition, drawing on the expertise of the editorial
team that had been involved with the project since the
beginning. As a result of feedback from libraries, it was
decided to extend Kaunokki to include movies, comics and
so on, and not limit it to traditional text. This meant a lot of
new terms were needed. It was also decided again to publish
a Swedish version, and for these next editions to continue
as hard-copy publications. They duly appeared in 2000
(Finnish) and 2004 (Swedish). To emphasize the broadened
nature of the thesaurus, it was named Kaunokki: thesaurus
for fictional materials.
A web-based version
It was now essential to move to a web-based version. This
was ready in 2006 (Finnish version available from http://
Kaunokki.kirjastot.fi/ and the Swedish version from
http://Kaunokki.kirjastot.fi/sv-FI/). The service is hosted
and maintained by Helsinki City Library. The editing process
has remained the same, with the editorial board deciding on
terminology and structure. The key advantages of the web
version are its enhanced searchability and the possibilities of
keeping users informed about the changes to the thesaurus.
Kirjasampo,3 a web fiction retrieval service
The problems concerning the description and retrieval of
fiction have been discussed above. The Kirjasampo project
is a web fiction retrieval service for readers and librarians,
which is being set up to help solve some of the problems of
fiction description and retrieval mentioned above. It will in
particular offer a variety of ways of looking for and recommending ‘good fiction’, ‘good books’ or ‘something similar’.
It will be based on two communities: the professional
community of librarians, who will catalogue the books,
including the ‘long tail’ of literature from the past, using
their traditional tools, and the non-professional community
of readers, who will discuss, tag and rate the books via the
Kirjasampo user interface. As explained below, it will use
semantic techniques.
This project is an extensive enterprise involving many
parties. A project group in Turku City Library is responsible
for the contents. The editorial staff of the libraries.fi web
service (http://www.libraries.fi/en-GB/) is responsible for
developing the user interface and maintaining the service.
Vaasa City Library is responsible for providing the information about Finnish writers. The Semantic Computing
Research Group (SeCo) at Aalto University (see http://
www.seco.tkk.fi/) is responsible for the semantic annotation tool SAHA. The project is funded by the Ministry
of Education, and will be available later in 2010 at www.
kirjasampo.fi.
A model for a fiction search and retrieval system
The concept of the Kirjasampo web service is based on the
model for a search and retrieval system described by Saarti
(1999a).
In 1999, when this model was first developed, it was not
possible to build a system that could take account of all
the aspects listed. Today, when information technology is
so much more sophisticated, the model can be turned into
a reality. In essence it remains much as it was – we have
neither added nor removed any aspects or topics, but have
just applied modern web technologies to make it work.
Catalogues and indexes of
fictional works:
• catalogues
• subject indexes
• abstracts
Reception by readers:
• reception of individual works
by different readers
• criticism
• scientific studies
Fictional work:
• digitized works,
• their intertextual references
Data about authors:
• personal history
• publication history
• manuscripts
Cultural historical context:
• history of fiction
• cultural history
• history of reception
Figure 1 A model for a fiction search and retrieval system model
Source: Saarti (1999a: 194).
54
The Indexer Vol. 28 No. 2 June 2010
Saarti and Hypén: From thesaurus to ontology
We are not in the business of digitizing books, just
providing more information about them. The most important part of Kirjasampo is the semantic database. This is
based on a metadata schema, in which we have specified
which elements are important for the system, emphasizing
the linking and metadata features, which make it possible to
describe fictional works including poems and short stories in
several ways, and to link together works that have something
in common.
The Kirjasampo database, which uses a metadata format
similar to the Dublin Core, displays fields for author, title,
publisher and so on. Works can be linked to related webbased resources, such as critical reviews and studies. Data
is fed into the database using the SAHA annotation editor
(http://demo.seco.tkk.fi/smetana/kirjasampo/index.
shtml), which is browser-based and available for simultaneous annotation by a number of contributors. The principles
of functional cataloguing are followed, with the substance or
abstract of each work included only once. Physical manifestations, such as translations and availability of the work in
different media and formats, are linked to the abstract.
Figure 2 shows Kirjasampo-SAHA in action, describing
the content of a fictional work. The facets, though phrased
slightly differently, are the same as in Kaunokki: genre,
theme, character, place and time. The language used can
be very precise, or take the form of broad ‘aboutness’ terms
corresponding to the different sorts of user motivation
discussed above. Detailed information can be provided
about the main characters: for example, information can
be given about the kind of policeman Adam Dalgliesh is. It
is also possible to specify which keywords belong together:
for example ‘symbolics: stones’. And in the keyword field
it is possible for users to introduce their own keywords, if
suitable ones cannot be found in the thesaurus. It is also
possible to provide a summary describing the work.
The book cover can be saved and indexed as text. Information can be included such as the fact that a work has
received an award, that it is a part of a series, or is the basis
for a film script, libretto, stage play or similar. A short story
or a poem can be included in its own right, and linked to
anthologies or collected works where it can be found.
writers and books, the librarian can add them to Kirjasampo-SAHA, enabling colleagues to share the results of these
efforts. Other librarians, snowball-like, can add more titles
and more authors, producing an accumulation of shared
knowledge.
So the librarian is increasingly in a position to make
recommendations. Semantic tools can also produce recommendations. For example, books that have five of the same
index terms can be linked. It will be interesting to see
whether librarians and the semantic systems recommend
the same books. Human recommendation can take into
account aspects that a machine will never consider. On the
other hand, artificial intelligence can identify hints about
books that may be forgotten by the current generation of
librarians. However, it depends of course on the descrip-
The librarian recommends
As mentioned above, readers are often looking for ‘something like’. They have read a good book and they would like
to read other books like it. It is not always easy to determine what exactly it is that appeals to the reader. It may
be something to do with the plot, the characters, themes,
style or mood of the work, and (for reasons discussed at
the beginning of this article) this means that ‘likeness’ is
not susceptible to automatic indexing. This is where the
talents of the librarian (and the good human indexer) are
so important: intuition and sensitivity, familiarity with the
literature and strong deductive powers come to the fore. For
instance a customer might ask for ‘a satire in Daniil Harms
style’. This is a complex question, and the librarian has to
process many different aspects simultaneously: who is Daniil
Harms, what is unique about his writing, who produces work
that is in some way similar? After identifying some suitable
The Indexer Vol. 28 No. 2 June 2010
Figure 2 The appearance of a view of Kirjasampo-SAHA.
‘Novels’ has been picked up from the Kaunokki ontology, and
‘social conflicts’ is to be picked up.
55
Saarti and Hypén: From thesaurus to ontology
tion of books made by people. Artificial intelligence cannot
function without index terms given to books, but it can
combine those terms in multiple ways.
Authors, readers and contexts
Readers looking for a book often ask questions that combine
aspects of the work and its author, asking for example, for
‘a French female writer of detective stories’. A fiction
retrieval system must be able to answer this kind of question.
Kirjasampo-SAHA offers just this, since it is possible to
record an author’s biographical details including nationality
and the language they use, and to link them to the relevant
literary school and period. Readers can add comments, tag
and rate books, and discuss them in the user interface. This
means, as with LibraryThing (http://www.librarything.
com/), that two different perspectives on the book are on
offer: book information – the way that librarians describe
the books – and social information, provided by readers.
One of the most interesting aspects dealing with a fictional
work in Saarti’s model is its cultural-historical context. It is
also the greatest challenge in applying the model, and until
now it has been almost impossible to implement. One of
the main concepts of the semantic web is to contextualize
cultural and social phenomena to show how they affect each
other. Content is crucial, it forms the base, but it is equally
important to describe the contents by placing them in
different contexts and in this way to create new knowledge
and provide experiences. In this sense, fiction is a very interesting field: fictional works link with reality in many ways, as
well as with cultural history and other fictional works.
Kaunokki-ontology
The concept of the semantic web is to build a metadata layer
that describes the contents on the web with sufficient precision and accuracy for a machine to use, allowing web systems
to achieve better interoperability and enabling end users to
access more intelligent services (Hyvönen, 2006). (See also,
for a fuller discussion of the semantic web, Northedge,
2008). The key to effective use of the semantic web is the
use of ontologies. This helps in many application areas, such
as semantic search, information retrieval, semantic linking
of contents, and making contents semantically interoperable (Hyvönen et al, 2007). Ciravegna and Petrelli (2006)
also explored the role of ontology-based annotation as a
means of making document content amenable to automatic
searching and indexing.
Ontologization of the Kaunokki thesaurus and maintenance of the ontology are two of the Kirjasampo project
tasks. The conversion from thesaurus to ontology (that is,
defining the terms and their relationships, and connecting
them to the YSO, the General Finnish Ontology) took
about five months. (For a detailed description of the
thesaurus/ontology conversion, see Ruotsalo et al, 2008).
This done, the indexers had access to an ontology (http://
www.yso.fi/onki/kauno) with over 25,000 terms in a single
location, and no longer needed to search for terms from two
web services.
The Kaunokki ontology and thesaurus are similar in
content, with the thesaurus facets carried through to the
56
ontology. The hierarchical structure of ontology (which will
be updated pari passu with updating of the thesaurus) makes
it easier to notice faults and to spot hierarchical and associative relationships and distinctions between the terms, and is
invaluable in developing both thesaurus and ontology.
The Kaunokki ontology is integrated into SAHA, the
autocomplete function helping the indexer to choose the
right term (see Figure 2). The ontology is multilingual,
with terms in both Finnish and Swedish, and those terms
included in the YSO also in English. Indexing terms can
be displayed in any of the languages depending on the user
interface. SAHA itself is trilingual: the names of the fields
and the instructions are in Finnish, Swedish and English.
SAHA is an inspiring tool, encouraging and motivating
the annotator to seek more information and to save it in
the database. As the user interface and its functionalities
develop, it will be possible to explore further how ontologies
and other semantic techniques can benefit fiction retrieval.
The use of ontologies make indexing easier, but they do
not replace the human thought process, in asking questions
such as what is this book about, and how should this work
be described? The fundamental question is how far semantic
recommendations and semantic linking can help readers to
find good books, to increase their choice, to let them find
texts more easily than hitherto.
Challenges for the future
The strength of the network communities and technologies is the active social interaction they promote. But social
interaction is not enough in itself. This poses an interesting
challenge especially for libraries: how to adapt tools created
for printed documents controlled within institutions to the
various types of documents published on the web and outside
the control of institutions. The challenge has a mirror image:
how can databases created within libraries be made available to network communities, and how, if users are given
a role in indexing a library’s documents, is total chaos to
be avoided? How can such traditional library tools as classification and indexing be combined with modern tools such
as ontology-based annotation, tagging and user-evaluation?
Defining the borderlines and the overlap between library
work and the work done by the web communities will be an
essential part of finding an answer to the challenges.
Many areas remain to be investigated. How, for example,
does the social environment impact on the indexing of
fiction, and how does access to an index influence a reader’s
choice of book? What scope is there for taking further the
concept of democratic, user-directed, indexing, already in
use in several libraries? The new web-based communities
and their ways of organizing and disseminating information offer a living laboratory where it is possible to observe
and analyse the evolution of information retrieval tools and
actions. And in the specific field of fiction retrieval, the
most important next step is to analyse in depth the special
information systems already available, including commercial
models such as Amazon (see also Adkins and Bossaller,
2007; Arvidsson and Tolstoy, 2005) and systems developed
for library environments, together with open systems where
the library is one participant with other actors.
The Indexer Vol. 28 No. 2 June 2010
Saarti and Hypén: From thesaurus to ontology
Acknowledgements
The authors are grateful to Dr Ewen MacDonald for
revising the English in this paper, and to the editor Maureen
MacGlashan.
Notes
1 With the publication of S. R. Ranganathan’s Colon
classification.
2 See note 1.
3 Kirja = book, Sampo is in Finnish mythology a magical artefact
that brings good fortune to its holder. It is also a mill which can
make things like flour, salt and gold out of thin air.
References
Adkins, D. and Bossaller, J. E. (2007) Fiction access points across
computer-mediated book information sources: a comparison of
online bookstores, reader advisory databases, and public library
catalogs. Library and Information Science Research 29, 354–68.
Arvidsson, S. and Tolstoy, T. (2005) Internetbokhandelns
rekommendationssystem: En undersökning av Amazon.coms
Similar Items. Borås, Sweden: Högskolan i Borås.
Beghtol, C. (1994) The classification of fiction: the development
of a system based on theoretical principles. Metuchen, N.J.:
Scarecrow Press.
Beghtol, C. (1997) Stories: applications of narrative discourse
analysis to issues in information storage and retrieval. Knowledge
Organization 24(2), 64–71.
Bell, H. K. (1992) Should fiction be indexed? The indexability of
text. The Indexer 18(2), 83–6.
Bella (1997) Specialtesaurus för skönlitteratur, ed. J. Saarti,
trans. M. Rajalin, R. Sandelin and Y. Thölix. Helsinki: BTJ
Kirjastopalvelu.
Bella (2004) Specialtesaurus för fiktiv material, ed. Le. Rehnström
and J. Saarti. Helsingki: BTJ Kirjastopalvelu.
Ciravegna, F. and Petrelli, D. (2006) Annotating document
content: a knowledge-management perspective. The Indexer
25(1), 23–7.
Gadamer, H.-G. (2005). Totuus hengentieteissä. In Hermeneutiikka:
ymmärtäminen tieteissä ja filosofiassa, s. 3–11, trans. I. Nikander.
Tampere, Finland: Vastapaino. (Translation of Wahrheit in den
Geisteswissenschaften, 1953).
Green, R. (1997) The role of relational structures in indexing for
the humanities. Information Services and Use 17(2–3), 85–100.
Hidderley, R. and Rafferty, P. (1997) Democratic indexing: an
approach to the retrieval of fiction. Information Services and
Use 17(2–3), 101–9.
Hyvönen, E. (2006) FinnONTO – building the basis for a national
SemanticWeb infrastructure in Finland. Developments in
Artificial Intelligence and the Semantic Web – Proceedings of the
12th Finnish AI Conference STeP 2006, October 26–27, 2006.
Available at: www.seco.tkk.fi/publications/2006/hyvonenfinnonto-building-the-basis-for-a-national-semanticweb-infrastructure-in-finland-2006.pdfhttp://www.seco.
tkk.fi/publications/2006/hyvonen-finnonto-buildingthe-basis-for-a-national-semantic-web-infrastructurein-finland-2006.pdf
Hyvönen, E., Viljanen, K,, Mäkelä, E., Kauppinen, T., Ruotsalo, T.,
Valkeapää, O., Seppälä, K., Suominen, O., Alm, O., Lindroos,
R., Känsälä, T., Henriksson, R., Frosterius, M., Tuominen,
J., Sinkkilä, R. and Kurki, J. (2007) Elements of a national
semantic web infrastructure – case study Finland on the
The Indexer Vol. 28 No. 2 June 2010
semantic web. Proceedings of the First International Semantic
Computing Conference (IEEE ICSC 2007), Irvine, California,
September, 2007, IEEE Press. Available at: www.seco.tkk.
fi/publications/2007/hyvonen-et-al-elements-2007.pdf
(accessed 18 March 2010).
Johnstone, J. (2010) Poetry and the indexing thereof: the role of
the Scottish Poetry Library (SPL). The Indexer 28(1), 2–5.
Kaunokki (1996) Kaunokirjallisuuden asiasanasto, ed. J. Saarti.
Helsinki: BTJ Kirjastopalvelu.
Kaunokki (2000) Fiktiivisen aineiston asiasanasto, ed. J. Saarti.
Helsinki: BTJ Kirjastopalvelu.
McCutcheon, S. (2009) Keyword vs controlled vocabulary
searching: the one with the most tools wins. The Indexer 27(2),
62–5.
Nielsen, H. J. (1997) The nature of fiction and its significance
for classification and indexing. Information Services and Use
17(2–3), 171–82.
Noruzi, A. (2007) Editorial. Webology 4(2), 12. Available at: www.
webology.ir/2007/v4n2/editorial12.html (accessed 11
January 2009).
Northedge, R. (2008). The medium is not the message: topic maps
and the seapration of presentation and content in indexes. The
Indexer 26(2), 60–4.
Pejtersen, A. M. and Austin, J. (1983) Fiction retrieval:
experimental design and evaluation of a search system based
on users’ value criteria: part 1. Journal of Documentation 39(4),
230–46.
Rich, E. (1979). User modeling via stereotypes. Cognitive Science
3, 329–54.
Rich, E. (1983) Users are individuals: individualizing user models.
International Journal of Man–Machine Studies 18, 199–214.
Ruotsalo, T., Seppälä, K., Viljanen, Ki., Mäkelä, E., Kurki, J.,
Alm, O., Kauppinen, T., Tuominen, J., Frosterus, M., Sinkkilä,
R. and Hyvönen, E. (2008) Ontology-based approach for
interoperability of digital collections. Signum 5, 5–13. Available
at: http://pro.tsv.fi/stks/signum/ (accessed 18 March 2010).
Saarti, J. (1997) Feeding with the spoon, or the effects of shelf
classification of fiction on the loaning of fiction. Information
Services and Use 17(2–3), 159–69.
Saarti, J. (1999a) Kaunokirjallisuuden sisällönkuvailun aspektit:
kirjastoammattilaisten ja kirjastonkäyttäjien tekemien romaanien
tiivistelmien ja asiasanoitusten yhdenmukaisuus. Acta
Universitatis Ouluesis. B, Humaniora, 33. Oulu, Finland:
Oulun yliopisto. Available at: http://herkules.oulu.fi/
isbn9514254767 (accessed 18 March 2010).
Saarti, J. (1999b) Fiction indexing and the development of fiction
thesauri. Journal of Librarianship and Information Science
31(2), 85–92.
Saarti, J. (2002a) The analysis of the information process of fiction:
a holistic approach to information processing, pp. 74–9 in M.
J. López-Huertas (ed.), Challenges in Knowledge Representation
and Organization for the 21st Century: Integration of Knowledge
across Boundaries, Proceedings of the Seventh International
ISKO Conference 10–13 July 2002, Granada, Spain. Advances in
Knowledge Organization, vol. 8. Würzburg, Germany: Ergon.
Saarti, J. (2002b) Consistency of subject indexing of novels by public
library professionals and patrons. Journal of Documentation
58(1), 49–65.
Saracevic, T. (1975) Relevance: a review of and a framework for
thinking on the notion in information science. Journal of the
American Society for Information Science 26(6), 321–43.
Saracevic, T. (1996) Relevance reconsidered. In: Information
science: Integration in perspectives, pp. 201–18 in Proceedings
of the Second Conference on Conceptions of Library and
Information Science, Copenhagen.
57
RSaarti and Hypén: From thesaurus to ontology
Shatford, S. (1986) Analyzing the subject of a picture: a theoretical
approach. Cataloging and Classification Quarterly 6(3), 39–62.
Spiteri, L. F. (2007) Structure and form of folksonomy tags: the
road to the public library catalogue. Webology 4(2), article 41.
Available at: http://www.webology.ir/2007/v4n2/a41.html
(accessed 11 January 2009).
Wikipedia (2009) Folksonomy, Wikipedia Finland. Available at:
http://en.wikipedia.org/wiki/Folksonomy (accessed 11
January 2009).
58
Jarmo Saarti works at the University of Eastern Finland Library, P.O.
Box 1627, FIN-70211 Kuopio, Finland. Email: jarmo.saarti@
uef.fi
Kaisa Hypén is based at Turku City Library, Linnankatu 2,
FIN-20100 Turku, Finland. Email: [email protected]
The Indexer Vol. 28 No. 2 June 2010