From thesaurus to ontology: the development of the Kaunokki Finnish fiction thesaurus Jarmo Saarti and Kaisa Hypén Jarmo Saarti and Kaisa Hypén look briefly at pre-computer age fiction retrieval methods. They discuss the challenges works of fiction present in the information management and retrieval context and the need to develop fiction indexing tools suitable for web-based services, and describe how the creation of Kaunokki – the Finnish fiction thesaurus – in 1996, and its development into the Kirjasampo-SAHA web service for readers and librarians, are helping to meet these challenges. Background Finding what you don’t know is there: free text searching, the index and the social networker Works of fiction have traditionally been ordered on a shelfarrangement basis in such a way as to allow browsing by potential readers as they walk between open stacks (see Saarti, 1997). We all know how much we resent it when books are out of sight in basement stores and have to be called up by known title. These shelf arrangements were based mainly on genres, and are known as genre classifications. The usual library tools were used for determining the arrangements: classification, indexing and abstracts (see Saarti, 1999b, 2002b). But the reality, shown up by developments in the information technology field, including in particular digital distribution methods and retrospective digitization of material, is that in the years of paper catalogues much material (above all fiction) went unindexed and uncatalogued in any meaningful way. Unless steps are taken to remedy the situation this could continue in the Internet age since, as digitization demonstrates, texts and other types of documents that have not been analysed and classified and/ or indexed in full text databases are simply ignored. If the system cannot see them, it will not find them. Not only did the advent of highly sophisticated computer technology show up the gaps of the past, it also provided the means to move towards filling them. For natural science materials, the task is fairly easy. The fact that texts are usually topic-based makes them relatively unambiguous and thus ready for automatic text-search ‘indexing’. Fiction, being concept-based, is another matter, the greatest challenge in the representation of fictional content being that interpretation of the content is always subjective. Readers make their own judgement. (See Bell, 1992; and for a discussion of the problem in relation to the indexing of poetry, Johnstone, 2010.) The challenge for those seeking to develop a fiction retrieval system is to identify which aspects of a fictional work are the least subjective, and which are the aspects most readers will be able to agree on (see Saarti, 2002a; and for a discussion of the problem in relation to the indexing of poetry, Johnstone, 2010). In a sense this process began during the period of card catalogues, with the first experiments in developing subject 50 headings for fiction and the creation of systematic thesauri. The concept of faceted classification goes back to 1933,1 and there are now many lists describing the most objective aspects (or ‘facets’) in fiction content in the library and information science context (see e.g. Beghtol, 1994), the facets most often mentioned being genre, time and space. Other ‘objective’ elements include the style of the work, its language, and factual objects mentioned in the text including, for example, characters in a novel. However, ‘objective’ though they may be, this does not mean that they are necessarily easy to define objectively. It all comes back to the subjectivity of fiction interpretation. We discuss the role of facets and ways of dealing with subjectivity in the context of the Kaunokki thesaurus below. The most radical change introduced by computer technology into the world of information management and retrieval has been the possibility of arranging masses of text in searchable databases. The arrival of interactive user interfaces and social networking has taken things a step further, revolutionizing the publication and dissemination of information and making it possible to incorporate user behaviour into the content representation of fiction. One exciting aspect of the analysis of fiction texts and the fiction information dissemination process is the challenge it offers to traditional models, rigorously testing and, it is hoped, expanding the theoretical tools and concepts used in our field of research. (See e.g. Beghtol, 1994, 1997; Green, 1997.) In this article we explore developments in the field of fiction knowledge management, focusing especially on the problems of fiction indexing and on the development of the Kaunokki – the Finnish fiction thesaurus from printed thesaurus to online ontology. Relevance as interface between work and user In library and information science (LIS), the relationship between a potential reader and a work is defined by the concept of relevance. This is a somewhat fuzzy notion, but roughly speaking, ‘relevance’ happens when the process of information seeking and searching leads to a positive encounter between the searcher and the work. In other words, the searchers find something corresponding to what The Indexer Vol. 28 No. 2 June 2010 Saarti and Hypén: From thesaurus to ontology they are looking for. The fuzziness derives from the fact that information needs of an individual and the information in the documents under scrutiny are both infinitely variable, and so, therefore, is the scope for a ‘positive encounter’. Positive encounters are many-faceted and multi-dimensional. The first document turned up as the result of an informationseeking and searching process may not meet the needs of the seeker, but it can be used to redefine or refine the query, with the prospect of getting nearer to the target. Tefko Saracevic first addressed the concept of relevance in 1975. Returning to the subject in 1996, she suggested that ‘relevance’ manifests itself in five ways: system or algorithmic relevance (relationship between a query and information objects); subject relevance (relationship between the subject or topic expressed in a query); cognitive relevance or pertinence (relationship between the state of knowledge and cognitive information needs of a user, and the texts retrieved); situational relevance or utility (relationship between the situation, task, or problem identified by the user, and the texts retrieved); and motivational or affective relevance (relationship between the user’s goals and motivations, and the texts retrieved). These all pose particular challenges for the knowledge management of works of fiction, whether we are talking about traditional library tools or modern IT methods. Topicality in fiction: the problems of term selection Distinguishing fact and fiction As already mentioned, works of fiction present particular problems when it comes to ‘topicality’ – in other words, term selection. For example, novels drawing on the life of a real person or a historical novel representing events of the past are based explicitly on reality, but the fact that a novel based on a real person has been published as fiction implies that the author has deliberately moved the real person into a fictional world. So the rules are changed: the novel is a recreation of a life that did occur but which is now reinterpreted and rewritten. This sets one minimum requirement for the chosen system, particularly one that contains metadata about both factual and fictional works: it must be able to recognize the difference between them! And, as noted in studies on fiction knowledge management, the systems must be able to handle multi-faceted content descriptions and be open to multi-faceted information searching. User motivation There are essentially two sorts of motivation in the fictionsearching context. Either the user wants to track down works on a particular topic, or they are simply looking for enjoyment in the shape of a good novel, just as with a good movie or good music. Those looking for fiction of topical or subject relevance may be in search of something very precise, or may be interested more broadly in ‘aboutness’. So if a fiction information retrieval (IR) system is to be effective, it must be able to respond to searches about how truthfully or realistically a The Indexer Vol. 28 No. 2 June 2010 novel handles a historical fact, for example, or addresses aspects of real life such as illnesses or building one’s identity. People searching fiction by topic are often motivated by factors other than just the pleasure of reading – for example, fictional works may be used as material for historical or sociological studies, travel guides, even as language learning resources – but their search is essentially objective. ‘Good fiction’: a subjective concept When it comes to searching for ‘good’ fiction we are in more subjective territory, and it is very difficult to address the requirement using the traditional classification and content representation tools. Searchers are looking for something new based on their previous experiences (‘I want to read something like book X or similar to author A . . .’) but at the same time they want to experiment (‘. . . but I want to read something new’). It could indeed be that the greatest pleasure is to stumble upon a totally new world of fiction, with new types of material or authors or genres that do not correspond to anything read previously. In fact, the concept of ‘a good work’ in this context is totally dependent on the individual’s own, ever-changing point of view. A novel that someone considers ‘good’ today might not resonate with them tomorrow, but could become meaningful again at some time in the future. This is an experience we have all had when we re-read something we enjoyed when we were young and now find disappoints, or vice versa. For somebody looking for a ‘good’ work of fiction, ‘good’ only really has meaning by reference to the needs of that particular person at that particular moment in time: and the needs may not even be clear until the search is successful, or until they have been redefined and refined to the point of total clarity in response to unsuccessful searches. So when searching fiction, the opportunity to browse is allimportant. The sociohistorical context However, although only individuals can judge what is ‘good’ for them, fiction is always defined within a sociohistorical context. The interpretation and reception of works of fiction is invariably a social and historical construction, an aspect reflected in the use of categorizations (for example, ‘the most important movies in the world’, ‘Finnish national writers’, ‘a hundred horror movies you have to see’, ‘Death metal – the most important tracks of last year’). To be effective an information retrieval system must be able to build in this sociohistorical context, and to make use of the language and terminology of a given group for search and retrieval purposes. Actional relevance One further ‘relevance’ category can be added to Saracevic’s list especially in relation to fiction: actional or interpretative relevance. Reception of fiction is based on interpretation, and these days interpretation is actively disseminated in Internet forums, in book club discussions, and in reviews. Actional relevance also occurs when new works of fiction 51 Saarti and Hypén: From thesaurus to ontology are created, based on those that have been read, or perhaps through role playing activity. Fan-fiction portals and social networking are excellent examples of actional relevance. Folksonomies The reader as interpreter of fiction and creator of fictional environments Indexing carried out in libraries and similar institutions tends to follow one or other of two main approaches: intellectual or assigned indexing done by humans, or derived indexing using different types of algorithms in conjunction with fulltext databases. In the first approach different types of word lists, thesauri and so on are used to guide and control the indexer. In the latter, the text in the databases is indexed by computer using statistical approaches based on language. The first approach, intellectual or assigned indexing, has been transformed with the evolution of Web 2.0 technologies. Users of the material can now be involved in the indexing process, identifying their needs and preferences by a process of ‘tagging’. Word lists or thesauri thus created are known as folksonomies, defined (Wikipedia, 2009; Noruzi, 2007) as taxonomies created within a network-based community for the indexing (tagging) of Internet or other documents, the key word here perhaps being ‘community’. Folksonomies started to develop, as was the case with printed documents, when the amount of material available reached the point that the target community needed tools to manage it. Web technologies also offer other possibilities for developing thesauri, such as linking between thesauri and dictionaries in order to define and tighten or broaden the terminology used initially. From anarchy to structure and control The building of folksonomies started in an anarchic fashion, but it was soon realized that they needed structure and vocabulary control. Spiteri (2007) saw folksonomies as a tool for broadening library services, but stressed the importance of structural and term control if library catalogues were to be open to indexing (or rather, tagging) by the users. Noruzi (2007) identified thesaurus-like structural requirements as including term control (form and meaning), hierarchies between terms used, spelling and vocabulary structures (indexes, guides and so on). The structures and rules for ‘open’ indexing need to be produced professionally, a task we have already embarked on in Finland using various approaches. These include ‘ontologized’ net thesauri (VESA and Kaunokki), the use of statistical algorithms to regulate free indexing by users (see Saarti, 2002b), and Internet bookstore voting methods in which users are invited to evaluate the tagging done by other users. Another possibility is ‘tagging’ moderation by professionals (Hidderley and Rafferty,1997). Folkosonomies and controlled vocabularies compared The basic difference between these two systems (see Table 1) is that a folksonomy is built within an interactive subcultural Internet community and is continuously tested and modified. Building a controlled vocabulary is a much more elaborate process requiring a great deal of sophisticated professional input, and therefore adopting new terms takes time (McCutcheon, 2009). Folksonomy responds to the cognitive relevance for users, and provides tools to put them in the driving seat, whereas a controlled vocabulary is a tool for controlling and managing the topical or subject relevance. There is clearly a need for further studies on how these tools can be united, working towards a controlled way of creating open and flexible vocabularies and discovering how to make use of all available tools in combination. In other words, as McCutcheon puts it, ‘the one with the most tools wins’. Kaunokki – the Finnish fiction thesaurus First steps The origins of the Finnish thesaurus for fiction can be traced to experiments on indexing fiction carried out by a group of Finnish librarians and booksellers using the Finnish general thesaurus (YSA). They soon discovered that the YSA lacked the terms for indexing fiction. It was decided, therefore, to develop centralized indexing services for fiction. But indexing fiction tends to be seen as laborious in itself, and suffers (at least in Finland) from a lack of tradition and guidelines, including subject heading lists and thesauri. In the autumn of 1993, the University of Helsinki Library (acting also as the Finnish National Library) and the Finnish Library Service embarked on the construction of a subject heading list for fiction under the supervision of a general editor and editorial board. The initial draft of the Kaunokki Table 1 Folksonomies and controlled vocabularies Folksonomy Controlled vocabulary Answers in an open manner to the aboutness and topicality of fiction Answers in a controlled manner to the aboutness and topicality of fiction Used within communities to meet their particular interpretive needs Helps bibliographic control by library professionals Flexibility of definition Strict definition Transition from ‘controlled’ indexing to describing one’s own experiences is straightforward Transition from ‘controlled’ indexing to describing one’s own experiences is practically impossible 52 The Indexer Vol. 28 No. 2 June 2010 Saarti and Hypén: From thesaurus to ontology was tested in Finnish public libraries, and the first edition published in 1996. Structuring the Kaunokki: a faceted thesaurus The first problem was to decide on how to structure the Kaunokki: subject heading list or thesaurus? The editorial board opted for thesaurus format so that the new list would match up with other thesauri published by the University of Helsinki Library. The organization of the thesaurus was to be compatible with the facets and topics identified as relevant in various studies on the classification and indexing of fiction. An alphabetical ‘index’ was added of all the terms used in the thesaurus. The following facets were used: • • • • • • fictional genres and their explanations events, motives and themes characters settings times ‘other’ (mostly technical and typographical aspects). Four of these facets – events, characters, spaces (or settings) and times – are mentioned in almost all studies as the main categories for fiction indexing: ‘Characters, Events, Spaces and Times may be taken as fundamental data categories for fiction’ (Beghtol, 1994: 157). Beghtol’s list is very similar to Ranganathan’s PMEST schema (personality, matter, energy, space, time), as is Shatford’s ‘who, what, when, where’ approach to indexing pictures (Shatford, 1986). Shatford decided to combine personality and matter facets as ‘characters’, and ascribes activities by characters to the energy facet. In Kaunokki, the solution was to treat terms describing the genre of the fictional work as corresponding to Ranganathan’s personality facet, since the genre in fact describes the personality of the work and determines many of the events, spaces and times used in the novel. Events and motives in Kaunokki were treated as Ranganathan’s matter facet, while Kaunokki characters corresponded to Ranganathan’s energy facet. Using the facets in association with Ranganathan’s basic class structure2 made it possible to separate out different types of media, such as fiction, comics, movies and musical works. The miscellaneous ‘other’ group included mainly terms that describe aspects which are outside the factual text of the work but are regularly inquired about in libraries. These aspects fall into Pejtersen’s accessibility category, such as readability, language (Pejtersen and Austin, 1983: 234). Horses for courses It was realized from an early stage that the context in which the thesaurus was to be used was key to choosing the right terms and the right depth. For example, the decision to use the Kaunokki in public libraries meant that many terms which were important for literature studies were not appropriate for the Kaunokki’s less specialized users. This gap will be made good when the planned Thesaurus for literary research is published, and some features of the Thesaurus for literary research may also be useful to public The Indexer Vol. 28 No. 2 June 2010 library users. Hans Jørn Nielsen argues that traditional fiction indexing, which is mainly based on the factual aspects, should be extended to include thematic factors, as well as aspects related to the narrative structures. (It must be remembered that in modern and postmodern fiction the main point is not what is told, but how it is told.) Nielsen also emphasizes the importance of including in the index cultural and historical factors that have affected the work. Some of these have been included in the Kaunokki (for example schools of art and cultural periods) (Nielsen, 1997). A test run From the beginning, a database was used to collect the terms and manage the Kaunokki, but at that time database systems had not evolved sufficiently allow for an electronic version of the first edition. It was decided therefore to produce the first edition in hard copy only. To emphasize its pioneering spirit, it was described as a ‘test’ edition. This first edition of Kaunokki was published in 1996 by BTJ Kirjastopalvelu, a company that provides services for Finnish libraries. BTJ Kirjastopalvelu then decided to index all published Finnish fiction using Kaunokki. As it was already providing library cataloguing data for the majority of Finnish public libraries, its decision ensured that Kaunokki had an immediate impact on the whole Finnish library field. Fiction indexing: why and how? BTJ Kirjastopalvelu’s decision also gave rise to considerable debate about the why and how of fiction indexing. Should works of fiction be indexed at all? That hurdle safely negotiated, the debate turned to the practical aspects of indexing: how many terms should be used per work, do all the aspects of the work of fiction need to be indexed, and importantly, how exhaustive and how specific should the indexing of works of fiction be? Perhaps the most important outcome of this discussion was that it forced libraries to develop their own policies on fiction indexing, taking this debate into account. BTJ Kirjastopalvelu had to revise and refine its fiction indexing policy in response to points made by the libraries. A Swedish version – Bella Once the first version of Kaunokki was published, the next step was to produce a Swedish version (Bella) to meet the needs of Finland’s Swedish-speaking population. The Finnish version was first translated into Swedish, or rather into the Swedish that is used in Finland. The result was then carefully edited to ensure that the terms and the meanings attributed to them worked in ‘Finnish’ Swedish. Although Finnish and Swedish are both official languages of Finland, some cultural meanings are not identical in the two languages, so Bella has some terms that are unique to it. Bella was published in 1997, just a year after Kaunokki. Expanding the scope Once Kaunokki/Bella was firmly established as a tool for fiction indexing in Finland, it was decided to prepare a 53 Saarti and Hypén: From thesaurus to ontology second edition, drawing on the expertise of the editorial team that had been involved with the project since the beginning. As a result of feedback from libraries, it was decided to extend Kaunokki to include movies, comics and so on, and not limit it to traditional text. This meant a lot of new terms were needed. It was also decided again to publish a Swedish version, and for these next editions to continue as hard-copy publications. They duly appeared in 2000 (Finnish) and 2004 (Swedish). To emphasize the broadened nature of the thesaurus, it was named Kaunokki: thesaurus for fictional materials. A web-based version It was now essential to move to a web-based version. This was ready in 2006 (Finnish version available from http:// Kaunokki.kirjastot.fi/ and the Swedish version from http://Kaunokki.kirjastot.fi/sv-FI/). The service is hosted and maintained by Helsinki City Library. The editing process has remained the same, with the editorial board deciding on terminology and structure. The key advantages of the web version are its enhanced searchability and the possibilities of keeping users informed about the changes to the thesaurus. Kirjasampo,3 a web fiction retrieval service The problems concerning the description and retrieval of fiction have been discussed above. The Kirjasampo project is a web fiction retrieval service for readers and librarians, which is being set up to help solve some of the problems of fiction description and retrieval mentioned above. It will in particular offer a variety of ways of looking for and recommending ‘good fiction’, ‘good books’ or ‘something similar’. It will be based on two communities: the professional community of librarians, who will catalogue the books, including the ‘long tail’ of literature from the past, using their traditional tools, and the non-professional community of readers, who will discuss, tag and rate the books via the Kirjasampo user interface. As explained below, it will use semantic techniques. This project is an extensive enterprise involving many parties. A project group in Turku City Library is responsible for the contents. The editorial staff of the libraries.fi web service (http://www.libraries.fi/en-GB/) is responsible for developing the user interface and maintaining the service. Vaasa City Library is responsible for providing the information about Finnish writers. The Semantic Computing Research Group (SeCo) at Aalto University (see http:// www.seco.tkk.fi/) is responsible for the semantic annotation tool SAHA. The project is funded by the Ministry of Education, and will be available later in 2010 at www. kirjasampo.fi. A model for a fiction search and retrieval system The concept of the Kirjasampo web service is based on the model for a search and retrieval system described by Saarti (1999a). In 1999, when this model was first developed, it was not possible to build a system that could take account of all the aspects listed. Today, when information technology is so much more sophisticated, the model can be turned into a reality. In essence it remains much as it was – we have neither added nor removed any aspects or topics, but have just applied modern web technologies to make it work. Catalogues and indexes of fictional works: • catalogues • subject indexes • abstracts Reception by readers: • reception of individual works by different readers • criticism • scientific studies Fictional work: • digitized works, • their intertextual references Data about authors: • personal history • publication history • manuscripts Cultural historical context: • history of fiction • cultural history • history of reception Figure 1 A model for a fiction search and retrieval system model Source: Saarti (1999a: 194). 54 The Indexer Vol. 28 No. 2 June 2010 Saarti and Hypén: From thesaurus to ontology We are not in the business of digitizing books, just providing more information about them. The most important part of Kirjasampo is the semantic database. This is based on a metadata schema, in which we have specified which elements are important for the system, emphasizing the linking and metadata features, which make it possible to describe fictional works including poems and short stories in several ways, and to link together works that have something in common. The Kirjasampo database, which uses a metadata format similar to the Dublin Core, displays fields for author, title, publisher and so on. Works can be linked to related webbased resources, such as critical reviews and studies. Data is fed into the database using the SAHA annotation editor (http://demo.seco.tkk.fi/smetana/kirjasampo/index. shtml), which is browser-based and available for simultaneous annotation by a number of contributors. The principles of functional cataloguing are followed, with the substance or abstract of each work included only once. Physical manifestations, such as translations and availability of the work in different media and formats, are linked to the abstract. Figure 2 shows Kirjasampo-SAHA in action, describing the content of a fictional work. The facets, though phrased slightly differently, are the same as in Kaunokki: genre, theme, character, place and time. The language used can be very precise, or take the form of broad ‘aboutness’ terms corresponding to the different sorts of user motivation discussed above. Detailed information can be provided about the main characters: for example, information can be given about the kind of policeman Adam Dalgliesh is. It is also possible to specify which keywords belong together: for example ‘symbolics: stones’. And in the keyword field it is possible for users to introduce their own keywords, if suitable ones cannot be found in the thesaurus. It is also possible to provide a summary describing the work. The book cover can be saved and indexed as text. Information can be included such as the fact that a work has received an award, that it is a part of a series, or is the basis for a film script, libretto, stage play or similar. A short story or a poem can be included in its own right, and linked to anthologies or collected works where it can be found. writers and books, the librarian can add them to Kirjasampo-SAHA, enabling colleagues to share the results of these efforts. Other librarians, snowball-like, can add more titles and more authors, producing an accumulation of shared knowledge. So the librarian is increasingly in a position to make recommendations. Semantic tools can also produce recommendations. For example, books that have five of the same index terms can be linked. It will be interesting to see whether librarians and the semantic systems recommend the same books. Human recommendation can take into account aspects that a machine will never consider. On the other hand, artificial intelligence can identify hints about books that may be forgotten by the current generation of librarians. However, it depends of course on the descrip- The librarian recommends As mentioned above, readers are often looking for ‘something like’. They have read a good book and they would like to read other books like it. It is not always easy to determine what exactly it is that appeals to the reader. It may be something to do with the plot, the characters, themes, style or mood of the work, and (for reasons discussed at the beginning of this article) this means that ‘likeness’ is not susceptible to automatic indexing. This is where the talents of the librarian (and the good human indexer) are so important: intuition and sensitivity, familiarity with the literature and strong deductive powers come to the fore. For instance a customer might ask for ‘a satire in Daniil Harms style’. This is a complex question, and the librarian has to process many different aspects simultaneously: who is Daniil Harms, what is unique about his writing, who produces work that is in some way similar? After identifying some suitable The Indexer Vol. 28 No. 2 June 2010 Figure 2 The appearance of a view of Kirjasampo-SAHA. ‘Novels’ has been picked up from the Kaunokki ontology, and ‘social conflicts’ is to be picked up. 55 Saarti and Hypén: From thesaurus to ontology tion of books made by people. Artificial intelligence cannot function without index terms given to books, but it can combine those terms in multiple ways. Authors, readers and contexts Readers looking for a book often ask questions that combine aspects of the work and its author, asking for example, for ‘a French female writer of detective stories’. A fiction retrieval system must be able to answer this kind of question. Kirjasampo-SAHA offers just this, since it is possible to record an author’s biographical details including nationality and the language they use, and to link them to the relevant literary school and period. Readers can add comments, tag and rate books, and discuss them in the user interface. This means, as with LibraryThing (http://www.librarything. com/), that two different perspectives on the book are on offer: book information – the way that librarians describe the books – and social information, provided by readers. One of the most interesting aspects dealing with a fictional work in Saarti’s model is its cultural-historical context. It is also the greatest challenge in applying the model, and until now it has been almost impossible to implement. One of the main concepts of the semantic web is to contextualize cultural and social phenomena to show how they affect each other. Content is crucial, it forms the base, but it is equally important to describe the contents by placing them in different contexts and in this way to create new knowledge and provide experiences. In this sense, fiction is a very interesting field: fictional works link with reality in many ways, as well as with cultural history and other fictional works. Kaunokki-ontology The concept of the semantic web is to build a metadata layer that describes the contents on the web with sufficient precision and accuracy for a machine to use, allowing web systems to achieve better interoperability and enabling end users to access more intelligent services (Hyvönen, 2006). (See also, for a fuller discussion of the semantic web, Northedge, 2008). The key to effective use of the semantic web is the use of ontologies. This helps in many application areas, such as semantic search, information retrieval, semantic linking of contents, and making contents semantically interoperable (Hyvönen et al, 2007). Ciravegna and Petrelli (2006) also explored the role of ontology-based annotation as a means of making document content amenable to automatic searching and indexing. Ontologization of the Kaunokki thesaurus and maintenance of the ontology are two of the Kirjasampo project tasks. The conversion from thesaurus to ontology (that is, defining the terms and their relationships, and connecting them to the YSO, the General Finnish Ontology) took about five months. (For a detailed description of the thesaurus/ontology conversion, see Ruotsalo et al, 2008). This done, the indexers had access to an ontology (http:// www.yso.fi/onki/kauno) with over 25,000 terms in a single location, and no longer needed to search for terms from two web services. The Kaunokki ontology and thesaurus are similar in content, with the thesaurus facets carried through to the 56 ontology. The hierarchical structure of ontology (which will be updated pari passu with updating of the thesaurus) makes it easier to notice faults and to spot hierarchical and associative relationships and distinctions between the terms, and is invaluable in developing both thesaurus and ontology. The Kaunokki ontology is integrated into SAHA, the autocomplete function helping the indexer to choose the right term (see Figure 2). The ontology is multilingual, with terms in both Finnish and Swedish, and those terms included in the YSO also in English. Indexing terms can be displayed in any of the languages depending on the user interface. SAHA itself is trilingual: the names of the fields and the instructions are in Finnish, Swedish and English. SAHA is an inspiring tool, encouraging and motivating the annotator to seek more information and to save it in the database. As the user interface and its functionalities develop, it will be possible to explore further how ontologies and other semantic techniques can benefit fiction retrieval. The use of ontologies make indexing easier, but they do not replace the human thought process, in asking questions such as what is this book about, and how should this work be described? The fundamental question is how far semantic recommendations and semantic linking can help readers to find good books, to increase their choice, to let them find texts more easily than hitherto. Challenges for the future The strength of the network communities and technologies is the active social interaction they promote. But social interaction is not enough in itself. This poses an interesting challenge especially for libraries: how to adapt tools created for printed documents controlled within institutions to the various types of documents published on the web and outside the control of institutions. The challenge has a mirror image: how can databases created within libraries be made available to network communities, and how, if users are given a role in indexing a library’s documents, is total chaos to be avoided? How can such traditional library tools as classification and indexing be combined with modern tools such as ontology-based annotation, tagging and user-evaluation? Defining the borderlines and the overlap between library work and the work done by the web communities will be an essential part of finding an answer to the challenges. Many areas remain to be investigated. How, for example, does the social environment impact on the indexing of fiction, and how does access to an index influence a reader’s choice of book? What scope is there for taking further the concept of democratic, user-directed, indexing, already in use in several libraries? The new web-based communities and their ways of organizing and disseminating information offer a living laboratory where it is possible to observe and analyse the evolution of information retrieval tools and actions. And in the specific field of fiction retrieval, the most important next step is to analyse in depth the special information systems already available, including commercial models such as Amazon (see also Adkins and Bossaller, 2007; Arvidsson and Tolstoy, 2005) and systems developed for library environments, together with open systems where the library is one participant with other actors. The Indexer Vol. 28 No. 2 June 2010 Saarti and Hypén: From thesaurus to ontology Acknowledgements The authors are grateful to Dr Ewen MacDonald for revising the English in this paper, and to the editor Maureen MacGlashan. Notes 1 With the publication of S. R. Ranganathan’s Colon classification. 2 See note 1. 3 Kirja = book, Sampo is in Finnish mythology a magical artefact that brings good fortune to its holder. It is also a mill which can make things like flour, salt and gold out of thin air. References Adkins, D. and Bossaller, J. E. (2007) Fiction access points across computer-mediated book information sources: a comparison of online bookstores, reader advisory databases, and public library catalogs. Library and Information Science Research 29, 354–68. Arvidsson, S. and Tolstoy, T. (2005) Internetbokhandelns rekommendationssystem: En undersökning av Amazon.coms Similar Items. Borås, Sweden: Högskolan i Borås. Beghtol, C. (1994) The classification of fiction: the development of a system based on theoretical principles. Metuchen, N.J.: Scarecrow Press. Beghtol, C. (1997) Stories: applications of narrative discourse analysis to issues in information storage and retrieval. Knowledge Organization 24(2), 64–71. Bell, H. K. (1992) Should fiction be indexed? The indexability of text. The Indexer 18(2), 83–6. Bella (1997) Specialtesaurus för skönlitteratur, ed. J. Saarti, trans. M. Rajalin, R. Sandelin and Y. Thölix. Helsinki: BTJ Kirjastopalvelu. Bella (2004) Specialtesaurus för fiktiv material, ed. Le. Rehnström and J. Saarti. Helsingki: BTJ Kirjastopalvelu. Ciravegna, F. and Petrelli, D. (2006) Annotating document content: a knowledge-management perspective. The Indexer 25(1), 23–7. Gadamer, H.-G. (2005). Totuus hengentieteissä. In Hermeneutiikka: ymmärtäminen tieteissä ja filosofiassa, s. 3–11, trans. I. Nikander. Tampere, Finland: Vastapaino. (Translation of Wahrheit in den Geisteswissenschaften, 1953). Green, R. (1997) The role of relational structures in indexing for the humanities. Information Services and Use 17(2–3), 85–100. Hidderley, R. and Rafferty, P. (1997) Democratic indexing: an approach to the retrieval of fiction. Information Services and Use 17(2–3), 101–9. Hyvönen, E. (2006) FinnONTO – building the basis for a national SemanticWeb infrastructure in Finland. Developments in Artificial Intelligence and the Semantic Web – Proceedings of the 12th Finnish AI Conference STeP 2006, October 26–27, 2006. Available at: www.seco.tkk.fi/publications/2006/hyvonenfinnonto-building-the-basis-for-a-national-semanticweb-infrastructure-in-finland-2006.pdfhttp://www.seco. tkk.fi/publications/2006/hyvonen-finnonto-buildingthe-basis-for-a-national-semantic-web-infrastructurein-finland-2006.pdf Hyvönen, E., Viljanen, K,, Mäkelä, E., Kauppinen, T., Ruotsalo, T., Valkeapää, O., Seppälä, K., Suominen, O., Alm, O., Lindroos, R., Känsälä, T., Henriksson, R., Frosterius, M., Tuominen, J., Sinkkilä, R. and Kurki, J. (2007) Elements of a national semantic web infrastructure – case study Finland on the The Indexer Vol. 28 No. 2 June 2010 semantic web. Proceedings of the First International Semantic Computing Conference (IEEE ICSC 2007), Irvine, California, September, 2007, IEEE Press. Available at: www.seco.tkk. fi/publications/2007/hyvonen-et-al-elements-2007.pdf (accessed 18 March 2010). Johnstone, J. (2010) Poetry and the indexing thereof: the role of the Scottish Poetry Library (SPL). The Indexer 28(1), 2–5. Kaunokki (1996) Kaunokirjallisuuden asiasanasto, ed. J. Saarti. Helsinki: BTJ Kirjastopalvelu. Kaunokki (2000) Fiktiivisen aineiston asiasanasto, ed. J. Saarti. Helsinki: BTJ Kirjastopalvelu. McCutcheon, S. (2009) Keyword vs controlled vocabulary searching: the one with the most tools wins. The Indexer 27(2), 62–5. Nielsen, H. J. (1997) The nature of fiction and its significance for classification and indexing. Information Services and Use 17(2–3), 171–82. Noruzi, A. (2007) Editorial. Webology 4(2), 12. Available at: www. webology.ir/2007/v4n2/editorial12.html (accessed 11 January 2009). Northedge, R. (2008). The medium is not the message: topic maps and the seapration of presentation and content in indexes. The Indexer 26(2), 60–4. Pejtersen, A. M. and Austin, J. (1983) Fiction retrieval: experimental design and evaluation of a search system based on users’ value criteria: part 1. Journal of Documentation 39(4), 230–46. Rich, E. (1979). User modeling via stereotypes. Cognitive Science 3, 329–54. Rich, E. (1983) Users are individuals: individualizing user models. International Journal of Man–Machine Studies 18, 199–214. Ruotsalo, T., Seppälä, K., Viljanen, Ki., Mäkelä, E., Kurki, J., Alm, O., Kauppinen, T., Tuominen, J., Frosterus, M., Sinkkilä, R. and Hyvönen, E. (2008) Ontology-based approach for interoperability of digital collections. Signum 5, 5–13. Available at: http://pro.tsv.fi/stks/signum/ (accessed 18 March 2010). Saarti, J. (1997) Feeding with the spoon, or the effects of shelf classification of fiction on the loaning of fiction. Information Services and Use 17(2–3), 159–69. Saarti, J. (1999a) Kaunokirjallisuuden sisällönkuvailun aspektit: kirjastoammattilaisten ja kirjastonkäyttäjien tekemien romaanien tiivistelmien ja asiasanoitusten yhdenmukaisuus. Acta Universitatis Ouluesis. B, Humaniora, 33. Oulu, Finland: Oulun yliopisto. Available at: http://herkules.oulu.fi/ isbn9514254767 (accessed 18 March 2010). Saarti, J. (1999b) Fiction indexing and the development of fiction thesauri. Journal of Librarianship and Information Science 31(2), 85–92. Saarti, J. (2002a) The analysis of the information process of fiction: a holistic approach to information processing, pp. 74–9 in M. J. López-Huertas (ed.), Challenges in Knowledge Representation and Organization for the 21st Century: Integration of Knowledge across Boundaries, Proceedings of the Seventh International ISKO Conference 10–13 July 2002, Granada, Spain. Advances in Knowledge Organization, vol. 8. Würzburg, Germany: Ergon. Saarti, J. (2002b) Consistency of subject indexing of novels by public library professionals and patrons. Journal of Documentation 58(1), 49–65. Saracevic, T. (1975) Relevance: a review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science 26(6), 321–43. Saracevic, T. (1996) Relevance reconsidered. In: Information science: Integration in perspectives, pp. 201–18 in Proceedings of the Second Conference on Conceptions of Library and Information Science, Copenhagen. 57 RSaarti and Hypén: From thesaurus to ontology Shatford, S. (1986) Analyzing the subject of a picture: a theoretical approach. Cataloging and Classification Quarterly 6(3), 39–62. Spiteri, L. F. (2007) Structure and form of folksonomy tags: the road to the public library catalogue. Webology 4(2), article 41. Available at: http://www.webology.ir/2007/v4n2/a41.html (accessed 11 January 2009). Wikipedia (2009) Folksonomy, Wikipedia Finland. Available at: http://en.wikipedia.org/wiki/Folksonomy (accessed 11 January 2009). 58 Jarmo Saarti works at the University of Eastern Finland Library, P.O. Box 1627, FIN-70211 Kuopio, Finland. Email: jarmo.saarti@ uef.fi Kaisa Hypén is based at Turku City Library, Linnankatu 2, FIN-20100 Turku, Finland. Email: [email protected] The Indexer Vol. 28 No. 2 June 2010
© Copyright 2025 Paperzz