Information Processing and Management 37 (2001) 147±161 www.elsevier.com/locate/infoproman Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval Turid Hedlund*, Ari Pirkola, Kalervo JaÈrvelin Department of Information Studies, University of Tampere, PO Box 607, FIN-33101, Tampere, Finland Received 21 January 2000; accepted 12 April 2000 Abstract This paper analyzes the features of the Swedish language from the viewpoint of mono- and crosslanguage information retrieval (CLIR). The study was motivated by the fact that Swedish is known poorly from the IR perspective. This paper shows that Swedish has unique features, in particular gender features, the use of fogemorphemes in the formation of compound words, and a high frequency of homographic words. Especially in dictionary-based CLIR, correct word normalization and compound splitting are essential. It was shown in this study, however, that publicly available morphological analysis tools used for normalization and compound splitting have pitfalls that might decrease the eectiveness of IR and CLIR. A comparative study was performed to test the degree of lexical ambiguity in Swedish, Finnish and English. The results suggest that part-of-speech tagging might be useful in Swedish IR due to the high frequency of homographic words. 7 2000 Elsevier Science Ltd. All rights reserved. Keywords: Text retrieval; Cross-language information retrieval; Swedish language; Natural language processing 1. Introduction Our society depends on written communication, which today often is created and stored in * Corresponding author: Swedish School of Economics and Business Administration Library, PO Box 479, FIN00101 Helsinki, Finland. Tel.: +358-9-43133378; fax: +358-9-43133425. E-mail address: turid.hedlund@shh.® (T. Hedlund). 0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved. PII: S 0 3 0 6 - 4 5 7 3 ( 0 0 ) 0 0 0 2 4 - 8 148 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 digital form. We have an increasing amount of full text material in dierent languages available through the Internet and other information suppliers. Information retrieval (IR) from operational full text databases or text retrieval is characterized by the lack of controlled vocabularies. A relatively new research area is cross-language information retrieval (CLIR). It is a process of selecting and ranking documents in a language dierent from the query language. Text retrieval involves the use of natural language in the form of textual descriptions of a topic. The variation in word forms and homographic and polysemous expressions causes disambiguation problems in text retrieval, and the eect increases in CLIR. We can express the same information content in many dierent ways depending on who is the writer and for whom or for what purpose a text is written. Full text retrieval as a response to a query in natural language involves methods developed in computational linguistics (Grishman, 1986) and natural language processing (NLP) (Strzalkowski, 1994, 1995). Increasing recall and precision depends on these applications to match non-identical expressions belonging to the same concept. Computational linguistics oers research in language analysis and knowledge representation (syntactic and semantic knowledge), and NLP applications rely strongly on statistical text analysis and machine readable texts and dictionaries (Boguraev & Briscoe, 1989; Guthrie, Pustejovsky, Wilks & Slator, 1996). Analytic tools relevant for IR purposes involve morphological analysis programs, part of speech taggers, syntactic parsers, and truncation algorithms (stemmers) (Harman, 1991; Karlsson, 1992; Koskenniemi, 1983; Porter, 1980). Because English has been the main language for IR system development, much research on IR and NLP involves English. However, IR systems for small languages like Finnish and Swedish and other Nordic languages cannot be developed properly without studying their special features. Although Spanish and Chinese have rendered special tracks in the TREC Conferences1 and even Finnish (Alkula & Honkela, 1992) has been explored the results cannot be applied on linguistically quite dierent languages. This paper analyzes Swedish as document and query language for IR. Swedish is spoken as a native language by eight to nine million people, mainly in Sweden but also by a minority population in Finland. However, due to close relationships between the Nordic countries and the other Scandinavian languages the number of people who speak Swedish and can understand it is much larger (Teleman, Hellberg & Andersson, 1999). Approximately 20 million people have a basic knowledge of Swedish. Since the Scandinavian languages Swedish, Danish and Norwegian are close, results of scienti®c research on one language can help to create a deeper understanding of the same phenomena in the other Scandinavian languages (Teleman et al., 1999). Thus, a careful generalization of the results in this study can be made to all three languages. So far we can report very few analyses of the Scandinavian languages concerning IR, only a small Norwegian study (Fjeldvig & Golden, 1988). Swedish and the Scandinavian languages have unique and novel features which aect IR. We will identify and present these features in the Swedish language relevant to IR on morphologic, syntactic as well as on semantic level. 1 The fourth, ®fth and sixth REtrieval Conferences, 1996±1998. URL:http://trec.nist.gov/ T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 149 The problem areas studied are (a) morphological features such as in¯ection, derivation, gender and compound words, (b) semantical features as homonymy, polysemy and hyponymy. The viewpoints for the analysis are full text IR, database indexing, query formulation and CLIR. Publicly available tools to overcome some of the problems have some pitfalls, which we analyze. We believe that these questions deserve serious research eorts. The rest of this paper is organized as follows. Section 2 summarizes the linguistic problems of IR. Section 3 considers IR and natural language processing. In Section 4 the main linguistic features of Swedish are analyzed from the perspective of IR. In Section 5 the use of NLP tools in Swedish IR and CLIR is discussed. Section 6 presents concluding remarks. 2. Linguistic problems in IR An ideal case in searching would be that a search would give all the relevant documents of a database, ranked in an order of descending relevance, and none of the irrelevant documents. In a typical case, however, search results often include many irrelevant documents, and many of the relevant documents are not retrieved. This is primarily caused by linguistic phenomena. The basic linguistic problems of IR can be classi®ed as follows: (1) the selection of alternative concepts and search keys, (2) the morphological variation of search keys, (3) referred and omitted search keys, (4) search key ambiguity, and (5) multilinguality. The ®rst problem, the selection of alternative concepts and search keys, is related to the fact that dierent documents which consider the same topic may use, and often do use, dierent concepts and expressions. Typically only part of the relevant documents are retrieved, as users are not able to produce exhaustively alternative concepts for the concepts of requests and are not able to expand the queries using synonyms. The problem of query expansion has been investigated in several studies (Efthimiadis, 1996; KekaÈlaÈinen, 1999). In IR terms, the search concepts which describe a given aspect of a request and which thus are alternative concepts are either in hierarchical or associative relations to each other. In linguistic terms, these relationships include hyponymy, a hierarchical subordinate relation (animal Ð horse ) meronymy, a relationship between the whole and its parts (hand Ð ®nger ), and other sense relations, such as antonymy (hot Ð cold ) (for associative relations). In IR terms, the search keys which represent a given concept are equivalents. In linguistic terms, equivalence covers synonymy (Information Retrieval Ð IR ). The second problem, the morphological variation of search keys is associated with the matching process. In morphologically simple languages, such as English, word form variation can be handled by quite simple means, e.g., for nouns, by recognizing the plural forms (-s ) (Harman, 1991; Porter, 1980). However several European languages are morphologically much more complex (e.g., German, French, Italian and the Scandinavian languages). Documents are not retrieved if the search key and its occurrence in a database index (the index term) are not identical in form. Thus a search key given in a base form does not match with the in¯ected forms of the key. The same problem also concerns derivationally related keys. Moreover, in a database index, search key occurrences may be embedded in compound words. All these forms may, however, re¯ect the information need behind the query. In this study a compound is 150 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 de®ned as a word formed by two or more components that are spelled together. The term phrase is used for the case where the constituents are spelled separately. The third problem, referred and omitted search keys covers anaphoric and elliptical keys. In the sentences: ``Where is Mary? She is in the garden.'', the pronoun she and the noun Mary are coreferential, and the pronoun is an anaphor. An example of omitted search key or ellipsis is found in the sentences: ``I have a dog. You have one too.'' In certain cases in proximity searching a considerable improvement in retrieval performance is achieved if anaphors and ellipses are resolved (Pirkola & JaÈrvelin, 1996a, 1996b). The fourth problem, search key ambiguity, covers polysemy, a word with more than one sense or subsenses (star Ð a star in the sky or star Ð a famous person) and homonymy, dierent words with the same form (bank Ð a ®nancial institution or bank Ð the side of a river). Due to the ambiguity of search keys the intended senses of search keys are not always present in retrieved documents. In other words, despite successful matching irrelevant documents are retrieved. The ®fth problem, multilinguality, concerns multilingual and monolingual document collections from where documents can be retrieved using more than one language. Crosslanguage retrieval is concerned in particular with the issues associated with the cross-lingual equivalence of words (Ballesteros & Croft, 1997; Davis & Ogden, 1997; Hull, 1998; Oard & Dorr, 1996; Pirkola, 1998). 3. IR and natural language processing Natural language processing involves linguistic methods and the analysis can take place on dierent levels of a language, i.e., morphological, syntactic and semantic levels. On a morphological level the structure of words is analyzed. Recognizing dierent word forms as variants of the same basic word aects both indexing (word weights) and retrieval (matching). Syntactic analysis determines the structure of phrases and sentences, while semantic analysis investigates the meaning or sense of words and sentences. Commonly used methods in document indexing are word form normalization (Koskenniemi, 1983; Pirkola, 1999) and stemming (Harman, 1991). A stemmer removes axes from the word forms and the output is a common root, not necessarily a real word. A similar type of process is normalization, but in this case the output is the base form, a real word. Due to stemming and normalization three kinds of bene®ts may be gained (Harman, 1991; JaÈrvelin, 1995; Alkula & Honkela, 1992). (1) A user does not need to worry about truncation and in¯ection, because dierent forms of the key are automatically con¯ated into the same form. (2) Stemming and normalization result in storage savings. (3) Stemming and normalization may improve retrieval performance, especially recall since a larger number of potentially relevant documents are retrieved. However, no signi®cant improvement in performance was found by Harman (1991) in her experiment with simple stemmers for the English language. For in¯ectionally more complex languages the results are not necessarily the same. Swedish is rich in compound words, and is thus confronted with the problem of embedded search keys. Splitting the compounds into their components allows the use of the component words as separate search keys. For instance, the decomposition of the compound hustak (roof T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 151 of the house) gives the expansion keys hus (house) and tak (roof). If the compound hustak was truncated and used as a key, documents including the word tak would not be found. In IR part-of-speech (POS) tagging may be used to identify central words (word classes, especially nouns) and phrases of a sentence. In CLIR part-of-speech tagging is useful in matching the source language keys with translation dictionary entry words. A syntactic parser (program) determines the structure of a sentence according to a particular grammar (Grishman, 1986). The parsing procedure may involve the assignment of a tree structure to the input sentence. Linguistic transformation, such as transforming active sentences into passive can be of potential value for IR. Syntactic analysis can be used as a basis for further analysis, e.g., anaphor resolution. Word sense disambiguation is an NLP method which aims at ®nding correct senses for word occurrences. It has been studied intensively in IR and other ®elds. The methods used in the studies include dictionaries (Dagan, Itai & Schwall, 1991; Guthrie, Guthrie, Wilks & Aidinejad, 1991; Krovetz & Croft, 1989), knowledge bases (Hirst, 1987), statistical methods (Brown, Della Pietra, Della Pietra & Mercer, 1991; SchuÈtze & Pedersen, 1995), multiple knowledge sources (McRoy, 1992), thesauri (Vorhees, 1993) and pseudo-words (Sanderson, 1994, 1997). Most IR studies have reported no or only slight improvements in retrieval performance due to word sense disambiguation (Krovetz & Croft, 1992; Sanderson, 1994, 1997; Vorhees, 1993). 4. Linguistic features of Swedish from IR perspective In Swedish we can identify the following features that are likely to aect information retrieval: (1) a fairly rich morphology with (2) gender features ( gender uter and gender neuter2) and (3) high frequency of compound and derivative word forms, (4) common noun phrases are less frequent in Swedish than for example in English (5) a high frequency of homographic word forms. 4.1. Morphology The in¯ectional and derivational morphology is more complex than that of English. Nouns can be divided into ®ve declination types according to the plural suxes they take, i.e., -or, -ar, -er(r ), -n, and f(no sux). Genitive forms are formed by the sux -s. De®nite forms of nouns are expressed using the articles den, det (sing.) de (pl.), and the de®nite suxes -en(-n ) (gender uter) and -et (-t ) (gender neuter). Due to the fairly rich in¯ectional morphology, nouns have several in¯ectional forms and even the stems change (umlaut) (Table 1). Therefore simple indexing and matching methods are insucient for Swedish IR. In¯ection interferes with the weighting of index terms, and in matching documents are lost due to in¯ected words. Adjectives are congruent with their headwords in gender and number (Table 2). Irregularities 2 Gender is an inalienable property of a Swedish noun. A noun is de®ned as gender `uter' or gender `neuter' according to the in¯ectional sux it takes in de®nite form singular (uter: -en, -n, neuter: -et, -t). Homonyms often have dierent gender, for example: plan-en (plan), plan-et (plane). 152 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 Table 1 Examples of in¯ected noun forms Base form De®nite form sing. Plural De®nite form pl. Flickva (girl) Bro (bridge) Bok (book) Flicka-n Bro-n Bok-en Flick-or Bro-ar BoÈck-er Flickor-na Broar-na BoÈcker-na Genitive sing. Genitive def. sing. Genitive plural Genitive def. pl. Flicka-s Bro-s Bok-s Flickan-s Bron-s Boken-s Flickor-s Broar-s BoÈcker-s Flickorna-s Broarna-s BoÈckerna-s to the general in¯ectional rules exist in nouns as well as in adjectives and verbs. In Swedish strong verbs have vowel change. For a comprehensive description of in¯ection we refer the reader to books on Swedish morphology, e.g., Hellberg (1978), Kiefer (1970), Malmgren (1994), Teleman et al. (1999). For the problem of in¯ectional forms, morphological analysis programs should be used for word normalization at the indexing and searching stages. We shall consider morphological analyzers and word form normalization in Section 5. 4.2. Gender Swedish nouns have two grammatical genders, gender uter and gender neuter (Teleman et al., 1999). The separation of gender in Swedish has a positive eect on the eectiveness of anaphor resolution algorithms, as in the algorithm for pronoun resolution in Swedish, constructed by Fraurud (1988). The gender based resolution is based on the fact that the pronouns den (uter), det (neuter) (meaning it ) should be in agreement with their correlates. Homographic nouns often possess dierent genders (Teleman et al., 1999), thus the use of gender is helpful also in the resolution of homographs. 4.3. Derivatives New words are formed by derivation and compounding. Derivatives and their stem words Table 2 Examples of adjective in¯ection Indef. form sing. De®nite form sing. Indef. form plural De®nite form pl. en vit hare (uter) (a white rabbit) ett vitt hus(neuter) (a white house) den vita haren (the white rabbit) det vita huset (the white house) vita harar (white rabbits) vita hus (white houses) de vita hararna (the white rabbits) de vita husen (the white houses) T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 153 may belong to dierent part-of-speech categories. The possibility to change part-of speech class of a word by derivation enables for example the derivation of a noun (an actor or an action) from a verb. Such words are in many cases lexicalized. Words belonging to dierent part-ofspeech categories have dierent derivational suxes, and the in¯ection of derived and underived words is similar. Examples of noun suxes are -are -het and -ning, e.g., maÊlare (painter), skoÈnhet (beauty) and foÈrkortning (abbreviation). Examples of adjective suxes are -bar and -aktig, e.g., maÈtbar (measurable) and livaktig (vivid). The suxes -era and -na, e.g., spionera (spy), vakna (wake up) are examples of verb suxes (Malmgren, 1994). Derivation is a common method to form antonyms.They are often formed by pre®xes, i.e., o-, a-, in-, il-, im-, ir-, e.g., lycklig/olycklig (happy/unhappy), symmetrisk/asymmetrisk, tolerant/ intolerant, legal/illegal, produktiv/improduktiv, rationell/irrationell. Derivational pre®xes are, as in the examples above in many cases of foreign origin (Malmgren, 1994). Derivatives can enter into compounds as a stem and the rules for derivation must precede those of compounding, for example: laÈra Ð laÈrare Ð laÈrarfoÈrbundet (teach Ð teacher Ð teacher association) faÈngsla Ð faÈngelse Ð faÈngelsestra (arrest Ð prison Ð penalty of imprisonment) antik Ð antikvitet Ð antikvitetsaaÈr (antique Ð antiquity Ð antique shop) In a normalization process for IR, derivatives may be related back to the original word/stem. This increases ambiguity and sometimes makes it impossible to ®nd the right translation alternative for CLIR, because in many cases the derived forms are lexicalized and have separate entries in a translation dictionary. 4.4. Compounds and phrases Swedish is characterized by high frequency of compound words. Phrases are not so common as in English. Compounds are easier to deal with than phrases, because there is no need for identi®cation. On the other hand for eective IR and CLIR, compounds need to be decomposed. Especially in CLIR this is often necessary, because for many compounds dictionaries include only the components of compounds. A typical feature of Swedish is also the use of fogemorphemes in compound word formation. Fogemorphemes are elements used to join the constituents of compounds. The fogemorpheme types are as follows: -(omission ) Ð ¯icknamn (maiden name) -s Ð raÈttsfall (legal case) -e Ð ¯ickebarn (female child) -a Ð gaÈstabud (feast, banquet) -u Ð gatubelysning (street lighting) -o Ð maÈnniskokaÈrlek (love of mankind) Sometimes the component preceding the fogemorpheme is a stem, e.g., gat(a ) sometimes a base form, e.g., raÈtt. Proper weighting of index terms and eective matching require that fogemorphemes could be handled and correct base forms of component words could be identi®ed. 154 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 Many nominal compounds are lexicalized so that it is not always possible to derive their meaning from the meanings of component words, e.g., jordgubbe (strawberry). In many compound words, however, the last component is a hyperonym of the full compound being thus a valuable key (Pirkola, 1999). For example, compound words denoting to dierent types of schools have the component skola (school) as their last component. In IR the decomposition of compounds would often be helpful in cases like this. The last component may also be used as an ellipsis later in the text. Particle verbs (compound verbs where one component is a particle) are special forms of compounds. There are four types of particle verbs: always tightly compounded, e.g., paÊminna (to remind), always loosely compounded (except in past participle), e.g., tycka om/omtyckt (to like), both loosely and tightly compounded, but with no change in sense, e.g., omtala/tala om (speak about), and with a change in sense, e.g., avbryta/bryta av (interrupt/break) (Malmgren, 1994). From IR point of view, some particle verbs are similar to nominal compounds and some similar to nominal phrases. Tightly compounded particle verbs are mostly lexicalized, but with the other forms we can see the bene®ts of identifying phrases and the bene®ts of splitting compounds into components. 4.5. Homonymy and polysemy Homonymy and polysemy are two dierent forms of lexical ambiguity. Homonyms are dierent lexemes spelled similarly. A polysemous word is a word that has several subsenses which are related with one another (Teleman et al., 1999). For example the word stjaÈrna (star) may denote to a famous person or a star in the sky. Due to lexical ambiguity the intended senses of keys are not always implemented in the retrieval results, and irrelevant documents are retrieved. In CLIR the negative eects of lexical ambiguity on retrieval performance are often more severe than in monolingual retrieval, because in CLIR both source and target language keys may be ambiguous. Homographs are very common in Swedish. According to Karlsson (1994) around 65% of Swedish words in running text are homographs. In Finnish the percentage is only 15% and in English around 50%. For example, the Swedish form foÈr can be used as a preposition, verb, noun, and adverb. Since Swedish is a language rich in homographic word forms it is possible that the disambiguation of homonyms or the analysis of the meaning of a word from its context would have a positive eect on retrieval results, contrary to the ®ndings for English, e.g., Sanderson (1994). 5. Natural language analysis programs for Swedish 5.1. Normalization SWETWOL is a morphological analyzer developed to analyze Swedish words (Karlsson, 1992). The program incorporates a full description of Swedish morphology, and morphological descriptions are assigned. Normalization returns all the derivational and in¯ectional forms of a word to the base form by recognizing/deleting in¯ectional and derivational axes. T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 155 Table 3 Normalization for Swedish Input word Normalization to base form avgoÈras utslaÈpp avgoÈra (decide) utslaÈppa (dumping) utslaÈpp (outlet) behandling (treatment, usage, management) vaÈsentligt (essentially, mainly) operation (operation) SWETWOL analysis V PASS INF (verb, passive, in®nitive) V ACT IMP (verb, active, imperative) N NEU INDEF SG/PL NOM (noun, neuter, inde®nite, sg./pl. nominative) behandla (treat, deal VDER/-ning N UTR INDEF SG NOM (verb with) derivation/-ning, noun, uter, inde®nite, sg. nominative) vaÈsentlig (essential, A NEU INDEF SG NOM (adjective, neuter, sg. fundamental, main) nominative) operation N UTR INDEF SG NOM (noun, uter, inde®nite, sg. (operation) nominative) Let us consider some examples of Swedish word normalization using SWETWOL (webversion, used in October 1999). Table 3 shows that most sample words are correctly normalized. The noun behandling (treatment) is normalized to the verb behandla (to treat). In CLIR this may be a problem, because dictionaries typically give nouns and verbs as separate entries. It would bene®t the translation if the normalization output would give both forms. In monolingual IR this problem is less severe since both document and query words are treated similarly. 5.2. Compound splitting Table 4 presents some examples of compound splitting by SWETWOL. In the ®rst example ``skogsindustrin'' (forest industry) the components skog (forest) and industrin (the industry) are Table 4 Examples of compound splitting in Swedish Input word Compound splitting SWETWOL analysis skogsindustrin (the forest industry) kaÈrnkraftverken (the nuclear power plant) toppmoÈte (summit, (top-level) meeting) skogs]industri (forest ] industry) hNi ] N UTR DEF SG NOM (noun ] noun, uter, de®nite, sg. nominative) hNi ] hNi ] N NEU DEF PL NOM (noun ] noun ] noun, neuter, de®nite, pl. nominative) hNi ] N NEU DEF SG NOM (noun]noun, neuter, def. sg. nominative) ``kaÈrn]kraft]verk'' (kaÈrn-a = kernel, seed, grain, nuclear etc. kraft = force, strength, etc. verk = work(s), labor, plant, mill etc.) ``topp]moÈte'' (topp = done!, agreed!, it's a bargain! top crest, summit, pinnacle, peak, apex moÈte = meeting, encounter, appointment, date, gathering, assembly, conference) ``topp]moÈ]te'' (moÈ=maid, maiden te = tea) hNi ] hNi ] N NEU DEF SG NOM (noun]noun, neuter, def. sg. nominative) 156 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 Table 5 Average number of homographic base form source words in dictionaries Average number of homographic base form words Nouns Verbs Adjectives+Adverbs Average all POS Finnish English Swedish 0 0 0 0 0.05 0.15 0 0.07 0.25 0.4 0 0.22 joined with the fogemorpheme ``s''. The latter component industrin is normalized to the base form industri, which is a hyperonym of skogsindustri and therefore often a valuable search key. The former component skogs has retained the ``s'', and is not normalized to the base form skog. In monolingual IR this has the eect that the matching process only concerns the genitive form skogs and compounds with the ®rst component skogs. The compounds with skog- (the fogemorpheme ``s'' not used) like skogbevuxen (wooded), skogloÈs (treeless) are not found. For CLIR this is still more puzzling since the translation alternatives are several for almost every word. If the compound word kaÈrnkraftverk (nuclear power plant) (example 2 in Table 4) is found in the dictionary, we will get an unambiguous translation. If the compound is split, we will have a situation where the last component is normalized to base form but the other components are retained in the form they exist in the compound word. In this case we have an example of omission (See Section 4 on fogemorphemes), where kaÈrna is transformed to kaÈrn. For the word toppmoÈtet (meeting, summit) there are two outputs topp and moÈte and topp, moÈ and te (example 3 in Table 4). The latter case provides an example of a semantically false coordination. 5.3. Lexical ambiguity Homograph disambiguation might be useful in Swedish IR due to the high frequency of homographic expressions in Swedish. We performed a test in which 35 Finnish test requests used in IR and CLIR research were translated into English and Swedish. We picked out 60 Finnish, English and Swedish equivalent words representing dierent part-of-speech classes, and studied the degree of ambiguity in the three languages. We used o-the-shelf dictionaries3 to count the number of homographic word forms and translation alternatives. By translation alternatives has meant alternative translations (including polysemy) and translation examples to a word in similar dictionaries that are used for CLIR-research. The test with homographic base form words in Table 5 show an average of 0.22 homographic expressions for Swedish, 3 Suomi-ruotsi suursanakirja [Finnish±Swedish comprehensive dictionary]/Knut Cannelin, Lauri Hirvensalo, Nils Hedlund 3rd ed. Porvoo: WSOY, 1976 (150,000 words) Ruotsi-suomi-suursanakirja [Swedish±Finnish comprehensive dictionary]/Lea LampeÂn, Porvoo: WSOY, 1973 (70,000 words) Norstedts engelsk-svenska ordbok [Norstedt's English±Swedish dictionary]/Britt-Marie Berlund . . . [et al.] Stockholm: Norstedts 1983 (60,000 words and phrases) T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 157 Table 6 Average number of translation alternatives for base form words Average number of transl. alternatives for base form words Nouns Verbs Adjectives+Adverbs Average all POS Fin±Swe Eng±Swe Swe±Fin 20.8 43.9 5.7 23.4 8.1 36.4 7 17 8.7 19.3 5 11.0 0.07 for English and no homographic expressions in the Finnish test sample. The number of translation alternatives for the same test sample is shown in Table 6. The number of translation alternatives depends on the dictionary used, in this case standard dictionaries of approximately the same size. Similar translation tools would have to be used for CLIR. To test the eect of normalization we used the test sample, this time using the in¯ected word form of the original test requests. The (TWOL)4 morphological analyzers were used in the experiments. In this case (Table 7), we can see a clear dierence in the languages. For Finnish nouns we can see a small increase in translation alternatives, and a higher for verbs and adjectives+adverbs. English shows practically no dierence, but in Swedish we have a considerable increase in translation alternatives for all part-of-speech classes. A general cautiousness is needed about the prospects of word sense disambiguation strategies and the hope that they should signi®cantly improve retrieval performance (Lewis & Sparck Jones, 1996). But also in this case we have no research result for Swedish. One method to disambiguate word senses is the use of a concordance, where the context of the word is listed along with the word. Homonymous words can be disambiguated by establishing their co-occurrence or collocability. For example, as in English in Swedish the word ``bank'' means both ®nancial institution and reef or sand bank. Bank in the sense of ®nancial institution probably occurs with words like ``laÊn'' (credit), ``pengar'' (money), ``ekonomi '' (economy), while the word ``bank'' (reef) occurs with ``jord '' (earth), ``¯od '' (river) etc. Consider one of our test requests: ``Skuldkrisen i Sydamerika. Hur har skuldsaÈttningsproblemet utvecklats?'' ``South American debt crisis. How has the debt problem developed?'' From a dictionary we can establish the following meanings for the Swedish words skuld, kris and skuldsaÈttning: skuld Ð debt, amount due, liabilities, fault, blame, guilt 4 SWETWOL, Copyright7 Lingsoft Ab 1995. http://www.lingsoft.®/cgi-pub/swetwol ENGTWOL Copyright7 Atro Voutilainen, Juha HeikkilaÈ, Timo JaÈrvinen and Lingsoft, Inc. 1993±1995. http://www.lingsoft.®/cgi-pub/ engtwol FINTWOL Copyright 7 Kimmo Koskenniemi & Lingsoft Oy 1995±1997. http://www.lingsoft.®/cgi-pub/ ®intwol 158 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 Table 7 Average number of translation alternatives after normalization Average number of translation alternatives of in¯ected word forms after morphology analysis, the increase in percent to words in base form (Table 6) Nouns Verbs Adjectives+Adverbs Average all POS Fin± Swe % increase Eng± Swe % increase Swe± Fin % increase 22.9 50.5 13.9 29.1 10.1 15.0 143.9 56.3 8.1 36.4 8.9 17.8 0 0 29.0 9.7 33.6 34.7 18.1 28.8 286.2 79.8 262.0 209.3 kris Ð crisis, depression skuldsaÈttning Ð getting into debt If we can establish that the words skuld and skuldsaÈttning are usually co-occurring, we can establish that the right translation would be ``debt crisis'', not ``guilt crisis'' or ``guilt depression''. However, the method with collocating words for disambiguation purposes is not easy to implement in operational IR-systems, and especially not if we only have a relatively short query to work with. This is the case for dictionary-based methods in CLIR. Part-of-speech tagging might be useful in Swedish IR because of the high frequency of homographic word forms. The words in the query and in the documents are part-of-speech marked and in the matching process the part-of-speech marks for the search keys and the document index terms are acknowledged. With this method, for example, the preposition foÈr (because) could be separated from the verb foÈr(a ) (bring). In CLIR part-of-speech tagging of source language words has been used successfully for ®nding correct translation equivalents in the dictionary (Ballesteros & Croft, 1998). With CLIR-research in mind we tested part-of-speech-marking, using the same test sample of 60 words in Finnish, English and Swedish. The eect on the number of translation Table 8 Average number of translation alternatives after POS-marking Average number of translation alternatives after POS-marking the decrease in percent to normalization without POS-marking (Table 7) Nouns Verbs Adjectives+Adverbs Average all POS Fin± Swe % decrease Eng± Swe % decrease Swe± Fin % decrease 22.9 45.5 5.0 24.5 0 ÿ9.9 ÿ64.0 ÿ24.6 7.2 31.9 4.8 14.6 ÿ11.1 ÿ12.4 ÿ46.1 ÿ23.2 21.4 29.8 7.5 19.5 ÿ36.3 ÿ14.1 ÿ58.6 ÿ36.3 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 159 alternatives is shown in Table 8. We can see a small decrease in the number of translation alternatives for English nouns, no decrease for Finnish nouns and a larger for Swedish nouns. The results agree with earlier research by LeppaÈnen (1995) (Finnish) and Sanderson (1994, 1997) (English). From the test we come to the conclusion that part-of-speech-marking might be eective in CLIR for Swedish. 6. Conclusions The present situation with Swedish natural language IR is that it is lacking research. Neither has there been any CLIR research in Swedish. A Norwegian study (Fjeldvig & Golden, 1988) done 10 years ago with a fairly small test database is the only research result that might give some leads on how the Swedish language behaves in similar research situations. The Norwegian study shows the eectiveness of automatic normalization and truncation but an improvement in the results for compound splitting was not shown. We have described the properties of Swedish language and pointed out a number of research problems for Swedish IR. We also argued that they deserve serious research attention, because (1) the Nordic countries have a fairly large population, (2) they are technically advanced countries with lots of IR-related activity, and because (3) the features of Swedish dier from those of many other languages, such as English and Finnish. The research results for other languages do not necessarily apply to Swedish due to its unique features, e.g., the frequency of homographs, the use of fogemorphemes in compound words, and particle verbs. We have also shown that the publicly available morphological analysis tools for word normalization and compound splitting have pitfalls that might decrease the eectiveness of IR and CLIR. The identi®cation of base forms and the analysis of compounds, in particular may present problems for IR. Therefore the morphological analysis programs cannot be utilized directly but need to be tuned for IR applications. References Alkula, R., & Honkela, T. (1992). Tekstin tallennus- ja hakumenetelmien kehittaÈminen suomen kielen tulkintaohjelmien avulla. FULLTEXT-projektin loppuraportti. [Linguistic processing and retrieval techniques in Finnish fulltext databases. Final report of the FULLTEXT project.] VTT julkaisuja Ð publikationer 765. Espoo: VTT. Ballesteros, L., & Croft, W. B. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the Twentieth ACM/SIGIR Conference (pp. 84±91). Ballesteros, L., & Croft, W. B. (1998). Resolving ambiguity for cross-language retrieval. In Proceedings of the Twenty-®rst Annual International ACM/SIGIR Conference (pp. 64±71). Boguraev, B., & Briscoe, T. (1989). Computational lexicography for natural language processing. London: Longman. Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1991). Word-sense disambiguation using statistical methods. In Proceedings of the Twenty-ninth Annual Meeting of the Association for Computational Linguistics (pp. 264±270). Dagan, I., Itai, A., & Schwall, U. (1991). Two languages are more informative than one. In Proceedings of the Twenty-ninth Annual Meeting of the Association for Computational Linguistics (pp. 130±137). Davis, M., & Ogden, W. C. (1997). QUILT: Implementing a large-scale cross-language text retrieval system. In Proceedings of the Twentieth Annual International ACM/SIGIR Conference (pp. 92±98). 160 T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 Efthimiadis, E. N. (1996). Query expansion. Annual Review of Information Science and Technology, 31, 121±187. Fjeldvig, T., & Golden, A. (1988). Experiments with language-based aids in information retrieval systems. Nordic Journal of Linguistics, 11(1-2), 33±46. Fraurud, K. (1988). Pronoun resolution in unrestricted text. Nordic Journal of Linguistics, 11(1-2), 47±68. Grishman, R. (1986). Computational linguistics: an introduction. Cambridge: Cambridge University Press. Guthrie, J. A., Guthrie, L., Wilks, Y., & Aidinejad, H. (1991). Subject-dependent co-occurrence and word sense disambiguation. In Proceedings of the Twenty-ninth Annual Meeting of the Association for Computational Linguistics (pp. 146±152). Guthrie, L., Pustejovsky, J., Wilks, Y., & Slator, B. (1996). The role of lexicons in natural language processing. Communications of the ACM, 39(1), 63±72. Harman, D. (1991). How eective is suxing? Journal of the American Society for Information Science, 42(1), 7±15. Hellberg, S. (1978). The morphology of present-day Swedish: word-in¯ection, word-formation, basic dictionary. Stockholm: Almqvist & Wiksell. Hirst, G. (1987). Semantic interpretation and the resolution of ambiguity. Cambridge: Cambridge University Press. Hull, D. (1998). A weighted Boolean model for cross-language text retrieval. In G. Grefenstette, Cross-language information retrieval (pp. 119±136). Boston: Kluwer Academic Publishers. JaÈrvelin, K. (1995). Tekstitiedonhaku tietokannoista. Johdatus periaatteisiin ja menetelmiin. Espoo: Suomen ATKkustannus [Text retrieval from databases. An introduction to principles and methods]. Karlsson, F. (1992). SWETWOL: A Comprehensive Morphological Analyser for Swedish. Nordic Journal of Linguistics, 15, 1±45. Karlsson, F. (1994). Yleinen kielitiede. Helsinki: Yliopistopaino [General linguistics]. KekaÈlaÈinen, J. (1999). The eects of query complexity, expansion and structure on retrieval performance in probabilistic text retrieval. Ph.D. Thesis, University of Tampere. Kiefer, F. (1970). Swedish morphology. Stockholm: Skriptor. Koskenniemi, K. (1983). Two-level morphology: a general computational model for word-form recognition and production. Ph.D. Thesis. University of Helsinki. Krovetz, R., & Croft, W. B. (1989). Word sense disambiguation using machine-readable dictionaries. In Proceedings of the Twelfth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 127±136). Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10(2), 115±141. LeppaÈnen, E. (1995). Homogra®en disambiguoinnin vaikutukset ja toteuttaminen tekstihaussa [The eects of disambiguation of homographs and its implementation in text retrieval] Master's Thesis. University of Tampere. Lewis, D., & Sparck Jones, K. (1996). Natural language processing for information retrieval. Communications of the ACM, 39(1), 92±101. Malmgren, S. G. (1994). Svensk lexikologi. Ord, ordbildning, ordboÈcker och orddatabaser. Lund: Studentlitteratur [Swedish lexicology. Words, word formation, dictionaries and word databases]. McRoy, S. W. (1992). Using multiple knowledge sources for word sense disambiguation. Computational Linguistics, 18(1), 1±30. Oard, D., & Dorr, B. (1996). A survey of multilingual text retrieval. Technical Report UMIACS-TR-96-19. University of Maryland, Institute for Advanced Computer Studies. Pirkola, A. (1998). The eects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the Twenty-®rst Annual International ACM/SIGIR Conference (pp. 55± 63). Pirkola, A. (1999). Studies on linguistic problems and methods in text retrieval. Ph.D. Thesis, University of Tampere. Pirkola, A., & JaÈrvelin, K. (1996a). The eect of anaphor and ellipsis resolution on proximity searching in a text database. Information Processing & Management, 32(2), 199±216. Pirkola, A., & JaÈrvelin, K. (1996b). Recall and precision eects of anaphor and ellipsis resolution in proximity searching in a text database. In P. Ingwersen, & N. O. Pors, Proceeding CoLIS2, Second International Conference on Conceptions of Library and Information Science: Integration in Perspective. Copenhagen, October 13±16, 1996 (pp. 459±475). Copenhagen: The Royal School of Librarianship. T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161 161 Porter, M. F. (1980). An algorithm for sux stripping. Program, 14, 130±137. Sanderson, M. (1994). Word sense disambiguation and information retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 142± 151). Sanderson, M. (1997). Word sense disambiguation and information retrieval. Ph.D. Thesis University of Glasgow, Department of Computing Science. SchuÈtze, H., & Pedersen, J. O. (1995). Information retrieval based on word senses. In Proceedings of the Symposium on Document Analysis and Information Retrieval (pp. 161±175). Strzalkowski, T. (1994). Robust text processing in automated information retrieval. Proceedings of the Fourth Conference on Applied Natural Language Processing, 168±173. Stuttgart, Germany: Association for Computational Linguistics. Reprinted in K. Sparck Jones & P. Willet (Eds.). Readings in Information Retrieval (1997). San Francisco, CA: Morgan Kaufmann Publishers. Strzalkowski, T. (1995). Natural language information retrieval. Information Processing & Management, 31(3), 397± 417. Teleman, V., Hellberg, S., & Andersson, E. (1999). Svenska Akademiens grammatik 1±4. Stockholm: Svenska Akademien [Grammar of the Swedish Academy 1±4]. Vorhees, E. M. (1993). Using WordNet to disambiguate word senses for text retrieval. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 171±180).
© Copyright 2026 Paperzz