Aspects of Swedish morphology and semantics from the perspective

Information Processing and Management 37 (2001) 147±161
www.elsevier.com/locate/infoproman
Aspects of Swedish morphology and semantics from the
perspective of mono- and cross-language information
retrieval
Turid Hedlund*, Ari Pirkola, Kalervo JaÈrvelin
Department of Information Studies, University of Tampere, PO Box 607, FIN-33101, Tampere, Finland
Received 21 January 2000; accepted 12 April 2000
Abstract
This paper analyzes the features of the Swedish language from the viewpoint of mono- and crosslanguage information retrieval (CLIR). The study was motivated by the fact that Swedish is known
poorly from the IR perspective. This paper shows that Swedish has unique features, in particular gender
features, the use of fogemorphemes in the formation of compound words, and a high frequency of
homographic words. Especially in dictionary-based CLIR, correct word normalization and compound
splitting are essential. It was shown in this study, however, that publicly available morphological
analysis tools used for normalization and compound splitting have pitfalls that might decrease the
e€ectiveness of IR and CLIR. A comparative study was performed to test the degree of lexical
ambiguity in Swedish, Finnish and English. The results suggest that part-of-speech tagging might be
useful in Swedish IR due to the high frequency of homographic words. 7 2000 Elsevier Science Ltd. All
rights reserved.
Keywords: Text retrieval; Cross-language information retrieval; Swedish language; Natural language processing
1. Introduction
Our society depends on written communication, which today often is created and stored in
* Corresponding author: Swedish School of Economics and Business Administration Library, PO Box 479, FIN00101 Helsinki, Finland. Tel.: +358-9-43133378; fax: +358-9-43133425.
E-mail address: turid.hedlund@shh.® (T. Hedlund).
0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved.
PII: S 0 3 0 6 - 4 5 7 3 ( 0 0 ) 0 0 0 2 4 - 8
148
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
digital form. We have an increasing amount of full text material in di€erent languages
available through the Internet and other information suppliers. Information retrieval (IR) from
operational full text databases or text retrieval is characterized by the lack of controlled
vocabularies. A relatively new research area is cross-language information retrieval (CLIR). It
is a process of selecting and ranking documents in a language di€erent from the query
language.
Text retrieval involves the use of natural language in the form of textual descriptions of a
topic. The variation in word forms and homographic and polysemous expressions causes
disambiguation problems in text retrieval, and the e€ect increases in CLIR. We can express the
same information content in many di€erent ways depending on who is the writer and for
whom or for what purpose a text is written.
Full text retrieval as a response to a query in natural language involves methods developed
in computational linguistics (Grishman, 1986) and natural language processing (NLP)
(Strzalkowski, 1994, 1995). Increasing recall and precision depends on these applications to
match non-identical expressions belonging to the same concept. Computational linguistics
o€ers research in language analysis and knowledge representation (syntactic and semantic
knowledge), and NLP applications rely strongly on statistical text analysis and machine
readable texts and dictionaries (Boguraev & Briscoe, 1989; Guthrie, Pustejovsky, Wilks &
Slator, 1996). Analytic tools relevant for IR purposes involve morphological analysis programs,
part of speech taggers, syntactic parsers, and truncation algorithms (stemmers) (Harman, 1991;
Karlsson, 1992; Koskenniemi, 1983; Porter, 1980).
Because English has been the main language for IR system development, much research on
IR and NLP involves English. However, IR systems for small languages like Finnish and
Swedish and other Nordic languages cannot be developed properly without studying their
special features. Although Spanish and Chinese have rendered special tracks in the TREC
Conferences1 and even Finnish (Alkula & Honkela, 1992) has been explored the results cannot
be applied on linguistically quite di€erent languages.
This paper analyzes Swedish as document and query language for IR. Swedish is spoken as
a native language by eight to nine million people, mainly in Sweden but also by a minority
population in Finland. However, due to close relationships between the Nordic countries and
the other Scandinavian languages the number of people who speak Swedish and can
understand it is much larger (Teleman, Hellberg & Andersson, 1999). Approximately 20
million people have a basic knowledge of Swedish.
Since the Scandinavian languages Swedish, Danish and Norwegian are close, results of
scienti®c research on one language can help to create a deeper understanding of the same
phenomena in the other Scandinavian languages (Teleman et al., 1999). Thus, a careful
generalization of the results in this study can be made to all three languages.
So far we can report very few analyses of the Scandinavian languages concerning IR, only a
small Norwegian study (Fjeldvig & Golden, 1988). Swedish and the Scandinavian languages
have unique and novel features which a€ect IR. We will identify and present these features in
the Swedish language relevant to IR on morphologic, syntactic as well as on semantic level.
1
The fourth, ®fth and sixth REtrieval Conferences, 1996±1998. URL:http://trec.nist.gov/
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
149
The problem areas studied are (a) morphological features such as in¯ection, derivation, gender
and compound words, (b) semantical features as homonymy, polysemy and hyponymy. The
viewpoints for the analysis are full text IR, database indexing, query formulation and CLIR.
Publicly available tools to overcome some of the problems have some pitfalls, which we
analyze. We believe that these questions deserve serious research e€orts.
The rest of this paper is organized as follows. Section 2 summarizes the linguistic problems
of IR. Section 3 considers IR and natural language processing. In Section 4 the main linguistic
features of Swedish are analyzed from the perspective of IR. In Section 5 the use of NLP tools
in Swedish IR and CLIR is discussed. Section 6 presents concluding remarks.
2. Linguistic problems in IR
An ideal case in searching would be that a search would give all the relevant documents of a
database, ranked in an order of descending relevance, and none of the irrelevant documents. In
a typical case, however, search results often include many irrelevant documents, and many of
the relevant documents are not retrieved. This is primarily caused by linguistic phenomena.
The basic linguistic problems of IR can be classi®ed as follows: (1) the selection of alternative
concepts and search keys, (2) the morphological variation of search keys, (3) referred and omitted
search keys, (4) search key ambiguity, and (5) multilinguality.
The ®rst problem, the selection of alternative concepts and search keys, is related to the fact
that di€erent documents which consider the same topic may use, and often do use, di€erent
concepts and expressions. Typically only part of the relevant documents are retrieved, as users
are not able to produce exhaustively alternative concepts for the concepts of requests and are
not able to expand the queries using synonyms. The problem of query expansion has been
investigated in several studies (Efthimiadis, 1996; KekaÈlaÈinen, 1999). In IR terms, the search
concepts which describe a given aspect of a request and which thus are alternative concepts are
either in hierarchical or associative relations to each other. In linguistic terms, these
relationships include hyponymy, a hierarchical subordinate relation (animal Ð horse )
meronymy, a relationship between the whole and its parts (hand Ð ®nger ), and other sense
relations, such as antonymy (hot Ð cold ) (for associative relations). In IR terms, the search
keys which represent a given concept are equivalents. In linguistic terms, equivalence covers
synonymy (Information Retrieval Ð IR ).
The second problem, the morphological variation of search keys is associated with the
matching process. In morphologically simple languages, such as English, word form variation
can be handled by quite simple means, e.g., for nouns, by recognizing the plural forms (-s )
(Harman, 1991; Porter, 1980). However several European languages are morphologically much
more complex (e.g., German, French, Italian and the Scandinavian languages). Documents are
not retrieved if the search key and its occurrence in a database index (the index term) are not
identical in form. Thus a search key given in a base form does not match with the in¯ected
forms of the key. The same problem also concerns derivationally related keys. Moreover, in a
database index, search key occurrences may be embedded in compound words. All these forms
may, however, re¯ect the information need behind the query. In this study a compound is
150
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
de®ned as a word formed by two or more components that are spelled together. The term
phrase is used for the case where the constituents are spelled separately.
The third problem, referred and omitted search keys covers anaphoric and elliptical keys. In
the sentences: ``Where is Mary? She is in the garden.'', the pronoun she and the noun Mary are
coreferential, and the pronoun is an anaphor. An example of omitted search key or ellipsis is
found in the sentences: ``I have a dog. You have one too.'' In certain cases in proximity
searching a considerable improvement in retrieval performance is achieved if anaphors and
ellipses are resolved (Pirkola & JaÈrvelin, 1996a, 1996b). The fourth problem, search key
ambiguity, covers polysemy, a word with more than one sense or subsenses (star Ð a star in
the sky or star Ð a famous person) and homonymy, di€erent words with the same form (bank
Ð a ®nancial institution or bank Ð the side of a river). Due to the ambiguity of search keys
the intended senses of search keys are not always present in retrieved documents. In other
words, despite successful matching irrelevant documents are retrieved.
The ®fth problem, multilinguality, concerns multilingual and monolingual document
collections from where documents can be retrieved using more than one language. Crosslanguage retrieval is concerned in particular with the issues associated with the cross-lingual
equivalence of words (Ballesteros & Croft, 1997; Davis & Ogden, 1997; Hull, 1998; Oard &
Dorr, 1996; Pirkola, 1998).
3. IR and natural language processing
Natural language processing involves linguistic methods and the analysis can take place on
di€erent levels of a language, i.e., morphological, syntactic and semantic levels. On a
morphological level the structure of words is analyzed. Recognizing di€erent word forms as
variants of the same basic word a€ects both indexing (word weights) and retrieval (matching).
Syntactic analysis determines the structure of phrases and sentences, while semantic analysis
investigates the meaning or sense of words and sentences.
Commonly used methods in document indexing are word form normalization (Koskenniemi,
1983; Pirkola, 1999) and stemming (Harman, 1991). A stemmer removes axes from the word
forms and the output is a common root, not necessarily a real word. A similar type of process
is normalization, but in this case the output is the base form, a real word. Due to stemming
and normalization three kinds of bene®ts may be gained (Harman, 1991; JaÈrvelin, 1995; Alkula
& Honkela, 1992). (1) A user does not need to worry about truncation and in¯ection, because
di€erent forms of the key are automatically con¯ated into the same form. (2) Stemming and
normalization result in storage savings. (3) Stemming and normalization may improve retrieval
performance, especially recall since a larger number of potentially relevant documents are
retrieved. However, no signi®cant improvement in performance was found by Harman (1991)
in her experiment with simple stemmers for the English language. For in¯ectionally more
complex languages the results are not necessarily the same.
Swedish is rich in compound words, and is thus confronted with the problem of embedded
search keys. Splitting the compounds into their components allows the use of the component
words as separate search keys. For instance, the decomposition of the compound hustak (roof
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
151
of the house) gives the expansion keys hus (house) and tak (roof). If the compound hustak was
truncated and used as a key, documents including the word tak would not be found.
In IR part-of-speech (POS) tagging may be used to identify central words (word classes,
especially nouns) and phrases of a sentence. In CLIR part-of-speech tagging is useful in
matching the source language keys with translation dictionary entry words.
A syntactic parser (program) determines the structure of a sentence according to a particular
grammar (Grishman, 1986). The parsing procedure may involve the assignment of a tree
structure to the input sentence. Linguistic transformation, such as transforming active
sentences into passive can be of potential value for IR. Syntactic analysis can be used as a
basis for further analysis, e.g., anaphor resolution.
Word sense disambiguation is an NLP method which aims at ®nding correct senses for
word occurrences. It has been studied intensively in IR and other ®elds. The methods used
in the studies include dictionaries (Dagan, Itai & Schwall, 1991; Guthrie, Guthrie, Wilks &
Aidinejad, 1991; Krovetz & Croft, 1989), knowledge bases (Hirst, 1987), statistical methods
(Brown, Della Pietra, Della Pietra & Mercer, 1991; SchuÈtze & Pedersen, 1995), multiple
knowledge sources (McRoy, 1992), thesauri (Vorhees, 1993) and pseudo-words (Sanderson,
1994, 1997). Most IR studies have reported no or only slight improvements in retrieval
performance due to word sense disambiguation (Krovetz & Croft, 1992; Sanderson, 1994,
1997; Vorhees, 1993).
4. Linguistic features of Swedish from IR perspective
In Swedish we can identify the following features that are likely to a€ect information
retrieval: (1) a fairly rich morphology with (2) gender features ( gender uter and gender neuter2)
and (3) high frequency of compound and derivative word forms, (4) common noun phrases are less
frequent in Swedish than for example in English (5) a high frequency of homographic word forms.
4.1. Morphology
The in¯ectional and derivational morphology is more complex than that of English. Nouns
can be divided into ®ve declination types according to the plural suxes they take, i.e., -or, -ar,
-er(r ), -n, and f(no sux). Genitive forms are formed by the sux -s. De®nite forms of nouns
are expressed using the articles den, det (sing.) de (pl.), and the de®nite suxes -en(-n ) (gender
uter) and -et (-t ) (gender neuter). Due to the fairly rich in¯ectional morphology, nouns have
several in¯ectional forms and even the stems change (umlaut) (Table 1). Therefore simple
indexing and matching methods are insucient for Swedish IR. In¯ection interferes with the
weighting of index terms, and in matching documents are lost due to in¯ected words.
Adjectives are congruent with their headwords in gender and number (Table 2). Irregularities
2
Gender is an inalienable property of a Swedish noun. A noun is de®ned as gender `uter' or gender `neuter'
according to the in¯ectional sux it takes in de®nite form singular (uter: -en, -n, neuter: -et, -t). Homonyms often
have di€erent gender, for example: plan-en (plan), plan-et (plane).
152
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
Table 1
Examples of in¯ected noun forms
Base form
De®nite form sing.
Plural
De®nite form pl.
Flickva (girl)
Bro (bridge)
Bok (book)
Flicka-n
Bro-n
Bok-en
Flick-or
Bro-ar
BoÈck-er
Flickor-na
Broar-na
BoÈcker-na
Genitive sing.
Genitive def. sing.
Genitive plural
Genitive def. pl.
Flicka-s
Bro-s
Bok-s
Flickan-s
Bron-s
Boken-s
Flickor-s
Broar-s
BoÈcker-s
Flickorna-s
Broarna-s
BoÈckerna-s
to the general in¯ectional rules exist in nouns as well as in adjectives and verbs. In Swedish
strong verbs have vowel change. For a comprehensive description of in¯ection we refer the
reader to books on Swedish morphology, e.g., Hellberg (1978), Kiefer (1970), Malmgren
(1994), Teleman et al. (1999).
For the problem of in¯ectional forms, morphological analysis programs should be used for
word normalization at the indexing and searching stages. We shall consider morphological
analyzers and word form normalization in Section 5.
4.2. Gender
Swedish nouns have two grammatical genders, gender uter and gender neuter (Teleman et al.,
1999). The separation of gender in Swedish has a positive e€ect on the e€ectiveness of anaphor
resolution algorithms, as in the algorithm for pronoun resolution in Swedish, constructed by
Fraurud (1988). The gender based resolution is based on the fact that the pronouns den (uter),
det (neuter) (meaning it ) should be in agreement with their correlates. Homographic nouns
often possess di€erent genders (Teleman et al., 1999), thus the use of gender is helpful also in
the resolution of homographs.
4.3. Derivatives
New words are formed by derivation and compounding. Derivatives and their stem words
Table 2
Examples of adjective in¯ection
Indef. form sing.
De®nite form sing.
Indef. form plural
De®nite form pl.
en vit hare (uter) (a white
rabbit)
ett vitt hus(neuter) (a white
house)
den vita haren (the white
rabbit)
det vita huset (the white
house)
vita harar (white
rabbits)
vita hus (white
houses)
de vita hararna (the white
rabbits)
de vita husen (the white
houses)
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
153
may belong to di€erent part-of-speech categories. The possibility to change part-of speech class
of a word by derivation enables for example the derivation of a noun (an actor or an action)
from a verb. Such words are in many cases lexicalized. Words belonging to di€erent part-ofspeech categories have di€erent derivational suxes, and the in¯ection of derived and
underived words is similar. Examples of noun suxes are -are -het and -ning, e.g., maÊlare
(painter), skoÈnhet (beauty) and foÈrkortning (abbreviation). Examples of adjective suxes are
-bar and -aktig, e.g., maÈtbar (measurable) and livaktig (vivid). The suxes -era and -na, e.g.,
spionera (spy), vakna (wake up) are examples of verb suxes (Malmgren, 1994).
Derivation is a common method to form antonyms.They are often formed by pre®xes, i.e.,
o-, a-, in-, il-, im-, ir-, e.g., lycklig/olycklig (happy/unhappy), symmetrisk/asymmetrisk, tolerant/
intolerant, legal/illegal, produktiv/improduktiv, rationell/irrationell. Derivational pre®xes are, as
in the examples above in many cases of foreign origin (Malmgren, 1994).
Derivatives can enter into compounds as a stem and the rules for derivation must precede
those of compounding, for example:
laÈra Ð laÈrare Ð laÈrarfoÈrbundet (teach Ð teacher Ð teacher association)
faÈngsla Ð faÈngelse Ð faÈngelsestra€ (arrest Ð prison Ð penalty of imprisonment)
antik Ð antikvitet Ð antikvitetsa€aÈr (antique Ð antiquity Ð antique shop)
In a normalization process for IR, derivatives may be related back to the original word/stem.
This increases ambiguity and sometimes makes it impossible to ®nd the right translation
alternative for CLIR, because in many cases the derived forms are lexicalized and have
separate entries in a translation dictionary.
4.4. Compounds and phrases
Swedish is characterized by high frequency of compound words. Phrases are not so common
as in English. Compounds are easier to deal with than phrases, because there is no need for
identi®cation. On the other hand for e€ective IR and CLIR, compounds need to be
decomposed. Especially in CLIR this is often necessary, because for many compounds
dictionaries include only the components of compounds.
A typical feature of Swedish is also the use of fogemorphemes in compound word formation.
Fogemorphemes are elements used to join the constituents of compounds. The fogemorpheme
types are as follows:
-(omission ) Ð ¯icknamn (maiden name)
-s Ð raÈttsfall (legal case)
-e Ð ¯ickebarn (female child)
-a Ð gaÈstabud (feast, banquet)
-u Ð gatubelysning (street lighting)
-o Ð maÈnniskokaÈrlek (love of mankind)
Sometimes the component preceding the fogemorpheme is a stem, e.g., gat(a ) sometimes a base
form, e.g., raÈtt. Proper weighting of index terms and e€ective matching require that
fogemorphemes could be handled and correct base forms of component words could be
identi®ed.
154
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
Many nominal compounds are lexicalized so that it is not always possible to derive their
meaning from the meanings of component words, e.g., jordgubbe (strawberry). In many
compound words, however, the last component is a hyperonym of the full compound being
thus a valuable key (Pirkola, 1999). For example, compound words denoting to di€erent types
of schools have the component skola (school) as their last component. In IR the decomposition
of compounds would often be helpful in cases like this. The last component may also be used
as an ellipsis later in the text.
Particle verbs (compound verbs where one component is a particle) are special forms of
compounds. There are four types of particle verbs: always tightly compounded, e.g., paÊminna
(to remind), always loosely compounded (except in past participle), e.g., tycka om/omtyckt (to
like), both loosely and tightly compounded, but with no change in sense, e.g., omtala/tala om
(speak about), and with a change in sense, e.g., avbryta/bryta av (interrupt/break) (Malmgren,
1994). From IR point of view, some particle verbs are similar to nominal compounds and
some similar to nominal phrases. Tightly compounded particle verbs are mostly lexicalized, but
with the other forms we can see the bene®ts of identifying phrases and the bene®ts of splitting
compounds into components.
4.5. Homonymy and polysemy
Homonymy and polysemy are two di€erent forms of lexical ambiguity. Homonyms are
di€erent lexemes spelled similarly. A polysemous word is a word that has several subsenses
which are related with one another (Teleman et al., 1999). For example the word stjaÈrna (star)
may denote to a famous person or a star in the sky. Due to lexical ambiguity the intended
senses of keys are not always implemented in the retrieval results, and irrelevant documents are
retrieved. In CLIR the negative e€ects of lexical ambiguity on retrieval performance are often
more severe than in monolingual retrieval, because in CLIR both source and target language
keys may be ambiguous.
Homographs are very common in Swedish. According to Karlsson (1994) around 65% of
Swedish words in running text are homographs. In Finnish the percentage is only 15% and in
English around 50%. For example, the Swedish form foÈr can be used as a preposition, verb,
noun, and adverb. Since Swedish is a language rich in homographic word forms it is possible
that the disambiguation of homonyms or the analysis of the meaning of a word from its
context would have a positive e€ect on retrieval results, contrary to the ®ndings for English,
e.g., Sanderson (1994).
5. Natural language analysis programs for Swedish
5.1. Normalization
SWETWOL is a morphological analyzer developed to analyze Swedish words (Karlsson,
1992). The program incorporates a full description of Swedish morphology, and morphological
descriptions are assigned. Normalization returns all the derivational and in¯ectional forms of a
word to the base form by recognizing/deleting in¯ectional and derivational axes.
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
155
Table 3
Normalization for Swedish
Input word
Normalization to
base form
avgoÈras
utslaÈpp
avgoÈra (decide)
utslaÈppa (dumping)
utslaÈpp (outlet)
behandling (treatment, usage,
management)
vaÈsentligt (essentially, mainly)
operation (operation)
SWETWOL analysis
V PASS INF (verb, passive, in®nitive)
V ACT IMP (verb, active, imperative)
N NEU INDEF SG/PL NOM (noun, neuter,
inde®nite, sg./pl. nominative)
behandla (treat, deal VDER/-ning N UTR INDEF SG NOM (verb
with)
derivation/-ning, noun, uter, inde®nite, sg.
nominative)
vaÈsentlig (essential, A NEU INDEF SG NOM (adjective, neuter, sg.
fundamental, main) nominative)
operation
N UTR INDEF SG NOM (noun, uter, inde®nite, sg.
(operation)
nominative)
Let us consider some examples of Swedish word normalization using SWETWOL (webversion, used in October 1999). Table 3 shows that most sample words are correctly
normalized. The noun behandling (treatment) is normalized to the verb behandla (to treat). In
CLIR this may be a problem, because dictionaries typically give nouns and verbs as separate
entries. It would bene®t the translation if the normalization output would give both forms. In
monolingual IR this problem is less severe since both document and query words are treated
similarly.
5.2. Compound splitting
Table 4 presents some examples of compound splitting by SWETWOL. In the ®rst example
``skogsindustrin'' (forest industry) the components skog (forest) and industrin (the industry) are
Table 4
Examples of compound splitting in Swedish
Input word
Compound splitting
SWETWOL analysis
skogsindustrin (the
forest industry)
kaÈrnkraftverken
(the nuclear power
plant)
toppmoÈte (summit,
(top-level)
meeting)
skogs]industri (forest ] industry)
hNi ] N UTR DEF SG NOM (noun ]
noun, uter, de®nite, sg. nominative)
hNi ] hNi ] N NEU DEF PL NOM
(noun ] noun ] noun, neuter, de®nite, pl.
nominative)
hNi ] N NEU DEF SG NOM
(noun]noun, neuter, def. sg. nominative)
``kaÈrn]kraft]verk'' (kaÈrn-a = kernel, seed, grain,
nuclear etc. kraft = force, strength, etc. verk =
work(s), labor, plant, mill etc.)
``topp]moÈte'' (topp = done!, agreed!, it's a bargain!
top crest, summit, pinnacle, peak, apex
moÈte = meeting, encounter, appointment, date,
gathering, assembly, conference) ``topp]moÈ]te''
(moÈ=maid, maiden te = tea)
hNi ] hNi ] N NEU DEF SG NOM
(noun]noun, neuter, def. sg. nominative)
156
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
Table 5
Average number of homographic base form source words in dictionaries
Average number of homographic base form words
Nouns
Verbs
Adjectives+Adverbs
Average all POS
Finnish
English
Swedish
0
0
0
0
0.05
0.15
0
0.07
0.25
0.4
0
0.22
joined with the fogemorpheme ``s''. The latter component industrin is normalized to the base
form industri, which is a hyperonym of skogsindustri and therefore often a valuable search key.
The former component skogs has retained the ``s'', and is not normalized to the base form
skog. In monolingual IR this has the e€ect that the matching process only concerns the
genitive form skogs and compounds with the ®rst component skogs. The compounds with
skog- (the fogemorpheme ``s'' not used) like skogbevuxen (wooded), skogloÈs (treeless) are not
found.
For CLIR this is still more puzzling since the translation alternatives are several for almost
every word. If the compound word kaÈrnkraftverk (nuclear power plant) (example 2 in Table 4)
is found in the dictionary, we will get an unambiguous translation. If the compound is split, we
will have a situation where the last component is normalized to base form but the other
components are retained in the form they exist in the compound word. In this case we have an
example of omission (See Section 4 on fogemorphemes), where kaÈrna is transformed to kaÈrn.
For the word toppmoÈtet (meeting, summit) there are two outputs topp and moÈte and topp,
moÈ and te (example 3 in Table 4). The latter case provides an example of a semantically false
coordination.
5.3. Lexical ambiguity
Homograph disambiguation might be useful in Swedish IR due to the high frequency of
homographic expressions in Swedish. We performed a test in which 35 Finnish test requests
used in IR and CLIR research were translated into English and Swedish. We picked out 60
Finnish, English and Swedish equivalent words representing di€erent part-of-speech classes,
and studied the degree of ambiguity in the three languages. We used o€-the-shelf dictionaries3
to count the number of homographic word forms and translation alternatives. By translation
alternatives has meant alternative translations (including polysemy) and translation examples
to a word in similar dictionaries that are used for CLIR-research. The test with homographic
base form words in Table 5 show an average of 0.22 homographic expressions for Swedish,
3
Suomi-ruotsi suursanakirja [Finnish±Swedish comprehensive dictionary]/Knut Cannelin, Lauri Hirvensalo, Nils
Hedlund 3rd ed. Porvoo: WSOY, 1976 (150,000 words) Ruotsi-suomi-suursanakirja [Swedish±Finnish comprehensive dictionary]/Lea LampeÂn, Porvoo: WSOY, 1973 (70,000 words) Norstedts engelsk-svenska ordbok [Norstedt's
English±Swedish dictionary]/Britt-Marie Berlund . . . [et al.] Stockholm: Norstedts 1983 (60,000 words and phrases)
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
157
Table 6
Average number of translation alternatives for base form words
Average number of transl. alternatives for base form words
Nouns
Verbs
Adjectives+Adverbs
Average all POS
Fin±Swe
Eng±Swe
Swe±Fin
20.8
43.9
5.7
23.4
8.1
36.4
7
17
8.7
19.3
5
11.0
0.07 for English and no homographic expressions in the Finnish test sample. The number of
translation alternatives for the same test sample is shown in Table 6. The number of
translation alternatives depends on the dictionary used, in this case standard dictionaries of
approximately the same size. Similar translation tools would have to be used for CLIR.
To test the e€ect of normalization we used the test sample, this time using the in¯ected word
form of the original test requests. The (TWOL)4 morphological analyzers were used in the
experiments. In this case (Table 7), we can see a clear di€erence in the languages. For Finnish
nouns we can see a small increase in translation alternatives, and a higher for verbs and
adjectives+adverbs. English shows practically no di€erence, but in Swedish we have a
considerable increase in translation alternatives for all part-of-speech classes.
A general cautiousness is needed about the prospects of word sense disambiguation strategies
and the hope that they should signi®cantly improve retrieval performance (Lewis & Sparck
Jones, 1996). But also in this case we have no research result for Swedish.
One method to disambiguate word senses is the use of a concordance, where the context of
the word is listed along with the word. Homonymous words can be disambiguated by
establishing their co-occurrence or collocability. For example, as in English in Swedish the
word ``bank'' means both ®nancial institution and reef or sand bank. Bank in the sense of
®nancial institution probably occurs with words like ``laÊn'' (credit), ``pengar'' (money),
``ekonomi '' (economy), while the word ``bank'' (reef) occurs with ``jord '' (earth), ``¯od '' (river)
etc.
Consider one of our test requests:
``Skuldkrisen i Sydamerika. Hur har skuldsaÈttningsproblemet utvecklats?''
``South American debt crisis. How has the debt problem developed?''
From a dictionary we can establish the following meanings for the Swedish words skuld, kris
and skuldsaÈttning:
skuld Ð debt, amount due, liabilities, fault, blame, guilt
4
SWETWOL, Copyright7 Lingsoft Ab 1995. http://www.lingsoft.®/cgi-pub/swetwol ENGTWOL Copyright7
Atro Voutilainen, Juha HeikkilaÈ, Timo JaÈrvinen and Lingsoft, Inc. 1993±1995. http://www.lingsoft.®/cgi-pub/
engtwol FINTWOL Copyright 7 Kimmo Koskenniemi & Lingsoft Oy 1995±1997. http://www.lingsoft.®/cgi-pub/
®intwol
158
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
Table 7
Average number of translation alternatives after normalization
Average number of translation alternatives of
in¯ected word forms after morphology analysis,
the increase in percent to words in base form
(Table 6)
Nouns
Verbs
Adjectives+Adverbs
Average all POS
Fin±
Swe
%
increase
Eng±
Swe
%
increase
Swe±
Fin
%
increase
22.9
50.5
13.9
29.1
10.1
15.0
143.9
56.3
8.1
36.4
8.9
17.8
0
0
29.0
9.7
33.6
34.7
18.1
28.8
286.2
79.8
262.0
209.3
kris Ð crisis, depression
skuldsaÈttning Ð getting into debt
If we can establish that the words skuld and skuldsaÈttning are usually co-occurring, we can
establish that the right translation would be ``debt crisis'', not ``guilt crisis'' or ``guilt
depression''. However, the method with collocating words for disambiguation purposes is not
easy to implement in operational IR-systems, and especially not if we only have a relatively
short query to work with. This is the case for dictionary-based methods in CLIR.
Part-of-speech tagging might be useful in Swedish IR because of the high frequency of
homographic word forms. The words in the query and in the documents are part-of-speech
marked and in the matching process the part-of-speech marks for the search keys and the
document index terms are acknowledged. With this method, for example, the preposition foÈr
(because) could be separated from the verb foÈr(a ) (bring). In CLIR part-of-speech tagging of
source language words has been used successfully for ®nding correct translation equivalents in
the dictionary (Ballesteros & Croft, 1998).
With CLIR-research in mind we tested part-of-speech-marking, using the same test sample
of 60 words in Finnish, English and Swedish. The e€ect on the number of translation
Table 8
Average number of translation alternatives after POS-marking
Average number of translation alternatives after
POS-marking the decrease in percent to
normalization without POS-marking (Table 7)
Nouns
Verbs
Adjectives+Adverbs
Average all POS
Fin±
Swe
%
decrease
Eng±
Swe
%
decrease
Swe±
Fin
%
decrease
22.9
45.5
5.0
24.5
0
ÿ9.9
ÿ64.0
ÿ24.6
7.2
31.9
4.8
14.6
ÿ11.1
ÿ12.4
ÿ46.1
ÿ23.2
21.4
29.8
7.5
19.5
ÿ36.3
ÿ14.1
ÿ58.6
ÿ36.3
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
159
alternatives is shown in Table 8. We can see a small decrease in the number of translation
alternatives for English nouns, no decrease for Finnish nouns and a larger for Swedish nouns.
The results agree with earlier research by LeppaÈnen (1995) (Finnish) and Sanderson (1994,
1997) (English). From the test we come to the conclusion that part-of-speech-marking might be
e€ective in CLIR for Swedish.
6. Conclusions
The present situation with Swedish natural language IR is that it is lacking research. Neither
has there been any CLIR research in Swedish. A Norwegian study (Fjeldvig & Golden, 1988)
done 10 years ago with a fairly small test database is the only research result that might give
some leads on how the Swedish language behaves in similar research situations. The
Norwegian study shows the e€ectiveness of automatic normalization and truncation but an
improvement in the results for compound splitting was not shown.
We have described the properties of Swedish language and pointed out a number of research
problems for Swedish IR. We also argued that they deserve serious research attention, because
(1) the Nordic countries have a fairly large population, (2) they are technically advanced
countries with lots of IR-related activity, and because (3) the features of Swedish di€er from
those of many other languages, such as English and Finnish. The research results for other
languages do not necessarily apply to Swedish due to its unique features, e.g., the frequency of
homographs, the use of fogemorphemes in compound words, and particle verbs.
We have also shown that the publicly available morphological analysis tools for word
normalization and compound splitting have pitfalls that might decrease the e€ectiveness of IR
and CLIR. The identi®cation of base forms and the analysis of compounds, in particular may
present problems for IR. Therefore the morphological analysis programs cannot be utilized
directly but need to be tuned for IR applications.
References
Alkula, R., & Honkela, T. (1992). Tekstin tallennus- ja hakumenetelmien kehittaÈminen suomen kielen tulkintaohjelmien
avulla. FULLTEXT-projektin loppuraportti. [Linguistic processing and retrieval techniques in Finnish fulltext
databases. Final report of the FULLTEXT project.] VTT julkaisuja Ð publikationer 765. Espoo: VTT.
Ballesteros, L., & Croft, W. B. (1997). Phrasal translation and query expansion techniques for cross-language
information retrieval. In Proceedings of the Twentieth ACM/SIGIR Conference (pp. 84±91).
Ballesteros, L., & Croft, W. B. (1998). Resolving ambiguity for cross-language retrieval. In Proceedings of the
Twenty-®rst Annual International ACM/SIGIR Conference (pp. 64±71).
Boguraev, B., & Briscoe, T. (1989). Computational lexicography for natural language processing. London: Longman.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1991). Word-sense disambiguation using
statistical methods. In Proceedings of the Twenty-ninth Annual Meeting of the Association for Computational
Linguistics (pp. 264±270).
Dagan, I., Itai, A., & Schwall, U. (1991). Two languages are more informative than one. In Proceedings of the
Twenty-ninth Annual Meeting of the Association for Computational Linguistics (pp. 130±137).
Davis, M., & Ogden, W. C. (1997). QUILT: Implementing a large-scale cross-language text retrieval system. In
Proceedings of the Twentieth Annual International ACM/SIGIR Conference (pp. 92±98).
160
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
Efthimiadis, E. N. (1996). Query expansion. Annual Review of Information Science and Technology, 31, 121±187.
Fjeldvig, T., & Golden, A. (1988). Experiments with language-based aids in information retrieval systems. Nordic
Journal of Linguistics, 11(1-2), 33±46.
Fraurud, K. (1988). Pronoun resolution in unrestricted text. Nordic Journal of Linguistics, 11(1-2), 47±68.
Grishman, R. (1986). Computational linguistics: an introduction. Cambridge: Cambridge University Press.
Guthrie, J. A., Guthrie, L., Wilks, Y., & Aidinejad, H. (1991). Subject-dependent co-occurrence and word sense
disambiguation. In Proceedings of the Twenty-ninth Annual Meeting of the Association for Computational
Linguistics (pp. 146±152).
Guthrie, L., Pustejovsky, J., Wilks, Y., & Slator, B. (1996). The role of lexicons in natural language processing.
Communications of the ACM, 39(1), 63±72.
Harman, D. (1991). How e€ective is suxing? Journal of the American Society for Information Science, 42(1),
7±15.
Hellberg, S. (1978). The morphology of present-day Swedish: word-in¯ection, word-formation, basic dictionary.
Stockholm: Almqvist & Wiksell.
Hirst, G. (1987). Semantic interpretation and the resolution of ambiguity. Cambridge: Cambridge University Press.
Hull, D. (1998). A weighted Boolean model for cross-language text retrieval. In G. Grefenstette, Cross-language
information retrieval (pp. 119±136). Boston: Kluwer Academic Publishers.
JaÈrvelin, K. (1995). Tekstitiedonhaku tietokannoista. Johdatus periaatteisiin ja menetelmiin. Espoo: Suomen ATKkustannus [Text retrieval from databases. An introduction to principles and methods].
Karlsson, F. (1992). SWETWOL: A Comprehensive Morphological Analyser for Swedish. Nordic Journal of
Linguistics, 15, 1±45.
Karlsson, F. (1994). Yleinen kielitiede. Helsinki: Yliopistopaino [General linguistics].
KekaÈlaÈinen, J. (1999). The e€ects of query complexity, expansion and structure on retrieval performance in
probabilistic text retrieval. Ph.D. Thesis, University of Tampere.
Kiefer, F. (1970). Swedish morphology. Stockholm: Skriptor.
Koskenniemi, K. (1983). Two-level morphology: a general computational model for word-form recognition and
production. Ph.D. Thesis. University of Helsinki.
Krovetz, R., & Croft, W. B. (1989). Word sense disambiguation using machine-readable dictionaries. In Proceedings
of the Twelfth Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval (pp. 127±136).
Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information
Systems, 10(2), 115±141.
LeppaÈnen, E. (1995). Homogra®en disambiguoinnin vaikutukset ja toteuttaminen tekstihaussa [The e€ects of
disambiguation of homographs and its implementation in text retrieval] Master's Thesis. University of Tampere.
Lewis, D., & Sparck Jones, K. (1996). Natural language processing for information retrieval. Communications of the
ACM, 39(1), 92±101.
Malmgren, S. G. (1994). Svensk lexikologi. Ord, ordbildning, ordboÈcker och orddatabaser. Lund: Studentlitteratur
[Swedish lexicology. Words, word formation, dictionaries and word databases].
McRoy, S. W. (1992). Using multiple knowledge sources for word sense disambiguation. Computational Linguistics,
18(1), 1±30.
Oard, D., & Dorr, B. (1996). A survey of multilingual text retrieval. Technical Report UMIACS-TR-96-19.
University of Maryland, Institute for Advanced Computer Studies.
Pirkola, A. (1998). The e€ects of query structure and dictionary setups in dictionary-based cross-language
information retrieval. In Proceedings of the Twenty-®rst Annual International ACM/SIGIR Conference (pp. 55±
63).
Pirkola, A. (1999). Studies on linguistic problems and methods in text retrieval. Ph.D. Thesis, University of Tampere.
Pirkola, A., & JaÈrvelin, K. (1996a). The e€ect of anaphor and ellipsis resolution on proximity searching in a text
database. Information Processing & Management, 32(2), 199±216.
Pirkola, A., & JaÈrvelin, K. (1996b). Recall and precision e€ects of anaphor and ellipsis resolution in proximity
searching in a text database. In P. Ingwersen, & N. O. Pors, Proceeding CoLIS2, Second International Conference
on Conceptions of Library and Information Science: Integration in Perspective. Copenhagen, October 13±16, 1996
(pp. 459±475). Copenhagen: The Royal School of Librarianship.
T. Hedlund et al. / Information Processing and Management 37 (2001) 147±161
161
Porter, M. F. (1980). An algorithm for sux stripping. Program, 14, 130±137.
Sanderson, M. (1994). Word sense disambiguation and information retrieval. In Proceedings of the Seventeenth
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 142±
151).
Sanderson, M. (1997). Word sense disambiguation and information retrieval. Ph.D. Thesis University of Glasgow,
Department of Computing Science.
SchuÈtze, H., & Pedersen, J. O. (1995). Information retrieval based on word senses. In Proceedings of the Symposium
on Document Analysis and Information Retrieval (pp. 161±175).
Strzalkowski, T. (1994). Robust text processing in automated information retrieval. Proceedings of the Fourth
Conference on Applied Natural Language Processing, 168±173. Stuttgart, Germany: Association for
Computational Linguistics. Reprinted in K. Sparck Jones & P. Willet (Eds.). Readings in Information Retrieval
(1997). San Francisco, CA: Morgan Kaufmann Publishers.
Strzalkowski, T. (1995). Natural language information retrieval. Information Processing & Management, 31(3), 397±
417.
Teleman, V., Hellberg, S., & Andersson, E. (1999). Svenska Akademiens grammatik 1±4. Stockholm: Svenska
Akademien [Grammar of the Swedish Academy 1±4].
Vorhees, E. M. (1993). Using WordNet to disambiguate word senses for text retrieval. In Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp. 171±180).