Stemming of French Words Based on Grammatical Categories Jacques Savoy Universith de Mont&al, Dhpartement d’lnformatique Station A, Montreal, Quebec H3C 3Jz Canada et de Recherche Opkrationnelle, Automatic indexing systems use suffix stripping algorithms to cluster various words derived from a common root under the same stem. Currently, removing affixes to either a context-free or context-sensitive operation, where the context refers to the remaining stem. In this article, we propose a suffixing algorithm which uses grammatical categories to enhance the stemming process. This approach supports the use of foreign languages. In our case, the language is French, and a morphological analysis is required for removing inflectional suffixes or morphosyntactic variants of a lemma. After this analysis, we implement a suffix stripping algorithm which uses a dictionary and the grammatical categories to remove derivational suffixes. Our approach always returns a linguistically correct lemma, but not necessarily the “right” one. Based on our tests, this solution is an attractive one, with a mean error rate of 16%. We finish by explaining why we cannot expect significantly better results with this approach. 0 1993 John Wiley & Sons, Inc. P.O. Box 6128, vides clues helping the syntactical analysis of a sentence. In information retrieval, the stems are often nonlinguistic elements such as the stem “appreci” from “appreciate.” This approach is unsatisfactory for linguists who are interested in finding a linguistically correct lemma. In this article, we describe the problem of removing inflectional suffixes in the French language,which leads to the proposal of another form of suffixing algorithm based on a dictionary and grammatical categories. This results in a process which always returns a linguistically correct stem. However, in information retrieval, linguistic correctness is regarded as irrelevant. Thus, in this context, the main criterion is that semantically equivalent words be merged or “conflated” to the same form and that semantically distinct words remain separate. Our solution tries to reduce words to their shortest root and thus it should satisfy both linguists and information scientists. Introduction A stemming algorithm reduces inflectional and derivational variants of words to a common form. For example, the words “thinking,” “thinkers,” or “thinks” are reduced to the stem “think.” To be precise, the root of a word is obtained by removing both suffixes and prefixes; the stem is obtained by deleting only the suffixes. In information retrieval, grouping words having the same root under the same stem (or indexing term) will increase the success of matching of documents to a query (Harman, 1991; van Rijsbergen, 1979, chap. 2). Linguists are more interested in finding the “right” separation between the root and their affixes, but sometimes they attach more attention to the suffix than to the root itself. In fact, the suffix represents information about the grammatical function of a word, and thus pro- Received October April 9, 1992. 17, 1991; revised January 27, 1992; accepted 0 1993 by John Wiley & Sons, Inc. JASIS: Journal of the American Society for Information Current Solutions Various suffix stripping algorithms have already been proposed, ranking from a weak stemmer which removes plural inflections (and also perhaps the past participle “-ed” and the gerund or present participle “-in,“), to more sophisticated schemesdesigned to remove suffixes and even prefixes. The design of such procedures is based mainly on one of two principles: (1) removing the longest match suffix; and (2) iterating over a set of predefined classesof suffixes. In the latter case, the various suffixes are classified according to derivation rules (e.g., the first class groups plural inflections together with the suffixes “ -ed” and “-ing”). Other approaches may exist, such as the stemmer proposed by Paice (1990), which is iterative but does not have its endings classified. Based on one of these two principles, the system finds a suffix and the stripping operation is done without further consideration (context-free). However this schemeleads to a significant error rate (van Rijsbergen, Science. 44(1):1-g, 1993 CCC 0002-8231/93/010001-09 1979). To produce “better” stems, and even to find the “right” root, a context-sensitive approach add constraints to the stripping operation. Three general types of constraints are proposed: A Lemmatizer for French Texts Lovin’s algorithm (1968) is based on the longest match principle, and uses a list of 260 endings. After a suffix is removed, the stem is compared with a set of 34 recoding rules to take into account spelling exceptions. For Paice (1990), only one table containing both deletion and replacement rules is needed. Porter’s (1980) suffixing procedure operates in five stages, using five different classes of suffixes to simulate the inflectional and derivational process of words. Some of Porter’s rules do not actually delete a suffix but they are required to recode the stem; for example, words ending with “ -anti” are transformed with the new ending “-ante”; thus, “hesitanci” gives “hesitance.” Using a stemming algorithm does not always improve search effectiveness in all circumstances (Harman, 1991). However, the storage occupied by the inverted files is reduced because various forms are reduced to a unique stem (Salton, 1989; van Rijsbergen, 1979). Stemming procedures may also be useful for statistical analysis of a corpus; e.g., word-frequency or lemmafrequency counts (Muller, 1985). Of course, other statistical analysis would suffer from stemming; i.e., statistics of word co-occurrences, concordance analysis or KWIC (keyword-in-context) (see Smith 1990, Chap. 8). Should we use the above schemes for other languages, and, in our context, for French? Plural inflection of the English noun is simple, since most English nouns take “-s.” Of course, one can also find exceptions; for example, nouns that end with “-ss,” “-x,” “-0,” “ssh,” and some in “-ch,” take “-es” (e.g., “boss, bosses,” “box, boxes,” “dish, dishes,” etc.). More complex rules exist but they occur more rarely (“child, children,” “ mouse, mice,” etc.). Beale (1987) proposes an automatic lemmatizer for English words in which exceptions are handled by a special word list. For example, the rule “-nging + -ng” is well adapted for “belonging” or “bringing,” but not for “changing,” which appears in an exception list. In French, nouns, adjectives, pronouns, articles, etc., are declined according to gender (masculine, feminine) and number (singular, plural). For verbs, we have to add person and tense. Overall however, French has more numerous and more complex inflections than English [e.g., the verb ccttre,, (to be) has 40 different forms]. In the current study, we use the symbols N. . . D for French words, affixes, and stems. In German, the task is more compiex becausethe gender can be masculine, feminine, or neuter, the grammar includes cases (nominative, accusative, genitive, and dative case) and, moreover, different suffixes are used with each combination of gender and case. French has a large number of irregularities in both morphology and orthography. A weaker stemmer for French, using a context-free removal procedure, will require a large table to represent the set of all French inflectional suffixes (around 3,000). The main problem is not the elaboration or the management of this table; but with this large size, an unacceptable number of conflation errors occurs. For example, Table 1 (top) shows the results when removing the most common inflectional suffix in French (CC-SD).The various forms of the adjective ccneuf>j(new) cannot be reduced to the same root. In considering all inflectional suffixes and removing the longest match suffix, the various forms of the adjective ccneuf>>are now conflated to the same nonlinguistic stem <<neu>> in Table 1 (bottom). But, semantically distinct words will also be reduced to the same stem. Some of the inflectional suffixes used in Table 1 (bottom) appear in Table 3. The problem of removing inflectional endings can be resolved using another point of view. The French dictionary is more stable than the English or German; a new word cannot be included without acceptance from the “Acadkmie Franqaise.” Contrary to Beale (1987), who built a lexicon based on a given corpus, our solution uses a given French dictionary. We begin the stemming of a word by a morphological analysis (Sabah, 1989), which requires the dictionary and a declension file. Our dictionary contains 52,627 entries (see Table 2 for ex- 2 January (1) Quantitative constraints:the length of the remain- ing stem must exceed a given number (e.g., from the word “ring” and the suffix “‘-in,,” the system cannot derive “r” because a minimum length of two characters is required). (4 Qualitative constraints: the stem ending must satisfy a given condition (e.g., it does not end with “e” when the suffix “ -ize” is removed. In this case, “‘seize” minus “ -ize” is not allowed). (3) Recoding rules: spelling or adjustment rules must be used to improve the accuracy of conflation of the stems produced by the suffix stripping algorithm. (3.1) Remove one of double “b,” “d,” “g,” “m,” “n,” “p,” “r,” “s,” or “t,” at the end of the stem (e.g., “hopping” minus “-ing” gives “hopp,” and, after correction, “hop”; “admittance” less “-ante” gives “admitt” which is modified to “admit”). (3.2) Turn terminal “d,” “r,” “t,” “z” into “s” (the previous stem “admit” is transformed into “admis”; thus, “admittance” and “admission” will be reduced to the same stem “admis”). This last example shows that the recoding rules are ordered. (3.3) Change “-rpt” into “-rb” (“absorption” minus “-ion” gives “absorpt,” and “absorbing” minus “-ing” gives “absorb.” The recoding rules transforms “absorpt” into “absorb,” and thus, we find the common root.) JASIS: Journal of the American Society for Information Science- 1993 TABLE 1. Examples of weaker stemmers in French. suffix word Removing the ending in a-s,) neuf (new) neufs (new) neuve (new) neuves (new) -S -S Context-free suffix removal with French words neuf (new) neufs (new) neuve (new) neuves (new) parle (he speaks) parlerai (i shall speak) parent (related) parents (parents) pares (you parry) paris (bets) parois (walls) parts (shares) val (valley) vals (valleys) vallonement (undulation) valves (valves) valet (servant) valeurs (values) valais (was worth) TABLE 2. Fragments from our dictionary -f -fs -ve -ves -1e -erai -ent -ents -es -is -0is -ts -S -ment -ves -et -ems -ais stem neuf neuf neuve neuve (new) (new) (new) (new) neu neu neu par par1 par par par par par par val val vallone val val val file. aimer verb v3 infpre avoir (to love) monsieur .. messieurs .. moral noun n8 mast sing (gentleman) noun n9 mast plur (gentlemen) adje n47 mast sing (moral) neuf adje n46 mast sing (new) nul adje n48 mast sing (useless, nil) robuste robustement robustesse adje adve noun n5 mast sing n4 femi sing (robust) (robustly) (robustness) amples). The declension file stores 100 entries for nouns, adjectives, pronouns, etc., and 132 entries for verbs (see Table 3). For example, the dictionary entry for ccneuf>b indicates that this adjective uses declension number 46, and is in the masculine and singular form. To generate all correct forms for this adjective, linking “n46” to the declension file gives the needed information. In this file, we first determine the final character(s) (or nil represented by a point G. D) to be removed to obtain the radix of the word, and then, the various inflections. For the word ccneufs,we delete the final cc-f,, and add cc-ve>>to form the feminine singular (cneuve,,). In Table 3, a line beginning with <<#Dis a comment, and the slash ((ID means no field. For example, declension 4 does not allow for a masculine form. The morphological analysis process works in reverse; it starts with word endings. Thus, our declension file cannot be directly helpful, but from it, we have built a truncated digital-search tree (Salton, 1989, sect. 7.6). The information attached to each node represents the possible endings found in the path from that node back to the root. For example, Figure 1 shows the endings of nouns and adjectives given in Table 3 (declension n4, n5, n46, n47, and n48). If a node does not correspond to a final inflectional suffix, no constraints exist (e.g., CCXD in Fig. 1). The root node does not have conditions because the words JASIS: Journal of the American Society for Information Science- January 1993 3 TABLE #number # n4 n5 ., n46 n47 n48 I.. #number v3 #tense ipr iim imp . v4 ipr .. 3. Examples from our declension ending mast sing femi sing JASJS: Journal of the American femi plur plur S S f 1 ending er 1st sing e ais uer ue f I ve le le fs ux S les les 2nd sing es ais e 3rd sing e ait I 1st plur ons ions ons 2nd plur ez iez ez 3rd plur ent aient I ues ue uons uez uent will be in the dictionary (e.g., ccmonsieur)) and ccmessieurs> in Table 2). One can find all inflections of a given language in such a tree, and, in our experiment, the tree is composed of 3,013 nodes. The morphological analysis can be better explained by considering an example. In its analyses of the word ccneuves>,the computer first tries to find the word ccneuves>in the dictionary, and this attempt fails. After that, characters are removed one-by-one from the end of the word. Thus, the computer takes out the final character (<<SD)and tries to find the word ccneuve>> in the dictionary. This goes on until the computer reached the node corresponding in the path c<sev>p. At this point, the declension stored at the node is “n46” within the ending character is cc-f>>(see Table 3). The computer removes the ending cc-ves>and adds the ending cc-f,,to form the word cneuf>j. This word is finally found in the dictionary (Table 2), and the information stored in the dictionary at the entry ccneuf>,(declension n46) agrees with the conditions stored at the reaching node. This scheme is very general and for other languages we merely have to change the dictionary and the declension file. Morphological analysis removes the inflectional suffix (e.g., plural forms), the past participle is analyzed, and the infinitive form of the verb is returned. Incidentally, we can build a stop list based on grammatical categories, by removing, for example, the definite article, indefinite article, conjunction, preposition, personal pronoun, possessivepronoun, etc. (“the,” “an,” “and, ” “over, ” “them,” “their,” etc.). This method is more interesting in French because we find more forms than in English. For example, the word “which” is considered a nonsignificant word and it appears in stop lists; for example, Fox (1990) or van Rijsbergen (1979, pp. 18-19). This pronoun can be translated into nine different pronouns in French (cdlequel>>,c<laquelle>>, ccauxquels,, <<desquelles>, . . .), “mine” has four corre- 4 file. Society for Information Science- ves sponding forms (ccmiemt, ccmiennej,, (<miens>),<<miennes,) and the verb “to have” 43 forms. Such a stop list based on grammatical categories is already used in our project (Savoy, 1992). A Stemmer Based on Grammatical Categories If we limit the stemming process to remove only inflectional endings, derived words will never reduce to the same stem (e.g., “robust” and “robustness”). To reduce these variations under the same stem, we have to consider the role of derivational suffixes. In English, such suffixes are used to change the grammatical category of a word; for example, “white” gives “whiteness.” In other languages, one of the most important roles of derivational suffixes is to transform a word from one gender to another; English also presents gender variants such as “count” which gives the feminine “countess.” Finally, they are used to slightly modify the meaning of a root as, for example, “green” gives “greenish.” The derivation of new words is not done arbitrarily, and some algorithms take this into consideration. In an iterating suffixing algorithm, the various tables of suffixes are ordered to simulate the derivation process. FIG. 1. Example of a truncated declension file. January 1993 digital-search tree built from our One might expect that the derivational suffixes precede the inflectional ones, but exceptions can be found; in the word “relatedness” the inflectional suffix “-ed” appears first. In Porter’s (1980) method, the word “related” is reduced to “relate” but the word “relatedness” gives the stem “related.” For French texts, we also consider the derivational process and we have proposed the design of a suffixstripping algorithm which has two stages.The first one corresponds to the inflectional analysis described previously. In the second phase, the derivational suffixes are removed according to the grammatical category, since this information was already obtained from the morphological analyzer. The declaration of all French suffixes was not done on an ad hoc basis, but the elaboration of this list follows a linguistic analysis (Grevisse, 1988). We have formed four different tables of suffixes corresponding to the four grammatical categories (noun, adjective, verb, and adverb). Once we have a word’s grammatical category and its suffix, we may then find its stem and the grammatical category of this stem. Some examples will clarify the derivational process. example, the adjective ccvalable>> (valid) cannot be derived from the noun ccvabb(valley) plus the suffix +able>>,because this suffix forms an adjective from the verb; with the word <<table>>, the system canot remove the suffix <c-able>> because <<table>> is not an adjective. Spelling Corrections and Adjustments As in other algorithms, we have also included spelling corrections to obtain the linguistic lemma. For example, in Table 4, the word ccampleuruminus the suffix +eur>> does not produce aampl>>,but ccample>>. Thus, with each rule, we have attached one or two characters to produce a new stem. For example, a rule can be defined as: l Or, for a verb: l (1) <<-able,forms adjectivesfrom verbs,like q<discutern (to discuss),whoseradix is cqdiscut), andwhosesuffix <<-able,, givesadiscutable,> (debatable); (2) a-iqueu forms adjectives from nouns; cxvolcam> plus +ique>>givesqcvolcaniquep> (volcano,volcanic). This suffix may have variants like c+istique, or +atique>j. (3) c+eur* is used to obtain a feminine noun from an adjective[e.g., <<blanchen (white) plus +eur,>gives the feminine noun q<blancheuru (whiteness)]; Other suffixes produce verbs from nouns (e.g., cc-iserw, etc.). Adverbs are derived from adjectives and more rarely from nouns, with the suffix cc-mentb).In Table 4 other examples are shown. Thus for each transformation rule, we attach conditions about the grammatical category of the resulting stem. For some suffixes, the conditions can also include restrictions about the gender or the length of the remaining stem. The inclusion of grammatical categories can improve the effectiveness of the suffix-stripping algorithm. For TABLE From adjective adjective adjective noun noun verb verb verb 4. Examplesof Given a feminine noun ending with cc-cur,,(e.g. <campleur>,) - considera correspondingadjective; - if none can be found (e.g., ccampb),consider a corresponding adjective ending with cc-e>> (e.g. <<ample,>). Given a noun endingwith cc-eur>p (e.g. ccchercheur,,) - considera correspondingverb; -if none can be found (e.g., Tccherchn),consider a correspondingverb endingwith +ern, +ir,,, or cc-re>> (e.g., CcchercherN)). In addition to these spelling adjustments, we sometimes have to slightly modify the radix. For example, the adjective ccverdstre,, (greenish) minus the suffix cc-ltre>>does not give the adjective ccvertj,(green) and, in this case, the final character <cd>> must be transformed into a c&j. In our algorithm, we have defined 35 spelling correction rules (see Table 5). The definition of this set of rules is the only part of our system which is ad hoc or based on experiments. These correction rules are also used to correct the presence of accents. French accents indicate the precise pronunciation or sound and identify homographs (e.g., cc0i.i~means “where,” and CCOUD “or”). A derived word does not necessary have the same accent as its lemma. For example, the accent & can be transformed into an C&Das with cccollCgiem> (schoolboy) derived from cccoll2ge>> (school). The suffix derivation does not always suffix derivation. To Suffix noun noun adverb adjective verb adjective noun noun -esse -cur -ment -ique -iser -able -eur -ement JAW: Example robuste ample Ctrange volcan utile discuter chercher renverser Journal of the American robustesse ampleur ktrangement volcanique utiliser discutable chercheur renversement Society for Information Science- January 1993 5 incorporate the accent; for example, the adjective Removing Prefixes ccaromatique,>(aromatic) comes from the noun <car&me,, Rules for removing prefixes are not as specific as (aroma). See Table 5 for examples and Savoy (1991) for those for suffixes. As Table 7 illustrates the rules atall spelling corrections. tached to prefix derivation are less stringent, since a prefix does not generally change the grammatical category of a word. In French, we encounter only a few Examples examples, but, the prefix may radically change the As a more complete example, we have derived all meaning of the word, and sometimes the semantics may words from the common form ccnavig>> and, after a spellbe very different. For this reason, we do not suggest ing correction of the common root, (cnaviguer,,(to sail). removing the prefixes for automatic indexing. For exTable 6 shows the results obtained by our approach. ample, ccprtdirej, (to foretell, to predict) is derived from The results of Table 6 are derived by the following <<dire>,(to say, to tell), and the meaning of “to tell berules. For the feminine noun ccnavigabilit&, the comfore” is not directly related “to foretell’: Regarding Enputer considers the following rule: glish words, Paice (1977, sect. 4.3.3) also suggests ignoring prefixes removal for general texts. However, we Given a feminine noun endingwith cc-abilitt, (e.g., may consider prefix stripping for technical subjects such ccnavigabilit&) as chemistry or medicine (e.g., DNA or desoxyribonu- considera correspondingverb; cleic acid). - if none can be found, considera corresponding verb ending with <q-err,cc-ir*or c<-re,>. l After removing the suffix, the system obtains the stem ccnavigj).In the dictionary, we cannot find a verb like ccnaviger,,,ccnavigir),,or crnavigre,,.The radix ccnavig), is passed through the spelling corrector. As shown in Table 5, this radix is transformed into <<cnavigu), and the verb anaviguer, can be found in the dictionary. Finally, the resulting stem ccnaviguerpjis reduced to the noun ccnavire,, (ship), which is the final root (the suffix +guer>>is transformed into cc-re,). TABLE 5. *ess VP *P *el length > 5 length > 5 length > 4 *C *t length > 4 *X *t *g *v I&* *p * * c Although in French we do not have test collections like CACM or CRANFIELD to study the retrieval effectiveness of our algorithm, our solution should be helpful to the linguist or computer scientist for whom the resulting stem is often a nonlinguistic element. Thus, to evaluate our schemewe have built three lists of French words where each list contains a set of words and its corresponding root. However, the definition of Examples of spelling corrections. *&S *al “ch *ct *ss *d Evaluation *@J *f *i* *@a *C* congress[iste] paliss[ade] actual[itk] duch[esse] dtlict[ueux] touss[er] verd[ltre] navig[able] veuv[ age] coll&g[ien] extrtm[itt] balanG[oire] *Any sequence (including nil) of characters. p Represents a given character; the additional fore the transformation. TABLE 6 6. size constraint cong& palis actuel due d&lit toux vert navigu[er] veuf coll&g[e] extr&m[e] balanc[e] is checked be- Examplesof suffixing. Word Suffix Removing One Suffix navigabilitt (seaworthiness) navigable navigant (seagoing personnel) navigateur (sailor) navigation (sailing) -abiliti -able -ant -ateur -ation naviguer naviguer naviguer naviguer naviguer JASIS: Journal of the American Society for Information Science- January 1993 TABLE 7. Examples of prefix derivation Example From Prefix From Verb to Verb From Noun to Noun noun or verb verb, adjective, or noun noun or verb pr&deco- preetablir decharger coexister preretraite denatalite codirecteur “correct” root is not always clear. For example, cccommuniste* (communist) gives cccommuneb> but the ultimate root is ~<comrnum~ (common). In our case, we have stored the shortest root. To evaluate our algorithm, we have used three main tests. The first contained 50 words presenting only inflectional suffixes (plural, past participle, etc.), and was designed to evaluate the removing of inflectional endings (weak stemmer). According to results shown in Table 8, the morphological analysis is done correctly. An error in this experiment would indicate the presence of an incorrect coding in our dictionary or declension file. The input lists for the further tests will not contain inflectional suffixes. In a second test, we tried removing the prefixes. To do this we built a list of 73 words having real prefixes and 73 counterexamples (words beginning with the same characters but which are not a prefix). The incorrect results of our system are reported in Table 9A. As mentioned previously, removing prefixes is an operation to be considered by linguists and not by information scientists. Finally, 402 words with only derivational suffixes were used to test our suffix-stripping solution. These words are derived from various roots in order to represent most of the French derivational suffixes. In removing these derivational suffixes, the rules are defined according to a linguistic analysis (Grevisse, 1988); therefore, they are not formulated on an ad hoc basis or given by experiments. The results of our tests are listed in Table 8, and examples of incorrect results are given in Table 9B. The three lists of words in Table 8 are established to reflect most of the inflections and all derivational suffixes used in French. They contain both regular and abnormal examples extracted mainly from Grevisse (1988). TABLE 8. Results according Analysis of Errors For a better comprehension of errors, we have divided them into four classes. The first class represents overstemming error, where, after finding the expected root, the algorithm carries on to remove suffixes. For example, ~~colonisatiom>(colonization) comes from <<coloniser>> (to colonize) which derives from cccolom (colonist). However, this last word has nothing to do with NCOIN (collar or pass). If the stemming process stops before reaching the appropriate root, we encounter an understemming error. For example, c<feminisme>> (feminism) gives <<ferninim> (feminine) and not c<femrnejj(woman). When the suffixstripping algorithm does not modify a word, we place this error in the “nonstemming” class, which represents a special case of an understemming error. Miscellaneous errors form the last class which in this case occur during the suffix stripping; the algorithm follows an to our five experiments. Experiment 1. 2. 3. 4. 5. Two additional tests have been done to evaluate the impact of using grammatical categories and spelling correction. As shown in Table 8, combining all suffixes in one table and ignoring grammatical categories does not produce satisfactory results (a decrease of 4%). However, this scheme does not correspond to Lovins’ (1968) algorithm, because the stemming is guided by a dictionary and always produces a linguistically correct lemma. Some examples of suffix stripping using the longest match principle are given in Table 1. The fifth test showed that the inclusion of spelling correction rules may improve the solution by up to 11.7%. This suffixstripping algorithm is already used with more general French texts (Savoy, 1992) and the conflation of semantically distinct words seemto confirm the present results. Size Weak stemmer Prefix stripping Suffix stripping Without grammatical categories Without spelling corrections JAW: Correct 50 146 402 402 402 Journal of the American 50 139 338 322 291 Success Rate 100% 95.2% 84.1% 80.1% 72.4% Society for Information Science- January 1993 7 incorrect path within which it never finds the expected root. The results of Table 10 show that overstemming represents the main source of error. When the system does not consider grammatical categories (experiment 4), the number of overstemmings increases, and the resulting stems (even the wrong ones) are shorter than the stems of test three (suffix stripping). Thus, a consideration of TABLE Word coefficient debut discours mtchant menager melanger retable 9A. Prefix removing errors. Expected Stem Resulting Stem efficient debut discours mechant mtnager melanger retable coefficient but tours chant nager langer table TABLE 9B. Examples of suffixing errors. Expected Stem Resulting Stem Error Type ailt aisement boxer courser crachat maquisard marquisat paperasse perissable secretariat valetaille aile aise boxe course cracher maquis marquis papier p6rir secretaire valet ail ais box tours crac maque marque pw ptre secrete val overstemming overstemming overstemming overstemming overstemming overstemming overstemming overstemming overstemming overstemming overstemming fkminisme humanitaire femme homme f6minin humain understemming understemming chandelier chevelure lisible ruelle sprinter chandelle cheveu lire rue sprint chandelier chevelure lisible ruelle sprinter unchanged unchanged unchanged unchanged unchanged aileron bruyance chauffard Ccrivailler lapereau lavasse linguistique aile bruit chauffeur Ccrire lapin laver langue ail bru chauffe tcr ier laper lave linge other other other other other other other Word TABLE 10. Experiment 2 3 4 5 8 grammatical categories can be viewed as a way of restraining overstemming. We should be able to improve our results a little by progressive elimination of errors from our dictionary and declension files. However, for various reasons, we can only expect limited improvements. First, the main source of errors comes from our rules and spelling adjustments. A given rule may almost always produce the correct answer, but an incorrect lemma is sometimes derived. For example, from the noun cctraineau>> (sleigh), the correct stem is cxtraine>,([robe] train), but from the noun ccmanteau,,(coat), the stem ccmante,,(mantis) is incorrect. Second, to retrieve the correct lemma, we sometimes have to consider the Latin or Greek etymon of the word. For example, the correct stem for the adjective <<simiesque>> (monkey-like) is <<singe,,(monkey); but <(simiesque>>comes directly from the Latin word ccsimius,. The word c&minescent>~ (luminescent) is derived from the Latin word c<lumen,luminisn and not JASIS: Journal of the American Error distribution. Overstemming 6 36 43 24 (85.7%) (56.3%) (53.8%) (21.6%) Society for Information Understemming 0 2 1 13 (0%) (3.1%) (1.3%) (11.7%) Science- January Nonstemming 1(14.3%) 14 (21.9%) 7 (8.8%) 69 (62.2%) 1993 Other Errors 0 12 29 5 (0%) (18.8%) (36.3%) (4.5%) from the French noun alumi&e>, (light). Different forms of a given root may exist because they come from different Latin forms; for example, ccnataln (native, Latin root ccnatalisw)and ccnaitre), (to be born, Latin root ccnasci,,).If our previous example of “sailor, sailing, . . . ” confirms that the terms “concept” and stem are synonymous, then the current example clearly demonstrates the limits of such an assessment.Of course, we may introduce the correct stem for each word in our dictionary and the stemming process can be reduced to a table lookup. Third, some derivations are very irregular, especially for compound words. For example, the adjective ccmoyenlgeux,, (medieval) is derived from amoyen $ge>> (Middle Ages), and the sport <<ping-pang>> has given the noun ccpongiste,,(table tennis player). Fourth, the derivation of a word is sometimes produced by simultaneously adding a prefix and a suffix. For example, <<dCbarquer>> (to land) comes from the noun <<barque>> (small boat), and in this case, the words c<dCbarque>b or ctbarquerj, do not exist in French. Our solution is unable to handle this exception because it cannot simultaneously remove a prefix and a suffix. Other words are formed by combining two existing words (e.g., “motor” more hotel” gives “motel”). Finally, for a given word, the morphological analysis can be ambiguous, and relating it to its corresponding dictionary entry is not an easy task. For example, the word aboxer,, can be a verb (to box) or a dog. In the former case, the correct root will be c<boxe>,(boxing), and in the latter c(boxer>>.The disambignation problem is beyond the scope of the present study; this task is even difficult for human beings, as shown in Choueka (1985). Summary The design of suffix-stripping algorithms may be based on the longest match suffix or on a set of predefined classes of endings. In both cases, only the sequence of characters determines the result. In this study, we have shown that these solutions may be difficult to implement for other languages, particularly French. We have shown how to remove inflectional suffixes from French words and described a morphological analysis which is more complex than for English. We have also explained how to implement another stem- ming algorithm based on a dictionary and grammatical categories. We have implemented this solution and our approach’s performance has been tested. This study merely outlines our general approach. For more details see Savoy (1991). Acknowledgments I wish to thank Georges C&6 Jr., Serge Simard, and Daniel Mdise, students at our department, for their participation in the implementation of the morphological analyzer and the stemming algorithm. The author also thanks Professor C. D. Paice of Lancaster University for her suggestions and helpful remarks. References Beale, A. (1987). Towards a Distributional Lexicon. In R. Garside, G. Leech, &G. Sampson (eds.), The computational analysisofEng1ish:A corpus-bused approach. (pp. 149-162). London: Longman. Choueka, Y., & Lusignan, S. (1985). Disambiguation by short contexts. Computers and Humanities, 19, 147-157. Fox, C. (1990). A stop list for general text. SIGIR Forum, 24, 19-35. Grevisse, M., & Goose, A. (1988). Le bon usage. Paris: Duculot. Harman, D. (1991). How effective is suffixing? Journal of the American Society of Information Science, 42, 7-15. Lovins, 3. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22-31. Lovins, J. B. (1971). Error evaluation for stemming algorithms as clustering algorithms. Journal of’the American Society for Information Science, 22, 28-40. Muller, C. (1985). Langue frun@se, linguistique quantitative, informatique. Geneva: Slatkine-Champion. Paice, C. D. (1977). Information retrieval and the computer. London: McDonald & Jane’s, Paice, C. D. (1990). Another stemmer. SIGIR Forum, 24, 56-61. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137. van Rijsbergen, C. J. (1979).Information retrieval. (2nd ed.), London: Butterworths. Sabah, G. (1989). LXntelligence artificielle et le langage: Processus de comprkhension (vol. 2). Paris: Hermss. Salton, G. (1989). Automatic text processing, The transformation, analysis, and retrieval of information by computer. Reading, MA: Addison-Wesley. Savoy, J. (1991, October). Stemming of French words. D&partement d’informatique et de recherche op&ationnelle, #793, UniversitC de Montreal, p. 48. Savoy, J. (1992). Bayesian inference networks and spreading activation in Hypertext systems. Information Processing & Management, 28, 389-406. Smith, P. D. (1990). An introduction lo fext processing. Cambridge, MA: The MIT Press. JASIS: Journal of the American Society for Information Science- January 1993 9
© Copyright 2026 Paperzz