Stemming of French words based on grammatical categories

Stemming of French Words Based on Grammatical
Categories
Jacques
Savoy
Universith de Mont&al, Dhpartement d’lnformatique
Station A, Montreal, Quebec H3C 3Jz Canada
et de Recherche Opkrationnelle,
Automatic
indexing
systems
use suffix stripping
algorithms to cluster various words derived from a common root under the same stem. Currently,
removing
affixes to either a context-free
or context-sensitive
operation, where the context refers to the remaining stem.
In this article, we propose a suffixing algorithm which
uses grammatical
categories to enhance the stemming
process. This approach supports the use of foreign languages. In our case, the language is French, and a morphological analysis is required for removing inflectional
suffixes or morphosyntactic
variants of a lemma. After
this analysis, we implement a suffix stripping algorithm
which uses a dictionary and the grammatical
categories
to remove derivational
suffixes. Our approach always
returns a linguistically
correct lemma, but not necessarily the “right” one. Based on our tests, this solution is an
attractive
one, with a mean error rate of 16%. We finish
by explaining why we cannot expect significantly
better
results with this approach. 0 1993 John Wiley & Sons, Inc.
P.O. Box 6128,
vides clues helping the syntactical analysis of a sentence.
In information retrieval, the stems are often nonlinguistic elements such as the stem “appreci” from “appreciate.” This approach is unsatisfactory for linguists who
are interested in finding a linguistically correct lemma.
In this article, we describe the problem of removing
inflectional suffixes in the French language,which leads
to the proposal of another form of suffixing algorithm
based on a dictionary and grammatical categories. This
results in a process which always returns a linguistically
correct stem. However, in information retrieval, linguistic correctness is regarded as irrelevant. Thus, in this
context, the main criterion is that semantically equivalent words be merged or “conflated” to the same form
and that semantically distinct words remain separate.
Our solution tries to reduce words to their shortest root
and thus it should satisfy both linguists and information
scientists.
Introduction
A stemming algorithm reduces inflectional and
derivational variants of words to a common form. For
example, the words “thinking,” “thinkers,” or “thinks”
are reduced to the stem “think.” To be precise, the root
of a word is obtained by removing both suffixes and prefixes; the stem is obtained by deleting only the suffixes.
In information retrieval, grouping words having the
same root under the same stem (or indexing term) will
increase the success of matching of documents to a
query (Harman, 1991; van Rijsbergen, 1979, chap. 2).
Linguists are more interested in finding the “right”
separation between the root and their affixes, but sometimes they attach more attention to the suffix than to
the root itself. In fact, the suffix represents information
about the grammatical function of a word, and thus pro-
Received October
April 9, 1992.
17, 1991; revised January
27, 1992; accepted
0 1993 by John Wiley & Sons, Inc.
JASIS: Journal of the American
Society for Information
Current
Solutions
Various suffix stripping algorithms have already been
proposed, ranking from a weak stemmer which removes
plural inflections (and also perhaps the past participle
“-ed” and the gerund or present participle “-in,“), to
more sophisticated schemesdesigned to remove suffixes
and even prefixes. The design of such procedures is
based mainly on one of two principles: (1) removing the
longest match suffix; and (2) iterating over a set of predefined classesof suffixes. In the latter case, the various
suffixes are classified according to derivation rules
(e.g., the first class groups plural inflections together
with the suffixes “ -ed” and “-ing”). Other approaches
may exist, such as the stemmer proposed by Paice
(1990), which is iterative but does not have its endings
classified.
Based on one of these two principles, the system
finds a suffix and the stripping operation is done without further consideration (context-free). However this
schemeleads to a significant error rate (van Rijsbergen,
Science. 44(1):1-g, 1993
CCC 0002-8231/93/010001-09
1979). To produce “better” stems, and even to find the
“right” root, a context-sensitive approach add constraints to the stripping operation. Three general types
of constraints are proposed:
A Lemmatizer for French Texts
Lovin’s algorithm (1968) is based on the longest
match principle, and uses a list of 260 endings. After a
suffix is removed, the stem is compared with a set of 34
recoding rules to take into account spelling exceptions.
For Paice (1990), only one table containing both deletion and replacement rules is needed. Porter’s (1980)
suffixing procedure operates in five stages, using five
different classes of suffixes to simulate the inflectional
and derivational process of words. Some of Porter’s
rules do not actually delete a suffix but they are required to recode the stem; for example, words ending
with “ -anti” are transformed with the new ending
“-ante”; thus, “hesitanci” gives “hesitance.”
Using a stemming algorithm does not always improve
search effectiveness in all circumstances (Harman,
1991). However, the storage occupied by the inverted
files is reduced because various forms are reduced to
a unique stem (Salton, 1989; van Rijsbergen, 1979).
Stemming procedures may also be useful for statistical
analysis of a corpus; e.g., word-frequency or lemmafrequency counts (Muller, 1985). Of course, other statistical analysis would suffer from stemming; i.e., statistics
of word co-occurrences, concordance analysis or KWIC
(keyword-in-context) (see Smith 1990, Chap. 8).
Should we use the above schemes for other languages, and, in our context, for French? Plural inflection of the English noun is simple, since most English
nouns take “-s.” Of course, one can also find exceptions; for example, nouns that end with “-ss,” “-x,” “-0,”
“ssh,” and some in “-ch,” take “-es” (e.g., “boss,
bosses,” “box, boxes,” “dish, dishes,” etc.). More complex rules exist but they occur more rarely (“child, children,” “ mouse, mice,” etc.). Beale (1987) proposes an
automatic lemmatizer for English words in which exceptions are handled by a special word list. For example,
the rule “-nging + -ng” is well adapted for “belonging”
or “bringing,” but not for “changing,” which appears in
an exception list.
In French, nouns, adjectives, pronouns, articles, etc.,
are declined according to gender (masculine, feminine)
and number (singular, plural). For verbs, we have to add
person and tense. Overall however, French has more numerous and more complex inflections than English
[e.g., the verb ccttre,, (to be) has 40 different forms]. In
the current study, we use the symbols N. . . D for French
words, affixes, and stems. In German, the task is more
compiex becausethe gender can be masculine, feminine,
or neuter, the grammar includes cases (nominative, accusative, genitive, and dative case) and, moreover, different suffixes are used with each combination of gender
and case.
French has a large number of irregularities in both
morphology and orthography. A weaker stemmer for
French, using a context-free removal procedure, will require a large table to represent the set of all French inflectional suffixes (around 3,000). The main problem is
not the elaboration or the management of this table; but
with this large size, an unacceptable number of conflation errors occurs. For example, Table 1 (top) shows the
results when removing the most common inflectional
suffix in French (CC-SD).The various forms of the adjective ccneuf>j(new) cannot be reduced to the same root.
In considering all inflectional suffixes and removing the
longest match suffix, the various forms of the adjective
ccneuf>>are now conflated to the same nonlinguistic
stem <<neu>>
in Table 1 (bottom). But, semantically distinct words will also be reduced to the same stem. Some
of the inflectional suffixes used in Table 1 (bottom) appear in Table 3.
The problem of removing inflectional endings can be
resolved using another point of view. The French dictionary is more stable than the English or German; a new
word cannot be included without acceptance from the
“Acadkmie Franqaise.” Contrary to Beale (1987), who
built a lexicon based on a given corpus, our solution
uses a given French dictionary. We begin the stemming
of a word by a morphological analysis (Sabah, 1989),
which requires the dictionary and a declension file. Our
dictionary contains 52,627 entries (see Table 2 for ex-
2
January
(1) Quantitative constraints:the length of the remain-
ing stem must exceed a given number (e.g., from
the word “ring” and the suffix “‘-in,,” the system
cannot derive “r” because a minimum length of
two characters is required).
(4 Qualitative constraints: the stem ending must satisfy a given condition (e.g., it does not end with “e”
when the suffix “ -ize” is removed. In this case,
“‘seize” minus “ -ize” is not allowed).
(3) Recoding rules: spelling or adjustment rules must
be used to improve the accuracy of conflation of the
stems produced by the suffix stripping algorithm.
(3.1) Remove one of double “b,” “d,” “g,” “m,”
“n,” “p,” “r,” “s,” or “t,” at the end of the
stem (e.g., “hopping” minus “-ing” gives
“hopp,” and, after correction, “hop”; “admittance” less “-ante” gives “admitt” which
is modified to “admit”).
(3.2) Turn terminal “d,” “r,” “t,” “z” into “s” (the
previous stem “admit” is transformed into
“admis”; thus, “admittance” and “admission”
will be reduced to the same stem “admis”).
This last example shows that the recoding
rules are ordered.
(3.3) Change “-rpt” into “-rb” (“absorption”
minus “-ion” gives “absorpt,” and “absorbing” minus “-ing” gives “absorb.” The recoding rules transforms “absorpt” into
“absorb,” and thus, we find the common
root.)
JASIS: Journal of the American
Society for Information
Science-
1993
TABLE
1.
Examples of weaker stemmers in French.
suffix
word
Removing the ending in a-s,)
neuf (new)
neufs (new)
neuve (new)
neuves (new)
-S
-S
Context-free
suffix removal with French words
neuf (new)
neufs (new)
neuve (new)
neuves (new)
parle (he speaks)
parlerai (i shall speak)
parent (related)
parents (parents)
pares (you parry)
paris (bets)
parois (walls)
parts (shares)
val (valley)
vals (valleys)
vallonement (undulation)
valves (valves)
valet (servant)
valeurs (values)
valais (was worth)
TABLE
2.
Fragments from our dictionary
-f
-fs
-ve
-ves
-1e
-erai
-ent
-ents
-es
-is
-0is
-ts
-S
-ment
-ves
-et
-ems
-ais
stem
neuf
neuf
neuve
neuve
(new)
(new)
(new)
(new)
neu
neu
neu
par
par1
par
par
par
par
par
par
val
val
vallone
val
val
val
file.
aimer
verb
v3
infpre
avoir
(to love)
monsieur
..
messieurs
..
moral
noun
n8
mast
sing
(gentleman)
noun
n9
mast
plur
(gentlemen)
adje
n47
mast
sing
(moral)
neuf
adje
n46
mast
sing
(new)
nul
adje
n48
mast
sing
(useless, nil)
robuste
robustement
robustesse
adje
adve
noun
n5
mast
sing
n4
femi
sing
(robust)
(robustly)
(robustness)
amples). The declension file stores 100 entries for
nouns, adjectives, pronouns, etc., and 132 entries for
verbs (see Table 3).
For example, the dictionary entry for ccneuf>b
indicates
that this adjective uses declension number 46, and is in
the masculine and singular form. To generate all correct
forms for this adjective, linking “n46” to the declension
file gives the needed information. In this file, we first
determine the final character(s) (or nil represented by a
point G. D) to be removed to obtain the radix of the
word, and then, the various inflections. For the word
ccneufs,we delete the final cc-f,, and add cc-ve>>to form
the feminine singular (cneuve,,). In Table 3, a line beginning with <<#Dis a comment, and the slash ((ID
means no field. For example, declension 4 does not
allow for a masculine form.
The morphological analysis process works in reverse;
it starts with word endings. Thus, our declension file
cannot be directly helpful, but from it, we have built a
truncated digital-search tree (Salton, 1989, sect. 7.6).
The information attached to each node represents the
possible endings found in the path from that node back
to the root. For example, Figure 1 shows the endings of
nouns and adjectives given in Table 3 (declension n4, n5,
n46, n47, and n48).
If a node does not correspond to a final inflectional
suffix, no constraints exist (e.g., CCXD
in Fig. 1). The
root node does not have conditions because the words
JASIS: Journal of the American
Society for Information
Science-
January
1993
3
TABLE
#number
#
n4
n5
.,
n46
n47
n48
I..
#number
v3
#tense
ipr
iim
imp
.
v4
ipr
..
3.
Examples from our declension
ending
mast
sing
femi
sing
JASJS: Journal of the American
femi
plur
plur
S
S
f
1
ending
er
1st sing
e
ais
uer
ue
f
I
ve
le
le
fs
ux
S
les
les
2nd sing
es
ais
e
3rd sing
e
ait
I
1st plur
ons
ions
ons
2nd plur
ez
iez
ez
3rd plur
ent
aient
I
ues
ue
uons
uez
uent
will be in the dictionary (e.g., ccmonsieur)) and ccmessieurs> in Table 2). One can find all inflections of a
given language in such a tree, and, in our experiment,
the tree is composed of 3,013 nodes.
The morphological analysis can be better explained
by considering an example. In its analyses of the word
ccneuves>,the computer first tries to find the word
ccneuves>in the dictionary, and this attempt fails. After
that, characters are removed one-by-one from the end of
the word. Thus, the computer takes out the final character (<<SD)and tries to find the word ccneuve>>
in the
dictionary. This goes on until the computer reached
the node corresponding in the path c<sev>p.
At this point,
the declension stored at the node is “n46” within the
ending character is cc-f>>(see Table 3). The computer
removes the ending cc-ves>and adds the ending cc-f,,to
form the word cneuf>j. This word is finally found in the
dictionary (Table 2), and the information stored in the
dictionary at the entry ccneuf>,(declension n46) agrees
with the conditions stored at the reaching node.
This scheme is very general and for other languages
we merely have to change the dictionary and the declension file.
Morphological analysis removes the inflectional suffix (e.g., plural forms), the past participle is analyzed,
and the infinitive form of the verb is returned.
Incidentally, we can build a stop list based on grammatical categories, by removing, for example, the definite article, indefinite article, conjunction, preposition,
personal pronoun, possessivepronoun, etc. (“the,” “an,”
“and, ” “over, ” “them,” “their,” etc.). This method is
more interesting in French because we find more forms
than in English. For example, the word “which” is considered a nonsignificant word and it appears in stop
lists; for example, Fox (1990) or van Rijsbergen (1979,
pp. 18-19). This pronoun can be translated into nine
different pronouns in French (cdlequel>>,c<laquelle>>,
ccauxquels,, <<desquelles>,
. . .), “mine” has four corre-
4
file.
Society for Information
Science-
ves
sponding forms (ccmiemt, ccmiennej,, (<miens>),<<miennes,) and the verb “to have” 43 forms. Such a stop list
based on grammatical categories is already used in our
project (Savoy, 1992).
A Stemmer Based on Grammatical Categories
If we limit the stemming process to remove only inflectional endings, derived words will never reduce to
the same stem (e.g., “robust” and “robustness”). To reduce these variations under the same stem, we have to
consider the role of derivational suffixes. In English,
such suffixes are used to change the grammatical category of a word; for example, “white” gives “whiteness.”
In other languages, one of the most important roles of
derivational suffixes is to transform a word from one
gender to another; English also presents gender variants
such as “count” which gives the feminine “countess.”
Finally, they are used to slightly modify the meaning of
a root as, for example, “green” gives “greenish.”
The derivation of new words is not done arbitrarily,
and some algorithms take this into consideration. In an
iterating suffixing algorithm, the various tables of suffixes are ordered to simulate the derivation process.
FIG. 1. Example of a truncated
declension file.
January
1993
digital-search
tree built from our
One might expect that the derivational suffixes precede
the inflectional ones, but exceptions can be found; in
the word “relatedness” the inflectional suffix “-ed” appears first. In Porter’s (1980) method, the word “related” is reduced to “relate” but the word “relatedness”
gives the stem “related.”
For French texts, we also consider the derivational
process and we have proposed the design of a suffixstripping algorithm which has two stages.The first one
corresponds to the inflectional analysis described previously. In the second phase, the derivational suffixes are
removed according to the grammatical category, since
this information was already obtained from the morphological analyzer.
The declaration of all French suffixes was not done
on an ad hoc basis, but the elaboration of this list follows a linguistic analysis (Grevisse, 1988). We have
formed four different tables of suffixes corresponding to
the four grammatical categories (noun, adjective, verb,
and adverb). Once we have a word’s grammatical category and its suffix, we may then find its stem and the
grammatical category of this stem. Some examples will
clarify the derivational process.
example, the adjective ccvalable>>
(valid) cannot be derived from the noun ccvabb(valley) plus the suffix
+able>>,because this suffix forms an adjective from the
verb; with the word <<table>>,
the system canot remove
the suffix <c-able>>
because <<table>>
is not an adjective.
Spelling Corrections and Adjustments
As in other algorithms, we have also included spelling corrections to obtain the linguistic lemma. For example, in Table 4, the word ccampleuruminus the suffix
+eur>> does not produce aampl>>,but ccample>>.
Thus,
with each rule, we have attached one or two characters
to produce a new stem. For example, a rule can be
defined as:
l
Or, for a verb:
l
(1) <<-able,forms adjectivesfrom verbs,like q<discutern
(to discuss),whoseradix is cqdiscut),
andwhosesuffix <<-able,,
givesadiscutable,>
(debatable);
(2) a-iqueu forms adjectives from nouns; cxvolcam>
plus +ique>>givesqcvolcaniquep>
(volcano,volcanic).
This suffix may have variants like c+istique, or
+atique>j.
(3) c+eur* is used to obtain a feminine noun from an
adjective[e.g., <<blanchen
(white) plus +eur,>gives
the feminine noun q<blancheuru
(whiteness)];
Other suffixes produce verbs from nouns (e.g., cc-iserw,
etc.). Adverbs are derived from adjectives and more
rarely from nouns, with the suffix cc-mentb).In Table 4
other examples are shown.
Thus for each transformation rule, we attach conditions about the grammatical category of the resulting
stem. For some suffixes, the conditions can also include
restrictions about the gender or the length of the remaining stem.
The inclusion of grammatical categories can improve
the effectiveness of the suffix-stripping algorithm. For
TABLE
From
adjective
adjective
adjective
noun
noun
verb
verb
verb
4.
Examplesof
Given a feminine noun ending with cc-cur,,(e.g.
<campleur>,)
- considera correspondingadjective;
- if none can be found (e.g., ccampb),consider a
corresponding adjective ending with cc-e>>
(e.g.
<<ample,>).
Given a noun endingwith cc-eur>p
(e.g. ccchercheur,,)
- considera correspondingverb;
-if none can be found (e.g., Tccherchn),consider
a correspondingverb endingwith +ern, +ir,,, or
cc-re>>
(e.g., CcchercherN)).
In addition to these spelling adjustments, we sometimes have to slightly modify the radix. For example,
the adjective ccverdstre,, (greenish) minus the suffix
cc-ltre>>does not give the adjective ccvertj,(green) and, in
this case, the final character <cd>>
must be transformed
into a c&j. In our algorithm, we have defined 35 spelling
correction rules (see Table 5). The definition of this set
of rules is the only part of our system which is ad hoc or
based on experiments.
These correction rules are also used to correct the
presence of accents. French accents indicate the precise
pronunciation or sound and identify homographs (e.g.,
cc0i.i~means “where,” and CCOUD
“or”). A derived word
does not necessary have the same accent as its lemma.
For example, the accent & can be transformed into an
C&Das with cccollCgiem> (schoolboy) derived from
cccoll2ge>>
(school). The suffix derivation does not always
suffix derivation.
To
Suffix
noun
noun
adverb
adjective
verb
adjective
noun
noun
-esse
-cur
-ment
-ique
-iser
-able
-eur
-ement
JAW:
Example
robuste
ample
Ctrange
volcan
utile
discuter
chercher
renverser
Journal of the American
robustesse
ampleur
ktrangement
volcanique
utiliser
discutable
chercheur
renversement
Society for Information
Science-
January
1993
5
incorporate the accent; for example, the adjective
Removing Prefixes
ccaromatique,>(aromatic) comes from the noun <car&me,,
Rules for removing prefixes are not as specific as
(aroma). See Table 5 for examples and Savoy (1991) for
those for suffixes. As Table 7 illustrates the rules atall spelling corrections.
tached to prefix derivation are less stringent, since a
prefix does not generally change the grammatical category of a word. In French, we encounter only a few
Examples
examples, but, the prefix may radically change the
As a more complete example, we have derived all
meaning of the word, and sometimes the semantics may
words from the common form ccnavig>>
and, after a spellbe very different. For this reason, we do not suggest
ing correction of the common root, (cnaviguer,,(to sail).
removing the prefixes for automatic indexing. For exTable 6 shows the results obtained by our approach.
ample, ccprtdirej, (to foretell, to predict) is derived from
The results of Table 6 are derived by the following
<<dire>,(to say, to tell), and the meaning of “to tell berules. For the feminine noun ccnavigabilit&, the comfore” is not directly related “to foretell’: Regarding Enputer considers the following rule:
glish words, Paice (1977, sect. 4.3.3) also suggests
ignoring prefixes removal for general texts. However, we
Given a feminine noun endingwith cc-abilitt, (e.g.,
may
consider prefix stripping for technical subjects such
ccnavigabilit&)
as chemistry or medicine (e.g., DNA or desoxyribonu- considera correspondingverb;
cleic acid).
- if none can be found, considera corresponding
verb ending with <q-err,cc-ir*or c<-re,>.
l
After removing the suffix, the system obtains the
stem ccnavigj).In the dictionary, we cannot find a verb
like ccnaviger,,,ccnavigir),,or crnavigre,,.The radix ccnavig),
is passed through the spelling corrector. As shown in
Table 5, this radix is transformed into <<cnavigu),
and the
verb anaviguer, can be found in the dictionary. Finally,
the resulting stem ccnaviguerpjis reduced to the noun
ccnavire,, (ship), which is the final root (the suffix
+guer>>is transformed into cc-re,).
TABLE
5.
*ess
VP
*P
*el
length > 5
length > 5
length > 4
*C
*t
length > 4
*X
*t
*g
*v
I&*
*p
* *
c
Although in French we do not have test collections
like CACM or CRANFIELD to study the retrieval effectiveness of our algorithm, our solution should be
helpful to the linguist or computer scientist for whom
the resulting stem is often a nonlinguistic element.
Thus, to evaluate our schemewe have built three lists of
French words where each list contains a set of words
and its corresponding root. However, the definition of
Examples of spelling corrections.
*&S
*al
“ch
*ct
*ss
*d
Evaluation
*@J
*f
*i*
*@a
*C*
congress[iste]
paliss[ade]
actual[itk]
duch[esse]
dtlict[ueux]
touss[er]
verd[ltre]
navig[able]
veuv[ age]
coll&g[ien]
extrtm[itt]
balanG[oire]
*Any sequence (including nil) of characters.
p Represents a given character; the additional
fore the transformation.
TABLE
6
6.
size constraint
cong&
palis
actuel
due
d&lit
toux
vert
navigu[er]
veuf
coll&g[e]
extr&m[e]
balanc[e]
is checked be-
Examplesof suffixing.
Word
Suffix
Removing
One Suffix
navigabilitt
(seaworthiness)
navigable
navigant (seagoing personnel)
navigateur (sailor)
navigation (sailing)
-abiliti
-able
-ant
-ateur
-ation
naviguer
naviguer
naviguer
naviguer
naviguer
JASIS: Journal of the American
Society for Information
Science-
January
1993
TABLE
7.
Examples of prefix derivation
Example
From
Prefix
From Verb
to Verb
From Noun
to Noun
noun or verb
verb, adjective, or noun
noun or verb
pr&deco-
preetablir
decharger
coexister
preretraite
denatalite
codirecteur
“correct” root is not always clear. For example, cccommuniste* (communist) gives cccommuneb>
but the ultimate root is ~<comrnum~
(common). In our case, we have
stored the shortest root.
To evaluate our algorithm, we have used three main
tests. The first contained 50 words presenting only inflectional suffixes (plural, past participle, etc.), and was
designed to evaluate the removing of inflectional endings (weak stemmer). According to results shown in
Table 8, the morphological analysis is done correctly.
An error in this experiment would indicate the presence
of an incorrect coding in our dictionary or declension
file. The input lists for the further tests will not contain
inflectional suffixes.
In a second test, we tried removing the prefixes. To
do this we built a list of 73 words having real prefixes
and 73 counterexamples (words beginning with the
same characters but which are not a prefix). The incorrect results of our system are reported in Table 9A. As
mentioned previously, removing prefixes is an operation
to be considered by linguists and not by information
scientists.
Finally, 402 words with only derivational suffixes
were used to test our suffix-stripping solution. These
words are derived from various roots in order to represent most of the French derivational suffixes. In removing these derivational suffixes, the rules are defined
according to a linguistic analysis (Grevisse, 1988);
therefore, they are not formulated on an ad hoc basis or
given by experiments. The results of our tests are listed
in Table 8, and examples of incorrect results are given
in Table 9B.
The three lists of words in Table 8 are established to
reflect most of the inflections and all derivational suffixes used in French. They contain both regular and abnormal examples extracted mainly from Grevisse (1988).
TABLE
8.
Results according
Analysis of Errors
For a better comprehension of errors, we have divided
them into four classes. The first class represents overstemming error, where, after finding the expected root,
the algorithm carries on to remove suffixes. For example,
~~colonisatiom>(colonization) comes from <<coloniser>>
(to colonize) which derives from cccolom (colonist).
However, this last word has nothing to do with NCOIN
(collar or pass).
If the stemming process stops before reaching the appropriate root, we encounter an understemming error.
For example, c<feminisme>>
(feminism) gives <<ferninim>
(feminine) and not c<femrnejj(woman). When the suffixstripping algorithm does not modify a word, we place
this error in the “nonstemming” class, which represents
a special case of an understemming error. Miscellaneous errors form the last class which in this case occur
during the suffix stripping; the algorithm follows an
to our five experiments.
Experiment
1.
2.
3.
4.
5.
Two additional tests have been done to evaluate the
impact of using grammatical categories and spelling correction. As shown in Table 8, combining all suffixes in
one table and ignoring grammatical categories does not
produce satisfactory results (a decrease of 4%). However, this scheme does not correspond to Lovins’ (1968)
algorithm, because the stemming is guided by a dictionary and always produces a linguistically correct lemma.
Some examples of suffix stripping using the longest
match principle are given in Table 1. The fifth test
showed that the inclusion of spelling correction rules
may improve the solution by up to 11.7%. This suffixstripping algorithm is already used with more general
French texts (Savoy, 1992) and the conflation of semantically distinct words seemto confirm the present results.
Size
Weak stemmer
Prefix stripping
Suffix stripping
Without grammatical categories
Without spelling corrections
JAW:
Correct
50
146
402
402
402
Journal of the American
50
139
338
322
291
Success Rate
100%
95.2%
84.1%
80.1%
72.4%
Society for Information
Science-
January
1993
7
incorrect path within which it never finds the expected root.
The results of Table 10 show that overstemming represents the main source of error. When the system does
not consider grammatical categories (experiment 4), the
number of overstemmings increases, and the resulting
stems (even the wrong ones) are shorter than the stems
of test three (suffix stripping). Thus, a consideration of
TABLE
Word
coefficient
debut
discours
mtchant
menager
melanger
retable
9A.
Prefix removing errors.
Expected
Stem
Resulting
Stem
efficient
debut
discours
mechant
mtnager
melanger
retable
coefficient
but
tours
chant
nager
langer
table
TABLE
9B.
Examples of suffixing
errors.
Expected
Stem
Resulting
Stem
Error
Type
ailt
aisement
boxer
courser
crachat
maquisard
marquisat
paperasse
perissable
secretariat
valetaille
aile
aise
boxe
course
cracher
maquis
marquis
papier
p6rir
secretaire
valet
ail
ais
box
tours
crac
maque
marque
pw
ptre
secrete
val
overstemming
overstemming
overstemming
overstemming
overstemming
overstemming
overstemming
overstemming
overstemming
overstemming
overstemming
fkminisme
humanitaire
femme
homme
f6minin
humain
understemming
understemming
chandelier
chevelure
lisible
ruelle
sprinter
chandelle
cheveu
lire
rue
sprint
chandelier
chevelure
lisible
ruelle
sprinter
unchanged
unchanged
unchanged
unchanged
unchanged
aileron
bruyance
chauffard
Ccrivailler
lapereau
lavasse
linguistique
aile
bruit
chauffeur
Ccrire
lapin
laver
langue
ail
bru
chauffe
tcr ier
laper
lave
linge
other
other
other
other
other
other
other
Word
TABLE
10.
Experiment
2
3
4
5
8
grammatical categories can be viewed as a way of restraining overstemming.
We should be able to improve our results a little by
progressive elimination of errors from our dictionary
and declension files. However, for various reasons, we
can only expect limited improvements. First, the main
source of errors comes from our rules and spelling adjustments. A given rule may almost always produce the
correct answer, but an incorrect lemma is sometimes
derived. For example, from the noun cctraineau>>
(sleigh),
the correct stem is cxtraine>,([robe] train), but from the
noun ccmanteau,,(coat), the stem ccmante,,(mantis) is
incorrect.
Second, to retrieve the correct lemma, we sometimes
have to consider the Latin or Greek etymon of the
word. For example, the correct stem for the adjective
<<simiesque>>
(monkey-like) is <<singe,,(monkey); but
<(simiesque>>comes directly from the Latin word
ccsimius,. The word c&minescent>~ (luminescent) is
derived from the Latin word c<lumen,luminisn and not
JASIS: Journal of the American
Error
distribution.
Overstemming
6
36
43
24
(85.7%)
(56.3%)
(53.8%)
(21.6%)
Society for Information
Understemming
0
2
1
13
(0%)
(3.1%)
(1.3%)
(11.7%)
Science-
January
Nonstemming
1(14.3%)
14 (21.9%)
7 (8.8%)
69 (62.2%)
1993
Other Errors
0
12
29
5
(0%)
(18.8%)
(36.3%)
(4.5%)
from the French noun alumi&e>, (light). Different
forms of a given root may exist because they come from
different Latin forms; for example, ccnataln (native,
Latin root ccnatalisw)and ccnaitre), (to be born, Latin
root ccnasci,,).If our previous example of “sailor, sailing, . . . ” confirms that the terms “concept” and stem
are synonymous, then the current example clearly demonstrates the limits of such an assessment.Of course,
we may introduce the correct stem for each word in our
dictionary and the stemming process can be reduced to
a table lookup.
Third, some derivations are very irregular, especially
for compound words. For example, the adjective ccmoyenlgeux,, (medieval) is derived from amoyen $ge>>
(Middle Ages), and the sport <<ping-pang>>
has given the
noun ccpongiste,,(table tennis player).
Fourth, the derivation of a word is sometimes produced by simultaneously adding a prefix and a suffix.
For example, <<dCbarquer>>
(to land) comes from the
noun <<barque>>
(small boat), and in this case, the words
c<dCbarque>b
or ctbarquerj, do not exist in French. Our
solution is unable to handle this exception because it
cannot simultaneously remove a prefix and a suffix.
Other words are formed by combining two existing
words (e.g., “motor” more hotel” gives “motel”).
Finally, for a given word, the morphological analysis
can be ambiguous, and relating it to its corresponding
dictionary entry is not an easy task. For example, the
word aboxer,, can be a verb (to box) or a dog. In the
former case, the correct root will be c<boxe>,(boxing),
and in the latter c(boxer>>.The disambignation problem
is beyond the scope of the present study; this task is
even difficult for human beings, as shown in Choueka
(1985).
Summary
The design of suffix-stripping algorithms may be
based on the longest match suffix or on a set of predefined classes of endings. In both cases, only the sequence of characters determines the result. In this
study, we have shown that these solutions may be difficult to implement for other languages, particularly
French.
We have shown how to remove inflectional suffixes
from French words and described a morphological
analysis which is more complex than for English. We
have also explained how to implement another stem-
ming algorithm based on a dictionary and grammatical
categories. We have implemented this solution and our
approach’s performance has been tested. This study
merely outlines our general approach. For more details
see Savoy (1991).
Acknowledgments
I wish to thank Georges C&6 Jr., Serge Simard, and
Daniel Mdise, students at our department, for their participation in the implementation of the morphological
analyzer and the stemming algorithm. The author also
thanks Professor C. D. Paice of Lancaster University
for her suggestions and helpful remarks.
References
Beale, A. (1987). Towards a Distributional
Lexicon. In R. Garside,
G. Leech, &G. Sampson (eds.), The computational analysisofEng1ish:A corpus-bused approach. (pp. 149-162). London: Longman.
Choueka, Y., & Lusignan, S. (1985). Disambiguation by short contexts. Computers and Humanities, 19, 147-157.
Fox, C. (1990). A stop list for general text. SIGIR Forum, 24, 19-35.
Grevisse, M., & Goose, A. (1988). Le bon usage. Paris: Duculot.
Harman, D. (1991). How effective is suffixing? Journal of the
American Society of Information Science, 42, 7-15.
Lovins, 3. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22-31.
Lovins, J. B. (1971). Error evaluation for stemming algorithms as
clustering algorithms. Journal of’the American Society for Information Science, 22, 28-40.
Muller, C. (1985). Langue frun@se, linguistique quantitative, informatique. Geneva: Slatkine-Champion.
Paice, C. D. (1977). Information retrieval and the computer. London:
McDonald & Jane’s,
Paice, C. D. (1990). Another stemmer. SIGIR Forum, 24, 56-61.
Porter, M. F. (1980). An algorithm for suffix stripping. Program,
14, 130-137.
van Rijsbergen, C. J. (1979).Information
retrieval. (2nd ed.), London:
Butterworths.
Sabah, G. (1989). LXntelligence artificielle et le langage: Processus de
comprkhension (vol. 2). Paris: Hermss.
Salton, G. (1989). Automatic text processing, The transformation,
analysis, and retrieval of information by computer. Reading, MA:
Addison-Wesley.
Savoy, J. (1991, October). Stemming of French words. D&partement
d’informatique
et de recherche op&ationnelle,
#793, UniversitC
de Montreal, p. 48.
Savoy, J. (1992). Bayesian inference networks and spreading activation in Hypertext systems. Information Processing & Management,
28, 389-406.
Smith, P. D. (1990). An introduction lo fext processing. Cambridge,
MA: The MIT Press.
JASIS: Journal of the American
Society for Information
Science-
January 1993
9