A Contribution of Medical Terminology to Medical Language Processing Resources: Experiments in Morphological Knowledge Acquisition from Thesauri Pierre Zweigenbaum and Natalia Grabar Service d’Informatique Médicale / DSI / Assistance Publique – Paris Hospitals & Département de Biomathématiques, Université Paris 6, Paris, France {pz,ngr}@biomath.jussieu.fr http://www.biomath.jussieu.fr/~pz/ Morphological knowledge, especially derivation and compounding, is extremely useful both for natural language processing and information retrieval. Whereas large morphological knowledge bases are available for some languages, this is not the case for French. In order to fill this gap, we aim at setting up a method that can acquire automatically various kinds of morphological knowledge for a given language and domain. This method relies on the synonym terms present in a thesaurus of the domain and a list of words that can be drawn from the same thesaurus. This paper presents a series of experiments whose goal is to learn morphological knowledge from this initial data and without a priori linguistic knowledge. It shows that one can obtain instantaneously a massive, gross description of word morphology in the domain addressed. Introduction Morphological knowledge has long been acknowledged as an important area of medical language processing and medical information indexing [1,2,3,4]. Three types of morphological relations are classically considered. Inflection produces the various forms of a same word such as plural, feminine or the multiple forms of a verb according to person, tense, etc. Derivation is used to obtain, e.g., the adjectival form of a noun (noun aorta $ adjective aortic). Compounding combines several radicals to obtain complex word forms (e.g., aorta + coronary yields aortocoronary). Medical terminology indeed heavily draws on derivation and compounding to form new words. A complete list of these words cannot be expected to be present in a lexicon. Furthermore, the decomposition of a word into its component elements or morphemes is a step towards its decomposition into atomic concepts. Therefore, it may lead to better indexing of medical texts and better structuring of medical terminologies. Inflectional morphology has been widely studied, and for many languages, inflection analysis tools (lemmatizers) are 0 To appear in “Proc Conference on Natural Language Processing and Medical Concept Representation, Phoenix, Az, 16–19 Dec 1999”. available along with the suitable morphological knowledge (for instance for French, Silberztein’s INTEX system [5] or Namer’s FLEMM lemmatizer [6]). In contrast, no complete description of French derivational or compositional morphology is available. This situation prevails for most languages, notable exceptions being Dutch, English and German with the CELEX base [7]. For medical language, the only publicly available morphological knowledge base is that of the UMLS Specialist Lexicon [8,9], which addresses inflection and derivation for medical English. Derivational morphology is the subject of recent projects (e.g., the FRANLEX project on French derivational morphology [10]). Building medical morphological resources for several languages is beginning to be tackled in a global way [11]. The present work participates in this movement towards building resources for medical morphology processing in multiple languages. Our goal is to design a method that can help acquire automatically various types of morphological knowledge for a given language and domain [12]. We chose the approach of an automated, non-interactive tool, with no a priori linguistic knowledge, that can thus be applied quickly and with little effort to a new language or domain [13] . Automated methods for learning morphological variants have been designed to collect resources for information retrieval or natural language processing systems. Some rely on corpus collocations [14] or on approximate matches between thesaurus terms and corpus occurrences [15]. In the framework of the FRANLEX project, another method searched for correspondences between morphologically similar words with differing syntactic categories [10]. Constrastive studies on large word lists have also been performed to identify the most frequent suffixes of a language [16]. Machine learning algorithms have been applied to morphology too [17]. Tools to study morphological variants among terms of multiple vocabularies have also been designed when building the UMLS [9]. Whereas most of these methods work with classical string operations, linguistically more elaborated models have also been used, such as two-level morphology [18], to learn more general rules from morphologically related pairs of words [19]. The base material we use to learn morphological knowledge is a terminology with synonym terms: this can be a thesaurus, classification, nomenclature, controlled vocabulary or any other such terminological system where a « concept » can be denoted by several terms. This material can be complemented with a list of additional word forms of the same domain. In the medical domain, such terminologies exist for many languages: depending on the country, both national and international terminologies may be available. We worked with data obtained from the SNOMED Nomenclature and the International Classification of Diseases in French, English and Russian. This paper describes a series of experiments which aim to assess the kind of morphological knowledge on medical words that can be extracted from this data, using very little a priori knowledge on the language addressed. It first reviews more precisely the kind of knowledge that is expected. The material used throughout these experiments is then presented. The rest of the paper describes several experiments performed on a simple list of words, on the synonym terms of a thesaurus, and on both. The contribution of additional syntactic knowledge, in the form of part-of-speech tagging and lemmatization, are also assessed. We conclude with a synthesis of these experiments and perspectives for further work. (4) -e, -s, -aire, -ique, -ome, -ose These strings are close to morphemes, some of which have a strong semantic value in the domain; they can correspond to semantic primitives (atomic concepts) for domain knowledge representations (e.g., -ome = tumor). These radicals are morphemes too, most of which have strong semantic value in the domain and correspond to concept types. Decompositions (segmentations) of forms into their constitutive elements: (7) cicatricielles ! cicatriciel+le+s cicatriciel ! cicatrice+iel lymphoblastique ! lympho+blastique blastique ! blaste+ique where we find adjectivation suffixes -iel and -ique, feminine -le and plural -s inflections, radicals cicatrice and blaste, and prefix lympho- (related to radical lymphe). This kind of decomposition gives access to semantic components of a word, e.g., to index it [2] or to propose a definition for a new word in a lexicon [3]. Morphologically related pairs of word forms: (1) abdominal, abdominaux (inflection); aorte, aortique (derivation); cardio, cardiomégalie (compounding) All the methods presented in this article deal with strings rather than morphemes. However, we often use words such as prefix, suffix or radical to denote strings that might play the role of such word parts. but also any combination of these three types of relations. Families of morphologically related word forms: (2) abdomen, abdominal, abdominale, abdominales, abdominaux, abdomino; cardia, cardiaque, cardiaques, cardio, cardiomégalie, cardiopathie, cardiopathies, cardite Material The following material provides the basis for this series of experiments. We describe additional material where relevant in the following sections. These families are interesting, e.g., for information retrieval, where they can increase recall as a complement to or a substitute for stemming [15]. Prefixes of the language or the domain, that can be added in front of another word form: (6) hyper-, poly-, para- We aim to collect morphological knowledge of the following types. Radicals (or bases) of the language or the domain, to which suffix strings can be appended: (5) fibr, granul, hépat, hyalin, immun, lymph Objectives Language or domain suffix strings, which play a role in these rules: Thesaurus Morphological rules that transform a word form into another, morphologically related word form: We started from the series of synonym terms found in the SNOMED International nomenclature. Depending on the experiment, we worked with French, English or Russian: ejiel (3) cicatrice ! cicatriciel Examples : ejique, ejaire, js. Depending on their precision, they can be used to segment a word form into its morphological constituents or to produce a derived form. 2 French: the French version of the SNOMED Microglossary for Pathology [20] (12,555 terms); this is a precommercial version of the French Microglossary, gracefully lent by Dr Roger Côté; English: SNOMED International version 3.5, as included in the 1999 UMLS Metathesaurus (128,855 terms). These terms were obtained by collecting Metathesaurus strings with SAB=SNMI98 (source abreviation = SNOMED International, 1998); Methods Detecting prefixes and compositional radicals We performed an analysis of our reference word list based on commutation tests. By definition, if fibro- is a radical, this means that it can combine with various forms F to form fibroF compounds (with a possible local morphological adaptation at the frontier of the two components). We then assume, conversely, that if a large enough number of pairs { F , fibroF } occur in the reference list, this is a clue that fibro- is a « candidate radical » or a candidate prefix. For instance, for fibro-, 18 pairs can be found in the list: Russian: the Russian version of the SNOMED Microglossary for Pathology (13,462 terms); this is a working version of an ongoing translation, gracefully lent by Yvan Emelin. Reference word list We also used a reference list of attested word forms occurring in: (8) {améloblastique, fibroaméloblastique} {blastique, fibroblastique} {chondrosarcome, fibrochondrosarcome} {kystique, fibrokystique} French: the French version of the SNOMED Microglossary for Pathology or the main terms of the French ICD10 [21]; capitalized words (proper names) were filtered out; total: 8,874 word forms; ::: {sclérose, fibrosclérose} {vasculaire, fibrovasculaire} {xanthome, fibroxanthome} English: SNOMED International version 3.5 and ICD9-CM: Metathesaurus strings with SAB=SNMI98 or ICD99 (source abreviation = SNOMED International, 1998 or ICD-9-CM, 1998); total: 49,627 word forms; which provides good support for this candidate. Transposition to suffixes Russian: the Russian version of the SNOMED Microglossary for Pathology; 9,871 word forms. The same basic principle may be applied to identify suffix strings. Here, we want to spot word pairs of the form: All punctuation characters were considered as delimiters, and word forms containing digits were filtered out. All forms were converted to lower case. (9) {ligament, ligamentaire} {ligament, ligamenteux} {ligament, ligaments} {essentielle, essentiellement} {pneumo, pneumocyte} Bootstrapping on a list of word forms In this section, we examine what can be done starting from a “flat” list of word forms. This list of word forms has no a priori structure. Nevertheless, being drawn from medical terminologies, it has the specific property of containing a large proportion of words that are characteristic of the medical domain: domain-specific words (leucokératose, pemphigoïde) or words that typically occur in medical corpora (bénin, aberration). Many medical words are formed by compounding of Latin and Greek radicals: the so-called neo-classical compounds. For instance, fibro- and amélo- are prefixed to blastique to form fibro+amélo+blastique. A list of such radicals is a key resource for algorithms that segment word forms into their constituents (e.g., [3,4,22]). Our aim here is to collect such radicals. which reveal suffixable strings -aire, -eux, -s, -ment, -cyte. Results Prefixes and compositional radicals We applied the prefix method to the whole reference list and obtained 779 radical strings R that occur in at least one {F , RF } pair, among which 235 occur in at least two pairs and 120 in at least three. The 33 radicals with a frequency greater than or equal to 10 are shown below. 37 intra 36 in 34 péri 28 para 26 hyper 23 pré 21 hypo 20 trans 19 inter Material For this series of experiments, we worked with the reference list of 8,874 word forms obtained from the French version of the SNOMED Microglossary for Pathology and the French ICD-10. 3 18 fibro 17 ostéo 17 dé 15 neuro 15 micro 15 anti 14 poly 13 épi 13 sub 13 rétro 13 ré 13 pro 13 endo 13 dys 13 angio 13 a 11 myo 11 di 11 bi 10 pseudo 10 myélo 10 hém 10 extra 10 adéno The first 9 elements found (intra-, in-, péri-, para-, hyper-, pré-, hypo-, trans-, inter-) are prefixes of general French that are frequent in the medical domain. The first domain-specific radicals can be found at rank 10 (fibro-, ostéo-). The first error occurs at rank 61 (frequency = 5): pis neigher a radical nor a prefix, although pairs {laque, plaque}, {liée, pliée}, {lèvre, plèvre}, {réparation, préparation}, {urine, purine} are attested in the reference list. In total, for frequencies greater than 2, we found 13 errors in 120 radicals, i.e., a precision of 89.2%. If we exclude one-letter prefixes (only a- is a correct prefix), we reduce errors further down to 5 out of 111 radicals, yielding a 95.5% precision. Table 1 – Some preferred and synonym SNOMED terms containing words related to the mucous membrane. Suffixes that start with an-, only 6 were segmented with this prefix: an+aplasique, an+aérobies, an+euploïdie, an+ictérique, an+ovulation, an+ovulatoire. Most of the other forms starting with an- do not belong to this segmentation class, whether they be atomic (angine) or involve other prefixes or radicals (ana-, angio-, anté-, anti-, ano-). However, some an- segmentations are missed because the word form corresponding to the radical is either not present in our reference word list (e.g., algésie in an+algésie) or only exists as a bound morpheme in modern language (e.g., -ergie in an+ergie). In the suffix method, the pairs of forms found are by construction restricted to strict additions of suffixes. This prevents this process from identifying inflections or derivations, e.g., {abdominal, abdominaux} or {aorte, aortique}, where morphotactic constraints can result in what looks like suffix substitution. The following method deals with these cases by sifting additional knowledge from thesauri. Code T-00400 T-00400 T-00400 D0-A0100 D0-A0100 D0-A0100 D0-A0100 We applied the suffix method to the whole reference list and obtained 733 suffix strings S that occur in at least one pair {F , F S }, among which 170 in at least two pairs and 78 in at least three. The 15 suffixes with at least 10 occurrences are: 1136 s 168 e 34 ment 24 es 21 ux 17 ne 15 me 14 se 13 le 12 ïde 12 use 11 blastome 10 ur 10 sarcome 10 cyte The first two are inflection suffixes (plural -s and feminine -e), then adverb formation suffix -ment. The first domainspecific suffix is -me (rank 7), which actually corresponds to forms ending in -ome. Finally, the first radical found in a suffix position is -blastome (rank 12). Class 01 02 02 01 02 02 02 English term Mucous membrane, NOS Mucosa Tunica mucosa Mucous membrane inflammation, NOS Mucitis, NOS Mucosal inflammation, NOS Mucositis, NOS Discussion Bootstrapping on series of synonym terms In the prefix method, the fact that an initial form (such as blastique) can itself combine with other radicals (e.g., centroblastique, chondroblastique, lipoblastique, etc.) could be considered an additional clue. It confirms that in the obtained segmentation fibro+blastique, we do have two components that commute with various elements. However, several « parasitic » forms such as tique also give rise to numerous wrong decompositions. For instance, {tique, attique} which, in parallel with {tente, attente} and {teinte, atteinte}, push forward an erroneous prefix at-. Therefore, we did not include this criterion in our procedure, since it it not discriminative enough. Prefix analysis detects radicals or prefixes that are frequent in the domain with a good precision. At the same time, the 111 radicals Ri occur in 904 pairs of forms { Fj , Ri Fj } for which they propose a segmentation of a compound form Ri Fj into Ri + Fj . Among these 904 pairs, the 5 erroneous prefixes produce 13 wrong segmentations, i.e., an 1.4% error rate. We must add to these possible but rare irrelevant segmentations obtained with otherwise correct prefixes, such as a+voir or a+mont. Working on pairs of attested word forms is a powerful filter against irrelevant decompositions. Among the 173 forms A flat word list, as used in the previous section, has no inherent structure that can provide clues as to which word forms should be matched together. We relied instead on the sole internal form of words to build constrastive pairs. Thesauri do provide structure around terms. Given a concept, they often include both a preferred term and synonym terms. They also often specify hierarchical and other relations among concepts and terms. In this series of experiments, we take advantage of the series of synonym terms that occur in medical thesauri to identify pairs of morphologically related word forms. Instances of such terms are shown on table 1 (Class 01 means preferred term, whereas 02 means synonym term). Our bootstrapping method performs a kind of reverse engineering of the synonym enumeration process initially performed by the terminology authors. Material We start from the series of synonym terms found in the SNOMED International nomenclature. This experiment was performed for French, English and Russian. 4 Methods and for English: Aligning morphologically related word forms js (plural inflection) {organ, organs} ajosis (-osis derivation) {diverticula, diverticulosis} usjosis (-osis derivation) {thrombus, thrombosis} ajc (-ia / -ic derivation) {aphasia, aphasic} aemiajemia (alternation) {tularaemia, tularemia} jal (-al derivation) {orbit, orbital} Given two synonym terms {T1 , T2 }, where each term is a set of word forms, we consider the pairs of word forms { IS1 , IS2 } found in the two terms (with IS1 2 T1 and IS2 2 T2) and that share a « long enough » initial substring I . We worked with minimal lengths of three and four characters; this threshold is indeed a parameter of the method. Unless otherwise mentioned, in this paper, this threshold is set to 3. For instance, given the terms in table 1, this algorithm aligns the following word pairs: Suffix strings are collected at the same time. For instance, suffix strings -s, -a, -osis, -us, etc. are identified in the word pairs cited in the above rules. Collecting morphological families (10) {mucous, mucosa} (T-00400) {mucous, mucosal} (D0-A0100) {mucous, mucositis} (D0-A0100) {mucosal, mucositis} (D0-A0100) Each word pair specifies that two word forms are linked by a morphological relation: they belong to the same morphological family. We can then try to recover these morphological families. Two word forms are included in the same family in the following situations: Note that such an initial common substring algorithm, applied in an unconstrained context (e.g., our flat list of word forms), would give very bad results. For instance, a fourcharacter initial substring length would collect together all the words that start with abso-: absolute, absonum, absorbable, absorbent, absorber, absorptiometry, absorption, absorptive. This would be correct for the absorb-/absorp- family, but would mix with them absolute and absonum. Inducing morphological rules The aligned word pairs are then considered as examples from which morphological rules are induced. These rules are potentially applicable to identify other word pairs that undergo the same morphological relation. At this step of the process, we aim to obtain rules that embody the minimal specific features of the word pairs, while keeping a good generality. Each example {IS1 , IS2 }, where I is the maximal initial common substring of the two word forms, is generalized into a rule { S1 , S2 } that can be interpreted as follows: given a word form that ends with suffix string S1 , one can derive a word form in which S1 is substituted with S2 ; or the reverse, these rules being considered symmetric. We note rules {S1 , S2 } with the abbreviated notation S1 jS2 . Concretely, aligning two word forms { IS1 , IS2 } identifies at the same time the suffix strings {S1 , S2 }, thus the associated rule S1 jS2 . Some of the most frequent rules obtained for French are ( denotes the null string): They were identified as a word pair by the alignment process; Both were segmented by the alignment process on the same maximal initial substring I ; By transitivity, two families that share a common word form are merged. For instance, the word pairs {ischial, ischiadic} (T-16330) and {ischial, ischium} (T-12351) result in the morphological family ischiadic, ischial, ischium which is correctly separated from ischaemia, ischaemic, ischemia, ischemic, although their initial four-character substring (isch-) is the same. Results Table 2 – Bootstrapping: acquired knowledge Knowledge elements French English Russian Terms 12,555 128,855 13,462 Series of synonyms 2,344 26,312 2,636 Word pairs 1,718 16,653 2,661 Unique word pairs 1,180 6,757 1,692 Precision 99.9% – 99.9% Suffix strings 539 2,610 818 Morphological rules 644 3,188 955 Morphological families 662 3,207 824 Precision 98.6% – 99.8% Words per family 2.57 2.94 2.80 js je (plural inflection) {articulaire, articulaires} (feminine inflection) {surrénal, surrénale} ejique (adjectives ending with -ique) {prostate, prostatique} ejque (adjectives ending with -ique) {hyperkaliémie, hyperkaliémique} ejaire (adjectives ending with -aire) {valvule, valvulaire} mejsarcome (derivation -ome / compounding -osarcome) {mélanome, mélanosarcome} This method was applied to the SNOMED Microglossary for Pathology (French and Russian) and to the full SNOMED (English). The results are displayed on table 2. The full English SNOMED, with a size ten times larger than that of 5 erate new word forms. In particular, the rules that include the zero suffix do not mean that a correct word form can be obtained by applying them to any word form! The next section shows how to use the rules in an associative manner. Besides, the suffix strings may not all look appropriate. For instance, the suffix strings -c and -a in rule ajc would be better expressed as suffix -ic applied to word forms ending in -ia (as in {aphasia, aphasic}). The next section also shows how to obtain a better segmentation of these suffix strings. the French or Russian Microglossary, yields six times more unique word pairs and five times more rules. The precision of word pairs and morphological families was evaluated through human review. All of them were examined for French and Russian, whereas for English word pairs and families, random samples were extracted and reviewed. The most frequent suffix strings found in unique English word pairs were (frequency over 100): 671 s 630 a 555 osis 517 us 380 al 371 e 279 ic 239 um 211 y 193 idae 168 es 155 is 132 ed 131 c 128 ectomy 113 sis 110 tic 109 emia 108 ing 107 ia Applying induced rules to reference word list We now use the set of rules acquired in the bootstrapping step to identify new word pairs in a set of word forms. Material Our material in this experiments is the set of rules acquired in the bootstrapping step (previous section) and the reference list of attested word forms, for each of the three languages. To evaluate the recall for English word pairs, we used the UMLS 1999 Specialist Lexicon lvg V1.83 tool [8] in inflection and derivation modes [13]. We ran it with options (i) lvg -m -fi to produce inflections and (ii) lvg -m -fRf for derivations. Discussion The precision of the word pairs and of the morphological families is extremely high, and even higher when using a fourcharacter threshold. We believe this is due to the very constrained context in which initial common substring matching is performed: synonym terms. This may be compared to results obtained with other kinds of constraints: using cooccurrences in corpora [14] or initial common substring matching between theraurus terms and corpus occurrences [15]. Errors occur when two word forms occurring in two synonym terms accidentally share a long enough initial substring. This is the case in synonyms terms for concept C-A0700: oral contraceptive preparation, nos and birth control pill, nos lead to the alignment of {control, contraceptive} and to the rule oljaceptive. It would be possible to create manually exception lists of word pairs, which would prevent errors produced at a low threshold. However, in these experiments, we have opted for a method that can be applied to another language or domain with virtually no human intervention (currently, the only human intervention is an optional tuning of the threshold). The word pairs identified include the three types of morphological variation (inflection, derivation and compounding). Additional linguistic knowledge may cancel the effect of inflection, so that only derivation and compounding remain. This will be shown later in this paper. The observation of occurrences of suffix substrings in initial position in other word forms may help differentiate (compositional) radicals from derivational or inflectional suffixes; however, we have not yet come out with an automated method for separating out these three types of variation. The acquired rules only have an associative value: given two word forms, the rules can propose to relate them or not. It would not be wise however to apply these rules to any word form which ends with one of its suffix strings in order to gen- Methods Identifying new word pairs For a given language, each morphological rule is applied to each word form Wi of the reference list, producing another word form Wj . If the resulting word form Wj also belongs to the reference list, i.e., it is attested in the language, the word pair {Wi , Wj } is considered as morphologically related (the implemented program actually uses a more efficient algorithm). It is then added to a new set of word pairs. If the reference list contains all the word forms of the bootstrapping synonym terms, as is the case here, all the word pairs aligned at bootstrapping are found again here. The aim is that new word pairs can be identified too. As in the alignment step, we limit noise by constraining the initial substring I of each pair to have a minimal length over a given threshold. We also set this threshold to three characters. Extending the morphological families Here again, word pairs are merged into morphological families. These families are either extensions of the ones found in the bootstrapping step (when they contain a word pair that existed in the bootstrapping step) or are new families. Evaluating precision Precision was defined as the proportion of word pairs identified by our method that actually are in a relation of inflection, derivation, compounding or a combination thereof. Precision 6 forms from SNOMED Microglossary for Pathology and ICD10; for English, word forms from full SNOMED and ICD-9CM; for Russian, word forms from SNOMED Microglossary for Pathology. The results are shown in table 3. Rule suffix fitting was only experimented on French. was evaluated by a human review of the word pairs and morphological families returned by our program. For English, the size of the results was too large, so that samples were extracted for review: one word pair every 125 and one morphological family every 25. Evaluating recall for English Table 3 – Extension: acquired knowledge. Knowledge elements French English Russian From bootstrapping Morphological families 662 3,207 824 Words per family 2.57 2.94 2.80 Morphological rules 644 3,188 955 Input Word forms 8,874 49,627 9,871 Output Word pairs 4,838 25,740 6,887 99.6% Precision 98.1% 845% Morphological families 1,663 6,424 1,707 99.3% Precision 97% 864% Words per family 3.21 3.84 3.55 For English, we are in a position where morphological resources for derivation (as well as inflection) are available: those distributed in the UMLS Specialist lexicon and its lvg tool. We ran lvg on the list of word forms collected for English to identify the word pairs where either (i) one word form was an inflected form of the other, or (ii) one word form was a derived form of the other. We took these word pairs as the gold standard for evaluating the recall of our method. Recall was defined as the proportion of lvg-returned word pairs also identified by our method. Fitting suffix strings in rules The rules learnt in the bootstrapping step are the most general suffix substitutions that account for the initial word pairs. Now that more word pairs have been collected, we can try to make these rules more specific. In the bootstrapping step, we had individually generalized each example Ei = {Ii S1 , Ii S2 } to the rule R = {S1 , S2 }, on the basis of the maximal initial substring Ii common to both words in the pair. We shall now globally consider all examples EiR subsumed by this rule, and identify the maximal final substring SR common to all their maximal initial substrings IiR . We then specialize the rule R = {S1 , S2 } to {SR S1 , SR S2 }. This corresponds to extending suffixes S1 and S2 of rule R maximally to the left while still covering all the examples. Let us emphasize that if we had fitted the rules at bootstrapping, many of them might have been overfitted because of the smaller number of instances available at that step. They would therefore have been unable to apply to some word pairs of the reference list. We expect this to be less the case on the larger set of word pairs collected in this second step. The rules that apply to only one word pair have no generality on the set of word forms considered. They correspond to word pairs learnt on only one word pair during the bootstrapping step, and for which no external confirmation was found in the reference word list. These rules are kept as is. They can be considered as having a lower confidence than the other rules. The obtained suffix strings are now the longest possible that can allow a rule to apply to the largest possible subset of the reference list. We expect these strings to be a better approximation of the morphemes that a linguistic analysis would identify in these word forms. The number of unique word pairs identified, compared with that obtained at bootstrapping, is multiplied by 4. Morphological families are twice as numerous, and they are larger on average. Recall values for inflection and derivation are shown on table 4. Table 4 – Inflection and derivation recall for English. Knowledge elements Automatic lvg Recall Word pairs 25,740 Inflection 2,697 91.2% Derivation 2,973 79.2% For French, The most frequent rule applications are: 1149 js 297 je 145 jes 75 ejique 69 aireje 64 ejs 64 ejque 57 ationjé 53 sejx 46 aljo 45 ljux 45 ejo 41 lejux 41 ejose 40 itejo 38 jle 37 ejome 37 airejes 34 jne 34 jment 34 fjve 34 ejé 32 me|sarcome 539 suffixes are involved, among which the most frequent are: 1271 s 1111 e 272 es 269 o 227 é 222 aire Results The extension of word pairs and morphological families was performed on the material described above : for French, word 7 177 me 173 se 173 ique 145 ome 142 ose 117 ux 116 al 100 que 97 x 88 ïde 86 le 82 sarcome septic}, {tea, teal}, and fewer ones involve a larger common initial substring (e.g., {china, chin}). Recall for English inflected word pairs was very good (over 90%; see table 4), and recall for derivation was fairly good too (close to 80%). Silence is basically due to rules that were not represented in the training material (word forms in SNOMED terms), or to rules that are observed in the training material but could not be identified through the alignment method. It will be interesting to check whether this silence could be reduced by using the results of the direct observation of the reference word list, as described earlier in this paper. This is yet to be done. Our method identifies a much larger number of word pairs than were obtained by lvg. Several reasons can be invoked to account for this difference. First indeed, the evaluation of precision shows that 165% of these word pairs are spurious. Second, lvg did not generate the compound word forms found by our method. But, more important, our rules are somewhat redundant: one rule may perform the same substitution as the combination of several more elementary rules also identified on aligned word pairs. Given n words in a family, it can then collect up to O(n2 ) word pairs ( n(n2?1) ), depending on the redundancy of the acquired rules. In the most frequent rules, inflection again is the most frequent (-s, -e, -es); adjectival (-ique, -aire) and nominal (-ation) derivation comes next; domain-specific suffixes (-ose, -ome) and compounding (-sarcome) come later. Extended suffix strings as obtained by rule fitting look closer to what one could expect to be the morphemes involved in these word forms, and the rules obtained seem to better agree with a linguistic analysis of the morphological operations involved. For instance, rule iquejie pertains to the formation of adjectives with an -ique ending rather than adjectives with a -que ending, as the former rule queje seemed to propose. It specifies how this adjectival derivation applies to forms ending in -ie, and was learnt on 64 word forms among which achromie, allergie, amnésie, etc. Some suffix strings (such as -ique, just discussed, or -ome, ose or -ation) were previously split among several forms; they are now better identified, and their relative frequency is larger. 2879 initial strings are involved; the most frequent ones are: 29 myélo 25 ostéo 23 fibr 22 fibro 21 adéno 19 lympho 16 lipo 16 granul 15 angio 13 plasmocyt 13 myo 13 immun 13 chondro 12 méning 12 mélano 12 lymphocyt 12 histiocyt 11 hémangio 11 hyalin 11 dur 10 plan 10 neuro in which one can find the radicals directly obtained from the flat list of words in the first experiment; their relative frequency is different, two-letters radicals and prefixes are indeed absent, and some variant segmentations occur (i.e., fibro / fibr). Rule suffix fitting (performed only on French data, at threshold four) applied to 462 rules out of 644 (72%). Among the 23 above-mentioned most frequent rules, suffix extension applied to iejique (+i), eusejeux (+eu), aljaux (+a), alejaux (+a), ifjive (+i) and omejosarcome (+o): 1149 js 297 je 145 jes 75 ejique 69 aireje 64 iejique 64 ejs 57 ationjé 53 eusejeux 46 aljo 45 ljux 45 ejo 41 ejose 41 alejaux 40 itejo 38 jle 37 ejome 37 airejes 34 jne 34 jment 34 ifjive 34 ejé 32 omejosarcome 31 atoseje The most frequent suffixes are now: 1243 s 1027 e 299 ome 271 o 261 ique 254 es 234 aire 202 é 177 ose 158 eux 133 oïde 119 ation 115 al 108 euse 100 ie 92 ion 83 if 83 ale 82 osarcome Discussion Using more syntactic knowledge Precision was still excellent for French and Russian on both word pairs and families, and was good for English (the confidence interval was computed with = 0.05). Whereas the rules learnt may be overgeneral, their application to couples of attested word forms is a powerful filter. The fact that the reference word list mostly contains domain-specific words may also have helped. For a large part, erroneous word pairs involve a threecharacter common initial substring: {den, dent}, {fel, felids}, {levamphetamine, lev}, {maia, main}, {plan, plane}, {sepia, Introduction and material As already mentioned, the rules learnt are overgeneral. This may come from several origins, among which: 1. the constraints they put on string form (suffix strings) may be too loose (suffix strings may be too short); the rule fitting step just discussed may slightly reduce this default; 2. they do not include any exception, as is often done (e.g., 8 rule e/SBC:sgjique/ADJ:sg stands for a rule ejique constrained to apply between a singular noun (SBC:sg) and a singular adjective (ADJ:sg). For the [LEM] experiment (for French only), FLEMM was run on the tagged terms and on the tagged word list, and tagged word forms were replaced with their root forms. The rule acquisition method was then applied to both inputs. in lvg); this requires human intervention; 3. they do not take into account the syntactic properties of word forms; part-of-speech (POS) is generally used to constrain rule application (as in lvg); this is the point we address in this section; 4. they do not take into account the semantic properties of word forms; we have not yet worked on that point, although the multiaxial nature of terminologies like SNOMED might provide useful clues here. Results Table 5 – Acquired knowledge with more syntactic information (threshold = 4) [POS] [POS] [LEM] Knowledge elements French Russian French Terms 12,555 13,462 12,555 Series of synonyms 2,344 2,636 2,344 Bootstrapping Word pairs 1,595 2,554 1,414 Unique word pairs 1,104 1,826 935 Suffix strings 489 246 412 Morphological rules 594 854 507 Reference word list Word forms 8,989 9,871 7,188 Extension Word pairs 4,512 6,710 2,390 Morphological families 1,650 1,731 1,096 Words per family 3.10 3.45 2.82 In this section, we introduce syntactic constraints on rules by applying the previous method to POS-tagged input: each word form in the thesaurus terms was tagged with its part-of-speech; and each word form in the reference word list was tagged with all its possible parts-of-speech. This experiment [POS] was performed on French and Russian, using a specifically trained version of the Brill tagger [23]. Furthermore, as already discussed too, rules mix the different kinds of morphological variation: inflection, derivation and compounding, inflection being the most frequently occurring one. If each word is replaced with its lemma (canonical form), inflectional variation is canceled and the acquired rules and suffixes should be cleaner. Again, the previous rule acquisition method is then applied to the lemmatized input material. This experiment [LEM] was performed on French using Namer’s FLEMM lemmatizer [6]. Methods Table 5 shows the results obtained for both experiments. Word counts (word pairs, word forms, words per family) are given for tagged word forms in the [POS] experiment and for lemmatized words in the [LEM] experiment. In the [POS] experiment, the number of aligned word pairs has increased compared to the previous, non-tagged experiment. This comes from the fact that the same word form may occur with different part-of-speech tags depending on its usage. The number of rules and of new word pairs is also increased. However, the number of morphological families is slightly decreased, with a larger average number of words per family. On the contrary, in the [LEM] experiment, all these figures are lower than those obtained in the non-tagged experiment. For the French [POS] experiment, we first used the Brill tagger trained for French at INaLF [24] to tag a subset of the Microglossary terms with the INaLF tagset. We then manually corrected the errors in the assigned tags, and trained the tagger on this corpus. The process was repeated on subsets of the thesaurus of increasing sizes, until we obtained a satisfactory POS-tagged version of the thesaurus. Then all these tagged SNOMED word forms were copied into the reference word list, replacing the non-tagged SNOMED word forms therein. Word forms with several possible part-of-speech tags have several entries in this list. For instance, the French word technique may be either a noun (English technique) or an adjective (English technical). The remaining non-tagged word forms (ICD-10 forms not occurring in the Microglossary) were tagged manually. The same process was applied to the Russian version of the Microglossary for Pathology. However, since we had no already trained tagger for Russian, the first subset was tagged manually. Part-of-speech tags are appended to words or suffixes with a slash character “/” as separator. For instance, one term for code M-09150 is tissu/SBC:sg égaré/ADJ:sg pendant/PREP la/DTN:sg manipulation/SBC:sg technique/ADJ:sg. We handle these tags as extensions to word suffix strings, so that the 9 37 um 683 e 86 ïde 35 ant 83 que 263 o 34 on 82 sarcome 216 é 585 /ADJ:sgjs/ADJ:pl 38 ale/ADJ:sgjaux/ADJ:pl 34 ment 80 s 177 me 519 /SBC:sgjs/SBC:pl 38 al/ADJ:sgjaux/ADJ:pl 33 us 70 x 177 ique 267 /ADJ:sgje/ADJ:sg 37 aire/ADJ:sgjes/SBC:pl 33 cytaire 66 eux 175 aire 128 /ADJ:sgjes/ADJ:pl 34 /ADJ:sgjment/ADV 32 tion 61 ation 73 e/SBC:sgjique/ADJ:sg 32 ome/SBC:sgjosarcome/SBC:sg 173 se 32 ien 56 ite 117 al 69 aire/ADJ:sgje/SBC:sg 32 if/ADJ:sgjive/ADJ:sg 32 ement 55 blastome 111 ose 64 ie/SBC:sgjique/ADJ:sg 32 e/SBC:sgjome/SBC:sg 32 atose 38 oïde 105 ome 56 /ADJ:sgj/SBC:sg 31 en/ADJ:sgjenne/ADJ:sg 51 ation/SBC:sgjé/ADJ:sg 31 atose/SBC:sgje/SBC:sg 49 euse/ADJ:sgjeux/ADJ:sg 27 l/ADJ:sgjlle/ADJ:sg Discussion 49 e/ADJ:sgjs/ADJ:pl 27 if/ADJ:sgjion/SBC:sg Working on POS-tagged input increases the number of 46 al/ADJ:sgjo/PFX 27 e/SBC:sgjé/ADJ:sg tagged word pairs identified because of polycategorial word 41 e/SBC:sgjo/PFX 27 aire/ADJ:sgjo/PFX forms. On the one hand, they can undergo part-of-speech con40 ite/SBC:sgjo/PFX 26 ique/ADJ:sgjo/PFX versions (such as the word technique mentioned earlier). On 38 e/SBC:sgjose/SBC:sg 25 me/SBC:sgjse/SBC:sg the other hand, the same pair of untagged word forms may give rise to several tagged word forms when one of them is Most frequent suffixes ([POS], French): polycategorial. In contrast, a pair of untagged word forms, although iden1120 /ADJ:sg 118 oïde/ADJ:sg tified in the non-tagged experiment, may happen not to be 654 /SBC:sg 112 ation/SBC:sg picked in the [POS] experiment. An analysis of these cases 634 s/ADJ:pl 104 ie/SBC:sg showed that this occurred in the following situations: 530 e/SBC:sg 104 euse/ADJ:sg 519 s/SBC:pl 98 es/SBC:pl 1. Remaining tagging errors: a tagged word form was not 398 e/ADJ:sg 86 ion/SBC:sg identified because it was incorrectly tagged; 310 ome/SBC:sg 79 osarcome/SBC:sg 302 o/PFX 79 if/ADJ:sg 2. Missing category in word list: a polycategorial word 254 ique/ADJ:sg 76 aux/ADJ:pl form lacked some entries in the tagged word list, so that 222 aire/ADJ:sg 74 ale/ADJ:sg a rule that would have applied to its non-tag part could 191 ose/SBC:sg 64 ite/SBC:sg not apply because of tag discrepancy; 162 é/ADJ:sg 59 oblastome/SBC:sg 3. Missing tagged rule variant: here, a rule was learnt with 152 eux/ADJ:sg 54 ive/ADJ:sg a specific syntactic constraint represented in the aligned 144 al/ADJ:sg 49 l/ADJ:sg word forms; however, these constraints are more specific 128 es/ADJ:pl 48 se/SBC:sg than would be appropriate, which blocks the application of this rule in a correct context; for instance, the plural Most frequent rule applications ([LEM], French): rule l/ADJ:sgjux:ADJ:pl could not apply to the pair of noun forms canal/SBC:sg and canaux/SBC:pl, although 37 ejome 23 sejtique 75 ejique a similar rule (not learnt from the aligned word pairs) 69 aireje 34 jment 22 alje exists in French for nouns: l/SBC:sgjux:SBC:pl; 64 ejs 34 ejé 21 jux 64 ejque 32 mejsarcome 20 iquejome 4. Correct blocking of a morphologically unrelated word 57 ationjé 31 atoseje 20 euxjose pair; for instance, rule /ADJ:sgje/ADJ:sg produces 53 sejx 28 fjon 20 cejt the feminine form of regular adjectives; the adjec46 aljo 28 airejo 19 jme tive constraint blocked the incorrect pair {gain/SBC:sg, 45 ejo 27 iquejo 19 mejïde gaine/SBC:sg} ({gain, sheath}). 41 ejose 25 mejse 19 blastomejme In summary, using part-of-speech information decreases the 40 itejo 23 jse 19 antjé error rate (increases precision), but also decreases recall. Some of these problems might be reduced by considering that Most frequent suffixes ([LEM], French): a polycategorial word form occurrence should represent all its categories, as proposed by Krovetz [25] in a different context. Working on lemmatized input sensibly reduced the number of word pairs and morphological families by filtering out Most frequent rule applications ([POS], French): 10 inflection-related rules. The resulting rules, as expected, are much more focused on derivation and compounding, which are the semantically more interesting phenomena. The error rate, however, is increased due to lemmatization errors. The lemmatizer has a very good precision on general French, but does not know some specific medical words with irregular inflection, nor does it include provision for handling Latin words. Much of these specific properties were correctly acquired in the non-lemmatized method. then more overtly exhibited. In any case, a (small) proportion of the word pairs identified still require human correction. The number of word pairs to be reviewed may be reduced by observing that some of them are redundant. Finally, the remaining incorrect word pairs can be handled as exceptions, as is the case in many morphological processors such as lvg and FLEMM. Synthesis We have shown that starting from generally available terminological resources, a massive, gross description of medical word morphology can instantaneously be obtained in the form of a collection of affix strings and morphological rules. Its precision is high, and good recall may be obtained depending on the size of training data. Whereas many areas for improvement were already identified along these experiments, some additional research directions seem to be worth exploring: Conclusion and further work The rationale behind the methods in this work is that terminologies contain a wealth of linguistic knowledge that can be « reverse engineered » to a large extent. Synonymy relations include explicit morphological variants for terms, purposely inserted to facilitate « natural language » indexing and search. They also contain more implicit morphological variants for single words. Observing such morphological word variants in the synonym terms that denote the same concept is a very strong clue that these words belong to the same morphological family. We have seen, with experiments in three different languages (of the romance, germanic and slavic families), that this can lead to high precision collection of morphologically related word pairs. This initial collection is then used to bootstrap a larger acquisition process that identifies more morphologically related word pairs across concepts. This increases recall, while keeping a very good precision. The application to large-size English data showed that recall can be fairly good too. Its more precise dependence to training set size needs to be investigated. Besides, the direct exploitation of an unstructured list of word forms also proved to yield good quality segmentations and affixes. Altogether, these methods can segment a large part of the input word forms. Coupling them more tightly is a subject for further work. Several methods were examined to obtain more precise segmentation and rules. Setting a higher threshold (four instead of three) on the minimal initial common substring reduces a large part of the errors. A strategy for human validation could then be to systematically review generated word pairs with a common prefix substring of exactly three characters. A posteriori rule suffix fitting uses a larger language data sample to obtain better-confidence word segmentation and affixes. If part-of-speech-tagged input is available, rules can be made syntactically more precise and block some incorrect word pairs; however, polycategorial word forms result in a decrease in recall. A special treatment might be able to reduce this drawback. If, furthermore, lemmatized input can be obtained, rules and affixes are « cleaned » from the influence of inflection. The interesting affixes, among which morphemes can be found, are those of derivation and compounding, which are We have been working up to now independently on several languages. It may be useful to take advantage of the parallel nature of international terminologies to collect multilingual lexical and morphological resources [26,11]. In return, this may help obtain better monolingual morphological resources. One of the obstacles to the identification of canonical morphemes is the variation of affix strings, especially as caused by morphotactic constraints. These are elegantly dealt with by more elaborate linguistic models [18,27] that may help obtain more adequate morphological rules. Just like phrase-level syntax is constrained by lexical semantics, word-level morphology is constrained by morpheme semantics (e.g., [28]). Semantic clues may then be another knowledge source to constrain morphological rule application. The semantic types available in the UMLS or in multiaxial terminologies such as SNOMED or MeSH might provide a useful source for these; their appropriateness for morphemes instead of terms should however be assessed. We aim to refine this set of methods and to apply them to a larger set of languages. Acknowledgement We wish to thank Dr Roger Côté for a precommercial version of the French SNOMED Microglossary for Pathology, Yvan Emelin for a draft version of the Russian SNOMED Microglossary for Pathology, INaLF for lending us a version of Brill’s tagger trained for French, Fiametta Namer for the FLEMM lemmatizer, and the NLM for the lvg tool, Specialist Lexicon and UMLS resources. 11 References [14] Xu J and Croft BW. Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems 1998;16(1):61–81. [1] Pacak MG, Norton LM, and Dunham GS. Morphosemantic analysis of -ITIS forms in medical language. Methods Inf Med 1980;19:99–105. [15] Jacquemin C. Guessing morphology from terms and corpora. In: Actes, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), Philadelphia, PA. 1997:156– 67. [2] Wingert F, Rothwell D, and Côté RA. Automated indexing into SNOMED and ICD. In: Scherrer JR, Côté RA, and Mandil SH, eds, Computerised Natural Medical Language Processing for Knowledge Engineering. North-Holland, Amsterdam, 1989:201–39. [16] Déjean H. Morphemes as necessary concept for structures discovery from untagged corpora. In: Workshop on Paradigms and Grounding in Natural Language Learning, Adelaide. 1998:295–9. [3] Wolff S. Automatic coding of medical vocabulary. In: Sager N, Friedman C, and Lyman MS, eds, Medical Language Processing. Computer Management of Narrative Data. Addison-Wesley, New-York, 1986:145–62. [17] van den Bosch A, Daelemans W, and Weijters T. Morphological analysis as classification: an inductivelearning approach. In: Tsujii JI, ed, Proc 16 th COLING, Copenhagen, Denmark. 5–9 August 1996. [4] Lovis C, Baud R, Rassinoux AM, Michel PA, and Scherrer JR. Medical dictionaries for patient encoding systems: a methodology. Artif Intell Med 1998;14:201–14. [5] Silberztein M. Dictionnaires électroniques et analyse automatique de textes : le système INTEX. Masson, Paris, 1993. [18] Koskenniemi K. Two-level morphology: a general computational model for word-form recognition and production. PhD thesis, University of Helsinki Department of General Linguistics, Helsinki, 1983. [6] Toussaint Y, Namer F, Daille B, et al. Une approche linguistique et statistique pour l’analyse de l’information en corpus. In: Zweigenbaum P, ed, Actes de TALN 1998, Paris. June 1998. [19] Theron P and Cloete I. Automatic acquisition of twolevel morphological rules. In: ANLP97, Washington, DC. 1997:103–10. [7] Burnage G. CELEX - A Guide for Users. Nijmegen: Centre for Lexical Information, University of Nijmegen, 1990. [20] Côté RA. Répertoire d’anatomopathologie de la SNOMED internationale, v3.4. Université de Sherbrooke, Sherbrooke, Québec, 1996. [8] National Library of Medicine. Sources Manual, 1999. [21] Organisation mondiale de la Santé, Genève. Classification statistique internationale des maladies et des problèmes de santé connexes — Dixième révision, 1993. UMLS Knowledge [9] McCray AT, Srinivasan S, and Browne AC. Lexical methods for managing variation in biomedical terminologies. In: Proc Eighteenth Annu Symp Comput Appl Med Care, Washington. Mc Graw Hill, 1994:235–9. [22] Spyns P. A robust category guesser for Dutch medical language. In: Proceedings of ANLP 94 (ACL), 1994:150–5. [10] Dal G, Namer F, and Hathout N. Construire un lexique dérivationnel : théorie et réalisations. In: TALN99, 1999. [23] Brill E. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 1995;21(4):543–65. [11] Schulz S, Romacker M, Franz P, et al. Towards a multilingual morpheme thesaurus for medical free-text retrieval. In: Proceedings of MIE’99, Ljubliana, Slovenia. IOS Press, 1999. [24] Lecomte J. Le catégoriseur Brill14-JL5 / WinBrill0.3. Technical report, INaLF, December 1998. Available at http://jupiter.inalf.cnrs.fr/ WinBrill/winbrill.etiquetage.html. [12] Grabar N and Zweigenbaum P. Acquisition automatique de connaissances morphologiques sur le vocabulaire médical. In: Amsili P, ed, Actes de TALN 1999, Cargèse. July 1999:175–84. [25] Krovetz R. Viewing morphology as an inference process. In: Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1993:191–202. [13] Grabar N and Zweigenbaum P. Language-independent automatic acquisition of morphological knowledge from synonym pairs. J Am Med Inform Assoc 1999;6(suppl):77–81. [26] Baud RH, Lovis C, Rassinoux AM, Michel PA, and Scherrer JR. Extracting linguistic knowledge from an 12 international classification. In: Pappas C, Maglaveras N, and Scherrer JR, eds, Proceedings of MIE’97, Thessaloniki, Grece. IOS Press, 1997. [27] Antworth EL. PC-KIMMO: a two-level processor for morphological analysis. Number 16 in Occasional Publications in Academic Computing. Summer Institute of Linguistics, Dallas, TX, 1990. [28] Dujols P, Aubas P, Baylon C, and Grémy F. Morphosemantic analysis and translation of medical compound terms. Methods Inf Med 1991;30:30–5. 13
© Copyright 2025 Paperzz