A Contribution of Medical Terminology to

A Contribution of Medical Terminology to Medical Language Processing
Resources: Experiments in Morphological Knowledge Acquisition from
Thesauri
Pierre Zweigenbaum and Natalia Grabar
Service d’Informatique Médicale / DSI / Assistance Publique – Paris Hospitals &
Département de Biomathématiques, Université Paris 6, Paris, France
{pz,ngr}@biomath.jussieu.fr http://www.biomath.jussieu.fr/~pz/
Morphological knowledge, especially derivation and compounding, is extremely useful both for natural language processing and information retrieval. Whereas large morphological knowledge bases are available for some languages, this
is not the case for French. In order to fill this gap, we aim
at setting up a method that can acquire automatically various
kinds of morphological knowledge for a given language and
domain. This method relies on the synonym terms present in a
thesaurus of the domain and a list of words that can be drawn
from the same thesaurus. This paper presents a series of experiments whose goal is to learn morphological knowledge
from this initial data and without a priori linguistic knowledge.
It shows that one can obtain instantaneously a massive, gross
description of word morphology in the domain addressed.
Introduction
Morphological knowledge has long been acknowledged as
an important area of medical language processing and medical information indexing [1,2,3,4]. Three types of morphological relations are classically considered. Inflection produces
the various forms of a same word such as plural, feminine
or the multiple forms of a verb according to person, tense, etc.
Derivation is used to obtain, e.g., the adjectival form of a noun
(noun aorta $ adjective aortic). Compounding combines several radicals to obtain complex word forms (e.g., aorta + coronary yields aortocoronary).
Medical terminology indeed heavily draws on derivation and
compounding to form new words. A complete list of these
words cannot be expected to be present in a lexicon. Furthermore, the decomposition of a word into its component elements or morphemes is a step towards its decomposition into
atomic concepts. Therefore, it may lead to better indexing of
medical texts and better structuring of medical terminologies.
Inflectional morphology has been widely studied, and for
many languages, inflection analysis tools (lemmatizers) are
0 To appear in “Proc Conference on Natural Language Processing and
Medical Concept Representation, Phoenix, Az, 16–19 Dec 1999”.
available along with the suitable morphological knowledge
(for instance for French, Silberztein’s INTEX system [5] or
Namer’s FLEMM lemmatizer [6]).
In contrast, no complete description of French derivational
or compositional morphology is available. This situation prevails for most languages, notable exceptions being Dutch, English and German with the CELEX base [7]. For medical language, the only publicly available morphological knowledge
base is that of the UMLS Specialist Lexicon [8,9], which addresses inflection and derivation for medical English. Derivational morphology is the subject of recent projects (e.g., the
FRANLEX project on French derivational morphology [10]).
Building medical morphological resources for several languages is beginning to be tackled in a global way [11].
The present work participates in this movement towards
building resources for medical morphology processing in multiple languages. Our goal is to design a method that can help
acquire automatically various types of morphological knowledge for a given language and domain [12]. We chose the
approach of an automated, non-interactive tool, with no a priori linguistic knowledge, that can thus be applied quickly and
with little effort to a new language or domain [13] .
Automated methods for learning morphological variants
have been designed to collect resources for information retrieval or natural language processing systems. Some rely on
corpus collocations [14] or on approximate matches between
thesaurus terms and corpus occurrences [15]. In the framework of the FRANLEX project, another method searched for
correspondences between morphologically similar words with
differing syntactic categories [10]. Constrastive studies on
large word lists have also been performed to identify the most
frequent suffixes of a language [16]. Machine learning algorithms have been applied to morphology too [17]. Tools to
study morphological variants among terms of multiple vocabularies have also been designed when building the UMLS [9].
Whereas most of these methods work with classical string operations, linguistically more elaborated models have also been
used, such as two-level morphology [18], to learn more general rules from morphologically related pairs of words [19].
The base material we use to learn morphological knowledge is a terminology with synonym terms: this can be a thesaurus, classification, nomenclature, controlled vocabulary or
any other such terminological system where a « concept » can
be denoted by several terms. This material can be complemented with a list of additional word forms of the same domain. In the medical domain, such terminologies exist for
many languages: depending on the country, both national and
international terminologies may be available. We worked with
data obtained from the SNOMED Nomenclature and the International Classification of Diseases in French, English and
Russian.
This paper describes a series of experiments which aim to
assess the kind of morphological knowledge on medical words
that can be extracted from this data, using very little a priori
knowledge on the language addressed. It first reviews more
precisely the kind of knowledge that is expected. The material used throughout these experiments is then presented. The
rest of the paper describes several experiments performed on a
simple list of words, on the synonym terms of a thesaurus, and
on both. The contribution of additional syntactic knowledge,
in the form of part-of-speech tagging and lemmatization, are
also assessed. We conclude with a synthesis of these experiments and perspectives for further work.
(4) -e, -s, -aire, -ique, -ome, -ose
These strings are close to morphemes, some of which
have a strong semantic value in the domain; they can correspond to semantic primitives (atomic concepts) for domain knowledge representations (e.g., -ome = tumor).
These radicals are morphemes too, most of which have
strong semantic value in the domain and correspond to
concept types.
Decompositions (segmentations) of forms into their constitutive elements:
(7) cicatricielles ! cicatriciel+le+s
cicatriciel ! cicatrice+iel
lymphoblastique ! lympho+blastique
blastique ! blaste+ique
where we find adjectivation suffixes -iel and -ique, feminine -le and plural -s inflections, radicals cicatrice and
blaste, and prefix lympho- (related to radical lymphe).
This kind of decomposition gives access to semantic
components of a word, e.g., to index it [2] or to propose
a definition for a new word in a lexicon [3].
Morphologically related pairs of word forms:
(1) abdominal, abdominaux (inflection);
aorte, aortique (derivation);
cardio, cardiomégalie (compounding)
All the methods presented in this article deal with strings
rather than morphemes. However, we often use words such
as prefix, suffix or radical to denote strings that might play the
role of such word parts.
but also any combination of these three types of relations.
Families of morphologically related word forms:
(2) abdomen, abdominal, abdominale, abdominales,
abdominaux, abdomino;
cardia, cardiaque, cardiaques, cardio, cardiomégalie, cardiopathie, cardiopathies, cardite
Material
The following material provides the basis for this series of
experiments. We describe additional material where relevant
in the following sections.
These families are interesting, e.g., for information retrieval, where they can increase recall as a complement
to or a substitute for stemming [15].
Prefixes of the language or the domain, that can be added
in front of another word form:
(6) hyper-, poly-, para-
We aim to collect morphological knowledge of the following
types.
Radicals (or bases) of the language or the domain, to
which suffix strings can be appended:
(5) fibr, granul, hépat, hyalin, immun, lymph
Objectives
Language or domain suffix strings, which play a role in
these rules:
Thesaurus
Morphological rules that transform a word form into another, morphologically related word form:
We started from the series of synonym terms found in the
SNOMED International nomenclature. Depending on the experiment, we worked with French, English or Russian:
ejiel
(3) cicatrice ! cicatriciel
Examples : ejique, ejaire, js.
Depending on their precision, they can be used to segment a word form into its morphological constituents or
to produce a derived form.
2
French: the French version of the SNOMED Microglossary for Pathology [20] (12,555 terms); this is a precommercial version of the French Microglossary, gracefully
lent by Dr Roger Côté;
English: SNOMED International version 3.5, as included
in the 1999 UMLS Metathesaurus (128,855 terms).
These terms were obtained by collecting Metathesaurus strings with SAB=SNMI98 (source abreviation =
SNOMED International, 1998);
Methods
Detecting prefixes and compositional radicals
We performed an analysis of our reference word list based
on commutation tests. By definition, if fibro- is a radical,
this means that it can combine with various forms F to form
fibroF compounds (with a possible local morphological adaptation at the frontier of the two components). We then assume,
conversely, that if a large enough number of pairs { F , fibroF }
occur in the reference list, this is a clue that fibro- is a « candidate radical » or a candidate prefix.
For instance, for fibro-, 18 pairs can be found in the list:
Russian: the Russian version of the SNOMED Microglossary for Pathology (13,462 terms); this is a working version of an ongoing translation, gracefully lent by
Yvan Emelin.
Reference word list
We also used a reference list of attested word forms occurring in:
(8) {améloblastique, fibroaméloblastique}
{blastique, fibroblastique}
{chondrosarcome, fibrochondrosarcome}
{kystique, fibrokystique}
French: the French version of the SNOMED Microglossary for Pathology or the main terms of the French ICD10 [21]; capitalized words (proper names) were filtered
out; total: 8,874 word forms;
:::
{sclérose, fibrosclérose}
{vasculaire, fibrovasculaire}
{xanthome, fibroxanthome}
English: SNOMED International version 3.5 and ICD9-CM: Metathesaurus strings with SAB=SNMI98 or
ICD99 (source abreviation = SNOMED International,
1998 or ICD-9-CM, 1998); total: 49,627 word forms;
which provides good support for this candidate.
Transposition to suffixes
Russian: the Russian version of the SNOMED Microglossary for Pathology; 9,871 word forms.
The same basic principle may be applied to identify suffix
strings. Here, we want to spot word pairs of the form:
All punctuation characters were considered as delimiters, and
word forms containing digits were filtered out. All forms were
converted to lower case.
(9) {ligament, ligamentaire}
{ligament, ligamenteux}
{ligament, ligaments}
{essentielle, essentiellement}
{pneumo, pneumocyte}
Bootstrapping on a list of word forms
In this section, we examine what can be done starting from
a “flat” list of word forms. This list of word forms has no a
priori structure. Nevertheless, being drawn from medical terminologies, it has the specific property of containing a large
proportion of words that are characteristic of the medical domain: domain-specific words (leucokératose, pemphigoïde) or
words that typically occur in medical corpora (bénin, aberration).
Many medical words are formed by compounding of Latin
and Greek radicals: the so-called neo-classical compounds.
For instance, fibro- and amélo- are prefixed to blastique to
form fibro+amélo+blastique. A list of such radicals is a key
resource for algorithms that segment word forms into their
constituents (e.g., [3,4,22]). Our aim here is to collect such
radicals.
which reveal suffixable strings -aire, -eux, -s, -ment, -cyte.
Results
Prefixes and compositional radicals
We applied the prefix method to the whole reference list and
obtained 779 radical strings R that occur in at least one {F ,
RF } pair, among which 235 occur in at least two pairs and
120 in at least three. The 33 radicals with a frequency greater
than or equal to 10 are shown below.
37 intra
36 in
34 péri
28 para
26 hyper
23 pré
21 hypo
20 trans
19 inter
Material
For this series of experiments, we worked with the reference
list of 8,874 word forms obtained from the French version of
the SNOMED Microglossary for Pathology and the French
ICD-10.
3
18 fibro
17 ostéo
17 dé
15 neuro
15 micro
15 anti
14 poly
13 épi
13 sub
13 rétro
13 ré
13 pro
13 endo
13 dys
13 angio
13 a
11 myo
11 di
11 bi
10 pseudo
10 myélo
10 hém
10 extra
10 adéno
The first 9 elements found (intra-, in-, péri-, para-, hyper-,
pré-, hypo-, trans-, inter-) are prefixes of general French that
are frequent in the medical domain. The first domain-specific
radicals can be found at rank 10 (fibro-, ostéo-).
The first error occurs at rank 61 (frequency = 5): pis neigher a radical nor a prefix, although pairs {laque,
plaque}, {liée, pliée}, {lèvre, plèvre}, {réparation, préparation}, {urine, purine} are attested in the reference list. In total, for frequencies greater than 2, we found 13 errors in 120
radicals, i.e., a precision of 89.2%. If we exclude one-letter
prefixes (only a- is a correct prefix), we reduce errors further
down to 5 out of 111 radicals, yielding a 95.5% precision.
Table 1 – Some preferred and synonym SNOMED terms containing words related to the mucous membrane.
Suffixes
that start with an-, only 6 were segmented with this prefix: an+aplasique, an+aérobies, an+euploïdie, an+ictérique,
an+ovulation, an+ovulatoire. Most of the other forms starting with an- do not belong to this segmentation class, whether
they be atomic (angine) or involve other prefixes or radicals
(ana-, angio-, anté-, anti-, ano-). However, some an- segmentations are missed because the word form corresponding to the
radical is either not present in our reference word list (e.g., algésie in an+algésie) or only exists as a bound morpheme in
modern language (e.g., -ergie in an+ergie).
In the suffix method, the pairs of forms found are by construction restricted to strict additions of suffixes. This prevents this process from identifying inflections or derivations,
e.g., {abdominal, abdominaux} or {aorte, aortique}, where
morphotactic constraints can result in what looks like suffix
substitution. The following method deals with these cases by
sifting additional knowledge from thesauri.
Code
T-00400
T-00400
T-00400
D0-A0100
D0-A0100
D0-A0100
D0-A0100
We applied the suffix method to the whole reference list and
obtained 733 suffix strings S that occur in at least one pair {F ,
F S }, among which 170 in at least two pairs and 78 in at least
three. The 15 suffixes with at least 10 occurrences are:
1136 s
168 e
34 ment
24 es
21 ux
17 ne
15 me
14 se
13 le
12 ïde
12 use
11 blastome
10 ur
10 sarcome
10 cyte
The first two are inflection suffixes (plural -s and feminine
-e), then adverb formation suffix -ment. The first domainspecific suffix is -me (rank 7), which actually corresponds to
forms ending in -ome. Finally, the first radical found in a suffix position is -blastome (rank 12).
Class
01
02
02
01
02
02
02
English term
Mucous membrane, NOS
Mucosa
Tunica mucosa
Mucous membrane inflammation, NOS
Mucitis, NOS
Mucosal inflammation, NOS
Mucositis, NOS
Discussion
Bootstrapping on series of synonym terms
In the prefix method, the fact that an initial form (such as
blastique) can itself combine with other radicals (e.g., centroblastique, chondroblastique, lipoblastique, etc.) could be
considered an additional clue. It confirms that in the obtained
segmentation fibro+blastique, we do have two components
that commute with various elements. However, several « parasitic » forms such as tique also give rise to numerous wrong
decompositions. For instance, {tique, attique} which, in parallel with {tente, attente} and {teinte, atteinte}, push forward
an erroneous prefix at-. Therefore, we did not include this criterion in our procedure, since it it not discriminative enough.
Prefix analysis detects radicals or prefixes that are frequent
in the domain with a good precision. At the same time, the 111
radicals Ri occur in 904 pairs of forms { Fj , Ri Fj } for which
they propose a segmentation of a compound form Ri Fj into
Ri + Fj . Among these 904 pairs, the 5 erroneous prefixes
produce 13 wrong segmentations, i.e., an 1.4% error rate. We
must add to these possible but rare irrelevant segmentations
obtained with otherwise correct prefixes, such as a+voir or
a+mont.
Working on pairs of attested word forms is a powerful filter against irrelevant decompositions. Among the 173 forms
A flat word list, as used in the previous section, has no inherent structure that can provide clues as to which word forms
should be matched together. We relied instead on the sole internal form of words to build constrastive pairs.
Thesauri do provide structure around terms. Given a concept, they often include both a preferred term and synonym
terms. They also often specify hierarchical and other relations
among concepts and terms. In this series of experiments, we
take advantage of the series of synonym terms that occur in
medical thesauri to identify pairs of morphologically related
word forms. Instances of such terms are shown on table 1
(Class 01 means preferred term, whereas 02 means synonym
term). Our bootstrapping method performs a kind of reverse
engineering of the synonym enumeration process initially performed by the terminology authors.
Material
We start from the series of synonym terms found in the
SNOMED International nomenclature. This experiment was
performed for French, English and Russian.
4
Methods
and for English:
Aligning morphologically related word forms
js (plural inflection) {organ, organs}
ajosis (-osis derivation) {diverticula, diverticulosis}
usjosis (-osis derivation) {thrombus, thrombosis}
ajc (-ia / -ic derivation) {aphasia, aphasic}
aemiajemia (alternation) {tularaemia, tularemia}
jal (-al derivation) {orbit, orbital}
Given two synonym terms {T1 , T2 }, where each term is a
set of word forms, we consider the pairs of word forms { IS1 ,
IS2 } found in the two terms (with IS1 2 T1 and IS2 2 T2)
and that share a « long enough » initial substring I . We
worked with minimal lengths of three and four characters; this
threshold is indeed a parameter of the method. Unless otherwise mentioned, in this paper, this threshold is set to 3. For
instance, given the terms in table 1, this algorithm aligns the
following word pairs:
Suffix strings are collected at the same time. For instance,
suffix strings -s, -a, -osis, -us, etc. are identified in the word
pairs cited in the above rules.
Collecting morphological families
(10) {mucous, mucosa} (T-00400)
{mucous, mucosal} (D0-A0100)
{mucous, mucositis} (D0-A0100)
{mucosal, mucositis} (D0-A0100)
Each word pair specifies that two word forms are linked by a
morphological relation: they belong to the same morphological family. We can then try to recover these morphological
families. Two word forms are included in the same family in
the following situations:
Note that such an initial common substring algorithm, applied in an unconstrained context (e.g., our flat list of word
forms), would give very bad results. For instance, a fourcharacter initial substring length would collect together all the
words that start with abso-: absolute, absonum, absorbable,
absorbent, absorber, absorptiometry, absorption, absorptive.
This would be correct for the absorb-/absorp- family, but
would mix with them absolute and absonum.
Inducing morphological rules
The aligned word pairs are then considered as examples
from which morphological rules are induced. These rules are
potentially applicable to identify other word pairs that undergo
the same morphological relation.
At this step of the process, we aim to obtain rules that embody the minimal specific features of the word pairs, while
keeping a good generality. Each example {IS1 , IS2 }, where
I is the maximal initial common substring of the two word
forms, is generalized into a rule { S1 , S2 } that can be interpreted as follows: given a word form that ends with suffix
string S1 , one can derive a word form in which S1 is substituted with S2 ; or the reverse, these rules being considered
symmetric. We note rules {S1 , S2 } with the abbreviated notation S1 jS2 .
Concretely, aligning two word forms { IS1 , IS2 } identifies
at the same time the suffix strings {S1 , S2 }, thus the associated rule S1 jS2 . Some of the most frequent rules obtained for
French are ( denotes the null string):
They were identified as a word pair by the alignment process;
Both were segmented by the alignment process on the
same maximal initial substring I ;
By transitivity, two families that share a common word
form are merged.
For instance, the word pairs {ischial, ischiadic} (T-16330)
and {ischial, ischium} (T-12351) result in the morphological
family ischiadic, ischial, ischium which is correctly separated
from ischaemia, ischaemic, ischemia, ischemic, although their
initial four-character substring (isch-) is the same.
Results
Table 2 – Bootstrapping: acquired knowledge
Knowledge elements
French English Russian
Terms
12,555 128,855
13,462
Series of synonyms
2,344
26,312
2,636
Word pairs
1,718
16,653
2,661
Unique word pairs
1,180
6,757
1,692
Precision
99.9%
–
99.9%
Suffix strings
539
2,610
818
Morphological rules
644
3,188
955
Morphological families
662
3,207
824
Precision
98.6%
–
99.8%
Words per family
2.57
2.94
2.80
js
je
(plural inflection) {articulaire, articulaires}
(feminine inflection) {surrénal, surrénale}
ejique (adjectives ending with -ique) {prostate, prostatique}
ejque (adjectives ending with -ique)
{hyperkaliémie, hyperkaliémique}
ejaire (adjectives ending with -aire) {valvule, valvulaire}
mejsarcome (derivation -ome / compounding -osarcome)
{mélanome, mélanosarcome}
This method was applied to the SNOMED Microglossary
for Pathology (French and Russian) and to the full SNOMED
(English). The results are displayed on table 2. The full
English SNOMED, with a size ten times larger than that of
5
erate new word forms. In particular, the rules that include the
zero suffix do not mean that a correct word form can be obtained by applying them to any word form! The next section
shows how to use the rules in an associative manner.
Besides, the suffix strings may not all look appropriate. For
instance, the suffix strings -c and -a in rule ajc would be better
expressed as suffix -ic applied to word forms ending in -ia (as
in {aphasia, aphasic}). The next section also shows how to
obtain a better segmentation of these suffix strings.
the French or Russian Microglossary, yields six times more
unique word pairs and five times more rules.
The precision of word pairs and morphological families was
evaluated through human review. All of them were examined
for French and Russian, whereas for English word pairs and
families, random samples were extracted and reviewed.
The most frequent suffix strings found in unique English
word pairs were (frequency over 100):
671 s
630 a
555 osis
517 us
380 al
371 e
279 ic
239 um
211 y
193 idae
168 es
155 is
132 ed
131 c
128 ectomy
113 sis
110 tic
109 emia
108 ing
107 ia
Applying induced rules to reference word list
We now use the set of rules acquired in the bootstrapping
step to identify new word pairs in a set of word forms.
Material
Our material in this experiments is the set of rules acquired
in the bootstrapping step (previous section) and the reference
list of attested word forms, for each of the three languages.
To evaluate the recall for English word pairs, we used the
UMLS 1999 Specialist Lexicon lvg V1.83 tool [8] in inflection and derivation modes [13]. We ran it with options
(i) lvg -m -fi to produce inflections and (ii) lvg -m
-fRf for derivations.
Discussion
The precision of the word pairs and of the morphological
families is extremely high, and even higher when using a fourcharacter threshold. We believe this is due to the very constrained context in which initial common substring matching
is performed: synonym terms. This may be compared to results obtained with other kinds of constraints: using cooccurrences in corpora [14] or initial common substring matching
between theraurus terms and corpus occurrences [15].
Errors occur when two word forms occurring in two synonym terms accidentally share a long enough initial substring.
This is the case in synonyms terms for concept C-A0700: oral
contraceptive preparation, nos and birth control pill, nos lead
to the alignment of {control, contraceptive} and to the rule
oljaceptive.
It would be possible to create manually exception lists of
word pairs, which would prevent errors produced at a low
threshold. However, in these experiments, we have opted for
a method that can be applied to another language or domain
with virtually no human intervention (currently, the only human intervention is an optional tuning of the threshold).
The word pairs identified include the three types of morphological variation (inflection, derivation and compounding).
Additional linguistic knowledge may cancel the effect of inflection, so that only derivation and compounding remain.
This will be shown later in this paper. The observation of occurrences of suffix substrings in initial position in other word
forms may help differentiate (compositional) radicals from
derivational or inflectional suffixes; however, we have not yet
come out with an automated method for separating out these
three types of variation.
The acquired rules only have an associative value: given two
word forms, the rules can propose to relate them or not. It
would not be wise however to apply these rules to any word
form which ends with one of its suffix strings in order to gen-
Methods
Identifying new word pairs
For a given language, each morphological rule is applied to
each word form Wi of the reference list, producing another
word form Wj . If the resulting word form Wj also belongs
to the reference list, i.e., it is attested in the language, the
word pair {Wi , Wj } is considered as morphologically related
(the implemented program actually uses a more efficient algorithm). It is then added to a new set of word pairs. If the
reference list contains all the word forms of the bootstrapping
synonym terms, as is the case here, all the word pairs aligned
at bootstrapping are found again here. The aim is that new
word pairs can be identified too.
As in the alignment step, we limit noise by constraining the
initial substring I of each pair to have a minimal length over a
given threshold. We also set this threshold to three characters.
Extending the morphological families
Here again, word pairs are merged into morphological families. These families are either extensions of the ones found
in the bootstrapping step (when they contain a word pair that
existed in the bootstrapping step) or are new families.
Evaluating precision
Precision was defined as the proportion of word pairs identified by our method that actually are in a relation of inflection,
derivation, compounding or a combination thereof. Precision
6
forms from SNOMED Microglossary for Pathology and ICD10; for English, word forms from full SNOMED and ICD-9CM; for Russian, word forms from SNOMED Microglossary
for Pathology. The results are shown in table 3. Rule suffix
fitting was only experimented on French.
was evaluated by a human review of the word pairs and morphological families returned by our program. For English, the
size of the results was too large, so that samples were extracted
for review: one word pair every 125 and one morphological
family every 25.
Evaluating recall for English
Table 3 – Extension: acquired knowledge.
Knowledge elements
French English Russian
From bootstrapping
Morphological families
662
3,207
824
Words per family
2.57
2.94
2.80
Morphological rules
644
3,188
955
Input
Word forms
8,874
49,627
9,871
Output
Word pairs
4,838
25,740
6,887
99.6%
Precision
98.1% 845%
Morphological families
1,663
6,424
1,707
99.3%
Precision
97% 864%
Words per family
3.21
3.84
3.55
For English, we are in a position where morphological resources for derivation (as well as inflection) are available:
those distributed in the UMLS Specialist lexicon and its lvg
tool. We ran lvg on the list of word forms collected for English to identify the word pairs where either (i) one word form
was an inflected form of the other, or (ii) one word form was
a derived form of the other. We took these word pairs as the
gold standard for evaluating the recall of our method. Recall
was defined as the proportion of lvg-returned word pairs also
identified by our method.
Fitting suffix strings in rules
The rules learnt in the bootstrapping step are the most general suffix substitutions that account for the initial word pairs.
Now that more word pairs have been collected, we can try to
make these rules more specific. In the bootstrapping step, we
had individually generalized each example Ei = {Ii S1 , Ii S2 }
to the rule R = {S1 , S2 }, on the basis of the maximal initial
substring Ii common to both words in the pair. We shall now
globally consider all examples EiR subsumed by this rule, and
identify the maximal final substring SR common to all their
maximal initial substrings IiR . We then specialize the rule
R = {S1 , S2 } to {SR S1 , SR S2 }. This corresponds to extending suffixes S1 and S2 of rule R maximally to the left while
still covering all the examples.
Let us emphasize that if we had fitted the rules at bootstrapping, many of them might have been overfitted because of the
smaller number of instances available at that step. They would
therefore have been unable to apply to some word pairs of the
reference list. We expect this to be less the case on the larger
set of word pairs collected in this second step.
The rules that apply to only one word pair have no generality
on the set of word forms considered. They correspond to word
pairs learnt on only one word pair during the bootstrapping
step, and for which no external confirmation was found in the
reference word list. These rules are kept as is. They can be
considered as having a lower confidence than the other rules.
The obtained suffix strings are now the longest possible that
can allow a rule to apply to the largest possible subset of the
reference list. We expect these strings to be a better approximation of the morphemes that a linguistic analysis would
identify in these word forms.
The number of unique word pairs identified, compared with
that obtained at bootstrapping, is multiplied by 4. Morphological families are twice as numerous, and they are larger on
average. Recall values for inflection and derivation are shown
on table 4.
Table 4 – Inflection and derivation recall for English.
Knowledge elements Automatic
lvg Recall
Word pairs
25,740
Inflection
2,697 91.2%
Derivation
2,973 79.2%
For French, The most frequent rule applications are:
1149 js
297 je
145 jes
75 ejique
69 aireje
64 ejs
64 ejque
57 ationjé
53 sejx
46 aljo
45 ljux
45 ejo
41 lejux
41 ejose
40 itejo
38 jle
37 ejome
37 airejes
34 jne
34 jment
34 fjve
34 ejé
32 me|sarcome
539 suffixes are involved, among which the most frequent are:
1271 s
1111 e
272 es
269 o
227 é
222 aire
Results
The extension of word pairs and morphological families was
performed on the material described above : for French, word
7
177 me
173 se
173 ique
145 ome
142 ose
117 ux
116 al
100 que
97 x
88 ïde
86 le
82 sarcome
septic}, {tea, teal}, and fewer ones involve a larger common
initial substring (e.g., {china, chin}).
Recall for English inflected word pairs was very good (over
90%; see table 4), and recall for derivation was fairly good too
(close to 80%). Silence is basically due to rules that were not
represented in the training material (word forms in SNOMED
terms), or to rules that are observed in the training material but
could not be identified through the alignment method. It will
be interesting to check whether this silence could be reduced
by using the results of the direct observation of the reference
word list, as described earlier in this paper. This is yet to be
done.
Our method identifies a much larger number of word pairs
than were obtained by lvg. Several reasons can be invoked
to account for this difference. First indeed, the evaluation of
precision shows that 165% of these word pairs are spurious. Second, lvg did not generate the compound word forms
found by our method. But, more important, our rules are
somewhat redundant: one rule may perform the same substitution as the combination of several more elementary rules also
identified on aligned word pairs. Given n words in a family, it
can then collect up to O(n2 ) word pairs ( n(n2?1) ), depending
on the redundancy of the acquired rules.
In the most frequent rules, inflection again is the most frequent (-s, -e, -es); adjectival (-ique, -aire) and nominal (-ation)
derivation comes next; domain-specific suffixes (-ose, -ome)
and compounding (-sarcome) come later.
Extended suffix strings as obtained by rule fitting look closer
to what one could expect to be the morphemes involved in
these word forms, and the rules obtained seem to better agree
with a linguistic analysis of the morphological operations involved. For instance, rule iquejie pertains to the formation of
adjectives with an -ique ending rather than adjectives with a
-que ending, as the former rule queje seemed to propose. It
specifies how this adjectival derivation applies to forms ending in -ie, and was learnt on 64 word forms among which
achromie, allergie, amnésie, etc.
Some suffix strings (such as -ique, just discussed, or -ome, ose or -ation) were previously split among several forms; they
are now better identified, and their relative frequency is larger.
2879 initial strings are involved; the most frequent ones are:
29 myélo
25 ostéo
23 fibr
22 fibro
21 adéno
19 lympho
16 lipo
16 granul
15 angio
13 plasmocyt
13 myo
13 immun
13 chondro
12 méning
12 mélano
12 lymphocyt
12 histiocyt
11 hémangio
11 hyalin
11 dur
10 plan
10 neuro
in which one can find the radicals directly obtained from the
flat list of words in the first experiment; their relative frequency is different, two-letters radicals and prefixes are indeed absent, and some variant segmentations occur (i.e., fibro
/ fibr).
Rule suffix fitting (performed only on French data, at threshold four) applied to 462 rules out of 644 (72%). Among the
23 above-mentioned most frequent rules, suffix extension applied to iejique (+i), eusejeux (+eu), aljaux (+a), alejaux (+a),
ifjive (+i) and omejosarcome (+o):
1149 js
297 je
145 jes
75 ejique
69 aireje
64 iejique
64 ejs
57 ationjé
53 eusejeux
46 aljo
45 ljux
45 ejo
41 ejose
41 alejaux
40 itejo
38 jle
37 ejome
37 airejes
34 jne
34 jment
34 ifjive
34 ejé
32 omejosarcome
31 atoseje
The most frequent suffixes are now:
1243 s
1027 e
299 ome
271 o
261 ique
254 es
234 aire
202 é
177 ose
158 eux
133 oïde
119 ation
115 al
108 euse
100 ie
92 ion
83 if
83 ale
82 osarcome
Discussion
Using more syntactic knowledge
Precision was still excellent for French and Russian on both
word pairs and families, and was good for English (the confidence interval was computed with = 0.05). Whereas the
rules learnt may be overgeneral, their application to couples of
attested word forms is a powerful filter. The fact that the reference word list mostly contains domain-specific words may
also have helped.
For a large part, erroneous word pairs involve a threecharacter common initial substring: {den, dent}, {fel, felids},
{levamphetamine, lev}, {maia, main}, {plan, plane}, {sepia,
Introduction and material
As already mentioned, the rules learnt are overgeneral. This
may come from several origins, among which:
1. the constraints they put on string form (suffix strings)
may be too loose (suffix strings may be too short); the
rule fitting step just discussed may slightly reduce this
default;
2. they do not include any exception, as is often done (e.g.,
8
rule e/SBC:sgjique/ADJ:sg stands for a rule ejique constrained
to apply between a singular noun (SBC:sg) and a singular adjective (ADJ:sg).
For the [LEM] experiment (for French only), FLEMM was
run on the tagged terms and on the tagged word list, and
tagged word forms were replaced with their root forms.
The rule acquisition method was then applied to both inputs.
in lvg); this requires human intervention;
3. they do not take into account the syntactic properties of
word forms; part-of-speech (POS) is generally used to
constrain rule application (as in lvg); this is the point
we address in this section;
4. they do not take into account the semantic properties of word forms; we have not yet worked on that
point, although the multiaxial nature of terminologies
like SNOMED might provide useful clues here.
Results
Table 5 – Acquired knowledge with more syntactic information
(threshold = 4)
[POS]
[POS] [LEM]
Knowledge elements
French Russian French
Terms
12,555
13,462 12,555
Series of synonyms
2,344
2,636
2,344
Bootstrapping
Word pairs
1,595
2,554
1,414
Unique word pairs
1,104
1,826
935
Suffix strings
489
246
412
Morphological rules
594
854
507
Reference word list
Word forms
8,989
9,871
7,188
Extension
Word pairs
4,512
6,710
2,390
Morphological families
1,650
1,731
1,096
Words per family
3.10
3.45
2.82
In this section, we introduce syntactic constraints on rules by
applying the previous method to POS-tagged input: each word
form in the thesaurus terms was tagged with its part-of-speech;
and each word form in the reference word list was tagged with
all its possible parts-of-speech. This experiment [POS] was
performed on French and Russian, using a specifically trained
version of the Brill tagger [23].
Furthermore, as already discussed too, rules mix the different kinds of morphological variation: inflection, derivation
and compounding, inflection being the most frequently occurring one. If each word is replaced with its lemma (canonical
form), inflectional variation is canceled and the acquired rules
and suffixes should be cleaner. Again, the previous rule acquisition method is then applied to the lemmatized input material. This experiment [LEM] was performed on French using
Namer’s FLEMM lemmatizer [6].
Methods
Table 5 shows the results obtained for both experiments.
Word counts (word pairs, word forms, words per family) are
given for tagged word forms in the [POS] experiment and for
lemmatized words in the [LEM] experiment.
In the [POS] experiment, the number of aligned word pairs
has increased compared to the previous, non-tagged experiment. This comes from the fact that the same word form may
occur with different part-of-speech tags depending on its usage. The number of rules and of new word pairs is also increased. However, the number of morphological families is
slightly decreased, with a larger average number of words per
family.
On the contrary, in the [LEM] experiment, all these figures
are lower than those obtained in the non-tagged experiment.
For the French [POS] experiment, we first used the Brill tagger trained for French at INaLF [24] to tag a subset of the Microglossary terms with the INaLF tagset. We then manually
corrected the errors in the assigned tags, and trained the tagger on this corpus. The process was repeated on subsets of the
thesaurus of increasing sizes, until we obtained a satisfactory
POS-tagged version of the thesaurus.
Then all these tagged SNOMED word forms were copied
into the reference word list, replacing the non-tagged
SNOMED word forms therein. Word forms with several possible part-of-speech tags have several entries in this list. For
instance, the French word technique may be either a noun (English technique) or an adjective (English technical). The remaining non-tagged word forms (ICD-10 forms not occurring
in the Microglossary) were tagged manually.
The same process was applied to the Russian version of the
Microglossary for Pathology. However, since we had no already trained tagger for Russian, the first subset was tagged
manually.
Part-of-speech tags are appended to words or suffixes with
a slash character “/” as separator. For instance, one term for
code M-09150 is tissu/SBC:sg égaré/ADJ:sg pendant/PREP
la/DTN:sg manipulation/SBC:sg technique/ADJ:sg. We handle these tags as extensions to word suffix strings, so that the
9
37 um
683 e
86 ïde
35 ant
83 que
263 o
34 on
82 sarcome
216 é
585 /ADJ:sgjs/ADJ:pl
38 ale/ADJ:sgjaux/ADJ:pl
34 ment
80 s
177 me
519 /SBC:sgjs/SBC:pl
38 al/ADJ:sgjaux/ADJ:pl
33 us
70 x
177 ique
267 /ADJ:sgje/ADJ:sg
37 aire/ADJ:sgjes/SBC:pl
33 cytaire
66 eux
175 aire
128 /ADJ:sgjes/ADJ:pl
34 /ADJ:sgjment/ADV
32 tion
61 ation
73 e/SBC:sgjique/ADJ:sg
32 ome/SBC:sgjosarcome/SBC:sg 173 se
32 ien
56 ite
117 al
69 aire/ADJ:sgje/SBC:sg
32 if/ADJ:sgjive/ADJ:sg
32 ement
55 blastome
111 ose
64 ie/SBC:sgjique/ADJ:sg
32 e/SBC:sgjome/SBC:sg
32 atose
38 oïde
105 ome
56 /ADJ:sgj/SBC:sg
31 en/ADJ:sgjenne/ADJ:sg
51 ation/SBC:sgjé/ADJ:sg
31 atose/SBC:sgje/SBC:sg
49 euse/ADJ:sgjeux/ADJ:sg
27 l/ADJ:sgjlle/ADJ:sg
Discussion
49 e/ADJ:sgjs/ADJ:pl
27 if/ADJ:sgjion/SBC:sg
Working on POS-tagged input increases the number of
46 al/ADJ:sgjo/PFX
27 e/SBC:sgjé/ADJ:sg
tagged
word pairs identified because of polycategorial word
41 e/SBC:sgjo/PFX
27 aire/ADJ:sgjo/PFX
forms.
On
the one hand, they can undergo part-of-speech con40 ite/SBC:sgjo/PFX
26 ique/ADJ:sgjo/PFX
versions
(such
as the word technique mentioned earlier). On
38 e/SBC:sgjose/SBC:sg
25 me/SBC:sgjse/SBC:sg
the other hand, the same pair of untagged word forms may
give rise to several tagged word forms when one of them is
Most frequent suffixes ([POS], French):
polycategorial.
In contrast, a pair of untagged word forms, although iden1120 /ADJ:sg
118 oïde/ADJ:sg
tified in the non-tagged experiment, may happen not to be
654 /SBC:sg
112 ation/SBC:sg
picked in the [POS] experiment. An analysis of these cases
634 s/ADJ:pl
104 ie/SBC:sg
showed that this occurred in the following situations:
530 e/SBC:sg
104 euse/ADJ:sg
519 s/SBC:pl
98 es/SBC:pl
1. Remaining tagging errors: a tagged word form was not
398 e/ADJ:sg
86 ion/SBC:sg
identified because it was incorrectly tagged;
310 ome/SBC:sg
79 osarcome/SBC:sg
302 o/PFX
79 if/ADJ:sg
2. Missing category in word list: a polycategorial word
254 ique/ADJ:sg
76 aux/ADJ:pl
form lacked some entries in the tagged word list, so that
222 aire/ADJ:sg
74 ale/ADJ:sg
a rule that would have applied to its non-tag part could
191 ose/SBC:sg
64 ite/SBC:sg
not apply because of tag discrepancy;
162 é/ADJ:sg
59 oblastome/SBC:sg
3. Missing tagged rule variant: here, a rule was learnt with
152 eux/ADJ:sg
54 ive/ADJ:sg
a specific syntactic constraint represented in the aligned
144 al/ADJ:sg
49 l/ADJ:sg
word forms; however, these constraints are more specific
128 es/ADJ:pl
48 se/SBC:sg
than would be appropriate, which blocks the application
of this rule in a correct context; for instance, the plural
Most frequent rule applications ([LEM], French):
rule l/ADJ:sgjux:ADJ:pl could not apply to the pair of
noun forms canal/SBC:sg and canaux/SBC:pl, although
37 ejome
23 sejtique
75 ejique
a similar rule (not learnt from the aligned word pairs)
69 aireje
34 jment
22 alje
exists in French for nouns: l/SBC:sgjux:SBC:pl;
64 ejs
34 ejé
21 jux
64 ejque
32 mejsarcome
20 iquejome
4. Correct blocking of a morphologically unrelated word
57 ationjé
31 atoseje
20 euxjose
pair; for instance, rule /ADJ:sgje/ADJ:sg produces
53 sejx
28 fjon
20 cejt
the feminine form of regular adjectives; the adjec46 aljo
28 airejo
19 jme
tive constraint blocked the incorrect pair {gain/SBC:sg,
45 ejo
27 iquejo
19 mejïde
gaine/SBC:sg} ({gain, sheath}).
41 ejose
25 mejse
19 blastomejme
In summary, using part-of-speech information decreases the
40 itejo
23 jse
19 antjé
error rate (increases precision), but also decreases recall.
Some of these problems might be reduced by considering that
Most frequent suffixes ([LEM], French):
a polycategorial word form occurrence should represent all its
categories, as proposed by Krovetz [25] in a different context.
Working on lemmatized input sensibly reduced the number of word pairs and morphological families by filtering out
Most frequent rule applications ([POS], French):
10
inflection-related rules. The resulting rules, as expected, are
much more focused on derivation and compounding, which
are the semantically more interesting phenomena. The error rate, however, is increased due to lemmatization errors.
The lemmatizer has a very good precision on general French,
but does not know some specific medical words with irregular inflection, nor does it include provision for handling Latin
words. Much of these specific properties were correctly acquired in the non-lemmatized method.
then more overtly exhibited.
In any case, a (small) proportion of the word pairs identified
still require human correction. The number of word pairs to be
reviewed may be reduced by observing that some of them are
redundant. Finally, the remaining incorrect word pairs can be
handled as exceptions, as is the case in many morphological
processors such as lvg and FLEMM.
Synthesis
We have shown that starting from generally available terminological resources, a massive, gross description of medical
word morphology can instantaneously be obtained in the form
of a collection of affix strings and morphological rules. Its
precision is high, and good recall may be obtained depending
on the size of training data.
Whereas many areas for improvement were already identified along these experiments, some additional research directions seem to be worth exploring:
Conclusion and further work
The rationale behind the methods in this work is that terminologies contain a wealth of linguistic knowledge that can be
« reverse engineered » to a large extent. Synonymy relations
include explicit morphological variants for terms, purposely
inserted to facilitate « natural language » indexing and search.
They also contain more implicit morphological variants for
single words. Observing such morphological word variants
in the synonym terms that denote the same concept is a very
strong clue that these words belong to the same morphological family. We have seen, with experiments in three different
languages (of the romance, germanic and slavic families), that
this can lead to high precision collection of morphologically
related word pairs.
This initial collection is then used to bootstrap a larger acquisition process that identifies more morphologically related
word pairs across concepts. This increases recall, while keeping a very good precision. The application to large-size English data showed that recall can be fairly good too. Its more
precise dependence to training set size needs to be investigated.
Besides, the direct exploitation of an unstructured list of
word forms also proved to yield good quality segmentations
and affixes. Altogether, these methods can segment a large
part of the input word forms. Coupling them more tightly is a
subject for further work.
Several methods were examined to obtain more precise segmentation and rules. Setting a higher threshold (four instead
of three) on the minimal initial common substring reduces a
large part of the errors. A strategy for human validation could
then be to systematically review generated word pairs with a
common prefix substring of exactly three characters. A posteriori rule suffix fitting uses a larger language data sample
to obtain better-confidence word segmentation and affixes. If
part-of-speech-tagged input is available, rules can be made
syntactically more precise and block some incorrect word
pairs; however, polycategorial word forms result in a decrease
in recall. A special treatment might be able to reduce this
drawback. If, furthermore, lemmatized input can be obtained,
rules and affixes are « cleaned » from the influence of inflection. The interesting affixes, among which morphemes can be
found, are those of derivation and compounding, which are
We have been working up to now independently on several languages. It may be useful to take advantage
of the parallel nature of international terminologies to
collect multilingual lexical and morphological resources
[26,11]. In return, this may help obtain better monolingual morphological resources.
One of the obstacles to the identification of canonical
morphemes is the variation of affix strings, especially as
caused by morphotactic constraints. These are elegantly
dealt with by more elaborate linguistic models [18,27]
that may help obtain more adequate morphological rules.
Just like phrase-level syntax is constrained by lexical semantics, word-level morphology is constrained by morpheme semantics (e.g., [28]). Semantic clues may then
be another knowledge source to constrain morphological rule application. The semantic types available in the
UMLS or in multiaxial terminologies such as SNOMED
or MeSH might provide a useful source for these; their
appropriateness for morphemes instead of terms should
however be assessed.
We aim to refine this set of methods and to apply them to a
larger set of languages.
Acknowledgement
We wish to thank Dr Roger Côté for a precommercial version of the French SNOMED Microglossary for Pathology,
Yvan Emelin for a draft version of the Russian SNOMED
Microglossary for Pathology, INaLF for lending us a version
of Brill’s tagger trained for French, Fiametta Namer for the
FLEMM lemmatizer, and the NLM for the lvg tool, Specialist Lexicon and UMLS resources.
11
References
[14] Xu J and Croft BW. Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems 1998;16(1):61–81.
[1] Pacak MG, Norton LM, and Dunham GS. Morphosemantic analysis of -ITIS forms in medical language.
Methods Inf Med 1980;19:99–105.
[15] Jacquemin C. Guessing morphology from terms and corpora. In: Actes, 20th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR’97), Philadelphia, PA. 1997:156–
67.
[2] Wingert F, Rothwell D, and Côté RA. Automated indexing into SNOMED and ICD. In: Scherrer JR,
Côté RA, and Mandil SH, eds, Computerised Natural
Medical Language Processing for Knowledge Engineering. North-Holland, Amsterdam, 1989:201–39.
[16] Déjean H. Morphemes as necessary concept for structures discovery from untagged corpora. In: Workshop on
Paradigms and Grounding in Natural Language Learning, Adelaide. 1998:295–9.
[3] Wolff S. Automatic coding of medical vocabulary. In:
Sager N, Friedman C, and Lyman MS, eds, Medical Language Processing. Computer Management of Narrative
Data. Addison-Wesley, New-York, 1986:145–62.
[17] van den Bosch A, Daelemans W, and Weijters T.
Morphological analysis as classification: an inductivelearning approach. In: Tsujii JI, ed, Proc 16 th COLING,
Copenhagen, Denmark. 5–9 August 1996.
[4] Lovis C, Baud R, Rassinoux AM, Michel PA, and Scherrer JR. Medical dictionaries for patient encoding systems: a methodology. Artif Intell Med 1998;14:201–14.
[5] Silberztein M. Dictionnaires électroniques et analyse
automatique de textes : le système INTEX. Masson,
Paris, 1993.
[18] Koskenniemi K. Two-level morphology: a general computational model for word-form recognition and production. PhD thesis, University of Helsinki Department of
General Linguistics, Helsinki, 1983.
[6] Toussaint Y, Namer F, Daille B, et al. Une approche linguistique et statistique pour l’analyse de l’information en
corpus. In: Zweigenbaum P, ed, Actes de TALN 1998,
Paris. June 1998.
[19] Theron P and Cloete I. Automatic acquisition of twolevel morphological rules. In: ANLP97, Washington,
DC. 1997:103–10.
[7] Burnage G. CELEX - A Guide for Users. Nijmegen:
Centre for Lexical Information, University of Nijmegen,
1990.
[20] Côté RA.
Répertoire d’anatomopathologie de la
SNOMED internationale, v3.4. Université de Sherbrooke, Sherbrooke, Québec, 1996.
[8] National Library of Medicine.
Sources Manual, 1999.
[21] Organisation mondiale de la Santé, Genève. Classification statistique internationale des maladies et des problèmes de santé connexes — Dixième révision, 1993.
UMLS Knowledge
[9] McCray AT, Srinivasan S, and Browne AC. Lexical
methods for managing variation in biomedical terminologies. In: Proc Eighteenth Annu Symp Comput Appl
Med Care, Washington. Mc Graw Hill, 1994:235–9.
[22] Spyns P. A robust category guesser for Dutch medical language. In: Proceedings of ANLP 94 (ACL),
1994:150–5.
[10] Dal G, Namer F, and Hathout N. Construire un lexique dérivationnel : théorie et réalisations. In: TALN99,
1999.
[23] Brill E. Transformation-based error-driven learning
and natural language processing: A case study in
part-of-speech tagging.
Computational Linguistics
1995;21(4):543–65.
[11] Schulz S, Romacker M, Franz P, et al. Towards a multilingual morpheme thesaurus for medical free-text retrieval. In: Proceedings of MIE’99, Ljubliana, Slovenia.
IOS Press, 1999.
[24] Lecomte J. Le catégoriseur Brill14-JL5 / WinBrill0.3.
Technical report, INaLF, December 1998.
Available at http://jupiter.inalf.cnrs.fr/
WinBrill/winbrill.etiquetage.html.
[12] Grabar N and Zweigenbaum P. Acquisition automatique de connaissances morphologiques sur le vocabulaire médical. In: Amsili P, ed, Actes de TALN 1999,
Cargèse. July 1999:175–84.
[25] Krovetz R. Viewing morphology as an inference process. In: Proceedings of the 16th Annual International
ACM-SIGIR Conference on Research and Development
in Information Retrieval, 1993:191–202.
[13] Grabar N and Zweigenbaum P. Language-independent
automatic acquisition of morphological knowledge
from synonym pairs.
J Am Med Inform Assoc
1999;6(suppl):77–81.
[26] Baud RH, Lovis C, Rassinoux AM, Michel PA, and
Scherrer JR. Extracting linguistic knowledge from an
12
international classification. In: Pappas C, Maglaveras N,
and Scherrer JR, eds, Proceedings of MIE’97, Thessaloniki, Grece. IOS Press, 1997.
[27] Antworth EL. PC-KIMMO: a two-level processor for
morphological analysis. Number 16 in Occasional Publications in Academic Computing. Summer Institute of
Linguistics, Dallas, TX, 1990.
[28] Dujols P, Aubas P, Baylon C, and Grémy F. Morphosemantic analysis and translation of medical compound
terms. Methods Inf Med 1991;30:30–5.
13

Download Report

A Contribution of Medical Terminology to

Paperzz.com

Your Paperzz