Modeling Just the Important and Relevant Concepts in Medicine for

Modeling Just the Important and Relevant Concepts in Medicine
for Medical Language Understanding: A Survey of the Issues
Anne-Marie Rassinoux1, Randolph A. Miller 1,
Robert H. Baud2, Jean-Raoul Scherrer2
1
Division of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
2
Medical Informatics Division, University Hospital of Geneva, Switzerland
Over the past two decades, two challenging, ongoing domains of
medical informatics research have been the construction of
models for medical concept representation and the inter-related
task of understanding the deep meaning of medical free texts.
These domains can take advantage of each other by exploiting
the rich semantic content embedded in both concept models and
medical texts.
This review highlights how these two inter-related domains have
evolved by focusing on a number of significant works in this
area. The discussion examines one particular aspect: how to
employ medical modeling for the purpose of medical language
understanding. The understanding process analyzes and extracts
the content of medical free texts, and stores the information in a
deep semantic representation, useful for future elaborated
semantic-driven information retrievals. It is now recognized in
the medical informatics community that such understanding
processes can be augmented through use of a domain-specific
knowledge base that describes what can actually occur in a
given domain. For this, a well-balanced representation schema
should be developed somewhere between a partial but accurate,
versus a complete but complex semantic representation.
These observations are illustrated using examples from two
major independent efforts undertaken by the authors: the
elaboration and the subsequent adjustment of the RECIT
multilingual analyzer to a solid model of medical concepts, and
the recasting of a frame-based interlingua system, originally
developed to map equivalent concepts between controlled clinical
vocabularies.
Introduction
Building models for medical concept representation and
understanding the deep meaning of free medical texts represent
ongoing challenges for research in the medical informatics
community [1]. It is no an accident that the related areas of
Natural Language Processing (NLP) and Knowledge
Representation (KR) have been “hot topics” during the previous
and current international working conferences of the International
Medical Informatics Association Working Group 6 (IMIA WG6)
[2, 3].
The two disciplines can now realize the potential of a combined
approach. This convergence is all the more relevant for the
medical domain. First, a large amount of useful clinical
information is still embedded in natural language free texts.
Second, several different medical nomenclatures and thesauri have
already emerged, fostered by the need to standardize the
parameters of clinical practice and to organize the literature.
Several experiments have been reported, showing that advanced
NLP tools can help in refining such large medical vocabularies
[4, 5, 6, 7, 8, 9]. But, even if these methods greatly assist the
expert modeler, an important gap still remains to achieve the
building of a model of the domain, from this huge corpora of
linguistic knowledge.
Existing models for medical concept representation present a
significant amount of relevant information upon which natural
language processors can be grounded semantically [10, 11, 12].
The authors recently published a review of potential existing
medical knowledge sources that are candidates for exploitation
within NLP tools [13]. The existence of a list of multiple useful
sources reminds us that no standard medical terminology nor
common representation for medical information has emerged
across clinical institutions. Nevertheless, several important joint
efforts have been undertaken to find solutions towards well-known
drawbacks, such as the great degree of variability, redundancy and
inconsistency emerging from all these sources [14]. Exemplary
efforts include the Unified Medical Language System (UMLS) of
the National Library of Medicine, as well as the initiative of the
Canon Group. The UMLS project is a long-term and vast project
which aims at collecting and integrating electronic biomedical
information from a variety of other controlled vocabularies [15,
16]. At a more formal level, the Canon Group’s effort aims at
building a merged representational model for clinical radiology,
representing a consensus among Canon Group members, for use
in exchanging data and applications [17].
This paper addresses the benefits of linking the semantic
components of a medical text analyzer to a solid model of
medical concepts. In order to expound upon the main advantages
of such concept models for use by NLP tools, a brief review of
existing medical knowledge sources, their evolution and
interaction, as well as a view of the different modeling levels are
given. During this process, it is important to clarify the basic
units which are manipulated at each stage and which support the
bridge from free texts to a deep semantic representation of their
embedded meaning. Significant current efforts in this area are also
considered and compared, in order to gain a greater insight
toward modeling just the important and relevant concepts in
medicine for the purpose of medical language understanding.
controlled keyword vocabulary for indexing biomedical literature.
Likewise, the QMR vocabulary [26] (which is a superset of the
original INTERNIST-I vocabulary) was created to describe
possible (reported) patient findings in diseases in general internal
medicine. This vocabulary was derived from extensive manual
literature review and serves the purpose of providing input for the
QMR diagnostic program [27].
From Texts to Models: Organizing the Medical
Knowledge
Unfortunately, a tradeoff must be made by the developers of
controlled medical vocabularies (CMVs) that influences the
ultimate utility of each CMV for NLP. The two counterbalancing
features of a CMV are its breadth of scope (ability to easily
incorporate a large number of entries from diverse topics in
biomedicine) and its depth of representation (ability to represent
concepts in a computationally meaningful manner). The force that
drives the tradeoff is the amount of work that is humanly possible
on a given project. Deeper representations are significantly more
time consuming to build, probably by at least one order of
magnitude. As a result, most of the large CMVs, even though
specified through precise and technical expressions called terms
(generally used to name codes), are expressed through languagesurface forms, which present well-known drawbacks. In particular,
redundancy and inconsistency occur because of lack of formal
definitions suitable for automatic manipulation. There is no
algorithmic mechanism in these systems to precisely define what
a term is and express how it differs from others. Moreover, as
these terms are in most cases noun phrases, their underlying
interpretation is both language-dependent and context-dependent.
First, these terms are understandable only according to the syntax
of the specific language used to express them. Second,
interpretation of a term is often defined through its position in a
hierarchy (the implicit link between a parent term and its child).
Furthermore, the same term may appear in different positions in
the hierarchy. Exploiting such large terminologies by computer,
where terms only take their full meaning by considering their
position inside the whole system, requires developers to make
explicit, at least, all the various components of information
currently embedded in such partitions.
This paper considers how to use models for medical concept
representation to perform medical language processing.
Nowadays, access to some kind of “semantic model” or “domain
knowledge base”, specified through a formal framework that
describes what makes sense in a given domain, is a requisite for
succeeding with natural language processing. In particular,
Wehrli et al. [18] mention that “The connection between
language and the ‘real world’, which is what a real semantic
analysis should perform, is likely to remain out of reach as long
as we do not know more precisely how to give a computational
representation of the ‘real world’.” With a modeler viewpoint,
Rector argues [19] that “The automatic processing and analysis
of medical texts... is dependent upon concept systems to perform
the analysis and represent the structured information... Work on
the processing of medical texts is showing that the analysis of a
text depends primarily on a most important element: a model of
medical concepts.” But, what is the current status in organizing
the medical knowledge?
Different sources of medical information currently exist and are
more and more available in machine-readable form. However, the
way such medical information is expressed can greatly vary from
unstructured format to well-structured system of representation.
To be used effectively (by a computer as well as by human
beings), information must be rapidly accessible, and therefore
organized in an efficient way.
Medical Texts: A Natural and Rich Source of Medical
Knowledge
Large corpora, such as medical reports written by physicians in
their daily practice, or textbooks containing a huge corpus of
descriptive medical narratives, represent a preeminent source of
medical information expressed through natural language. Even if a
textbook eliminates specific medical jargons and local “oddities”
(e.g., nonstandard abbreviations) as encountered in medical
reports, the large expressiveness of language (authorizing
ambiguity and vagueness) significantly hampers its use by
computerized medical applications which require error-free and
clinically pertinent input.
Controlled Medical Vocabularies: A Precise and Technical
Source of Medical Knowledge
There are now a large range of available controlled medical
terminologies or vocabularies [20, 21, 22, 23, 24, 25, 26]. All
differ from one another based on the specific domains and scopes
or which they have been built [14]. For example, MeSH [22] has
evolved at the National Library of Medicine (NLM) as a
Concept Models: A Structured and Tractable Source of
Medical Knowledge
Due to the analytic complexities engendered by the above
surface-form controlled medical vocabularies, computerized
vocabulary mapping has become an important active area of
research in medical informatics. Methods employed have ranged
from lexical matching to conceptual matching.
Lexical Matching Methods
Lexical matching methods identify similarities among CMVs
by retrieving common “strings” (words or phrases, extended
through the notion of synonyms and related terms) in both the
source and the target vocabularies. This approach has been
favored, based on the availability of the UMLS knowledge
sources [28], which integrates nearly 30 authoritative medical
terminologies [29, 30, 31, 32]. Managing the large variability
inherent in natural language, and still present in biomedical
terminologies, has also required a fourth component for the
UMLS knowledge sources: the SPECIALIST lexicon [33].
Conceptual Matching Methods
Conceptual matching methods attempt to map different terms,
not at the level of words and phrases, but at the level of
“meaning”, implying a deeper representation of the intricate
concepts of medicine. Such representation can be directly built
over existing CMVs [34, 35]. But, at this level, it is more
crucial to develop formal systems which make a clear distinction
between the concepts to be represented and the linguistic terms
(or other mechanisms) used to refer to those concepts. That is
why formal systems for representing the concepts underlying
medical terminology have emerged [36, 37, 38, 39, 40, 41, 42,
43]. Among these, we describe three important efforts as they will
be discussed in more details in this paper.
• In 1981-83 [38], and again in 1988-91, Miller, Masarie et
al., developed and refined a frame-based interlingua [39]
initially to capture the complexity of clinical findings in
INTERNIST-I, and later to facilitate the translation
between CMVs. This system, supported in part by the
UMLS project, was based on the assumption that clinically
relevant statements about patients contain at least one
identifiable central concept, and this set of central concepts
can serve as focus for mapping between medical
vocabularies. Generic finding frames were used to specify
how a central concept may be expressed and also be
qualified by general modifiers. Each generic frame has a
superstructure (including its concept name, its status
descriptor, its potential site descriptor, its potential
subcategory descriptor, its potential qualifiers, as well as
“the methods of elicitation” descriptor), and finer details
which are encapsulated in the form of “item lists” (also
called qualifiers). Thus, the generic frames provide a
template for describing the medical meaning of specific
medical terms in a standardized manner. Over 750 generic
frames were created for describing the medical meaning of a
test set of 1,500 medical terms for general internal medicine
identified from the Quick Medical Reference (QMR) lexicon
[27], as well as portions of the HELP PTXT lexicon [24],
and parts of the DXplain lexicon [25].
• Cimino et al. have constructed the Medical Entities
Dictionary (MED) [40], a hybrid of terminology and
knowledge, using a semantic network based on the Unified
Medical Language System (UMLS) [28, 44], with a
directed acyclic graph which defines a multiple inheritance
hierarchy. Each concept node in the MED graph can be
viewed as a frame, and has links to nodes other than parentchild nodes through the semantic relationships. Every
concept in the MED is a generic concept and as such
should be regarded as a type or class. In order to support
NLP and the ability to map one controlled vocabulary to
another, a compositional modeling of lexically complex
concepts is also maintained in MED. This system, which
is beginning to reach critical mass (it currently contains
over 34,000 conceptual components), forms the heart of the
medical representation in the Clinical Information System
(CIS) of the Columbia-Presbyterian Medical Center
(CPMC). Clinical applications retrieve patient data using
MED concepts.
• Since 1992, Rector et al. have been developing, through
the GALEN consortium, a fully compositional and
generative system of medical concepts [41, 42]. These
concepts are represented in the Common Reference, or
CORE model, and are expressed in a language-independent
manner through the GALEN Representation and Integration
Language (GRAIL) Kernel. One important feature of this
model is that it attempts to restrict entries to valid
combinations of concepts that form medically sensible
expressions. In this regard, it is similar to the generic frame
system [39], which directly specifies how findings can be
constructed from concept definitions, and limits modifiers
to those that both make sense and are not selfcontradictory. An advantage of the notation used by the
GRAIL Kernel is that it can be converted directly to that of
conceptual graphs [45] and the set of criteria associated
with a concept can be seen as a frame-like structure. The
current version, which contains nearly 6,000 concepts,
must nevertheless be extended in order to be useful in
general clinical applications.
A Combined Solution
Examining the way these dual approaches (lexical and
conceptual) manage medical information, suggests their
combination as an ideal solution. This conclusion is reinforced
by the observation that medical language presents typical
characteristics of a sublanguage (specific types of texts or reports
have a writing style often designated as medical jargon), which
implies that under certain circumstances, the meaning of natural
language sentences is closely connected to the contextual medical
domain [46, 47]. Thus, NLP tools coupled with concept models
should succeed in managing medical information from texts. The
afore-mentioned models have served as examples already:
• The frame-based interlingua system developed by Miller,
Masarie et al. [38, 39] was successfully used to map among
the “pseudo” natural language embedded in QMR [27],
HELP [24], and DXplain terms [25].
• The MedLEE text processor (an acronym for Medical
Language Extraction and Encoding System) developed by
Friedman et al. [11, 48] maps chest x-ray and
mammography reports into unique medical concepts
defined in the MED [40, 49].
This analyzer provides three phases of processing all of
which are driven by different knowledge sources. The first
phase, parsing, identifies the structure of the text through
use of a grammar that defines semantic patterns and a target
form. The second phase, regularization, standardizes the
terms in the initial target structure via a compositional
mapping of multi-word phrases. Finally, the third phase,
encoding, maps the terms into the controlled-vocabulary
concepts that are maintained in the MED knowledge base,
thus ensuring that the data could be used by subsequent
automated applications. The formalism of conceptual
graphs is used to represent these concepts [50].
• Likewise, the RECIT multilingual analyzer (a French
acronym for “Représentation du Contenu Informationnel
des Textes médicaux) developed by Rassinoux et al. [51,
52] at the Geneva University Hospital, improved its
semantic validating and its inference capabilities by
grounding its semantic components directly in the GALEN
model [41, 12].
This multilingual analyzer (operational for French, English
and, to a minor extent, German) uses a two-phase process
to deal with the specific features of medical language. The
first phase, called “Proximity Processing”, is a
deterministic phase which combines the application of nonconventional syntactic procedures with the checking of the
semantic compatibilities in order to group neighboring
words together. From this set of relevant fragments, the
second phase deals with the building of a sound
representation of the sentence meaning into the formalism
of conceptual graphs [50]. Conceptual schemata are used to
select the heading concept and to establish the links
between it and the other concepts in the analyzed sentence.
From our experience, it appears that the success of this
combined approach relates mostly to the specification of a wellbalanced semantic representation. In order to be useful for NLP, a
model must at least provide the relevant concepts and their
relationships that naturally occur in medical texts. This
“coverage” criterion can be extended toward existing
nomenclatures and vocabularies in order to bridge with what is
currently available.
Medical Modeling: Limiting Large Domains of Expertise
Due to the large volume and diversity of medical knowledge,
the conceptual modeling task is naturally difficult and laborintensive, but worthwhile to ensure an efficient use of such
potential sources (expressed either through natural language or
controlled vocabularies) by computerized applications. At the
same time, the computational tractability of a knowledge base
(i.e. being suitable for manipulations by a computer program)
requires restrictions in the kinds of knowledge to be represented,
as well as the degree of details. Both must have a manageable
orm and size. Finally, a requisite for sharing models and systems
across institutions is a formal structure - sufficiently expressive to
represent complex knowledge, yet simple enough for semiautomated manipulation, and adequate for the domain needs.
These considerations point to the need to develop domain
models, answering to a well-defined goal, in order to yield
concrete outcomes.
The problem again relates to the tradeoff between depth of
representation and breadth of coverage. Because concept models
are far more labor-intensive to build, it would be a serious
mistake for a project with finite resources to attempt to build a
general concept model for all of biomedicine, because it would be
difficult to achieve closure. As a result, many existing concept
models were constructed for a specific purpose that further defined
and constrained their scopes. For example, the frame-based
interlingua system [38, 39] was limited to representing concepts
rom the QMR lexicon of 4500 possible patient findings in
internal medicine. The MED [40] was limited to actual
laboratory procedures (and their related findings) at one
institution, Columbia-Presbyterian Medical Center. Any openended attempt to represent, for example, the scope of concepts
found in MeSH (basic biomedical science and clinical practice as
described in the literature) in a “deep” model, would probably
never come to closure - due to lack of focus, lack of qualified
experts to represent concepts in a consistent manner, and lack of
financial resources.
Modeling for NLP needs
The representational needs of NLP are different, but
overlapping, with the needs of medical vocabulary system
builders. Let us start with a concrete example where we want to
determine what general kinds of knowledge are useful to start
speaking about the finding pleural effusion? For the modeling
task, this question can be reformulated into: what kinds of
knowledge, describing our understanding of the concept
PleuralEffusion, must be represented in a model? First, a
definition of the literal meaning of a pleural effusion can be given
as being “an effusion located in the pleural cavity”. In the same
way, an effusion can be defined as “an accumulation of pathologic
fluid”. These descriptions highlight the compositionality of this
concept which is built from a number of different more elementary
concepts. Therefore, this concept might be classified as being a
kind of effusion. Medical domain and NLP modelers might also
include descriptive features, such as the size (small, medium or
large), the gross appearance (purulent, serous,..) or the laterality
(left, right) of an effusion, as well as the list of methods useful to
elicit such a finding (such as a chest plain film or a chest
percussion). Finally, medical domain models may incorporate
specific inferential knowledge, such as the relationships between
findings and disease states (e.g., blunted costophrenic angle is a
radiological sign which provides non-specific evidence for a small
pleural effusion; and, clinical manifestations of large pleural
effusions include atelectasis, egophony, and dullness to
percussion; and that possible diseases causing a pleural effusion
are congestive heart failure, nephrotic syndrome, pulmonary
infarction, and rheumatoid arthritis.). When such information is
part of the model, specific reasoning processes can be triggered
(e.g., for purposes of recommending diagnoses or therapies).
PRAGMATIC LEVEL
Patient documents (free
texts, notes...), text
books...
A
META LEVEL
Concept or
terminology
models
B
INTERMEDIATE
LEVEL NLP
Instantiated
information
D
C
E
DOMAIN
KNOWLEDGE
LEVEL
Inferential
information
Figure 1 - The Sphere of Medical Modeling
A global model of medical vocabulary usage and processing is
presented in Figure 1. Two world views are combined: at one
end, the “pragmatic level”, formed by the set of all meaningful
medical utterances ever made by qualified domain experts and
practitioners, and at the other end, the “meta level” - a general
method for representing the concepts of a domain in a “deep
model”. The pragmatic level represents the textual information
available in the medical domain. The meta level represents
structures, rules, and constraints that permit modelers to
construct potential utterances in the domain. The meta level
involves capturing the concepts of medicine and formalizing them
into a concept or terminology model (link A in Figure 1). This
level, by organizing the concepts and specifying the relevant
relationships that occur between them, provides the semantics of
what is medically reasonable to say in a particular domain (such
as, a pleural effusion can be seen on radiograph or diagnosed by
percussion). Ideally, the pragmatic level and the meta level would
overlap completely. However, it is not possible to construct
meta-level representational systems that are sufficiently
constrained to only permit those sensible utterances that are
observed at the pragmatic level. Similarly, the rules of the
English language allow construction of grammatically correct, yet
nonsensical utterances - a feature exploited by Lewis Carroll in
the poem, “Jabberwocky” [53]. Most meta-level models, even
though they substantially limit the scope of allowable utterances
and force them to be related to meaningful concepts, are
sufficiently underconstrained to allow construction of
jabberwocky-like expressions.
reasoning [54], the last reasoning step is almost non-existent in
the current NLP tools as it needs a description which goes
beyond a “pure” conceptual description as defined in the meta
level. In particular, dealing with impliciteness and vagueness of
natural language requires an “assertional” or “inferential”
component that suggests the unstated concepts in order to achieve
a full semantic representation of the implicit missing content of
texts. The availability of the QMR knowledge base and lexicon
to the authors [27] should allow future development of
“intelligent” NLP tools that take advantage of this level.
However, efforts to build domain specific medical knowledge base
systems, such as for MYCIN [55], ILIAD [56], QMR [27] or
DXplain [25] are usually independent of NLP projects.
In between the two extremes of pragmatic and meta is the
intermediate level of “instantiated concepts”. Instantiated
concepts are the specific models derived from the meta level to
capture the actual content expressed at the pragmatic level. In fact,
it is the goal of NLP tools to perform exactly this conversion
(link B in Figure 1). The meta level can be directly exploited by
NLP tools for checking the semantics of any natural language
utterances against this set of sensible combinations of concepts
(link C in Figure 1). For example, retrieving the finding pleural
effusion from free texts requires the ability to cope with the
variety of ways this finding can be expressed in natural language.
In English, this finding can be formulated through the following
expressions: pleural effusion, presence of serous fluid in the
pleural cavity, hydrothorax, chylothorax, pleurisy with
effusion... (the corresponding terms in French being,
épanchement pleural, présence de liquide séreux dans la cavité
pleurale...).
Conceptual Modeling: Is There a Best Approach?
An important set of relationships has not been previously wellexploited in medical informatics. Figure 1 illustrates these
relationships: the ability to use the intermediate (instantiated)
level as the basic representational schema for a medical decision
support system’s knowledge base (link D), and the ability to add
pragmatic constraints to the meta level through use of a medical
knowledge base constructed in the instantiated format (link E).
An additional level is thus highlighted - “the domain knowledge
level” - which requires more sophisticated mechanisms, such as
the introduction of probabilistic relationships or temporal
progressions to reason about complex cases. This information,
embedding domain-expertise and reasoning capabilities
corresponds basically to what makes sense clinically. Due to its
inherent complexity, the domain knowledge level is not yet fully
ormalized and thus is not directly available for use by other
applications such as NLP systems. This is also a reason why, in
a traditional linguistic analysis including the following steps:
morphologic, lexical, syntactic, semantic, pragmatic and
The rest of this paper concentrates on models for medical
concept representation and their potential use by NLP tools. The
frame-based interlingua [39], the MED [40], and the GALEN
model [41] are examples of concept models which provide metalevel definitions, partially constrained by how to construct
meaningful utterances in a domain. The way such models are
built can greatly influence its size and degree of granularity,
which are all the more important to ensure a large coverage and
detailed representation of the medical texts analysis.
There is no general methodology for developing a concept
model. The quality of the modeling process is, on the one hand,
greatly dependent on the depth of understanding of the concepts
underlying the domain, and is thus best performed by a specialist
in the covered domain. On the other hand, building a concept
model is a formal process which requires skills in abstracting and
structuring, and is thus best performed by experienced analysts.
This last point insures the tractability of such a framework, as the
underlying logical process must be sufficiently practical and
pragmatic in order that specific operations (such as general
inferences) can be performed by computer applications. The
specification of a computationally tractable formalism also
implies some effort and precision at the level of identifying the
primitive and indivisible medical concepts and of distinguishing
the relationships used to link these concepts. Several formalisms,
suitable for a conceptual representation, have emerged from
Artificial Intelligence, such as semantic networks [57], frame
systems [58], conceptual graphs [50], or logical-statement
languages [59]. These formalisms have proved to be reasonably
equivalent.
Determining the extent of conceptual modeling for NLP can be
considered as an iterative process, involving a combination of
both a “top-down” and “bottom-up” approaches [60].
The top-down approach is taken to set up the general
characteristics of the domain and to organize them into a
hierarchically-structured view. This facilitates navigation through
the representation scheme, as well as assisting with the
consistency and maintenance of the parts of the model. The
highest levels of the GALEN schema [61] is a typical example of
a top-down approach, which presents interesting features for its
exploitation by NLP tools. The initial division at the highest
level structure occurs between the “DomainCategory” and the
“DomainAttribute” which corresponds respectively to the notions
of concepts and relationships. Another important division takes
place under “DomainCategory”, between the following
categories: structures, substances and processes, and the category
of modifiers. These subdivisions have had an important impact
on the adjustment of the RECIT analyzer to the GALEN model
[12]. The fact that medical language is, in essence, highly
compositional and logically structured, entails that its semantics
is intuitively based on the definition of relationships between
each pair of sensible concepts. A clear separation between the
notions of concepts and relationships is nevertheless important
even if they can be considered to some extent as interchangeable
[62]. As the concept form is usually more convenient for mapping
rom natural language, it is advantageous to express most of the
semantic features through concepts and to keep the relationships
as simple as possible. Basically, “content words” (i.e. words
conveying a strong semantics) should map to concepts and
“function words”, such as prepositions, conjunctions or modal
auxiliaries (like “can” o r “ must” in English) should map to
relationships. Moreover, adding some specific descriptors (as a
type or time modifier) is always possible for a concept, whereas
relationships cannot be further specified. Another important
distinction occurs between the meaningful concepts of medicine
(e.g., “Abdominal Pain”) and the modifiers (or qualifiers) which
can characterize these concepts. Modifiers generally have a broader
usage not restricted to the medical domain (e.g., “ Severity”,
“ Chronicity”). The tradeoff between medical concepts and
modifiers allows information to be weighed and fits quite well the
requirements of NLP systems. Indeed, the aim of NLP systems is
irst, to extract relevant medical concepts from texts and then, to
complete the representation with specification of the different
properties attached to those concepts.
Such an a priori conceptual organization defined in the GALEN
model seems reasonable to describe most of the generic medical
concepts. Nevertheless, its limited scope lacks the precision to be
directly used in clinical applications. Current efforts of the
GALEN consortium are to extensively cover the subdomain of
surgical procedures at the level of their representation in a
classification like ICD9-CM (vol. 3).
Examining empirical data in order to refine and delimit the
scope of modeling is known as the bottom-up approach. This
approach aims at exploiting information currently handled within
the considered domain. This approach was chosen to build the
rame-based interlingua system [38, 39], which consisted of
collecting all the relevant axes and terms that clinicians might
use to describe any and all medical concepts embedded in QMR
terms [27]. This led to specification of two interconnected (semiindependent) levels of information: a rich and accurate set of
generic medical concepts described through generic frames, and a
large set of well-defined qualifiers that are applicable across a
number of generic concepts. The qualifier description incorporates
both a limited set of values as well as a measure of the distance
between these different possible values (e.g. the qualifier
“ Severity” is defined as a progressive deviation with three
allowable values: “ mild”, “ moderate”, and “ severe”). This
bottom-up approach, by reviewing what is currently expressed in
the QMR terms, ensures the robustness of the representation as
the generic frames directly fit instances of concepts defined
through QMR terms. Nevertheless, used alone, this approach
cannot handle well all possible linguistic variations, as shown
below. Even if the frame structure used to represent the generic
medical concepts is convenient to express a first level of
description (through slots and fillers), allowing the initial
structure to be inverted according to some criteria, this
representation is nevertheless not easy to maintain. Indeed, the
frames per se do not specify any hierarchically-structured view of
the primitive concepts which are useful to describe more complex
medical information. Determining whether a generic frame or a
qualifier exits is difficult without knowledge of the entire contents
of the frame system. No restriction is mandated on the choice of a
given name to express a concept, so that redundancy and
inconsistency might appear. For example, names such as motion,
exercise, movement or moderate activity are used in the initial
system to designate concepts which influence generic frames such
as “ Abdominal Pain”, “Back Pain”, or “Myalgia”. This
example highlights the extensive use of subtle words in a
specified language which may be easily confused with the
concepts that they name. In this example, all the specified terms
can be considered, at first glance, as synonyms or lexical variants
of the unique concept representing the notion of Movement.
However, there are subtleties that make their meaning slightly
different clinically, and these latter ones should be taken into
account by adding appropriate modifiers to this unique concept.
Moreover, this thorough and enumerative building method has
led to a series of frame descriptions which make extensive use of
linguistic names to specify generic medical concepts, such as
“ Abdominal Aortic Aneurysm By Imaging”, as well as to
designate the suitable qualifiers, such as “ Type Of Aortic
Aneurysm”. The literal interpretation of the meaning of these
medical concepts is largely in their linguistic names rather than
in the model itself. A recasting of this original frame system has
been undertaken [63] to overcome these problems.
The two above examples emphasize the advantages of each
approach - top-down and bottom-up - as well as their weaknesses
when applied alone. The top-down approach to conceptual
modeling is useful to set up the general architecture of the model
but applied alone, it lacks practical feedback to become usable by
clinical systems. Conversely, the bottom-up approach to
conceptual modeling ensures the construction of an accurate
model based on empirical data directly extracted from a given
domain, but applied alone, it is often confounded by too many
linguistic details. This suggests a combined approach. Both of
the afore-mentioned examples have evolved a combined approach.
In particular, the high level structure of the GALEN Common
Reference Model has evolved iteratively through experiences
conducted with some coding systems [64] as well as practical
experiments in building clinical systems within the GALEN
project [65]. In the same way, the frame-based interlingua has
been recast [63], for integrating a more uniform and formal
description of the generic frames. First, the nature of the
conceptual information has been refined, through the distinction
between existential and quantitative frames, as well as the
specification of features essential to the description of a generic
frame (in particular, distinctions between general and local
qualifiers). Second, formal definitions of complex generic frames
have been introduced using the formalism of conceptual graphs
[50]. Finally, attempts to build a multi-level hierarchy upon this
rame system are under way.
The two above examples show the clear benefit from having
both a top-down and bottom-up approaches during the modeling
process. They also emphasize the distinction between the concept
model level and the precise language used to express these
conceptual components. This is an important feature toward the
specification of a Medical Linguistic Knowledge Base [66], and
thus entails to clarify the basic units sustaining concept models
as well as those underlying to the medical language.
From Words To Concepts: Identifying the Basic
Units
The development in parallel of several inter-connected
disciplines linked to the field of medical informatics has shown a
widespread use of notions such as “word”, “noun phrase”,
“sentence”, “term”, “code”, or “concept”. But depending on the
context of usage, the same notion covers different topics and this
interchangeable use unfortunately blurs the precise and technical
meaning assigned to these notions in a specific discipline. In
order to guarantee an accurate and correct communication between
experts and scientists in medical informatics, it is important to
preserve the distinctions as already emphasized by Tuttle et al.
[67].
Preserving the Distinctions while Highlighting the
Connections
Roughly speaking, after accepting Tuttle’s more formal
definitions, we can say that concepts are basic building blocks for
modeling; that sentences, noun phrases and words are basic
building blocks for natural language processing; and that codes
and terms are basic building blocks for classification. But the
reality is more complex since there is no one-to-one
correspondence between these different notions.
The Notion of “Concept”
A concept is a unit of thought which is the “pure fruit” of an
abstraction effort by human beings trying to represent the units of
meaning in a particular domain. In medicine, these mental
constructs strongly reflect the domain expert’s ability to extract
medical entities from clinical reality.
In order to refer to a particular concept, a unique identifier must
be defined. Different formats, such as icons, numbers, or words
can be used to specify this unique identifier. In practice, most
systems [40, 41, 63] choose a unique number as an internal
identifier. But, even if a unique number is a good means to refer a
specific concept in a non ambiguous way, it is less expressive.
That is why these systems usually associate with the internal
identifier an external “knowledge-name” - also called “concept
name” - which is used (instead of the numeric internal identifier)
in the system interface to designate the concept. These concept
names, expressed through words, are by convenience the most
common way to display to the user a readable and straightforward
understandable system of concepts, and as such, it is preferable to
define in a specific language, unique concept names with a well-
accepted usage. Moreover, in order to distinguish the concept
names from others words, one should normalize these names, for
example, by adding a specific prefix before each concept name, or
by starting all words, belonging to a group of words naming a
concept, with a capital letter and then concatenating them into
one block, or by naming the concepts in another language than
the native language of the system’s builder (in so far as this
language is also known by all potential users). All these methods
were combined in the RECIT analyzer [68] (i.e. “cl_” is used as
prefix to mark all the concepts: for example, cl_AbdominalPain
or cl_Heart). But, we can also think about concepts that have no
“common” name but for which, at least a clear definition exists.
Numerous examples can be found in the GALEN model [42]
where such concepts constitute an important part of the
compositional process. For example, the two abstract categories
defined by:
Culturing which actsOn ‘BloodSample’
and
Culturing which actsOn ‘UrineSample’
are mentioned in the GALEN model interface through the above
definitions and are both classified under the concept Culturing.
These unnamed composite concepts are useful in the GALEN
model for defining other composite concepts. For example, the
composite concept BloodCulture is defined as:
LaboratoryTest which
hasSubprocess
‘BloodSample’).
(Culturing
which
actsOn
These above examples also introduce the notion of
“compositionality”, which allows some compound concepts to
be decomposed into more primitive concepts. Three new notions
must then be considered:
• primitive (or basic) concepts: they are atomic semantic
entities, in the sense that they do not require to be
subsequently subdivided in order to reflect their literal
meaning (such as cl_Abdomen or cl_Excision).
• composite concepts: they can be expressed through
interconnected primitive concepts (for example, the literal
meaning of the concept cl_Cardiomegaly can be expressed
as an enlargement of the heart).
• relationships: they are set up to symbolize the links
amongst concepts, and thus act as “semantic glue” toward
the specification of a complete semantic representation.
Meaningful names (which can be prefixed by the characters
“ rel_”) are also chosen to correctly exhibit the underlying
semantics of each relationship (i.e. rel_hasLocation,
rel_IsAssociatedWith).
Therefore, the definition of a composite concept can always be
formalized through primitive concepts and their relationships
between each other.
Finally, an important feature which must be clearly
distinguished from the concept name is the concept annotation.
Indeed, in order to be able to extract concepts from textual
sources, it is important to annotate concepts with all the relevant
words (simple words or multi-words phrases) in a specific
language that are used to refer to this concept. For example, the
concept cl_Liver can be annotated in English by the words
“ liver”, “hepatic”, or even the prefix “hepato-”. Notice that, if
the name given to a concept is unambiguous with a well-accepted
medical usage (which is the basic rule!), a first annotation can be
more or less automatically performed by looking at the concept
name.
The Notion of “Word”
Words, which are strings of characters without blanks,
constitute the basic units of natural language. They are both
useful to express particular objects or acts (such as scissors or
ablation) as well as entities with a more abstract meaning (such
as pain or severity). They are also used to compose more
complex syntactic structures such as noun phrases (i.e. groups of
words centered around a word belonging to the grammatical
category of a noun), or full sentences (which are constructed
around a verb). These structures take their full meaning by
considering the narrative context in which they occur (i.e. the
surrounding information that clarifies how these structures must
be interpreted). As highlighted before, words are also useful to
designate concepts and constitute the basic elements for the
concepts’ annotation process. Synonyms are features defined at
the word level [69]. By and large, the set of expressions that
annotate a concept can be considered as a set of synonyms
(reflecting an identity of meaning). But, for specific applications,
this notion can be extended to equivalent expressions (i.e.
expressions conveying the same main idea while not being
totally equal) [16].
The Notions of “Code” and “Term”
In between concepts and words are codes and terms. A code is a
unit of partition (i.e. a unit useful to define some classifications or
categorizations), generally expressed through a numeric
expression, which has no intrinsic meaning in itself but rather
encodes important contextual information through two
complementary mechanisms. First, the position of a code in the
partition (i.e. the classification in which it is found) gives
important information about its meaning. Second, codes are also
specified by terms (or any definition), used to label an element of
the partition.
Terms (also called vocabulary entries), whether naming
particular codes or not, are units of technical language intended
or reuse [67]. They represent typical phrases selected by domain
experts and are usually specified through a formal and scientific
(technical) language, the structure of which is mainly expressed
through noun phrases. Moreover, precise definition of terms
through concepts is a useful way of clarifying their meaning [34,
35].
Common Confusions Among Concepts, Terms, and Words
These above descriptions emphasize the strong connection
existing between these units, whose casual usage often results in
some confusions. A common confusion occurs between the
notions of “concept” and “term”. In this matter, we can use the
example of the UMLS source, where some misunderstandings can
occur. The concepts defined in UMLS are only described through
a unique alphanumeric identifier (CUI), for which no concept
name is associated. Then, each concept identified through its
unique CUI is annotated by a set of terms or noun phrases, which
could be interpreted as potential names for that concept, whereas
they are not (unless a “preferred form” is explicitly specified).
Moreover, all concepts in the Metathesaurus (and, by extension,
the terms that annotate these concepts) are connected through the
“isa” link to one or more semantic types (or types categories)
such as Virus, Acquired Abnormality, Disease or Syndrome,
which are nothing more than other concepts specified through a
generic name.
Another common confusion takes place between a concept (or
more precisely, its name), and the words (or groups of words)
used to annotate this concept in the different considered
languages. In fact, any annotation of a concept can potentially be
chosen as the name used to refer to it (see the examples below).
That is why, choosing a specific naming convention can easily
remove confusion about the corresponding words from different
languages. Moreover, there is clearly a dissymmetry between the
notions of words and concepts as highlighted by the examples
shown in Table 1.
The two first examples show that an annotation for a primitive
concept (i.e. a concept that is indivisible) can be done either
through a single word or a group of words (sometimes annotation
by prefixes can also be considered, as “cardio-” for the concept
cl_Heart). The two following examples highlight that a
composite concept can possibly be annotated by a single word if
one exists in the specific language of the annotation. This also
implies that, although the “word” constitutes the basic unit of
any textual object (such as discharge summaries, reports,
notes...), it does not always correspond to the notion of primitive
(or basic) concept. Indeed, the latter can be inferior to the word
(i.e. more than one primitive concept is embedded in a single
word, for example, nephrectomy), or superior to the word (i.e.
one primitive concept needs more than one word to be expressed,
for example, abdominal aorta). It is also important to notice that
the specification of the annotation kidney excision is not
mandatory for the composite concept cl_Nephrectomy in so far as
a definition is given (for example, by using the formalism of
conceptual graphs as defined by Sowa [50]), as follows:
[Nephrectomy] is: [Excision] ->(actsOn)-> [Kidney].
Therefore, every primitive concept needs to be annotated (with
the different words and synonyms which serve to express this
concept, in the several treated languages), whereas composite
concepts (for which a definition is given) may be annotated based
on the availability of words in the respective language.
Maintaining equivalent definitions can then greatly reduce the
need for introducing a large range of lexical variants in the
system.
Table 1- Mapping between words and concepts
Words
Concepts
a word ->
a primitive concept
kidney
renal
cl_Kidney
cl_Kidney
a group of words ->
abdominal aorta
cl_AbdominalAorta
a primitive concept
renal failure
cl_RenalFailure
a word ->
a composite concept
nephrectomy
cl_Nephrectomy
a group of words ->
a composite concept
kidney excision
cl_Nephrectomy
a word ->
several concepts
left
(adjective)
cl_Left
cl_LeftSided
Finally, the last example deals with one important problem of
natural language: its ambiguity. The word left is ambiguous both
at the syntactic and semantic levels. Indeed, it can be an
adjective, a noun or the past form of the verb “to leave”. As an
adjective, it can take two different meanings. It can represent
either the left-right selector as in “pain in the left arm” (i.e.
being an annotation of the concept cl_Left, which is a child of
cl_LeftRightState), or the laterality position as in “pain in the
left side of the stomach” (i.e. being an annotation of the concept
cl_LeftSided, which is a child of cl_LateralityPositionState).
However, the adjective left can also occur in specific medical
expressions such as “left heart”. This expression is an example
of “medical jargon”, characteristic to the medical sublanguage,
and as such, it must be treated as a single unit (also referred to
“an idiomatic expression”) by NLP tools. Indeed, the meaning of
the whole expression cannot be deduced from the combination of
the meaning of each word composing this expression and
moreover, such group of words always occur contiguously in
textual records. In this way, “ abdominal aorta” and “ renal
failure” can also be considered as idiomatic expressions. Finally,
in some cases, taking into account the syntactic information is
sufficient to solve the ambiguity. For example, the English word
patient, considered as a noun annotates the concept cl_Patient,
and considered as an adjective annotates the concept cl_Patience.
Identifying the Corresponding Knowledge Levels
The preceding section has highlighted two main basic units:
“concept” and “word”. Each of these units fits into a particular
domain of knowledge, which is respectively the conceptual level
and the lexical level.
The Conceptual Level
The conceptual level embeds at least three kinds of information,
which constitute the required basis for a robust concept model.
• the set of concepts and their names, relevant for the treated
domain,
• a typology or ontology organizing these concepts relative
to each other,
• a set of semantic rules specifying how these concepts can be
combined together in a manner that makes sense and is
relevant.
All the concepts deemed relevant for a particular domain must
be described at this level. Specification of a structure through
which the concepts can be organized allows users to maintain a
consistent view of all the relevant clinical entities and their
associated attributes [70]. Such hierarchy should reflect an
appropriate level of generality and granularity, which may greatly
vary with the degree of precision needed by each application.
Generalization and specialization, conveyed along the “isa” link
between nodes of the hierarchy, constitute the basic principles on
which inference mechanisms can be implemented. The
complexity of these mechanisms is also dependent on the nature
of the hierarchy, which can be simple or multiple (i.e. allowing
more than one parent node per child). Finally, an important part
of the semantics of the domain under consideration is specified
through a set of semantic rules, which allows relationships to be
set up between each pair of sensible concepts. These
compatibility rules are also useful to define composite concepts.
The Lexical Level
The lexical level is necessary to recognize words and phrases
and thus plays an important role in natural language processing.
Indeed, acquiring the list of words used in a given language
(usually referred to as the lexicon or dictionary), is the first step of
any attempt to understand free texts. The lexical level is then the
place to support morpho-syntactic information about words
(being either simple words or multi-words phrases such as
idiomatic expressions) as well as the notions of synonyms,
lexical variants, abbreviations and acronyms. Three kinds of
information are usually asserted at this level for each lexical entry:
• the lexical unit: It corresponds to the specification of the
basic form of a word. Such basic form is generally
expressed through masculine, singular and infinitive,
depending on the word category and applicability in a
specific language. Recognizing the basic form from any
morphological variant is usually considered as a task
belonging to the NLP side. Such a task is languagedependent and even though analogies exist between
languages, it has to be redesigned for each new language.
• the syntactic argument: This argument allows each lexical
unit to be “grammatically” categorized, and may be rather
complex depending on the specific language and the
purpose of the application. It generally describes the
grammatical category (preposition, noun, adjective, verb...)
with some morpho-syntactic features as needed (i.e.
number, gender, mode variations), as well as some usage
information (for example, considering the usual position of
an adjective relatively to the qualified noun, for the French
language).
• the semantic argument: This argument aims at describing
the “meaning” of each lexical unit, this “meaning” being
exactly conveyed by what we have called a concept. It can
then be considered as a pointer toward one concept (or
more than one in case of semantic ambiguity), which is
precisely defined in the conceptual level.
These lexical units can also be viewed as annotations of the
semantic argument, useful to ensure a large coverage in searching
for instances of concepts in medical texts, written in a specific
language. The semantic argument is also the key element to
define dictionaries in a multilingual environment [66, 69].
As defined above, the content of the lexical and conceptual
levels must be broadened in order to fit with NLP and KR
purposes.
Extending these Levels for NLP and KR Purposes
In addition to the local syntactic properties associated with
individual lexical entries, syntactic rules can be added at the
lexical level to deal with more complex structures such as
sentences. These syntactic rules are language-dependent, and they
clarify the valid syntactic structures (i.e. well-formed
combinations of grammatical categories), which are permitted to
support the expressions formulated in the treated language. The
lexical level augmented with the syntactic rules allows NLP tools
to precisely manage the syntactic information embedded in
words, phrases, and sentences, as found in textual documents.
The semantic rules defined at the conceptual level are useful to
define binary relationships between two concepts, which can be
roughly categorized as thematic and attribute relationships.
Moreover, these rules are only locally validated. Describing the
roles that a concept plays in a particular situation requires taking
into account more precise information about the clinical context
where this event can occur. Such contexts are a good place to
specify more complex information, such as causal or temporal
information, as well as default values and basic common
knowledge, useful to build a semantic representation that clarifies
the implicit information not said in the texts, although wellknown by people reading these texts. This requires modelers to
incorporate contextual information at the conceptual level. For
this, frame-based representation systems [71, 72, 73, 39] are
suitable, as they provide a uniform environment for describing a
network of associations between concepts representing a
stereotyped situation.
Terms (and their associated codes) are critical to the collection
of accurate and aggregate health care data and to linking patient
records to decision support tools. Therefore, terms should be tied
to the two previously mentioned levels. First, such terms can be
smoothly integrated at the conceptual level if a conceptual
definition clarifying their meaning is provided [34, 35, 64].
Second, the technical vocabulary used to express these terms
must also be incorporated at the lexical level.
Once the basic units and corresponding levels of specification
have been determined, their use by NLP tools can be considered.
From Sentences to a Conceptual Representation:
Holding the Important and Relevant Information
Characteristics of Existing NLP systems
Developing analyzers that yield a conceptual representation of
medical narratives has long been a considerable research topic in
medical informatics. Several analytic techniques have emerged
[74, 75, 76, 77, 78, 79, 48, 49, 68, 51, 52]. The common
approach taken by these systems deals with transforming
sentences, from the language words in which they are expressed,
into the chosen conceptual representation, which will be used as a
standardized format for further information access. Different kinds
of knowledge are involved in the analysis process which can be
clustered in two main parts. The first category is concerned with
the morpho-syntactic knowledge related to the sentence structure.
This knowledge is precisely defined in what we have called the
lexical level. The second category deals with the semantic (or
conceptual) knowledge related to the sentence meaning. This
knowledge corresponds to what we have embedded in the
conceptual level, and is usually achieved as part of the domain
modeling task. In between is the integration process [80] dealing
with the problem of using syntactic and semantic information one
after the other, or together. MENELAS, a medical language
understanding project [78], is an example of a system that follows
the standard division of natural language processing: morphosyntactic, semantic and pragmatic analyses. In the Linguistic
String Project (LSP) system [74, 75], which is more syntaxoriented, the semantic restrictions are precoded at the level of the
grammar rules and thus must be entirely anticipated during the
conception of the system. In other systems, the weight of the
semantic argument is amplified because of the importance
conferred to a semantic-driven approach for medical texts analysis.
In the MedLEE analyzer [48, 49], the structure of the source
language is specified in a context-free semantic grammar which
defines the well-formed semantic structures of the domain,
integrating only few syntactic features. The METEXA (“MEdical
TEXt Analysis”) [77] and the RECIT [68, 51] systems both use
local syntactic rules to trigger the checking of any combination of
sensible concepts. That is to say, as soon as two syntactic
constituents have been detected (such a an adjective plus a noun,
or a noun plus a noun complement), a valid semantic
interpretation (specified by a pair of conceptual entities linked by
a relationship) is sought in the domain model, if there is any.
The syntactic information is moreover relaxed in the second stage
of the RECIT analyzer [51], devoted to the building of the
conceptual representation for the whole analyzed sentence. In this
way, ill-formed phrases (i.e. phrases constructed without
“syntactic glue”) are also treated insofar as they are sensible. This
relaxation fits in with the particularities of medical language
being both technical (using specific medical jargon) and written
in a concise and direct style (resembling the telegraphic style).
Some general remarks also apply to the above systems. Each of
these systems uses a conceptual structure to store the meaning of
natural language inputs. The formalism of conceptual graphs
(CGs) as defined by J.F. Sowa [50] is nowadays the most
popular formalism, chosen as the language-independent
representation for the storage of the result of the medical texts
analysis [77, 78, 48, 52]. Several attractive features promote
them, such as their ability to reflect both the constraints of
expressive power and notational efficacy. In particular, CGs allow
distinctive features to be expressed and they support various
kinds of operations. Moreover, their straightforward readability
makes them easily understood.
Although many groups have been working on medical language
processing, very few useful and practical systems exist at the
present time. Indeed, the strong medical constraints to be errorfree and accurate slow down the overall development. This is
why a common feature of most of the existing systems is that
they choose the radiology domain as the focus domain to test and
evaluate their implemented analysis strategies [48, 77, 79]. The
special appeal of X-ray reports is mainly due to their well-defined
physical and conceptual structure, encompassing a delimited
domain of clinical medicine that yields useful clinical information
or decision support and research. Finally, the fact that on-line
radiology reports are readily accessible from central patient
databases in most hospitals enhances their potential use for NLP.
Relying on Concept Models For NLP Needs: What are the
Requirements?
As highlighted before, the underlying properties of medical
language have oriented researches in medical language processing
toward investigating semantic-driven approaches, which make use
of a large body of semantic or conceptual knowledge [77, 48, 51].
Anticipating the amount of useful conceptual knowledge during
the design phase of a semantic-driven analyzer, even if a narrow
domain and specific task are considered, remains highly utopian,
due to the large and expressive amount of manageable
information. Moreover, this requires skills that range beyond the
domain of linguistic informatics by considering conceptual
modeling tasks. A solution is to grasp this conceptual
information directly from some existing conceptual knowledge
bases. RECIT [12, 68, 51, 52] is a typical example of an analyzer
that grounds its semantic components directly from a model for
medical concept representation, which is developed apart from the
analysis process, and which is the GALEN model [41, 42].
MedLEE [11, 48, 49] is another example of an analyzer that takes
advantage of the existence of the MED [40] to produce structures
which are compatible with, but not directly built from, the
indings as modeled with the MED. Even if these two above
systems differ in the way they rely on a concept model, they
emphasize some requirements for the success of a combined
approach. These requirements cover abilities defined from both
the NLP system side and the concept model side, as emphasized
in the following sections.
Separate Processor From Knowledge Components
A first requirement, essential for the implementation of an
analyzer relying on a concept model, is that the core engine of the
processor be clearly separated from the knowledge components.
This separation is a crucial implementation feature designed to
cope with the fluctuations in a concept model, which can at any
moment evolve (by editing, removing or adding pieces of
knowledge). This also ensures the independence of the processor
toward specific clinical applications, as redefining the
corresponding domain-specific knowledge sources should be
sufficient to switch to another clinical application. In MedLEE,
this separation is clearly emphasized [48, 81]. For the RECIT
system, its architecture has taken advantage of the ambition, from
the start, to develop an analyzer in a multilingual environment
[68] (i.e. first applied to French, then adjusted to English, and to
a minor extent, German). In order to minimize the development
efforts to accommodate the RECIT system to another language, a
modular structure has been implemented. This allows new rules
to be inserted without disturbing the general computational
mechanism implemented to select and apply them at the right
time during the analysis process. Moreover, the declarative style
in which this analyzer is written (using Prolog as the logical
programming language [82]) ensures a large expressiveness for the
set up and management of rules as well as for the design of a
knowledge representation. Finally, European languages, although
considered as different from a syntactic viewpoint (even if some
analogies exist such as between French and English [83, 68])
allow the same concepts to be expressed. That is why a semanticdriven approach has been chosen, in order to take advantage of a
single conceptual knowledge base, independent of any language
and thus, accessible by any version of the multilingual RECIT
analyzer. As a result, a Medical Linguistic Knowledge Base
(MLKB) has been set up as a recipient for all the declarative
knowledge used during the analysis of medical texts [52, 66].
Various guiding lines underlie this knowledge base, the main one
occurring between the semantic part (domain-dependent but
language-independent) and the syntactic part (domain-independent
but language-dependent). The semantic part can typically be
supplied with a concept model.
The separation between processor and knowledge (and
furthermore between different kinds of knowledge) is nevertheless
not absolute. Indeed, the way of formulating specific information
can be strongly dependent on the language style used as well as
on the nature of the information to be communicated. This
requires advance specification of the precise types of knowledge
which will be manipulated during the analysis process, as well as
their functionality. Finally, it is important to bridge the gap
between the way information is expressed in the medical texts
and the way it is represented at the conceptual level. The amount
of attention paid to these previous points determines the depth of
integration, that is to say, the required efforts to adjust NLP tools
to existing concept models and vice versa.
Exploiting Relevant Information From Concept Models:
Determining the Depth of Integration
Most existing medical knowledge sources have been developed
with objectives other than their exploitation by NLP tools.
Nevertheless, these sources embed categories of linguistic
knowledge (as outlined at the lexical and conceptual levels)
which are often applicable for NLP needs, as reviewed by the
authors [13]. In particular, the MED and GALEN models enclose
interesting though different features from strict NLP needs. MED
provides a large vocabulary which encompasses the needs of
ancillary clinical systems at the Columbia-Presbyterian Medical
Center. However, semantic rules are not directly available but
could be extracted from the frame concept nodes, which embed
semantic links and attributes, and which describe significant
contextual information. The GALEN model has a more restricted
vocabulary coverage but presents a large set of directly available
semantic rules expressed through the so-called “sensible
statements”. Contextual information is only partially present in
this system and has to be extracted from the set of criteria
associated with relevant concepts. The experiments conducted
with MedLEE [48] and RECIT [12] emphasize two different ways
of integrating, in the analysis process, relevant information
derived from a concept model, which is respectively the MED
and the GALEN models. However, the MedLEE processor does
not use MED as a direct source of conceptual knowledge, but
rather as a “reference model”, useful to specify the structure of the
analysis output which must map findings as modeled in the
MED. For this, additional knowledge sources have been
elaborated separately from the model (such as a formal semantic
grammar and a lexicon, a mapping knowledge base, and a
synonym knowledge base) which act as bridges between the
language of the texts and the unique concepts in the controlled
vocabulary as defined in the MED [48]. On the other hand, the
RECIT system uses the GALEN model as a direct semantic
source providing both the set of concepts which can be combined
to form the analysis output (this being expressed in the formalism
of conceptual graphs) as well as the sanctioning rules useful to
check the pertinence of any medical language expression against
the concept model. This integration process is presented below,
highlighting the relevant pieces of conceptual information
provided by such concept model and which are of direct use by
the analysis process.
Finally, the last piece of conceptual knowledge used by the
RECIT analyzer deals with the specification of conceptual
schemata. The latter ones are used in the second analysis phase to
link the heading concept of the sentence with all the other
concepts (highlighted in the sentence during the proximity
processing phase), in order to produce the CG representation
expressing the sentence meaning. As seen before, such
information still needs to be extracted from the GALEN model.
Conceptual modeling as implemented in a system like the framebase interlingua [39, 63] seems more appropriate for handling this
kind of knowledge, as it explicitly describes the properties of
concepts relative to a specific context.
The first version of the RECIT system relied on a knowledge
base built by the authors. But the limited size of such a domain
knowledge base greatly reduced the capacity of the analyzer. That
is why the idea to import as much conceptual knowledge as
possible from the GALEN model emerged. This transfer has been
acilitated by different factors from both the analyzer and the
model sides.
The above experiment has shown the different kinds of
knowledge that a concept model like GALEN can provide for
NLP needs. However, the main challenge is to stress the
distinction between information as it is formulated in medical
texts and as it is expressed in concept models. This entails
mediation between the large expressiveness, permissibility, and
impliciteness of natural language on the one hand, and the
generality, granularity, and conciseness of the concept model on
the other hand. Such a gap between the “language of the texts”
and “the language of concepts” can be filled in by considering
what linguistic information must be attached to the conceptual
level in order to manage the analysis of medical texts.
First, the typology of GALEN, by its high-level structure [61]
corresponds quite well to the main partition initially
implemented in RECIT, and which emphasized the distinctions
between the actors, the medical events, the qualifiers, the values,
and the modalities. These subdivisions are taken into account
during the analysis process, especially for the triggering of
relevant heading concepts, around which conceptual graphs can be
built. In order to retain the analysis strategies, an alignment of
the GALEN typology has been performed by specifying pointers
at the highest possible levels.
Second, a strong similarity was observed between the semantic
part of compatibility rules as implemented in the RECIT system
(and which are used during the proximity processing phase to
link neighboring words together) and the GALEN sensible
statements. Both aim at describing a relationship between each
pair of sensible concepts, as shown in the following example:
The last three semantic arguments of the NLP rules:
compatibility_rule(#Number, Syntax,
cl_Fracture, cl_Bone, ‘LOC’).
are equivalent to the GALEN sensible statement:
Fracture which hasLocation Bone.
Such rules are quite general (i.e. not all the bones are candidates
to be fractured) but are adequate to analyze sentences which are
sensible per se, where the need of sanctioning arises essentially in
presence of ambiguities. Moreover, taking into account other
kinds of fractures (like fractures occurring in cartilages) will
require additional statements to be specified.
Third, the compositionality property of the GALEN model is
largely exploited in the RECIT system to replace composite
concepts by their definition (using the expansion operation
defined with the conceptual graph formalism [50]), thus ensuring
better results in future querying as information is decomposed
into its primitive components.
Bridging the Gap Between Reality and Abstraction
Such syntactic attachments have been defined at different
strategic points in the RECIT system. First, it is important to
translate the model typology in the context of the analyzed texts.
This is performed through the typology annotation which allows
concepts to be annotated by words and expressions available in
the different languages together with their syntactic properties.
These annotations result in the creation of the dictionary as
needed by NLP tools.
Second, the application of the sensible statements to natural
language expressions implies to clarify the syntactic structures
supporting the expression of the concepts and the relationships in
a specific language. For example, the sensible statement linking
the concepts cl_Fracture and cl_Bone through the relationship
rel_hasLocation can be instantiated by different expressions in
English such as “ fractured femur”, “ fracture of the femur”,
which are respectively supported by the syntactic structures:
“adjective plus noun” and “noun plus noun complement”.
Relying directly on the sensible statements as described in the
GALEN model has permitted such syntactic constraints, initially
specified for each compatibility rules (second argument of the
clause compatibility_rule), to be defined at the level of the
relationships, without loosing information. This syntactic
information is specified for each relationship at the highest level
which can always be refined by defining a more restrictive
conceptual context.
For
example,
the
relationship
rel_hasLocation can be annotated by the two above syntactic
structures when used in the restrictive context occurring between
the concepts cl_PathologicalCondition and cl_BodyStructure.
These syntactic constraints, described as a syntactic annotation of
relationships is also easier to maintain, as the number of
relationships is greatly inferior to this one of sensible statements.
Finally, another encountered problem was that linguistic
relationships do not always fit with the conceptual relationships
as specified in the GALEN model. Indeed, in order to link the
expression “severe chest pain”, during the proximity processing
phase, RECIT needs to check the presence of a sensible statement
which specifies the relationship occurring between the concepts
cl_ChestPain and cl_Severity. But the granularity of the GALEN
model furnishes two sensible statements:
Pain which hasSeverity Severity
This results in a knowledge-oriented representation which adds
new functionality by moving from data to concepts.
Acknowledgments
This work is supported by grant number 8220-046502 from the
“Fonds National Suisse de la Recherche Scientifique”. Work on
the generic frame schema was originally supported through NLM
Contract N01-LM-6-3522.
Severity which hasState SeverityState
where the concept cl_ChestPain is a child of the concept cl_Pain.
A combination of these two statements is necessary to deal with
the impliciteness of natural language where the qualifier Severity
is inherently embedded in its values. Such operation can be
performed automatically by considering the transitivity between
the relationships hasFeature (being an ancestor of hasSeverity)
and hasState, to produce the following statement which applies
directly to natural language input:
Pain which hasFeatureState SeverityState.
The specification of additional linguistic information above a
“pure” concept model has proved to be a key solution, allowing
natural language processor to smoothly integrate such conceptual
information during the analysis process. For this, any concept
model, specified through a modular and declarative structure, and
providing at least the relevant inter-connected concepts as
naturally found in medical texts, should be considered as
potential conceptual source by NLP tools.
Conclusions
The experience of the authors’ group in managing models for
medical concept representation, first by adjusting the RECIT
analyzer to the GALEN model [12], then by recasting the generic
interlingua frame system, initially developed by Miller, Masarie
et al. [63], has reinforced our belief that a solid model of medical
concepts must be developed and used for feeding the semantic
components of a medical language processor. Integrating in the
analysis process, all the basic conceptual components (from
which the conceptual representation will be built) as well as the
sanctioning mechanism (used to set up an accurate representation)
rom a concept model, ensures a consistent follow-up of the
analyzer as the concept model is evolving. The major constraint
is that the success of the analysis process is then greatly
dependent on the accuracy and efficiency of the model. This
implies that developers should focus on a well-limited domain,
answering to a well-specified goal, in order to yield concrete
outcomes. Moreover, as natural language is, in essence, highly
permissive and generative, by authorizing ambiguity and
vagueness as well as neologism, it is all the more important to
rely on a concept model that has the quality of being more
restrictive while still preserving compositionality. Finally,
linking the semantic components of a medical language processor
to a concept model allows the combination of a usually top-down
approach to define the general structure of concepts in a given
domain with a bottom-up analysis of medical language texts.
References
1. Evans DA, Cimino JJ, Hersh WR, Huff SM, Bell DS, for
the Canon Group. Toward a Medical-concept Representation
Language. J Am Med Informatics Assoc 1994, 1: 207-217.
2. Scherrer J-R, Côté RA, Mandil SH (eds). Computerized
Natural Medical Language Processing for Knowledge
Representation. Proceedings of the IFIP-IMIA WG6 International
Working Conference, Geneva, Switzerland, 12-15 September,
1988. Amsterdam: Elsevier Science Publishers B.V. (NorthHolland), 1989.
3. McCray AT, Scherrer J-R, Safran C, Chute CG (eds).
Special Issue on Concepts, Knowledge, and Language in HealthCare Information Systems (IMIA). Methods of Information in
Medicine 1995, 34.
4. Evans DA, Ginther-Webster K, Hart M, Lefferts R, Monarch
I. Automatic indexing using selective NLP and first-order
thesauri. In: RIAO’91. Barcelona: Autonoma University of
Barcelona, 1991: 624-644.
5. Bell DS, Pattison-Gordon E, Greenes RA. Experiments in
Concepts Modeling for Radiographic Image Reports. J Am Med
Informatics Assoc 1994, 1: 249-262.
6. Spackman KA, Hersh WR. Recognizing Noun Phrases in
Medical Discharge Summaries: An Evaluation of Two Natural
Language Parsers. In: Cimino JJ (ed). Proceedings of the 1996
AMIA Annual Fall Symposium (Formerly SCAMC).
Philadelphia: Hanley & Belfus, Inc. 1996: 155-158.
7. Hersh WR, Campbell EH, Evans DA, Brownlow ND.
Empirical, Automated Vocabulary Discovery Using Large Text
Corpora and Advanced Natural Language Processing Tools. In:
Cimino JJ (ed). Proceedings of the 1996 AMIA Annual Fall
Symposium (Formerly SCAMC). Philadelphia: Hanley & Belfus,
Inc. 1996: 159-163.
8. Evans DA, Brownlow ND, Hersh WR, Campbell EM.
Automating Concept Identification in the Electronic Medical
Record: An Experimant in Extracting Dosage Information. In:
Cimino JJ (ed). Proceedings of the 1996 AMIA Annual Fall
Symposium (Formerly SCAMC). Philadelphia: Hanley & Belfus,
Inc. 1996: 388-392.
9. Hahn U, Schnattinger K, Romacker M. Automatic
Knowledge Acquisition from Medical Texts. In: Cimino JJ (ed).
Proceedings of the 1996 AMIA Annual Fall Symposium
(Formerly SCAMC). Philadelphia: Hanley & Belfus, Inc. 1996:
383-387.
10. Baud RH, Lovis C, Alpay L, Rassinoux A-M, Scherrer J-R,
Nowlan A, Rector A. Modelling for Natural Language
Understanding. In: Safran C (ed). Proceedings of SCAMC 93.
New York: McGraw-Hill, Inc. 1993: 289-293.
11. Friedman C, Cimino JJ, Johnson SB. A Schema for
Representing Medical Language Applied to Clinical Radiology. J
Am Med Informatics Assoc 1994, 1: 233-248.
12. Rassinoux A-M, Wagner JC, Lovis C, et al. Analysis of
Medical Texts Based on a Sound Medical Model. In: Gardner
RM (ed). Proceedings of SCAMC 95. Philadelphia:
Hanley&Belfus, Inc., 1995: 27-31.
13. Baud RH, Rassinoux A-M, Lovis C, Wagner J, Griesser V,
Michel P-A, Scherrer J-R. Knowledge Sources for Natural
Language Processing. In: Cimino JJ (ed). Proceedings of the
1996 AMIA Annual Fall Symposium (Formerly SCAMC).
Philadelphia: Hanley & Belfus, Inc. 1996: 70-74.
14. Ingenerf J. Taxonomic Vocabularies in Medicine: The
Intention of Usage Determines Different Established Structures.
In: Greenes RA et al. (eds). Proceedings of MEDINFO 95.
Alberta: HC&CC, 1995: 136-139.
15. McCray AT, Hole WT. The Scope and Structure of the First
Version of the UMLS Semantic Network. In: Miller RA (ed).
Proceedings of SCAMC 90. Los Alamitos: IEEE Computer
Society Press, 1990: 126-130.
16. McCray AT, Nelson SJ. The Representation of Meaning in
the UMLS. In [3]: 193-201.
17. Friedman C, Huff SM, Hersh WR, Pattison-Gordon E,
Cimino JJ. The Canon Group’s Effort: Working Toward a
Merged Model. J Am Med Informatics Assoc 1995; 2: 4-18.
18. Wehrli E, Clark R. Natural Language Processing, Lexicon
and Semantics. In: [3]: 68-74.
19. Rector A. Compositional Models of Medical Concepts:
Towards
Re-usable
Application-Independent
Medical
Terminologies. In: Barahona P, Christensen JP (eds). Knowledge
and Decisions in Health Telematics. IOS Press, 1994: 109-114.
20. The International Classification of Diseases, 9th revision,
Clinical Modification. 2nd ed. Vols. 1-3. U.S. Department of
Health and Human Services, September 1980.
21. Rothwell DJ. SNOMED-Based Knowledge Representation.
In: [3]: 209-213.
22. “Medical Subject Headings - Annotated Alphabetical List”,
National Library of Medicine, published annually.
23. Read J. The Read Clinical Classification. NHS Centre for
Coding and Classification, Loughborough, UK, 1993.
24. Pryor TA, Gardner RM, Clayton PD Warner HR. The
HELP system. J Med Syst 1983, 7(2): 87-102.
25. Barnett GO, Cimino JJ, Hupp JA, Hoffer EP. DXplain: An
Evolving Diagnostic Decision-Support System. J Am Med
Informatics Assoc 1987, 258: 67-74.
26. Masarie FE, Jr, Miller RA, Myers JD. INTERNIST-I
Properties: Representing Common Sense and Good Medical
Practice in a Computerized Medical Knowledge Base. Comput
Biomed Res 1985, 18: 458-479.
27. Miller RA, Massarie FE, Jr. Use of the Quick Medical
Reference (QMR) Program as a Tool for Medical Education. Meth
Inform Med 1989, 28(4): 340-345.
28. Lindberg DAB, Humphreys BL, McCray AT. The Unified
Medical Language System. Meth Inform Med 1993, 32: 281-291.
29. Sherertz DD, Tuttle MS, Blois MS, Erlbaum MS.
Intervocabulary Mapping within the UMLS: The Role of Lexical
Matching. In: Greenes RA (ed). Proceedings of SCAMC 88. Los
Angeles: IEEE Computer Society, 1988: 201-206.
30. Huff SM, Warner HR. A comparison of Meta-1 and HELP
terms: Implications for clinical data. In: Miller RA (ed).
Proceedings of SCAMC 1990. Los Angeles: IEE Computer
Society, 1990: 166-169.
31. Rocha RA, Huff SM. Using Digrams to Map Controlled
Medical Vocabularies. In: Ozbolt JG (ed). Proceedings of
SCAMC 94. Philadelphia: Hanley & Belfus, Inc., 1994: 172-176.
32. Miller RA, Gieszczykiewicz FM, Vries JK, Cooper GF.
CHARTLINE: Providing bibliographic references relevant to
patient charts using the UMLS Metathesaurus Knowledge
Sources. In: Frisse ME (ed). Proceedings of SCAMC 1992. New
York: McGraw Hill, 1992: 86-90.
33. McCray AT, Srinivasan S, Browne AC. Lexical Methods for
Managing Variation in Biomedical Terminologies. In: Ozbolt JG
(ed).
Proceedings
of
SCAMC
1994.
Philadelphia:
Hanley&Belfus, Inc., 1994: 235-239.
34. Campbell KE, Musen MA. Representation of Clinical Data
Using SNOMED III and Conceptual Graphs. In: Frisse ME (ed).
Proceedings of SCAMC 92. New York: McGraw-Hill, 1992:
354-358.
35. Joubert M, Miton F, Fieschi M, Robert J-J. A Conceptual
Graphs Modeling of UMLS Components. In: Greenes RA et al.
(eds). Proceedings of MEDINFO 95. Alberta: HC&CC, 1995:
90-94.
36. Evans DA. Final Report on the MedSORT-II Project:
Developing and Managing Medical Thesauri. Technical Report.
Pittsburgh, PA: Laboratory for Computational Linguistics,
Carnegie Mellon University, 1987.
37. Evans DA. Pragmatically-Structured, Lexical-Semantic
Knowledge Bases For Unified Medical Language Systems. In:
Greenes RA (ed). Proceedings of SCAMC 88. Los Angeles: IEE
Computer Society Press, 1988: 169-173.
38. Miller RA. A Computer-based Patient Case Simulator. Clin
Research 1984, 32: 651A.
39. Masarie FE, Miller RA, Bouhaddou O, Giuse NB, Warner
HR. An Interlingua for Electronic Interchange of Medical
Information: Using Frames to Map between Clinical
Vocabularies. Comput Biomed Res 1991, 24(4): 379-400.
40. Cimino JJ, Clayton PD, Hripcsak G, Johnson SB.
Knowledge-based Approaches to the Maintenance of a Large
Controlled Medical Terminology. J Am Med Informatics Assoc
1994, 1: 35-50.
41. Rector AL, Nowlan WA, Glowinski A. Goals for Concept
Representation in the GALEN project. In: Safran C (ed).
Proceedings of SCAMC 93. New York: McGraw-Hill, Inc. 1993:
414-418.
42. Rector AL. Coordinating Taxomomies: Key to Re-usable
Concept Representations. In: Barahona P, Stefanelli M, Wyatt J
(eds). Proceedings of Artificial Intelligence in Medicine (AIME
95). Berlin: Springer, 1995: 17-28.
43. Rossi-Mori A, Bernauer J, PakarinenV, et al.
CEN/TC251/PT003 models for representation of terminologies
and coding systems in medicine. Proceedings of the Seminar:
Opportunities for European and US Cooperation in
Standardization in Health Care Informatics, Geneva,
Switzerland, September 1992.
44. Cimino JJ. Use of the Unified Medical Language System in
Patient Care at the Columbia-Presbyterian Medical Center. In:
[3]: 158-164.
45. Alpay L, Baud RH, Rassinoux A-M, Wagner J, Lovis C,
Scherrer J-R. Interfacing Conceptual Graphs (CG) and the Galen
Master Notation (MN) for medical knowledge representation and
modelling. In: Andreassen S, Engelbrecht R, Wyatt J (eds).
Proceedings of Artificial Intelligence in Medicine 1993 (AIME
93). Amsterdam: IOS Press, 1993: 337-347.
46. Grishman R, Kittredge R. Analysing Language in Restricted
Domains: Sublanguage Description and Processing. Hillsdale,
NJ: Lawrence Erlbaum Associates, 1986.
47. Hirschman L, Sager N. Automatic Information Formatting of
a Medical Sublanguage. In: Kittredge R, Lehrberger J (eds).
Sublanguage: Studies of Language in Restricted Semantic
Domains. Berlin: Walter de Gruyter, 1982: 27-80.
48. Friedman C, Alderson PO, Austin JHM, Cimino JJ,
Johnson SB. A General Natural-language Text Processor for
Clinical Radiology. J Am Med Informatics Assoc. 1994, 1: 161174.
49. Friedman C, Cimino JJ, Johnson SB. A Conceptual Model
or Clinical Radiology Reports. In: Safran C (ed). Proceedings of
SCAMC 93. New York: McGraw-Hill, Inc. 1993: 829-833.
50. Sowa JF. Conceptual Structures: Information Processing in
Mind and Machine. Reading, MA: Addison-Wesley Publishing
Company, 1984.
51. Rassinoux A-M, Juge C, Michel P-A, Baud RH, Lemaitre
D, Jean F-C, Degoulet P, Scherrer J-R. Analysis of Medical
Jargon: The RECIT System. In: Barahona P, Stefanelli M,
Wyatt J (eds). Proceedings of Artificial Intelligence in Medicine
(AIME 95). Berlin: Springer, 1995: 42-52.
52. Baud RH, Rassinoux A-M, Wagner JC, Lovis C, Juge C,
Alpay LL, Michel P-A, Degoulet P, Scherrer J-R. Representing
Clinical Narratives Using Conceptual Graphs. In: [3]: 176-186.
53. Lewis Carroll. Jabberwocky. Further details on this poem
can
be
found
at
the
URL
http://www.iit.edu/~beberg/jabberwocky.html.
See
also
http://www.math.luc.edu/~vande/jabfrench.html
or
http://www.math.luc.edu/~vande/jabgerman.html.
54. Allen J. Natural Language Understanding. Menlo Park,
CA: The Benjamin/Cummings Publishing Compagny, 1987.
55. Shortliffe EH, Davis R, Axline SG, Buchanan BG, Green
CC, Cohen SN. Computer-based consultations in clinical
therapeutics: explanation and rule acquisition capabilities of the
MYCIN system. Comput Biomed Res 1975, 8(4): 303-320.
56. Warner HR, Haug P, Bouhaddou O, Lincoln M et al. ILIAD
As An Expert Consultant to Teach Differential Diagnosis. In:
Greenes RA (ed). Proceedings of SCAMC 88. Los Angeles: IEEE
Computer Society, 1988: 371-376.
57. Quillian MR. Semantic memory. In: Minsky M (ed).
Semantic information processing. Cambridge, MA: MIT Press,
1968: 227-270.
58. Minsky M. A framework for representing knowledge. In:
Winston PH (ed). The psychology of computer vision. New
York: McGraw-Hill, 1975: 211-277.
59. Brachman R, Schmolze J. An Overview of the KL-ONE
Knowledge Representation System. Cognitive Science 1985,
9(2): 171-216.
60. Barr CE, Komorowski HJ, Pattison-Gordon E, Greenes RA.
Conceptual Modeling for the Unified Medical Language System.
In: Greenes RA (ed). Proceedings of SCAMC 88. Los Angeles:
IEE Computer Society Press, 1988: 148-151.
61. Rector AL, Rogers JE, Pole P. The GALEN High Level
Ontology. In: Brender J, Christensen JP, Scherrer J-R, McNair P
(eds). Proceedings of Medical Informatics Europe ‘96 (MIE 96).
Amsterdam: IOS Press, 1996:174-178.
62. Sowa JF. Toward the Expressive Power of Natural
Language. In: Sowa JF (ed). Principles of Semantic Networks:
Explorations in the Representation of Knowledge. San Mateo,
CA: Morgan Kaufmann Publishers, 1991: 157-189.
63. Rassinoux A-M, Miller R A, Baud R H, Scherrer J-R.
Modeling Principles for QMR Medical Findings. In: Cimino JJ
(ed). Proceedings of the 1996 AMIA Annual Fall Symposium
(Formerly SCAMC). Philadelphia: Hanley & Belfus, Inc. 1996:
264-268.
64. Pole PM, Rector AL. Mapping the GALEN CORE Model
to SNOMED-III: Initial Experiments. In: Cimino JJ (ed).
Proceedings of the 1996 AMIA Annual Fall Symposium
(Formerly SCAMC). Philadelphia: Hanley & Belfus, Inc. 1996:
100-104.
65. Wigertz O, Hripscak G, Shasavar M, Bagenholm P, Ahlfeldt
H, Gill H. Data-driven medical knowledge based sustems based
on Arden Syntax. In: Barahona P, Christianson (eds). Knowledge
and Decisions in Health Telematics. IOS Press, 1994: 126-131.
66. Baud RH, Lovis C, Rassinoux A-M, Michel P-A, Alpay L,
Wagner JC, Juge C, Scherrer J-R. Towards a Medical Linguistic
Knowledge Base. In: Greenes RA et al. (eds). Proceedings of
MEDINFO 95. Alberta: HC&CC, 1995: 13-17.
67. Tuttle MS, Campbell KE, Olson NE et al. Concept, Code,
Term and Word: Preserving the Distinctions. In: Gardner RM
(ed). Proceedings of SCAMC 95. Philadelphia: Hanley&Belfus,
Inc., 1995: 956.
68. Rassinoux A-M, Baud RH, Scherrer J-R. A Multilingual
Analyser of Medical Texts. In: Tepfenhart WM, Dick JP, Sowa
JF (eds). Proceedings of the Second International Conference on
Conceptual Structures (ICCS 94). Berlin: Springer-Verlag, 1994:
84-96.
69. Baud RH, Lovis C, Rassinoux A-M, Scherrer J-R. Alternate
Ways for Knowledge Collection, Indexing and Robust Language
Retrieval. To appear in: Proceedings of the Fourth International
Conference on Medical Concept Representation, Jacksonville,
Florida, January 19-22, 1997.
70. Zweigenbaum P, Bachimont B, Bouaud J, Charlet J,
Boisvieux J-F. Issues in the Structuring and Acquisition of an
Ontology for Medical Language Understanding. In: [3] :15-24.
71. Lytinen SL. Frame selection in parsing. In: American
Association for Artificial Intelligence. Proceedings of the third
national conference on artificial intelligence (AAAI 84). Los
Altos, CA: William Kaufmann, 1984: 222-225.
72. Binot J-L, Ribbens D. Dual frames: a new tool for semanting
parsing. In: American Association for Artificial Intelligence.
Proceedings of the fifth national conference on artificial
intelligence (AAAI 86). Los Altos, CA: Morgan Kaufmann
Publishers, 1986: 579-583.
73. Rocha RA, Rocha BHSC, Huff SM. Automated Translation
Between Medical Vocabularies Using a Frame-Based Interlingua.
In: Safran C (ed). Proceedings of SCAMC 93. New York:
McGraw-Hill, Inc. 1993: 690-694.
74. Sager N, Lyman M, Bucknall C, Nhan N, Tick LJ. Natural
Language Processing and the Representation of Clinical Data. J
Am Med Infomatics Assoc. 1994, 1: 142-160.
75. Sager N, Lyman M, Nhàn NT, Tick LJ. Medical Language
Processing: Applications to Patient Data Representation and
Automatic Encoding. In: [3]: 140-146.
76. Berrut C, Cinquin P. Natural language understanding of
medical reports. In: [2]: 129-137.
77. Schröder M. Knowledge-based Processing of Medical
Language: A Language Engineering Approach. In: Ohlbach H-J
(ed). Proceedings of the Sixteenth German Workshop on AI
(GWAI 92). Berlin: Springer-Verlag, 1992: 221-234.
78. Zweigenbaum P, Consortium Menelas. MENELAS: an
access system for medical records using natural language. Comput
Meth Prog Biomed 1994, 45:117-120.
79. Haug P, Koehler S, Lau LM, Wang P, Rocha R, Huff S. A
Natural Language Understanding System Combining Syntactic
and Semantic Techniques. In: Ozbolt JG (ed). Proceedings of
SCAMC 1994. Philadelphia: Hanley&Belfus, Inc., 1994: 247251.
80. Lesmo L, Torasso P. Weighted Interaction of Syntax and
Semantics in Natural Language Analysis. In: Joshi A (ed).
Proceedings of the Ninth International Joint Conference on
Artificial Intelligence (IJCAI 85). Los Altos, CA: Morgan
Kaufmann Publishers, 1985: 772-778.
81. Friedman C, Johnson SB, Forman B, Starren J.
Architectural Requirements for a Multipurpose Natural Language
Processor in the Clinical Environment. In: Gardner RM (ed).
Proceedings of SCAMC 95. Philadelphia: Hanley&Belfus, Inc.,
1995: 347-351.
82. Gazdar G, Mellish C. Natural Language Processing in
PROLOG: An Introduction to Computational Linguistics.
Workingham, England: Addison-Wesley Publishing Company,
1989.
83. Nhan NT, Sager N, Lyman M, Tick LJ, Borst F, Su Y. A
Medical Language Processor for Two Indo-European Languages.
In: Kingsland LC III (ed). Proceedings of SCAMC 89.
Washington: IEEE Computer Society Press, 1989: 554-558.