Partial Parsing Method Applied to Rules

Partial Parsing Method Applied to Rules
Acquisition for Medical Expert System
Maciej Piasecki and Jerzy Sas
Computer Science Department of Wroclaw University of Technology
ul. Wybrzeże Wyspiańskiego 27, 50-370, Wrocl̄aw, Poland
{Piasecki, Sas}@ci.pwr.wroc.pl
Abstract. The paper presents the variant of partial parsing method
(PPM) applied to acquisition of expert rules from a Polish medical text.
PPM is based on the premise that knowledge domain is already defined
by knowledge engineer (i.e. names for classes, attributes, values etc.).
The definitions are automatically translated from natural language into
formal expressions stored partially in knowledge base and partially in
semantic dictionary. PPM preserves composi-tionality principle and is
based on sublanguage method. Subsequent sentences are scanned for occurrences of words belonging to subcategories. Parsing is used for recognition of compound phrases.
1
Introduction
During the implementation of expert system most efforts and costs is consumed
by the process of knowledge acquisition. These costs could be reduced using existing electronic texts. However, the full understanding and intelligent analyses
of text meaning is still impossible. In the paper, a simpler solution is proposed:
intelligent text scanning based on previously given detailed knowledge domain
specification. The whole process is designed to be controlled by a human operator - a knowledge engineer. The specification of domain is done in natural
language and then automatically translated into a given knowledge representation language. The whole task is focused on a specific area of application: a
medical expert system providing disease diagnosis. The expert system is based
on some probabilistic knowledge (i.e. expert rules to which some probabilities
values are assigned and a set of cases of real diagnosis). The expert rules are
very often one level, unstructured.
The main task of the system is to identify in the text all sentences possibly
containing rule-like information and to propose to the knowledge engineer a
draft version of rules. In the following sections knowledge representation (KR)
formalism will be presented together with some more details concerning the
expert system. Next, the effects of linguistic analysis of the collected corpus will
be discussed and, finally the applied Partial Parsing Method will be presented
together with its implementation.
2
Knowledge Representation
Expert knowledge is represented as a set of rules of the general form:
IF wk (x) T HEN J = i W IT H < pk , pk >
where x is a vector of attributes describing the particular case for diagnosis,
wk (x) is a logical sentence dependent in its value on the vector of attributes,
and j is a symbol of concluded diagnosis.
We assume that the set of possible diagnosis is finite.
Moreover, each rule is associated with a pair of values (< pk , pk >) describing
the boundaries for a posteriori probability p(j/wk (x)).
The set of rules of this form is used by the diagnostic algorithm (based on
combined: rule and case based approach) which evaluates the probabilities of
classes for given particular features vector (diagnostic case) [2].
Rules are written down in the special formal language RECLAN. The set
of rules must be associated with a knowledge domain specification - an input
specification. It includes the following elements:
– specification of classes of recognition (mainly definition of symbols for each
one),
– - specification of attributes and their domains (including symbols for values).
An example of the typical domain specification and a rule (based on Polish
terms) is given below:
ATTRIBUTES: cisnienie_krwi
WITH VALUES: obnizone,podwyzszone,w_normie
CLASSES:
przewlekla_niewydolnosc_nerek
IF cisnienie_krwi = podwyzszone
THEN przewlekla_niewydolnosc_nerek WITH <0.3,0.7>
3
General Linguistic Analysis of the Problem
The main goal of the system is to recognise sentences containing the rule like
information in the introduced text and to generate rules written in RECLAN
language for them. Simplifying a little the problem we assume that:
– each rule is contained in a separate sentence,
– texts delivered to the system use a correct language,
– texts are relevant.
The only interesting for us information conveyed by sentences are possible KR
rules. Starting with these assumptions we designed an ’optimistically’ working
semantic parser delivering only a draft versions of rules to a knowledge engineer.
Additionally we left a difficult problem of assignment of posteriori probabilistic
values as untouched. Even a preliminary approach to it needs a cognitive analysis. From the very beginning we rejected the possibility of full parsing as an
ineffective for Polish.
The simplest possible solution to the problem of text scanning seems to be
a kind of pattern matching technique used to look for names of features and
values. However, application of it to Polish text is almost impossible. Due to
almost free word order in Polish and richness of Polish morphology the number
of possible forms for an average phrase is big e.g. the following simple phrases
have the same meaning: ciśnienie krwi (eng. blood pressure), krwi ciśnienie, and
ciśnieniem krwi (the same meaning but different syntactic case).
Another difficult constructions are compound phrases including attributes,
values and conjunctions e.g. [ciśnienie krwi i temperatura] podwyższone (eng.
increased [blood pressure and temperature]).
Next, synonyms are used very often interchangeably, including very specific
synonymous relations (not existing in everyday Polish) e.g. silna gora̧czka (eng.
strong fever), gora̧czka (eng. fever) and infekcja (eng. infection).
However, some characteristic features of investigated example corpus simplifies the task of Knowledge Acquisition (KA) e.g. because most of the sentences
has generic character, they communicate general rules and dependencies in some
reality, the problem of reference and partially anaphora has a minor importance
in KA.
The chosen approach is based on partial syntactic parsing and sublanguage
method [5]. There are defined additional syntactic categories influenced by semantic considerations: class name, attribute name and value name - all in some
syntactic variants (see next section). Because each of them includes compound
expressions some special subcategories were defined, too e.g. element of category
name, attribute name etc. Information about sublanguage category assignment
to words is stored in semantic dictionary. The sublanguage grammar is based
on a subset of general grammar of Polish [6]. The appropriate subset was chosen after analysis of the corpus of ex-ample texts. The utilized subset of rules
is limited mainly to noun phrases and adjective phrases because of the Partial
Parsing Method used for text analysis.
4
Partial Parsing Method
The main task of the parser is to assign to each sentence its meaning i.e. a
sequence: identifier of a class(es) j and predicate expression wk (x). The predicate
expression describes pairs of identifiers of attributes and identifiers of values.The
pairs are connected by different conjunctions. In case the sentence does not
contain rule like information, the empty expression is assigned as its meaning.
To make the parsing relatively fast it is limited only to the phrases seeming to be significant to the expected meaning of the sentence. Each sentence is
scanned for words from semantic dictionary. On the base of sublanguage category of the processed word the appropriate grammar rules (with the appropriate
sublanguage head category) are activated to reconstruct a phrase following the
word. The parsing is limited only to the group of words of the same sublanguage
category. This process is illustrated by the following example. Let’s regard a
typical sentence:
U dzieci wczesnym objawem przewleklej niewydolności nerek moze być podwyższone ciśnienie krwi. 1
after assignment of categories we receive (X means a word not found in
semantic dictionary - and ignored by the parser): X[u] X[dzieci]
X[wczesnym] X[objawem] CN E[przewleklej] CN E[niewydolności] CN E[nerek]
X[może] X[być] VN E[podwyższone] AN E[ci snienie] AN E[krwi]
where CN E, VN E, AN E are sublanguage categories of the meaning: class
name element, value name element, attribute name element, respectively
Next, applying partial syntactic parsing we receive:
X[u] X[dzieci] X[wczesnym] X[objawem] CN NP[przewleklej niewydolności nerek]
X[może] X[być] VN ADJ[podwyższone] AN NP[ci snienie krwi]
When all words of the same category have been collected into a phrase, a
meaning must be assigned to the phrase. From each phrase a unique semantic
key for semantic dictionary must be generated. Because of free word order there
can be many order variants and derivation trees for the same name e.g. cisnienie
krwi and krwi cisnienie means the same. The unique key is generated on the
base of mechanism of normal derivation tree. There are all necessary rules in
the grammar of the parser but from each set of similar rules one of them is
arbitrary chosen as a normal one. The key is produced as a concatenation (with
spaces between words) of leaves of the derivation tree read from left to right.
The strings stored in leaves are not identical with words from the processed
sentence but represent a basic morphological form of each word (together with
stored information about values of morphological attributes). For instance, for
two phrases being order variants of each other: przewlekla niewydolnosc nerek
and przewlekla nerek niewydolnosc (there is more possible variants), there is
generated one unique semantic key: przewlekly niewydolnosc nerka.
Semantic keys are identical with names used in domain specification. If the
semantic key generated for a phrase is found in the semantic dictionary meaning
is assigned to the phrase i.e. a formula of Lambda Calculus (LC) including the
semantic key.
Semantic keys are also the base for the synonyms recognition. Special translation table: key to key is established and because synonyms are stored together
with their normal derivation trees including variables for some values of morphological attributes it is possible to exchange phrases in derivation trees of
sentences.
Starting with the level of name phrases the grammar used in PPM becomes
compositional i.e. there is a semantic rule for each syntactic rule. Mostly semantic
rules are just simple functional application based on LC. For instance, continuing
the last example, regarding the pair value-attribute, there is a syntactic rule
(written in DCG format):
VA_NP(Cs, Num,...) = VN_ADJ(Cs, Nm,...) ATN_NP(Cs, Nm,...)
where the meanings assigned on the base of semantic dictionary are:
1
eng. In the case of children an early symptom of chronic insufficiency of kidneys can
be an increased blood pressure.
AN NP[cisnienie krwi] ⇒ λP.P(cisnienie krwi)
VN ADJ[podwyzszone] ⇒ λN.[let(podwyzszony, N)]
and the semantic rule is just a functional application. After application of
semantic rule we receive:
λP.P(cisnienie krwi)( λN.[let(podwyzszony, N)])
= let(podwyzszony, cisnienie krwi)
Each semantic rule includes conditions, which must be fulfilled by its arguments to make the rule applicable e.g. value must belong to the set of possible
values of a given attribute. The conditions mostly concern information stored
in the domain specification e.g. some ambiguities in conjunction constructions
can be resolved on the base of the specification of attributes domains. For example, in the phrase znieksztalcone [krwinki i temperatura] (eng. disfigured [blood
corpuscle and temperature]) applying information from domain specification the
association of attribute temperatura with value znieksztalcony can be rejected.
Finaly, as an effect of semantic analysis a semantic representation for the
processed sentence is generated, e.g.
let( podwyzszony, cisnienie_krwi),
cl:przewlekla_niewydolnosc_nerek
and next it is transformed into a draft rule, e.g.
if (cisnienie_krwi = podwyzszony)
then przewlekla_niewydolnosc_nerek probability in <..>
Draft rules are presented together with the initial sentence to the knowledge
engineer (KE) and can be accepted, modified or rejected. KE must assign to
each draft rule the appropriate probabilistic values.
5
Implementation
Architecture of the system includes the following modules: Morphological Preanalyser, Partial Syntactic Parser (PSP), Semantic Analyser (SA), Draft Rules
Generator (DRG). The modules uses the following dictionaries: General Syntactic Dictionary, Temporary Syntactic Dictionary, Semantic Dictionary and
Domain Specification stored as data in the system. We assume the maximal
possibly usage of existing Polish language resources. This assumption strongly
influenced the construction of the system, especially the choice of Prolog as the
main implementation language (the modules: PSP, SA and DRG ). Prolog was
chosen because the biggest existing formal description of Polish grammar is done
in DCG format [6]. That is why partial parser is based on classical methods [3]
as well as LC implementation.
There is no big electronic syntactic dictionary of Polish in the format ready to
use. The only possible source is morphological analyser SAM-95 [4] unfortunately
producing complicated output. SAM-95 was used to produce a prototype of the
General Syntactic Dictionary (GSD) on the base of the corpus. The dictionary
was implemented as finite state automata using software prepared by Jan Daciuk
and described in [1].
Effectiveness of parser was improved by the morphological preanalysis 2 [7]
and switches 3 [7].
The User Interface (UI) is written in C++ and is working under Windows
NT. The communication between UI and text processing module is established
on the base of DDE mechanism (Dynamic Data Exchange).
6
Further Development of the System
PPM shows promising speed of processing and accuracy. The most serious limitation of it is that it does not work well for compound sentences. However, application of technique of templates, presently being developed, shows possibility
of PPM extension to compound sentences, as well. The application of compositionality paradigm as a base for the parses occurred to be very successful. We
received a clear construction of the system easy to maintain. Still, the problem
of posteriori probabilistic values assignment on the base of input sentence is the
big challenge.
References
1. Daciuk J, Watson B., Watson R., Incremental Construction of Minimal Acyclic
Finite State Automata and Transducers. In: Proceedings of Finite State Methods
in Natural Language Processing, Bilkent University, Ankara, Turkey, 1998.
2. Huzar Z., Kurzyński M., Sas J.: Rule-Based Pattern Recognition With Learning,
Wroclaw University of Tech. Press, Wroclaw, 1994.
3. Pereira F.C.N., Shieber S.M.: PROLOG and Natural-Language Analisis. CSLI,
Stanford, 1987.
4. Szafran K.: Analizator morfologiczny SAM-95 opis uzytkowy. Technical Report TR
96-05 of Computer Science Institute of Warsaw University, Warsaw, May 1996.
5. Sager N., Friedman C., Lyman M.S., Medical Language Processing, Computer Management of Narrative Data. Addison-Wesley, 1987.
6. Świdziński M.: Gramatyka formalna jezyka polskiego. In: Rozprawy Uniwersytetu
Warszawskiego. Warsaw University Press, 1992.
7. Vetulani Z.: POLINT - system automatycznej interpretacji pytań w jȩzyku polskim
i jego realizacja w PROLOGU. In: Eufonia i Logos. ed. Pogonowski J., UAM Press,
Poznań, 1995.
2
3
Before parsing a temporary dictionary including only forms of words found in the
sentence is created
dynamic cutting of some branches of inference