Paper

Challenges solved and unsolved in Fact Extraction
from Natural Language
David Mott
Stephen Poteet, Ping
Ann Copestake,
Cheryl Giammanco,
Emerging Technology
Services,
IBM United Kingdom Ltd
Hursley Park, Winchester,
UK
Xue, Anne Kao
University of Cambridge
UK
Human Research &
Engineering
US Army Research
Laboratory
Aberdeen Proving Ground,
MD, USA
Boeing Research &
Technology
Seattle, WA, US
Abstract— Information from unstructured sources is key to
human-machine cognitive tasks, but requires Natural Language
fact extraction, together with a reasoning capability that allows
the user to express assumptions and rules to infer high value
information, all based upon a domain conceptual model. We
describe research in the use of Controlled English (CE) for fact
extraction and reasoning, applied to a complex problem-solving
task, the ELICIT identification task, which has posed many
challenges.
Our research has integrated the DELPH-IN English Resource
Grammar (ERG) to extract detailed linguistic information based
upon a deep parsing of sentences, used a CE domain model to
guide transformation of the linguistic information to domain
facts in a principled way, extended CE syntax to allow metareasoning for dynamic rule extraction from sentences, applied
assumptions to handle NL ambiguities and track sources of
uncertainty such as sentence interpretations and linguistic
expressions of uncertainty, and solved a component of the
ELICIT task.
However significant challenges remain in the search for a
general solution to the application of domain knowledge to fact
extraction. Even ELICIT sentences show considerable
uncertainty and ambiguity that require a detailed understanding
of the domain to overcome, and we have therefore started to
analyse the specific relationships between domain knowledge and
the resolution of ambiguities, in order to make further progress.
I.
INTRODUCTION
This paper reports work under the International Technology
Alliance (ITA) [1] on supporting collaborations of human and
machine in the execution of cognitive problem-solving tasks,
such as that faced by analysts when providing high-value
information from a variety of sources, including unstructured
textual reports, as well as more structured sources such as
databases and spreadsheets. These are complex cognitive tasks
that require the making of assumptions and reasoning based
upon a "conceptual model" of the domain in which the analysis
is taking place. The performance of these tasks requires
support for extracting facts from sentences, querying, inference,
handing of uncertainty, making hypotheses and understanding
of the rationale for conclusions reached. We are researching to
provide support to users for such tasks, based on NL
processing, use of the human-readable language ITA
Controlled English (CE) and a reasoning system. A more
detailed version of parts of this paper is given in [2].
CE [3] is a Controlled Natural Language, a subset of
English, that is both human-readable and machine parseable,
suitable for the expression of domain knowledge, concepts and
reasoning. It is relatively easy for human analysts to use, but it
also has a formal interpretation that is sufficiently
unambiguous that a computer can interpret the input of the
domain analysts and use it to perform inferencing. Central to
the use of CE is a conceptual domain model, a structure that
holds all of the users' knowledge (concepts, relationships,
logical inferences, constraints, and assumptions) of the domain
in which the reasoning and problem solving is to be undertaken.
For example, the domain model for an intelligence analyst
might include the concepts of mission, enemy, terrain and
weather, troops and support available, time available, and civil
considerations (METT-TC). The analyst’s problem solving
strategies may also be represented in CE as ways of reasoning
and of making assumptions, such assumptions varying on the
level of expertise of the analyst and the domain of analysis.
The resulting reasoning can be tracked through the rationale,
showing how conclusions are dependent upon givens and
assumptions.
Our fact extraction combines a deep linguistic parsing
system (the ERG [8]) that generates detailed linguistic
information together with CE modelling and reasoning to map
the linguistic information onto domain facts expressed in CE.
Deep parsing can provide context to support more precise
application of the concepts and rules in the domain model to
identify ambiguities and uncertainties in sentence interpretation.
For example, a domain model that includes concepts of
“human intelligence, counter intelligence, signals intelligence,
and imagery intelligence” expressed in CE provides clues as to
the state of the world of an intelligence analyst in which
“intelligence” is an ambiguous term for multiple concepts with
various meanings.
However, there are many challenges in fact extraction from
NL sentences, and some remain unsolved within the scientific
community. Accurate extraction of the detail contained in a NL
sentence requires a deep analysis of the often complex
syntactic structure of the sentence, and a construction of the
meaning (or semantics) of the sentence from the meaning of its
syntactic component parts. There can be ambiguities in the
structure or meaning of sentences, or their components,
including individual words, which must be resolved in order to
make sense of the whole sentence. Disambiguation may be
impossible without background knowledge of the domain in
which the sentences occur, and this background knowledge
may have to be extensive. Even the relatively "simple"
sentences used in the task described below exhibit such
problems. To turn sentences into CE facts, it is necessary to
transform the sentence meaning into the particular concepts
used in the user's CE domain model. This paper describes our
research into these challenges, how some have been solved and
some of the work needed into order to solve others.
II.
THE ELICIT TASK
The Experimental
Laboratory for
Investigating
Collaboration, Information-sharing, and Trust (ELICIT) [4]
has devised the ELICIT framework for researching into how
organisational structures and communication patterns affect
human collaborative solving of problems requiring reasoning
and interpretation of facts. This framework contains the
"ELICIT task" that involves the identification of key aspects of
a (simulated) planned terrorist attack, by interpretation of a set
of simple English sentences (or factoids). The key aspects are
"who" is going to perform the attack, "what" will be attacked,
"when" and "where" the attack will take place. The ELICIT
task requires a domain model and reasoning steps similar to
the domain knowledge and cognitive processes expressed by
intelligence analysts when reasoning about attacks. It provides
a suitable problem on which to apply our research for the
extraction of facts from sentences and the subsequent
reasoning about the facts in order to perform the identification.
However, analysis of the ELICIT sentences [5] discovered
that there were significant ambiguities as to their interpretation
(and no specific contextual background is given to the
participants which might help disambiguation). An example is
the word "in" in "Dignitaries in Epsilonland employ private
guards" which could be interpreted as "belonging to" or
"located", with the decision affecting subsequent reasoning.
Significantly, some disambiguation could not occur without
general "commonsense" reasoning about the world, something
that is notoriously difficult to achieve by computer, so it was
decided at this stage to simplify the sentences by removing
ambiguities that would require common sense reasoning. Even
though this requires human intervention, it is of benefit as the
problem solving (which is itself complex) from the extracted
facts can be achieved automatically. It was also decided to
focus on identifying "who" was performing the attack.
To use CE to extract facts and perform reasoning we
developed a conceptual model ([6]) of the domain, extending
previous ITA models and including:





had to be the actual participant. By using the CE domain model
and problem solving strategy the system is able to infer the
participants from the simplified NL sentences, as summarised
in the figure below:
agents, operatives, groups of operatives
financial institutions, visiting dignitaries, embassies
time intervals, daytime, nighttime
attack situations, participants, non-participants, targets
working relationships (works with, cannot work with)
It was also necessary to define a problem solving strategy
that would guide the CE reasoning system in performing the
"who" identification. This problem strategy was determined to
be a process of elimination: by using reasoning and facts to
eliminate possible participants in the attack, the one remaining
The figure shows the sentences (black text) and the flow of
reasoning through the rules shown as round rectangles, to the
conclusion (bottom right). One participant is the Lion, based
upon the sentence "the Lion is involved". Another is the Violet
group, inferred by the process of elimination; a rule detects that
all other possible participants have been eliminated, leaving the
Violet group as the only one remaining. This rule is generated
dynamically (by rule-writing rules) from the ELICIT sentences
that indicate the set of possible (group) participants, together
with a user assumption that the world of possible (group)
participants is "closed", in that there are no more possibilities.
This assumption must be a judgement of the user, since it is not
explicitly stated in the sentences.
The figure shows how each possible group, apart from the
Violet group, was eliminated. The reasoning pathways include:




because the group is not one of those in "the Lion only
works with the Azuregroup and the Browngroup and the
Violetgroup". This sentence is turned into a rule
dynamically and applied to all the groups.
because the group is directly stated as being operational
because the group operates at a different time of day to a
known participant (the Lion), as specified in a sentence
such as "the Azuregroup operates in the nighttime".
because the group is recruiting locals and "the Lion does
not work with locals".
The following section describes how the facts were extracted,
and used to generate these reasoning pathways.
III.
NL PROCESSING AND FACT EXTRACTION
As described in [7], our NL processing research utilises
DELPH-IN linguistic resources, the English Resource
Grammar (ERG) [8] to perform a deep parse of the sentences
and Minimal Recursion Semantics (MRS) [9] to represent the
extracted linguistic semantics. A key task of our research is to
transform the linguistic semantics into CE facts, expressed in
the domain semantics of the CE conceptual model, as shown in
the figure and the steps below:
Step 1: English text is sent to the ERG for parsing,
resulting in the output of MRS "predicates" representing the
linguistic semantic information in the text. These predicates are
converted into CE, for example, the sentence "John chases the
cat" is turned into " the mrs elementary predication #ep2 is an
instance of the mrs predicate '_chase_v_1_rel' and has the
situation e3 as zeroth argument and has the thing x7 as first
argument and has the thing x9 as second argument.". This CE
requires an understanding of the linguistic processing to be
readable, and it is intended that it be shown only to linguistic
specialists rather than end users. It is also in a form that is
available to further processing by the CE reasoning
components. A tabular form of the CE can also be derived:
where there is a column for the thing x7 that is named "John",
the first argument of the predicate "_chase_v_1_rel",
expressing the situation (e3) of "chasing". There is also a
column for x9 that is the first argument of the predicate
"_cat_n_1_rel", expressing that x9 is a "cat".
Step 2: Some MRS is "linguistically nuanced", containing
information about how the sentences were constructed rather
than about what real world situations were being described,
and it is therefore desirable to abstract the MRS into a more
generic form. One such form is "intermediate MRS", such as
"quantifications" on things that indicate it is definite, or
indefinite, or a group). Another key abstraction is the concept
of a situation, or state of affairs, that may have attributes (time
and place), relationships (causal, temporal), and roles that
entities play in these situations. For example, "John chases the
cat" could be expressed as the CE "there is a situation e3 that
has the thing x7 as first role and has the thing x9 as second
role". This generic semantic representation provides an easier
starting point for the transformation into full domain semantics.
Other linguistic phenomena, such adjectives, modal verbs
("may be") and negations require different processing leading
to differing conceptualizations, as exemplified below.
Step 3: The generic semantics is transformed into CE facts
that conform to the specific domain semantics of the user's
conceptual model, which might include such concepts as
people, places, attacks, targets etc. Various sources of
knowledge (as CE background facts) may be used to guide this
transformation: CE facts may express how words (more
precisely MRS predicates) are mapped to concepts in the
domain model and can list known entities with their types
(such as the person John1). In this way the situation and
entities involved (such as x7) can be given names and types. In
addition situations involving one or two entities can be mapped
to more readable expressions, such as the domain specific and
more readable CE "the person John1 chases the cat x9".
Step 4: The CE facts may then be used to perform domain
specific reasoning, leading to inference of high valued
information of use to the analysts, based upon inference rules
written by the users, as exemplified by the inference of the
ELICIT participants in the previous section.
The processing of an ELICIT simplified sentence "The
Azuregroup may be a participant" into the CE fact "the group
Azuregroup is a possible participant" is shown below:
This involves the analysis of a modal sentence ("may be"),
following a similar line to that already described, but requires
further information about the modal CE concept to be used
("possible participant"). The figure shows the processing from
the MRS predicates (top left) through several rules (rectangles)
to the conclusion (bottom right). There are two pathways, one
detecting a "modal situation", and one matching the linguistic
information about "the Azuregroup" to a set of (known)
reference entities. These pathways converge on the final rule
that matches the situation to the correct modal CE concept. The
CE rule dealing with the recognition of a reference entity is:
[ nn_lookup ]
if ( the thing T has the value W as common name ) and
( it is false that the thing T is a reference entity ) and
( there is a reference entity named REF that
has the value W as common name )
then
( the thing REF is the same as the thing T ).
which matches the common name of the thing x7 (from the
string "Azuregroup" in the MRS) with an existing reference
entity defined in CE (the group Azuregroup), and concludes
that they are the same thing, thus inferring the "group
Azuregroup" is the role-player in the situation.
The details of the CE processing inside the boxes are not of
importance to the end user, but there are three types of CE fact
that must be provided by the user for this reasoning to work,
shown as the CE facts in blue boxes. These are the reference
entities ("the group Azuregroup has 'Azuregroup' as common
name and is a reference entity"), the mapping of MRS
predicate to entity concept ("the mrs predicate
'_participant_n_1_rel'
expresses
the
entity
concept
'participant'") and the semantic relation between the basic and
modal version of the concept ("the entity concept 'participant'
is modal to the entity concept 'possible participant'"). Thus this
is a "black box" that handles sentences of a certain linguistic
type, configured by a set of CE facts provided by the user.
In trying to analyse the ELICIT sentences it was discovered
that even though they are simplified, there were linguistic
"puzzles" to be solved, including the handling of negated and
modal situations (noted above), and the handling of generic
entities such as "daylight" and "locals" which could not be
simply mapped into specific entities. For example, in "the
Browngroup is recruiting locals" and "the Lion does not work
with locals", the word "locals" indicates some unspecified
group of people, but there is no indication that the group of
locals is the same in the two sentences. However in the context
of the ELICIT task, it seems plausible that the "locals" are
intended in some sense to be the same thing, and that the two
sentences operate to rule out the Lion working with the
Browngroup. Thus there is a conflict between the views of
"different instances" and "the same thing" in the interpretation
of "locals", but a specific interpretation must be made in order
that the facts can be extracted and used. Such challenges
correspond to known linguistic phenomena, some of which are
still subject to research, but an advantage of using CE is that
there is a computational but readable explicit specification of
the linguistic theory which can be reviewed and debated.
IV.
One significant challenge in interpreting the ELICIT
sentences was that some sentences express rules of habitual
behaviour rather than facts (e.g. "the Lion only works with the
Bluegroup and the Azuregroup and the Violetgroup"). Our
interpretation of this rule, in the context of the reasoning by
elimination strategy, is that if a group is not a member of the
set (Blue, Azure, Violet) then the Lion cannot work with that
group. This can be stated more formally as the CE rule:
if ( there is an agent named Lion ) and
( the agent A is different to the agent Bluegroup and
is different to the Azuregroup and
is different to the Violetgroup )
then
( the agent Lion cannot work with the agent A ).
Since all of the linguistic analysis is performed by CE, it
was necessary to extend the CE syntax to allow meta-reasoning
so that rules could themselves construct other rules. This was
achieved by creating a new type of CE object, a "logical
inference" that represented a rule and had complete CE
statements (with variables) as premises and conclusions. Such
logical inferences could be contained in the premises and
conclusions of "rule-writing rules", see [2] for more details.
Once these rules have been constructed automatically, they
may be added to the domain model rules in order to be
involved in the domain reasoning.
V.
SOLVING THE ELICIT "WHO" IDENTIFICATION TASK
The summary reasoning figure above showed how facts
and rules extracted from sentences, together with the domain
model and problem solving strategy, were used to
automatically identify the participants in the attack. The figure
below shows some of this reasoning in more detail, following
the general layout of the summary figure above, but replacing
some of the rule rectangles by "proof tables" that show detailed
rule applications, with rows for premises and a row for the
conclusion at the bottom.
META-LOGIC, AND RULE EXTRACTION
In these transformations, significant use was made of the
ability for CE to undertake meta-reasoning, where CE facts
define explicit information about concepts themselves, and
where rules can reason with this information. Simple examples
have already been given: in "the mrs predicate
'_participant_n_1_rel' expresses the entity concept 'participant'"
a link is defined between a predicate (in effect a word) and a
concept; in "the entity concept 'participant' is modal to the
entity concept 'possible participant' a relationship between two
concepts is given. However there are more complex examples
of the need for meta logic, in the creating of rules that write
rules and in the attempt to generalise sets of rules.
A full description is given in [2]; here we focus on a few
examples. In the top middle, reasoning about working at
incompatible times of day is executed by the domain rule:
[ no_time_overlap ]
if ( the operative A operates in the time interval TA ) and
( the group B operates in the time interval TB ) and
( the time interval TA does not overlap the time interval TB )
then
( the operative A cannot work with the group B ).
and the top proof table has this being applied between the
operative Lion and the Azuregroup.
At the right side of the figure is the identification of
"Violetgroup" as a participant, effected in several steps. At the
bottom right, an "elimination rule" is used to infer that the
Violetgroup is a participant on the basis that all other
participants are ruled out. The proof table at the bottom has this
reasoning, and also indicates that the conclusion is derived
from an assumption (ass321) about the closed nature of the set
of participants. This rule is itself generated automatically by a
set of "rule-writing rules" driven from the making of the closed
world assumption shown in the top right. In [2] we show the
effects of making this assumption prematurely, before receipt
of all the relevant sentences, which leads to an inconsistency as
all possible candidates are ruled out, and the rationale shows
the inconsistency being caused by the closed world assumption,
indicating that it would need to be unmade.
The result of the fact extraction and problem solving
undertaken by the CE system is the inference of the following
CE sentence:
the attack situation Elicitattack involves the operative Lion
and involves the group Violetgroup.
which constitutes a solution to the question about "who" is
participating in the attack, and the rationale for both
participants is available for visualization and examination by
the user. This solution is based upon the assumption that the
world is closed in regard of the entity concept "possible
participant". If further processing with this information (such
as attempt to solve "what" the target is) were to encounter an
inconsistency, then this assumption would have to be
considered to be suspect and be revoked.
VI.
Secondly, meta-logic is used extensively in the reasoning,
where rules operate on statements about the concepts and their
relationships themselves. Meta-logic is used for mapping
between the "world of words" (as expressed in MRS
predicates) and the "world of concepts" where it is necessary to
reason about how the words express the existence of the
concepts. Meta-logic is also used to allow generalisation of
rules covering specific concepts into rules that can handle all
concepts of a particular type. For example a specific rule that
infers that B is married to A given that A is married to B, can
be generalised by meta-logic to state that all relations defined
as "symmetric" can be handled in this way. This requires a
design of the generalisation (i.e. that there is a concept of
"symmetric"), a rewrite of the rule using meta-statements about
relations and their (pair-wise) instances, and the writing of CE
statements indicating which relations are symmetric.
This is shown in the figure below, where the specific "is
married to" rule on the left is generalised to the rule on the
right that operates on all relations that are defined as
"symmetric". These definitions are provided by the user as CE
meta-facts of the form "the relation concept EC is a symmetric
concept". Each specific statement about the "is married to"
relation is generalised to a CE meta-fact of the form "there is a
relation named R that has the sequence ( the thing X , and the
thing Y ) as realisation." which states that the ordered pair of
things X and Y are instances of (or "realise") the relation R.
Since R is a variable here, it is possible to states facts in
general terms about any relation R. In the generalised rule the
premise and conclusion match against all pairs of things P1 and
P2 that realise a relation RC (as long as it is symmetric) and
essentially infer that the relation holds when the order of the
pair is reversed to P2 and P1. As the rule stands, it operates
independently of the type of the things P1 and P2, but it is
possible (and sometimes necessary) to take account of the
range and domain of the relationships, and there are CE metastatements that can express this information.
FURTHER CHALLENGES
The approach described has made progress in solving some
of the challenges of NL processing. However there are other
significant challenges, and this section outlines some of these
and how we approach their solution.
A. Generalisation
It is important that CE processing is not tailored completely
to the set of sentences used for analysis. It is too easy to
construct "one rule per sentence", resulting in a system that is
non-generic, and of little use. We attempt to avoid this problem
by generalisation.
Firstly, the CE reasoning that leads from MRS output to
extracted facts uses concepts that have been generalised from
specific cases. For example, "intermediate mrs" provides a
more generic view of the MRS output, and "situations" provide
an abstract view of the semantics expressed in sentences. Such
generalisations allow reuse in a wider range of sentences.
A further approach to generalisation planned to be
undertaken is the use of the "MRS test suite" defined by
DELPH-IN. This is a set of example sentences that cover all of
the main linguistic phenomena handled by the ERG, and our
aim is to ensure that each linguistic phenomenon is covered by
generic CE reasoning, providing coverage of most sentences in
a principled way.
B. Linking words and predicates to concepts
Our approach relies upon the mapping between words (or
more specifically MRS predicates) and concepts in the CE
domain model. Currently these mappings need to be
constructed by hand, which is a potential bottleneck. Previous
ITA work started to explore the use of linguistic resources,
such as WordNet [10], to suggest suitable concepts for
unknown words to the user. It is now proposed [11] that these
links be made by distributional semantics, which uses a
database of MRS parses of a large corpus of text, and seeks
matches between unknown words and this corpus, in the
linguistic context defined by the sentence parse.
C. Handling Uncertainty and Ambiguity
A significant issue is the handling of ambiguities and
uncertainties, which may arise from a number of different
sources, including ambiguous interpretations of words (such as
"tank") and sentences (such as "dignitaries in Epsilonland") or
incomplete knowledge about the context (such as the complete
set of possible participants). As reported in [12] we use
assumptions and numerical certainty values to represent CE
facts that are uncertain, and apply assumption-based reasoning
to infer new information, detect inconsistencies and label
different sources of uncertainty to the user in the rationale. An
example already described above is the assumption that the set
of possible participants is known (i.e. the world is closed in
respect of other possibilities), leading to the inference of the
only possible participant, or to the detection of an
inconsistency if the assumption is incorrect.
Ambiguities arise in the interpretation of words (such as
"tank") and we have reported an assumption-based approach in
[12]. This is taken up in more detail in the next section on the
use of domain knowledge. Ambiguities also arise from
interpretation of sentences, and we represent such ambiguities
as types of assumption that represent a specific "sentence
interpretation" that form a premise of the rules that extract the
facts on the basis of this interpretation, see [2]. For example,
consider the sentence "Dignitaries in Epsilonland are
protected", which we might wish to turn into a rule inferring
that certain dignitaries are a "protected thing". However, as
noted above, there is an ambiguity as to the meaning of "in". If
we take "in" to mean "working for" then a sentence
interpretation is being made, which can be formulated as an
assumed proposition it is assumed that there is a sentence
interpretation named 'si_in' that has "we take 'in' to mean
working for" as description. This assumed proposition may
then be placed in the premise of the rule that is the expression
of the sentence, for example:
if
( the dignitary D is an official of the country Epsilonland )
and ( there is a sentence interpretation 'si_in' )
then
( the dignitary D is a protected thing ).
and when this rule is applied to a dignitary D, the conclusion
that D is protected will be dependent upon the assumed
sentence interpretation 'si_in'.
In some NL sentences, there are expressions that explicitly
indicate uncertainty in propositions stated in that sentence, for
example "it is thought that", "possibly" "John claimed that",
and it is useful to extract such sources of uncertainty and
present these to the user as part of the rationale for a fact. We
aim to achieve this by representing such information as specific
types of CE assumptions, with associated information (such as
the actual terms used), see [13].
All of these techniques rely on the fact that these different
types of assumptions can be tracked through the rationale and
presented to the users so they may know the sources of
ambiguity or uncertainty in any conclusion of high value
information. This approach based on assumptions is related to
the use of argumentation for the elucidation of sources of
uncertainty in reasoning, and the resolution of conflicts
between different arguments for conclusions; work is being
undertaken to link the ITA work on NL fact extraction with
that on argumentation [14].
D. Why Domain Knowledge Is Necessary for Disambiguation
A significant challenge to NL processing is that words and
sentences are potentially ambiguous, although human listeners
are usually able to disambiguate utterances without conscious
effort. This was exemplified in the ELICIT sentences above.
It is generally stated that disambiguation requires
knowledge of the world, and it is useful to understand why this
is the case. A typical simple algorithm for disambiguating a
word with multiple senses (e.g. tank as military vehicle or
liquid container) is the Lesk algorithm [15]. This determines
the context of the word, as being the other words surrounding
it in a sentence, and attempts to match this context with the
text definitions of the possible senses of the word from a
predefined dictionary, such as WordNet [10]. The definition
that matches the greatest number of words with the context
gives the "best" sense of the word. This algorithm demonstrates
two key aspects of disambiguation, determining the context of
a term, and finding the "best" match between the context and
the predefined senses of the term.
The intuition is that words (or terms) in same context will
share the same general topic, since they are (likely to be)
referring to the same situation in the world; there is a
relationship between the word W and its context C by virtue of
a common third factor, that of the topic or situation on which
word and context are dependent. Thus in the real world, a
situation may be about someone driving a tank, and this will be
expressed by words about drivers, driving and vehicles. Words
about liquid containers will not appear, since that is not what
the situation is actually about. In the Lesk algorithm, it is
assumed that words textually close to a word are in the same
context ("referring to the same topic") and that "being based on
the same topic" can be measured by textual identity of words in
the word definition and words in the context. However these
are assumptions, and are prone to error. There are ways to
improve the definition of context and similarity of sense. Some
of these involve the use of syntactic knowledge and some by
the use of semantic knowledge.
For syntactic knowledge, consider the sentence "the dog
didn't bark by the tree", where the word "bark" could have the
sense of "making a dog-like noise" or "outer covering of a tree".
Simple textual matching of the context would not rule out
either of these senses. However an understanding of the parse
structure of the tree (or possibly just the part of speech
analysis) would indicate that "bark" is the head of a verb
phrase (or just has "verb" as part of speech), and hence the
second sense would be ruled out, as being a noun. In more
complex sentences such as "the tank in the house was leaking
so the engineer drove the tank to the house as quickly as
possible", parsing could more accurately determine the context
(in the sense of the situation being referred to) from the phrase
structure than simple textual context alone. Given that the ERG
system generates output that is closely related to the syntactic
parsing, it might be possible to partition the predicates into
"similar contexts" on the basis of the arguments alone (which
therefore does not require any semantic analysis). However
there are cases where syntax alone cannot disambiguate (such
as "the engineer drove the tank"), and it is then necessary to
use semantic knowledge to match sense and topic context.
People have knowledge of the world which they use to
understand utterances. This semantic knowledge may be seen
as the situations and entities that may exist in the world,
expressed by their connections, relationships, subparts, causal
links, times, places and attributes. Such knowledge may also be
seen as a set of constraints on what is and is not possible in the
world, either physical or psychological. In this view, "being in
the same context" can be stated more precisely as "being in the
same situation", and therefore all NL expressions about that
situations should be semantically consistent, and not break any
of the constraints 1 . For example, role-players in a situation
must be of the right type to play the role of the situation; this
constraint is the basis of the disambiguation of "tank" as being
an entity that had to be "driven" (and hence could not be a
liquid container). It seems reasonable that use of semantic
knowledge could be the most powerful way to perform
disambiguation, since it most closely mirrors the real world
constraints. However there are difficulties in the use of such
knowledge.
It is useful to split "semantics" into some loose categories
of generic, commonsense and specialist. Generic semantics
may be considered as the fundamental structuring of the
conceptual space, and may include such concepts as "situation",
"role", "container" and "causality". This level of abstraction is
perhaps not particularly useful in disambiguation, since it does
not provide many detailed constraints. Commonsense
knowledge is that more detailed knowledge about many
situations in the world, without being specialist, and might
include understanding of how things grow and change, how
people interact, how machinery and natural forces "work". The
world is almost infinite in its possibilities, so commonsense
knowledge is almost infinite, although it can be applied easily
and quickly by humans. Specialist knowledge is that specific
and detailed knowledge of a particular area, such as cooking,
medicine, and the behaviour of tribal influences. In this paper
we will use the term "domain knowledge" to refer to a formal
and limited combination of all of these, mostly specialist
1
We ignore the case where speakers are deliberately or inadvertently
reporting the situations incorrectly.
knowledge, together with a (relatively small) degree of
commonsense and generic knowledge.
Research into Artificial Intelligence has found that
computers can be made to represent detailed domain
knowledge in a limited area, but not the totality of
commonsense knowledge applicable to all areas. (Such tasks
are "AI-complete", mirroring the term "NP-complete" for tasks
that are computationally intractable). Machine assistance to
problem solving tasks is most successful in specific domains,
and this probably applies to NL processing as well.
Thus domain knowledge can assist disambiguation, by
providing constraints based upon representations of situations
in the real world, and by using these to rule out impossible
situations as a result of inferring inconsistencies. However, it
may be the case that inconsistencies are not determined directly
from the information in a sentence, but from indirect inference
over a chain of reasoning steps, based upon application of
semantic knowledge. Therefore it is necessary to ensure that
the conceptual domain model includes much detail of the
regularities of situations and that all possible inferences have
been made from the data in the sentence, before it is certain
that disambiguation can be undertaken.
The work that we have done so far in disambiguation [12]
is in keeping with this approach, based upon the use of
assumptions to represent possible senses, and domain
knowledge to infer inconsistencies, in effect using domain
constraints to rule out impossible situations. We therefore
proposed to continue with this approach whilst seeking to
represent as much commonsense knowledge as possible.
However this will still not permit disambiguation in all cases,
particularly those where common sense is required, and it may
still be necessary to involve the human in disambiguation.
The use of the ERG/MRS is also an advantage, in that higher
precision of parsing and greater detail of semantic output is
more likely to cause the processed situations to be a more
accurate reflection of the situation in the real world, and hence
provide a better context to perform disambiguation.
In order to extend this approach to a wider range of
sentences, it is useful to consider the sources of semantic
knowledge (and constraints) available for disambiguation. The
user's domain model is of course available for this purpose, but
other generic sources of knowledge may be also available.
The ERG itself contains a source of knowledge, both in the
lexicon and in the grammar. The grammar mostly encodes
syntactic knowledge but the lexicon does provide a (relatively
small degree of) semantic knowledge in the form of types to
which the entries may belong. Thus some of the constraints
may already have been applied in the course of the ERG
processing, leading to less requirement for disambiguation.
However the ERG deliberately does not attempt to perform
word sense disambiguation (the word tank is output in the
MRS in the same way for both senses).
External sources of generic semantic knowledge include
WordNet [10], VerbNet [16], and FrameNet [17], and have
been used in the literature for disambiguation. The first two
have been converted into CE, and so could be (and has been)
used in the CE reasoning. We approach disambiguation via
making assumptions for different senses of a term, and it may
possible to use, say Wordnet, to assess and rank the likelihood
of each assumption of sense. Such a ranking could then
prioritise the processing of alternative senses, or could
contribute to the overall degree of uncertainty of a conclusion.
An alternative use of Wordnet might be as a means of
extending the domain model, either automatically or manually
by virtue of an "analysts helper" that could suggest extensions
to the user. Then the extended domain model could then be
used directly in the disambiguation process.
The largest resource of semantic knowledge is Wikipedia,
and it is useful to consider how this knowledge could be used.
In theory, Wikipedia could be converted into CE (even if only
by converting the DELPHIN MRS version of Wikipedia into
CE expressing the MRS). However this is very low level, and it
may be necessary to consider more structured sources such as
DBpedia. Information that is most likely to be of use in
disambiguation would be general semantic relations between
concepts rather than just specific instances of those concepts,
and further work is necessary to determine if such information
is extractable from Wikipedia or DBpedia.
powerfully influence the processing and disambiguation of
more complex types of sentence.
Our research has the potential to provide support for the
users and analysts in the types of cognitive problem solving
task described in the introduction, and we are exploring with
the ELICIT team, and others, how these capabilities compare
to the human reasoning that occurs in these types of task.
ACKNOWLEDGMENT
This research was sponsored by the U.S. Army Research Laboratory and
the U.K. Ministry of Defence and was accomplished under Agreement Number
W911NF-06-3-0001. The views and conclusions contained in this document
are those of the author(s) and should not be interpreted as representing the
official policies, either expressed or implied, of the U.S. Army Research
Laboratory, the U.S. Government, the U.K. Ministry of Defence or the U.K.
Government. The U.S. and U.K. Governments are authorized to reproduce and
distribute reprints for Government purposes notwithstanding any copyright
notation hereon.
REFERENCES
[1]
[2]
VII. CONCLUSIONS
The challenge of obtaining a deep representation of the
syntactic structure and general meaning for a sentence has been
addressed by the use of the ERG parser and its output in MRS.
The challenge of bridging the gap between this linguistic
semantics and the domain semantics of a CE model has in part
been solved by the use of ITA CE to apply domain knowledge
to transform linguistic expression to domain concepts and to
address ambiguity in sentence interpretation by use of
assumptions and the ruling out of those that are inconsistent
with the domain model.
CE has also been used to externalize human reasoning and
problem solving strategies in a complex problem-solving task,
and by making the reasoning visible to the human, CE can
facilitate the human understanding of the solution to the
problem. This has been applied to a realistic task that is used
elsewhere to experiment with collaborative problem-solving,
and discussions with the ELICIT team suggest that our
formulation of the problem may be useful to these experiments.
However there are still challenges that remain unresolved.
The range of sentences handled by our system must be
extended, by applying the techniques to the MRS test suite to
ensure coverage of the main linguistic phenomena to be found
in the output of the ERG system. The most significant
challenge is the need for considerable domain-specific and
common sense knowledge for disambiguation of sentences,
such as the ELICIT sentences. The work reported here has
sidestepped this problem by human simplification of sentences,
although more recent work is handling some of the ELICIT
sentences in the original form, in order to contribute facts in
CE to the collaborative sensemaking undertaken by the ITA
ACT-R research [18]. In this paper we have started to explore
in more detail why domain knowledge is so important to help
disambiguation, with a view to understanding how our work is
to be extended. By using more information within the domain
model, potentially extended from other sources for this domain
knowledge, we aim to apply domain knowledge to more
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
International Technology Alliance, https://www.usukita.org/
Mott, D., Poteet, S., Xue, P., & Copestake, A. (2014), Natural Language
Fact Extraction and Domain Reasoning using Controlled English,
DELPH-IN
2014,
Portugal.
http://www.delphin.net/2014/Mot:Pot:Xue:14.pdf
Mott,
D.
(2010)
Summary
of
CE,
https://www.usukita.org/papers/5658/details.html
http://www.dodccrp.org/html4/elicit.html
Mott
D.,
(2014)
On
Interpreting
ELICIT
sentences,
https://www.usukitacs.com/node/2603
Mott,
D.,
(2014)
Conceptualising
ELICIT
sentences,
https://www.usukitacs.com/node/2604
Mott, D., Poteet, S., Xue, P., Kao, A., & Copestake, A., (2013) Using
the English Resource Grammar to extend fact extraction capabilities,
ITA Annual Fall Meeting, 1st - 3rd October 2013,
https://www.usukitacs.com/node/2498.
Flickinger, D. (2007) The English Resource Grammar, LOGON
technical report #2007-7, www.emmtee.net/reports/7.pdf
Copestake, Ann., Flickinger, D., Sag, I. A., & Pollard, C. (2005)
Minimal Recursion Semantics: an introduction. Research on Language
and Computation, 3(2-3):281–332. (2005)
George A. Miller (1995). WordNet: A Lexical Database for English.
Communications of the ACM Vol. 38, No. 11: 39-41.
O Seaghdha, D, Copestake, A. & Mott, D. (2014) Investigating the use
of distributional semantics to expand domain vocabulary, ITAAFM,
https://www.usukitacs.com/node/2754
Mott, D., (2014) CE-based mechanisms for handling ambiguity in
Natural Language, Feb 2014, https://www.usukitacs.com/node/261
Xue, P., Poteet, S., Kao, A., Mott, D. & Giammanco, C. (2014)
Representing Natural Language Expressions of Uncertainty in CE,
ITAFM, https://www.usukitacs.com/node/2753
Cerutti, F., Braines, D., Mott, D., Norman, T.J., Oren,N. & Pipes, S.
(2014) Reasoning under Uncertainty in Controlled English: an
Argumentation-based
Perspective,
ITAAFM,
https://www.usukitacs.com/node/2756
Lesk, M. (1986). Automatic sense disambiguation using machine
readable dictionaries: how to tell a pine cone from an ice cream cone. In
SIGDOC '86: Proceedings of the 5th annual international conference on
Systems documentation, pages 24-26, New York, NY, USA. ACM.
Palmer, M., http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
Fillmore, C. J., https://framenet.icsi.berkeley.edu/fndrupal/about
Paul Smart, Yuqing Tang, Paul Stone, et al., (2014) Socially-Distributed
Cognition and Cognitive Architectures: Towards an ACT-R-Based
Cognitive
Social
Simulation
Capability,
ITAAFM,
https://www.usukitacs.com/node/2746.

Download Report

Paper

Paperzz.com

Your Paperzz