Feature Label Ordering in Learning the Meaning of a

Feature Label Ordering in Learning the
Meaning of the Polysemous Verb: ‘Run’ /
‘Lopen’
Jacob Verdegaal
5677688
bachelor thesis
Credits: 18 EC
Bachelor Opleiding Kunstmatige Intelligentie
University of Amsterdam
Faculty of Science
Science Park 904
1098 XH Amsterdam
Supervisor
dr. H. W. Zeevat
Institute for Logic, Language and Computation
Faculty of Science
University of Amsterdam
Science Park 904
1098 XH Amsterdam
Abstract
Feature Label Ordering in Learning the Meaning of a Polysemous Verb
by Jacob Verdegaal
Feature label ordering was introduced as better performing on prediction of symbols
from features. In this research it is explored of being useful for finding relevant semantic
properties in the lexicalization of a polysemous verb. The effect of cue competition of
feature label ordering is hypothesized and proven to be a good candidate when data
needs to be classified based on overlapping feature sets. Cue competition performs well
because if features are dependent on each other, predictive values of features (in view of
prediction of a symbol from features) need to be dependent on other features’ predictive
values as well. Label feature ordering learning does not account for these dependencies
as well as feature label ordering does.
Contents
Abstract
i
Contents
i
1 Introduction
1.1 Related work . . . . . . . . . . . . .
1.1.1 Semantic analysis . . . . . . .
1.1.2 Feature-label-ordering . . . .
1.2 Cue competition and disambiguation
.
.
.
.
1
2
2
3
3
2 Method
2.1 Used features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Learning associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
6
6
3 Results, Conclusion
3.1 Results . . . . . .
3.2 Conclusion . . .
3.3 Discussion . . . .
8
8
9
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
and Discussion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A featureStructure
10
B Table of features
11
C Extracted feature sets
13
Bibliography
14
ii
Chapter 1
Introduction
Word meaning is important for Artificial Intelligence (AI), because the Turing-test requires artificial agents to master skills of language in order to pass and to be perceived
(by humans) as intelligent. The meaning of lexical items varies depending on the context
in which they are used, but it is a non-trivial question how to define the exact meaning
of a word in a certain context. When one opens a dictionary one finds whole lists of
possible meanings of words. But elements of these lists, possible meanings of a word, are
defined by examples, and thus not usable for AI. A lexicon would be optimally usable
for AI when a logical meaning is defined for each possible use.
Much work has been done by linguists on the formalization of meaning [1], in which both
typological and cognitive approaches have been taken. The cognitive approach can serve
as an inspiration for AI, but was not developed per-se to be formalized. Typological
work is mostly done to compare languages and to research the origin and relatedness
of languages. Within AI, distributional semantics [2] has been developed in order to
formalize semantics, but is based on the idea that meaning can be defined fully in terms
of the company words keep.
This paper is part of an ongoing project investigating the possible feature-sets defining
correct meaning of a polysemous verb in its context. Furthermore, the feature sets must
be unique and usable for both production and interpretation of language by AI. The
challenge is to find feature-sets that work equally well for Dutch, English, German or,
in principle, for any language whatsoever. This restriction necessitates that the featuresets should be independent of syntactic rules and thus can only result from semantic
analysis. This project is new in its approach in that disambiguation, prediction of local
meaning, should be result of semantic definition (determined by a lexicon) of a single
word, not of context restrictions.
This paper focuses on ‘run’ and ‘lopen’, which have similar meaning. A direct translation
of ‘John runs to catch his train’ to Dutch is ‘Jan rent om zijn trein te halen’. However,
a direct translation is often not correct, for example: ‘the engine is running’ is equal
to ‘de motor loopt’, where the meaning of run and loopt is exactly the same. Above all,
‘de mototr rent’ is meaningless in Dutch.
1
Semantics in AI
1.1
1.1.1
2
Related work
Semantic analysis
Perhaps the best known work on the meaning of words is Wordnet. This is a collection
of 117000 sets of synonyms, linked by basic relations such as hypo- and hypernyms.
However, to know a word’s place in the ’hierarchy of things’ is not sufficient to represent
its local meaning. As noted earlier, meaning depends on context in many cases, especially
for polysemous words.
Currently a huge effort is being devoted to distributional semantics, in which semantic
representation of words is captured in models, known as vector spaces, semantic spaces,
word spaces or distributional semantic models [2]. These models can be used to capture
the semantics of words by defining relations expressing shared attribute values. If two
words share many attribute values, for example as dog and puppy share barking, having
4 legs and a tail, the meaning is similar. Attribute values are extracted from surrounding
words, WordNet and syntactic structure of the sentence wherein the to be defined word
is encountered. However, syntactic definitions do not have (enough) logical meaning to
be used for AI or disambiguation.
1. Jan loopt al 40 jaar
2. Jan loopt al in de 40
Example 1: Different uses sharing attributes
Polysemous words need extra attention in this approach to define any useful meaning.
For example, 1. and 2. share many attribute values, but the actual meaning of lopen
is very different. Disambiguation requires definition of meaningwhith which AI could
reason (i.e. logical meaning).
Disambiguation depends on deeper semantic analysis of possible contexts (i.e. the sentence in which run is encountered). Syntactic roles are not sufficient [3]. Semantic roles
of contextual elements, as described by Grimm in ‘Semantics of case’[4], can provide
some of these semantic properties. In the models described above disambiguation depends on knowledge of syntactic structure and meaning of the other words. Though,
disambiguation of loopt in this example is determined by qualitative change of the property age of Jan ( he is getting older ). A qualitative change is what Grimm calls a
persistence type. He forms a fine-grained case grammar by contruction of a lattice of
agentivity and persistence properties. Semantic properties (combinations of agentive
and persistence types) emerging from this grammar can then be used in semantic models. And because systems are being developed to automatically extract these properties
from corpora [3], they are candidates for use in automatic disambiguation by a lexicon.
Section 2.1 explaines how these semantic properties are used in this research.
Semantics in AI
1.1.2
3
Feature-label-ordering
So to construct a lexicon with which disambiguation can be done automatically featuresets must be found. Each of these sets will represent a specific meaning of the polysemous
verb. The elements of a specific set are hypothesized to be dependent of the other
elements in the set. Because of these dependencies Feature Label Order learning will be
tested to learn which features form the sets.
Feature Label Order (FLO) learning, opposed to Label Feature Order (LFO) learning
has been introduced by Ramscar et al.. They claim that humans are only able to
learn symbols poperly in this order. This claim is breaking with the current paradigm
because symbol learning has always been approached with the goal of predicting a set
of properties ( an object has ) given a symbol. So the predictor has always been the
symbol, the outcome a set of features. But use of symbols demands a prediction of a
symbol given a set of properties. Hence learning of the relations is more fruitful when
the set of properties is presented followed by the symbol [5]. The set of properties are
termed to be features, the symbols labels.
FLO learning has the advantage of cue competition. Prediction is the process of featuresets, or the ‘cue (predictor)’ [5, pp. 913] voting for an ‘outcome (thing to be predicted)’
[5, pp. 913]. When learned with LFO the predictor is a symbol and the outcome a set of
features. Thereby information about dependencies of features is disregarded. Hence the
predictive ‘strength’ of the label is in-, or decreased for the entire set. However, when
learning in FLO the cue consists of features of which the individual predictiveness can
change. For example, in the case where a strong predictive feature is present, but the
outcome is other than predicted ( by this individual feature ), the predictive strength (
association value ) can be decreased. Furthermore, this decrease makes it possible for
other features to increase their association strength, since associations between cues and
outcomes have a maximal value they can reach1 .
1.2
Cue competition and disambiguation
Disambiguation by means of a lexicon of a polysemous word requires definitions of
meaning in terms of combinations of features which determine the local meaning. Since
the features in a specific set are assumed to be dependent on other features of this set,
or more specific when a feature is present in the context, others are likely to be as well,
while yet others are not, the effect of cue competition is a precondition for learning of
these sets, at least according to the FLO learning hypothesis.
Cue competition as a precondition is tested by comparing FLO learned associations with
LFO learned associations performance on two prediction tasks:2
• Given a label and its context, predict the features defining the local meaning.
• Given a set of features, i.e. semantic interpretation in a specific example, predict
the correct label.
1
A maximal association between cue and outcome cannot increase more because it expresses predictiveness which cannot increase when cue a predicts an outcome with 100% certainty
2
From here on the polysemous words run and lopen are labels and their local (disambiguated) meanings are feature sets.
Semantics in AI
4
The first prediction task requires a local meaning to be defined before learning takes
place. In section 2.2.1 is explained how this is done. In the second prediction task a
correct label is to be predicted. Though a polysemous word is only one label, the used
data provides the possibility of discrimination among run and lopen. Furthermore, one
of the meanings of lopen, viz walk, is added as label to guarantee discreteness of labels.
The Rescorla-Wagner model is used to learn associations. Section 2.2 gives an explanation of the technical difference between FLO en LFO learning. In section 3.1 results of
the prediction tasks are presented and in section 3.2 the interpretation and the implications of the results are given.
Chapter 2
Method
2.1
Used features
The project requires use of features which are independent of specific language, hence
the most basic principles in language are used. In any language, an utterance is about
something, there is a main topic of discussion. This is commonly known as the theme.
The theme can be an agent (an entity that at least has sentience, i.e. a human), or not.
An agent itself can have further properties. Grimm constructed an agentivity lattice
in which possible degrees of agentivity are ordered. The lattice also contains degrees
of persistence, expressing change of existence (where an event causes a theme to start
or end existence) or change of qualities of a theme (for example the quality of place,
which is especially useful if the meaning of run is movement). Where Grimm assigns
these properties only to agents, entities in example sentences are allowed to be assigned
persistence values. Furthermore, example sentences can be annotated with a second
entity (ENT 2) which can also be assigned properties. Hereby the exact working of a
process on entities can be specified.
Other features are obtained with help of linguistic resources. A list of all possible uses
of run/lopen was constructed and analyzed.The most important result of this analysis
has been the declaration of any use of run/lopen as a process. All encountered processes
are listed and grouped in higher categories, which resulted in 13 process features to
use. Two more process categories are concrete and abstract in which all the 13 are
grouped. One of these process types, transformation, is enlisted in both, representing
the same meaning of an engine and a computer running. See appendix B for a full list
of used features with a description of their meaning.
Examples of use of run, walk, lopen were extracted from the British National Corpus
and the Alpino Treebank (for Dutch). Examples of rennen were also analyzed but 99%
of the examples were uses with meaning ‘currere’, which is not polysemous use, so these
were not included in testing.
All examples were annotated with a specially designed program. This program helps
in annotation by means of a feature structure. Some features imply others, or reduce
the remaining possible set of features. The structure that is used for annotation can be
found in appendix A. An important note to go with the structure is that this structure
should not lead categorization, but only help in faster annotation.
5
method
2.2
6
Learning associations
The Rescorla Wagner model is used to test association learning in LFO and FLO, identical to the research of Ramscar et al.. This model learns association values between
cues and outcomes. The association values are product of an update rule:
Vijn+1 = Vijn + ∆Vijn
∆Vijn = αi βj (λj − Vtotal )
where:
• ∆Vijn is the change in associative strength between a set of cues i and an outcome
j on trial n
• α is a parameter that allows individual cues to be marked as more or less salient.
• β is the parameter that determines the learning rate; in all experiments set to
0.001
• λj denotes the maximum amount of associative value (total cue value) that an
outcome j can support. In all experiments, λj was set to 1 (when the outcome j
was present in a trial) or 0 (when the outcome j was not present in a trial).
• Vtotal is the sum of all current cue values on a given trial.
At the FLO tests i are the features and j the labels, at the LFO tests the roles are
switched. This means that at the FLO tests association values for different cues are
updated for 1 outcome with a specific value. For the LFO at each trial the association
values for one feature are updated relative to each other for each of the labels. The
resulting association values can then be represented in a matrix.
2.2.1
Prediction
In the first prediction task, where context and label are given, the local meaning must
be predicted. The local meaning is chosen to be a process, because any use of run/lopen
implies a process, one of these must represent the local meaning. For the second prediction task the labels run, lopen, walk are associated with features. The examples of run
differ in frequency of distinct local meaning of those of lopen, so while the meaning is
formally equal, the local meaning in the data is different.
All tests used a random 90% of the data to learn associations, the remaining 10%
was used for testing. The resulting numbers are averages over 10 tests for each of the
4 prediction tasks. This method is chosen instead of a pre-selected test set because
the total data-set contained only 591 examples. By repeated training and testing on
different sets the chance of eventualities and influence of outliers (on training or testing)
is decreased.
A prediction is result of activation of each feature a test example is annotated with.
These activations are then multiplied by the value this feature has in the association
method
7
matrix. Then a summation of these values leads to a prediction value per label or process. The highest prediction value represents the most likely class to which an example
belongs.
Chapter 3
Results, Conclusion and
Discussion
3.1
Results
The results listed in the table below show that FLO learned associations perform better
at prediction than the LFO learned associations on both tasks. These percentages are
averages, the deviation was never more than 10%.
order
FLO
LFO
labels
71%
50%
processes
78%
70%
Below, some (closest to the average) association matrices are showed. A red colored box
means that this particular feature has a negative association value for the corresponding label, and green means positive association. The brightness of a color resembles
associative strength. Non-transparent green must be interpreted as maximal associative
strength and non-transparent red as maximal unassociative strength.
Figure 3.1: FLO on labels
Figure 3.2: LFO on labels
8
Results, Conclusion and Discussion
9
Figure 3.3: FLO on processes
Figure 3.4: LFO on processes
3.2
Conclusion
The matrices show a strong cue competition effect. Some features in the FLO learned
association matrices do not have any value, while in LFO learned association matrices
every feature has at least one green square. This is the direct result of association values
learned for a set of features per label (in FLO), independent of feature association values on other labels. The association values of features learned in FLO change relative
to other features for one outcome only. The values of features in LFO learned matrices change relative to the values of the same feature on other labels, hence a strong
association is learned even when a feature is only present on one label on few examples.
3.3
Discussion
It appears that when features have much overlap on varying labels, FLO is a much
better approach to predict which label follows a set of features. However the data-set
was small, only containing 591 examples. Hence the performance of FLO must be tested
further on bigger data-sets to exclude the possibility of FLO better performing on small
data sets.
These results indicate that the goal of the project to find feature sets can benefit from
the cue competition effect of FLO. Since the associations are sparse in the FLO learned
matrices, there is a possibility of extraction of feature sets which are unique and thus can
be used to define local meaning. In appendix C all sets are listed that have association
value greater than 0.2 for each process (THEME=... is excluded). It is from this point a
task for linguists to investigate whether found sets can sensibly define local meaning.
Appendix A
featureStructure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
%
%
%
%
%
%
%
feature structure to be used for fast annotation of data
keys can imply whole sets , or can be further specified
1 line may start with ’>>’, meaning this is the root layer
keys are the first list ( not in any brackets )
sets embraced by ’{ ’ and ’} ’ mean that the key implies the following set
sets embraced by ’[ ’ and ’] ’ mean that the key implies one of the following set
sets embraced by ’<’ and ’>’ mean that the key implies a subset of the
following set
>> THEME = AGENT THEME = PROCESS
THEME = AGENT { [ movement PROCESS = concrete PROCESS = abstract ] ENT_1 ENT_1 = animate
< ENT_2 SOURCE GOAL partofword > }
THEME = PROCESS { [ PROCESS = concrete PROCESS = abstract ] ENT_1 < ENT_2 SOURCE GOAL
partofword > }
PROCESS = concrete [ movement distance liquid tran sformati on volume body instrument
]
PROCESS = abstract [ temporal rational organisation social money communication
trans formati on descriptive ]
ENT_1 [ total ( ENT_1 ) qualitative ( ENT_1 ) existentialB ( ENT_1 ) existentialE ( ENT_1 )
nonExistentce ( ENT_1 ) ]
ENT_1 = animate < instigation ( ENT_1 ) motion ( ENT_1 ) volition ( ENT_1 ) sentience ( ENT_1 )
>
ENT_2 { [ total ( ENT_2 ) qualitative ( ENT_2 ) existentialB ( ENT_2 ) existentialE ( ENT_2 )
nonExistence ( ENT_2 ) ] ENT_1 }
ENT_2 = animate < instigation ( ENT_2 ) motion ( ENT_2 ) volition ( ENT_2 ) sentience ( ENT_2 )
>
movement THEME = agent { MANNER PROCESS = concrete < MANNER ENT_2 > }
movement ENT_1 = animate { MANNER motion ( ENT_1 ) }
volition ( ENT_1 ) { sentience ( ENT_1 ) }
volition ( ENT_2 ) { sentience ( ENT_2 ) }
motion ( ENT_1 ) { qualitative ( ENT_1 ) }
motion ( ENT_2 ) { qualitative ( ENT_2 ) }
partofword [ noun adjective verb other ]
MANNER { [ ambulare currere ] < ENT_2 = animate > }
movement { cause ( AGENT , start ( PROCESS ) ) }
organisation { cause ( AGENT , start ( PROCESS ) ) }
rational { cause ( AGENT , start ( PROCESS ) ) }
instrument { cause ( AGENT , start ( PROCESS ) ) }
communication { cause ( AGENT , start ( PROCESS ) ) }
rational { cause ( AGENT , start ( PROCESS ) ) }
0 run lopen walk 0
10
Appendix B
Table of features
Figure B.1: Table of used features with their meaning in a lexicon of run,walk,lopen
THEME=AGENT
THEME=PROCESS
movement
PROCESS=concrete
PROCESS=abstract
ENT 1
ENT 1=animate
ENT 2
ENT 2=animate
SOURCE
GOAL
partofword
distance
liquid
transformation
volume
body
instrument
temporal
rational
organisation
true if the theme is a human or an explicit mentioned
group of people
true if not THEME=AGENT
true if the theme moves from one place to another
used for fast annotation
used for fast annotation
always true, in every example a theme is present
true if the theme is animate
true if there is another entity or object mentioned
which is not the theme
true if ENT 2 is animate
true if an explicit source is mentioned, generally only
applicable when movement is true
true if an explicit goal is mentioned, generally only
applicable when movement is true
true if label is not a verb or part of word
true if the process descibes a physical path
true if the process is, or describes properties of, a liquid
substance
true if the process alters input to output
true if the process can be linked with an amount of
something, or with a physical mass
true if the process can be linked with the human body
true if an agent uses something or ‘runs’ a hand over
something
true if the process is exclusivly describing a specific
time or passing of time
true if the process can be linked with a product of the
human mind
true if the process can be linked with an organization
11
Table of features
social
money
communication
descriptive
total(ENT X)
qualitative(ENT X)
existentialB(ENT X)
existentialE(ENT X)
nonExistentce(ENT X)
instigation(ENT X)
motion(ENT X)
volition(ENT X)
sentience(ENT X)
MANNER
ambulare
currere
noun
adjective
verb
other (word)
cause(AGENT,start(PROCESS))
run
lopen
walk
12
true if process descibes an event involving two or more
agents
true if the process can be linked with money
true if the process can be linked with information
which is not part of the theme
true if the label has no meaning, no real process can
be identified and attributes are assigned to the theme
true if ENT X does not change
true if ENT X does changes in the course of the process
true if ENT X ceases to exist in the course of the process
true if ENT X exists at the end of the process and did
not at the start
true if ENT X does not exist (agentive property)
true if ENT X causes an event (agentive property)
true if ENT X undergoes physical movement (agentive
property)
true if ENT X intends a process to happen (agentive
property)
true if ENT X is consious involved in the process
(agentive property)
true
if
the
process
is
movement
and
THEME=AGENT, used for faster annotation.
Implies currere or ambulare
true if movement is a process of walking
true if movement is a process of running
true if partofword is true and word is a noun
true if partofword is true and word is an adjective
true if partofword is true and word is a verb
true if partofword is true and word is another kind of
word
true if a human is nessecarily to the process
label
label
label
Appendix C
Extracted feature sets
Figure C.1: Important features in defining the processes
movement
distance
liquid
transformation
volume
temporal
rational
organisation
money
social
body
instrument
THEME=AGENT, ENT 1=animate, qualitative(ENT 1), instigation(ENT 1), motion(ENT 1), volition(ENT 1), sentience(ENT 1),
MANNER, ambulare, currere, cause(AGENT,start(PROCESS))
THEME=PROCESS, ENT 1, partofword, total(ENT 1), noun, run,
walk
THEME=PROCESS, ENT 1, ENT 2, SOURCE, partofword, total(ENT 1), adjective, run
THEME=PROCESS, ENT 1, descriptive, total(ENT 1), run
THEME=PROCESS, ENT 1,
ENT 2,
descriptive,
existentialB(ENT 1), existentialB(ENT 2), run, lopen
THEME=PROCESS, ENT 1, partofword, existentialB(ENT 1), noun,
lopen
THEME=PROCESS, ENT 1, ENT 2, total(ENT 1), total(ENT 2),
cause(AGENT,start(PROCESS)), run
THEME=PROCESS, ENT 2, total(ENT 1), qualitative(ENT 2),
cause(AGENT,start(PROCESS)), run
THEME=PROCESS, ENT 1, ENT 2, qualitative(ENT 1), total(ENT 2), lopen
THEME=AGENT, ENT 1, ENT 1=animate, ENT 2, total(ENT 1),
sentience(ENT 1), total(ENT 2), ENT 2=animate, run
THEME=AGENT, ENT 1=animate, ENT 2, qualitative(ENT 1), existentialE(ENT 1), sentience(ENT 1), total(ENT 2), lopen
ENT 2, total(ENT 1), instigation(ENT 1), volition(ENT 1), total(ENT 2), adjective, cause(AGENT,start(PROCESS)), run
13
Bibliography
[1] James Pustejovsky. The generative lexicon. Computational linguistics, 17(4):409–
441, 1991.
[2] Marco Baroni and Alessandro Lenci. Distributional memory: A general framework
for corpus-based semantics. Comput. Linguist., 36(4):673–721, December 2010.
[3] N. Kambhatla and I. Zitouni. Systems and methods for automatic semantic role
labeling of high morphological text for natural language processing applications,
September 3 2013. URL http://www.google.com/patents/US8527262. US Patent
8,527,262.
[4] Scott Grimm. Semantics of case. Morphology, 21(3-4):515–544, October 2011.
[5] Michael Ramscar, Daniel Yarlett, Melody Dye, Katie Denney, and Kirsten Thorpe.
The effects of feature-label-order and their implications for symbolic learning. Cognitive Science, 34:909–957, November 2010.
[6] Hwee T. Ng and John Zelle. Corpus-based approaches to semantic interpretation in
natural language processing. AI Magazine, Winter 1997(45–64), 1997.
14