Unsupervised Relation Discovery for Prepositions and

M ASTER ’ S T HESIS
Unsupervised Relation Discovery for
Prepositions and Noun Compounds
Supervisor:
Dr. Ivan T ITOV
Co-supervisor:
Diego M ARCHEGGIANI
Author:
Benno K RUIT
10576223
A thesis submitted in fulfillment of the requirements
for the degree of MSc Artificial Intelligence
in the
Language and Computation Group
Institute for Logic, Language and Computation
(42 ECTS)
February 26, 2016
iii
Abstract
Faculty of Science
Institute for Logic, Language and Computation
MSc Artificial Intelligence
Unsupervised Relation Discovery for Prepositions and Noun
Compounds
by Benno K RUIT
10576223
This thesis introduces the problem of unsupervised discovery of semantic relations expressed by prepositions and noun compounds. These relations are rarely present in annotated resources, but contain useful information. I evaluate a model that induces a feature-rich semantic labeler for
prepositions and noun compounds by coupling the distribution over the
outputs of a classifier to a reconstruction component. First, I induce relations only for prepositions, but they do not correspond to gold-standard
annotations. While prepositions are undeniably polysemous, the baseline
assumption that they are not is hard to beat. Second, I evaluate the architecture of the reconstruction component on the task of predicting preposition
paraphrases for noun compounds. Trained from factorizing the prepositions as relation labels, its output distribution is close to the distribution of
a set of paraphrase annotations. Third, I induce a combined set of semantic relations for prepositions and noun compounds. The reconstructionbased models often predict the same relation for noun compounds and their
preposition paraphrase, while the baselines don’t. While the results show
that rich latent representations are beneficial for relating prepositions and
noun compounds, relation induction for this class of phrases remains an
open problem.
v
Acknowledgements
First, I would like to thank my thesis advisor Ivan Titov, who inspired
me to tread the deep waters of unsupervised learning and gave me the
freedom to try anything and everything. Second, I want to thank Diego
Macheggiani, who tirelessly helped me through those all-too-common “dips
in intelligence”. Third, I would like to thank my fellow students in the AI
program for their friendship, help and motivation: Sara, Joost, Otto and
Bas.
Finally, I want to thank my parents, who have seemingly unconditional
pride and the most enticing footsteps, and most of all my girlfriend Hester,
who keeps me happy, confident and sane.
vii
Contents
Abstract
iii
1
Introduction
1.1 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Reconstruction-error minimization . . . . . . . . . . . . . . .
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . .
2
Background
2.1 Distributed Word Representations . . . . . . . . . . . . .
2.2 Supervised Semantic Parsing . . . . . . . . . . . . . . . .
2.2.1 Parsing Prepositions . . . . . . . . . . . . . . . . .
2.2.2 Parsing Noun-Noun compounds . . . . . . . . . .
2.3 Information Extraction . . . . . . . . . . . . . . . . . . . .
2.3.1 Unsupervised Semantic Parsing and Factorization
2.3.2 Latent Structured Prediction . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
8
9
10
11
Model
3.1 Reconstruction-Error Minimization . . . . . . . . . . . .
3.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Independent Argument Probability . . . . . . .
3.3.2 Joint Argument Probability . . . . . . . . . . . .
3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Stochastic Gradient Descent . . . . . . . . . . . .
3.4.2 Encoder regularization using its entropy . . . .
3.4.3 Extension: decoder approximation by sampling
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
14
14
15
16
16
17
17
3
4
.
.
.
.
.
.
.
.
.
Experiments
4.0.1 Initialization and Hyperparameters . . . . . . . . . .
4.0.2 External Data . . . . . . . . . . . . . . . . . . . . . . .
4.0.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Discovering Preposition Relations . . . . . . . . . . . . . . .
4.1.1 Data and Evaluation . . . . . . . . . . . . . . . . . . .
4.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Paraphrasing Noun-Noun Compounds with Preposition Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Data and Evaluation . . . . . . . . . . . . . . . . . . .
4.2.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Discovering Combined Preposition and Noun-Noun Relations
4.3.1 Data and Evaluation . . . . . . . . . . . . . . . . . . .
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
4
19
19
19
20
20
20
21
24
25
25
26
26
27
29
29
30
viii
5
Discussion
5.1 Future Work . . . . . . . . . . . . . . . . . . .
5.1.1 Task-oriented evaluation . . . . . . . .
5.1.2 Variational Methods . . . . . . . . . .
5.1.3 Larger Dataset . . . . . . . . . . . . . .
5.1.4 Word Sense Induction and Clustering
5.2 Conclusion . . . . . . . . . . . . . . . . . . . .
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
34
34
35
35
37
1
Chapter 1
Introduction
1.1
Semantic Parsing
The field of semantic parsing aims to design computer programs that interpret the meaning of natural language text and generate data structures
which reflect that meaning. Often the first step in that endeavour is to design the structures that we would like the computer to generate. Consider
the following example:
(1) John is reading a book on the history of the Netherlands by a professor
from Amsterdam.
There are a number of projects that aim to annotate such sentences with
well-defined structures. In the FrameNet project (Baker, Fillmore, and Lowe,
1998), for instance, a sentence is assigned one or more semantic frames from
an inventory curated by experts. In the example above, the frame would
typically be labeled as Reading_perception, involving the frame elements
Reader (annotated to the word John) and Text (annotated to the rest of
the sentence). We can then use these annotations to train a computer program — a semantic parser — to predict these structures automatically when
given an input sentence.
However, there is more information in sentence (1) than only the fact
that John is reading something. Consider the prepositions in this sentence.
Using not the FrameNet inventory but a relation inventory for prepositions
from Srikumar and Roth, 2013b, those prepositions express the semantic
relations of Topic (book on history), Attribute (history of the Netherlands),
Source (book by professor), and Source (professor from Amsterdam). In these
examples, the preposition expresses a relation between the word on the
left, called the governor, and the word on the right, called the object. Given
enough high-quality annotations, we could train a semantic parser for not
only the verb-based frames, but these preposition relations too. Sadly, in
most annotated resources these constituents are not annotated with their
own structure, and thus many current semantic parsers do not generate
these structures. Indeed, there is no project in existence that would annotate
noun compounds like history book or Amsterdam professor in this context with
these kind of labels. This way, the limitations of these datasets of sentence
annotations also become the limitations of the parsers that are trained on
them.
Supervised parsers Training a semantic parser in a purely supervised
manner has a number of shortcomings. First, though annotation projects
2
Chapter 1. Introduction
such as FrameNet cost huge amounts of manual effort, they cannot provide a complete coverage of language use. Additionally, domains such as
medicine, law or governance use different vocabularies and discuss different concepts, which require their own massive annotation efforts.
Second, when semantic parsers are used in applications, the generated
structures should be suitable for automated reasoning. In question answering, they can be used to parse both the questions and the text that is used to
answer those questions. In other information retrieval tasks, they might be
used to build a general-purpose knowledge base of information extracted
from large volumes of text (e.g. on the web). However, it is often unclear
how the annotated structures correspond to the application that the semantic parser should be used for. Structures that may be useful for a task do not
necessarily exist in the annotations, or the structures may not be adequate.
For instance, annotations might group together antonyms such as open and
close, or related terms such as eat and drink.
This problem can be mitigated by using data related to a specific task —
such as questions and (structured) answers (Liang, Jordan, and Klein, 2012)
— instead of manual annotations. Though those structures are then necessarily useful for the task at hand, designing a task and collecting data for
general-purpose semantic parsing is extremely difficult. Thus, the datasets
of these tasks are typically constrained to very specific tasks, requiring a
new dataset for every task. In those specific tasks, there is no reason for the
structures of one task to be useful for another.
Finally, annotated datasets typically do not exist for languages other
than English. Translating a semantic parser is virtually impossible due to
the language-specific data and features that are used.
Unsupervised parsers Therefore, we would like to train a computer program to discover patterns in language that reflect its meaning without manual annotations. Current work on unsupervised semantic parsing typically
uses models with strong independence assumptions, a limited input representation and inflexible learning system (Titov and Khoddam, 2015). That
means these systems have limited ability to model the relations that exist
between parts of the sentence, and how concepts relate to each other in general. Additionally, these models have difficulties with languages that have
a freer word order than English.
In Chapter 2, I will discuss existing approaches to semantic parsing and
unsupervised language processing.
1.2
Reconstruction-error minimization
The use of unsupervised learning with a flexible reconstruction-based objective, as in this work, is a strategy to hopefully overcome these problems.
First of all, it allows us to use orders of magnitude more data than are used
in supervised approaches. By training on any available text instead of text
with annotations, we are not restricted to any domain, task or language.
Secondly, reconstruction-based methods have succeeded in other fields as
well. In machine learning, there has been a large amount of work on autoencoders, which learn to encode data into lower-dimensional latent representations that allow the data to be reconstructed as well as possible. These
1.2. Reconstruction-error minimization
3
methods find intrinsic patterns in data by minimizing the reconstruction error. We would like to use this succesful concept to create a semantic parser.
Finally, it allows us to search for structure in parts of sentences that are
normally not annotated.
Features
Relations
gov: professor
obj: Stanford
prep: from
pos: NN-IN-NN
Reconstruction
a 1: p r o f e s s o r
a 2: S t a n f o r d
0% 30%60%
g(x)1.1:
F IGURE
a1(x), a2(x)
Inducingr preposition relations by
reconstruction-error minimization
In this thesis, I will evaluate a program that tries to discover the relations
that are expressed by prepositions and noun phrases (figure 1.1). The program is based on a statistical model that reflects the likelihood of those parts
of the sentence. For example, the predicted relations for ‘on’ and ‘about’ in
book on/about history in contexts such as sentence (1) should become similar.
This happens because in the training data book occurs with words similar to
history in those contexts, which makes the word pair highly likely to occur
in that relation. In other contexts, such as book on shelf, the predicted relation for ‘on’ should be different, because in that relation book occurs with
words similar to shelf. Using its own predicted relations, the system tries
reconstruct the governor from the object, and vice versa. By minimizing
the reconstruction error, the program jointly learns a measure of word similarity, a clustering of the input data and a probabilistic model of word pairs
for each cluster.
The parameters of the model are optimized to make the input dataset
as likely as possible, which forces it to distribute the word pairs over different relations. These parameters then express the probabilities of word pairs
and relations. Modelling the probabilities of words together allows us to
predict one word from the other word in the pair, and the relation that the
model associates with the input. This way, the model is more expressive
than existing generative models. Hopefully, these properties will allow the
program to induce meaningful and useful representations of natural language semantics.
In Chapter 3, I will discuss the details of the model.
Of course, creating a semantic parser without a task to evaluate it on is a
pernicious affair. This thesis describes a very limited parser — it is only for
interpreting prepositions and noun compounds – which makes it useless
for any task on its own. I have no choice but to evaluate the model by comparing the relations that it discovers to annotations that have a linguistic
foundation.
I will evaluate the model in three situations: (1) discovering relations
that are expressed by prepositions, (2) the performace of the decoder when
using prepositions to express noun-noun relations, and (3) finding one set
of relations for both prepositions and noun compounds.
In Chapter 4, I will describe my experimental setup and results. In
Chapter 5, I will discuss what they imply for the model and the problem
of relation induction for prepositions and noun compounds.
4
Chapter 1. Introduction
1.3
Thesis Contributions
In this thesis,
• I introduce the problem of relation induction for prepositions and
noun compounds, and relate it to existing work.
• I evaluate an expressive reconstruction-error minimization model on
this problem, which highlights some of its shortcomings in this situation. In particular,
– I induce relations only for prepositions, but they do not correspond to gold-standard annotations. While prepositions are undeniably polysemous, the baseline assumption that they are not
is hard to beat.
– I evaluate the architecture of the reconstruction component on
predicting preposition paraphrases for noun-noun compounds.
Trained from factorizing the prepositions as relation labels, its
output distribution is close to the distribution of a set of paraphrase annotations.
– I induce a combined set of semantic relations for prepositions
and noun compounds. The model beats the baselines at predicting the same relation for noun compounds and their preposition
paraphrase.
5
Chapter 2
Background
The model described in the next chapter derives its inspiration from many
fields, and this chapter is meant to give an overview of the research that has
motivated its design. As this thesis introduces the problem of unsupervised
preposition and noun-noun compound relation discovery, I will describe
the its context in the literature on information extraction, semantic parsing
and on the meaning of prepositions and noun compounds.
2.1
Distributed Word Representations
The main problem in making language models is sparsity : people are very
productive with language, and combine words and phrases in so many
ways that it is impossible to encounter all possible patterns in texts (Manning and Schütze, 1999). Natural language processing systems must find a
way to keep the probability of unseen patterns high enough, and generalize
correctly from observations. If a model assigned a probability of zero to all
patterns it had never seen, it would not be useful. This is also called the
curse of dimensionality. A popular way to cope with sparsity is distributed
word representations (Bengio et al., 2003).
Learning distributed word representations can be seen as a form of dimensionality reduction, where word and phrase tallies are approximated
instead of tracked explicitly. The approximation is computed using word
representations, which are more constrained than the tallies and are thus
lower-dimensional. Those representations are not observed directly, but
created by optimizing the model in a certain way. This results in a model
that generalizes observations in order to overcome the sparsity problem.
Factorization For instance, a table with the tally of all word pairs in a
vocabulary of size V requires V × V cells, but the same information can
be approximated using fewer parameters. If we create representations for
words that encode the most important parts of how they occur together, we
can combine the representations for the words in the column and row of a
cell to estimate the tally of that word pair. When the original information
is a matrix (such as in this example) and the matrix is approximated using
the multiplication of latent representations, this is known as truncated (or
rank-reduced) matrix factorization (or decomposition). The reconstruction of
the original matrix is known as expansion.
The similarity between these latent representations of words often corresponds to our intuitions about language. This observation is central in
the field of distributional semantics , which is based on the hypothesis that
semantically related words occur in the same contexts (Harris, 1954). In
6
Chapter 2. Background
distributional semantics, the model is not tested directly for its language
modeling properties, but for the correspondence of the latent representations themselves to human judgements of word similarity and analogy.
Bengio et al., 2003, introduce a method to learn latent word representations by predicting words from the words that surround them, using a neural network. A related but more efficient model is described in Mikolov,
Sutskever, et al., 2013, and trained on more data. Pennington, Socher, and
Manning, 2014, describe a more direct way of finding latent representations
from word co-occurrence statistics. Levy and Goldberg, 2014, explain how
prediction-based and count-based methods are related, presenting the former model as implicit matrix factorization.
Some of the models described in the next chapter induce word representations similarly to these approaches. In most experiments, I will also use
word representations induced using one of the methods described above.
However, the goal of this thesis is not the induction of these representations
themselves, but to use them for unsupervised semantic parsing of prepositions and noun-noun compounds.
2.2
Supervised Semantic Parsing
Most current systems for semantic parsing are trained using a corpus of
annotations. It is important to understand the choices that are made when
creating these corpora, and the performance of the parses trained on them.
This is particularly important for prepositions and noun-noun compounds,
because there are significant disagreements on the way their semantics are
annotated.
2.2.1
Parsing Prepositions
Prepositions specify relations between two words: the governor (which can
be a noun, adjective or verb) and the object (which is usually a noun). Governor, preposition and object form a chain within the syntactic dependency
structure of a sentence. Prepositions are highly polysemous (Baldwin, Kordoni, and Villavicencio, 2009), which makes their interpretation a challenging problem in the field of natural language processing.
O’Hara and Wiebe, 2003, introduce the semantic role labeling of prepositional phrases for 7 Penn Treebank labels and 129 FrameNet roles. They
train a decision tree on features derived from WordNet for generalization,
and show that using high level features, such as semantic roles, significantly
aid prediction. The roles that they predict from Framenet, however, are
quite specific and primarily related to the frame that is evoked by a verb instead of to the preposition semantics itself. The labels attached to the Penn
Treebank are coarse and heavily skewed. Neither of the two inventories are
based on preposition word senses. Therefore, later approaches have created
a different set of relations that is motivated bottom-up from word senses,
merging frame elements together.
Hartrumpf, Helbig, and Osswald, 2006, make a distinction between
regular and irregular phenomena (though they also note ‘subregular’ phenomena), in which a preposition has a literal or non-literal meaning. They
2.2. Supervised Semantic Parsing
7
also take into account the linguistic distinction between the prepositional
phrase as complement or adjunct, which indicates how strongly it is linked
to the governor. Their system for preposition interpretation creates rulebased proposals, and selects from those proposals using a statistical backoff model. Their parse structures are part of the Multi-Net paradigm, which
is a knowledge representation formalism for natural language semantics
used in their semantic annotation efforts of German text. Their rule-based
system makes use of extensive rule-based semantic modelling structures
that contstrain the types of predictions, hindering the comparision of their
approach to related work.
Litkowski and Hargraves, 2005, introduce the Preposition Project, which
is a dataset of prepositions in FrameNet sentences disambiguated using
senses from the Oxford English Dictionary. By labeling prepositions in
FrameNet, they create a corpus that was used for the SemEval2007 challenge of preposition word-sense disambiguation (Litkowski and Hargraves,
2007). In this challenge, three teams participated. The team that achieved
the highest score used a feature-rich logistic regression classifier.
Tratz and D. Hovy, 2009, later improve preposition sense disambiguation, beating the systems created for the SemEval2007 challenge. They introduce a large number of features, based on syntactic context and WordNet synonyms and hypernyms. These were expanded and analysed in D.
Hovy, Tratz, and E. Hovy, 2010, and their performance was compared on
the SemEval2007 challenge dataset and the Penn Treebank labels. They
conclude that close mutual constraints hold between the elements of the
prepositional phrase, which motivates the use of a joint model of preposition arguments in this thesis.
Srikumar and Roth, 2013b, train a supervised model of preposition semantics using a structural SVM. Their Model 1 predicts a class label given a
WordNet-based feature-rich representation of the arguments, both on their
own and conjoined with the preposition. They extend the latent structural
SVM for their Model 2, which also predicts the arguments themselves from
a set of candidates proposed by a dependency parser. This work uses their
inventory of preposition relations for evaluation, but does not predict the
preposition arguments.
Nakashole and Mitchell, 2015, propose other interesting features related
to prepositions. They train a log-linear model of Prepositional Phrase attachment through Expectation-Maximization on extra unlabeled data. They
evaluated different linguistic feature sets, and found that the most salient
features came from the semantic types of noun phrases (from categorizations such as WordNet), the semantic type of the subject of the clause, and
n-grams containing the preposition.
Schneider et al., 2015, extend the inventory of Srikumar and Roth, 2013b,
to create a hierarchy of preposition senses, with many connections between
different senses. It also broadens the set of prepositions to include multiword prepositions that were not annotated in the Preposition Project.
Unsupervised Approaches D. Hovy, Vaswani, et al., 2011, introduce unsupervised preposition sense disambiguation, but they don’t use supersenses. They train a separate generative model for each preposition, which
finds latent word classes for the governor and object and the preposition
itself. They train the model using Expectation-Maximization, Maximum a
8
Chapter 2. Background
posteriori EM and Bayesian Inference with Dirichlet priors, and evaluate it
by mapping the resulting preposition sense clusters to the labels with which
they most frequently overlap. They achieve a score of 0.55 accuracy, which
is a significant improvement over the baseline of 0.40.
2.2.2
Parsing Noun-Noun compounds
Noun-noun compounds express a latent relation between the modifier and
head word (Downing, 1997).
Compound Splitting Some languages, such as German, compound nouns
together into single words. Splitting those compounds reduces sparsity in
natural language processing tasks. Some approaches, e.g. Daiber et al.,
2015, perform compound word splitting using word embeddings.
In contrast, Dima, 2015, tries to find a function that combines the word
embeddings of both nouns in a compound into the word embedding of the
compound word itself. The most effective function is one that learns an
additional representation for each word in the modifier and head position,
and transforms them with a matrix of weights.
However, neither approach attempts to induce a latent relation inherent to the compound. One could argue that compound words might have
different senses for every relation that it might express, but in practice compounds themselves are rarely ambiguous.
Paraphrasing with Verbs Downing, 1997, finds it impossible to constrain
the relations expressable by noun-noun compounds. Downing identifies a
minimum set of 12 relations which commonly occur, but argues that this set
does not adequately capture all relations that can occur between nouns in
novel compounds. This inspires Nakov, 2007, to use an unconstrained set
of verbs (or verbs+preposition) for paraphrasing these relations. The two
resulting datasets of search-engine-based and annotator-based praphrases
had significant overlap, indicating clear latent relations. In this way, verbs
allow the set of relations to be unconstrained, but the relation does not get
one unambiguous label. That makes classification difficult, and cannot result in a semantic parser.
Parsing into Classes Classes make semantic parsing easier, but you have
to create a relation inventory.
Levi, 1978, introduces 9 categories of noun-noun compounds, based on
‘recoverably deletable predicates’. However, Warren, 1978, finds 6 major
hierarcical categories by analysis of the Brown corpus. Other approaches
result in 14 categories (Vanderwende, 1996, statistically from online dictionaries), 8 relations (Séaghdha, 2007, by analysis of the British National
Corpus), or 22 relations (Girju, 2009, in a cross-lingual comparison).
Tratz and E. Hovy, 2010, create a taxonomy that was comprehensively
compared to and associated with existing set of relations, and created a
large dataset of noun-noun compounds. This taxonomy was used in Tratz
and E. Hovy, 2011, to enrich a statistical syntactic parser. The dataset is the
largest of its kind, and publicly available online1 . However, the noun-noun
1
http://www.isi.edu/publications/licensed-sw/fanseparser/
2.3. Information Extraction
9
compounds are not in sentence contexts, making them difficult to use for
unsupervised semantic parsing.
In recent work, Dima and Hinrichs, 2015, improves the state-of-the-art
noun-noun parsing on this dataset using a 2-layer neural network and pretrained word embeddings. They also show that the representation in their
hidden layer corresponds to our intuition on noun-noun compounds semantics.
Paraphrasing with Prepositions With prepositions, there is no need to
create an well-defined inventory. However, the ambiguity of prepositions
presents a problem for annotators, resulting in low inter-annotator agreement.
Lauer, 1996, uses 8 prepositions for paraphrasing. Motivating this, Lauer
explains: “French makes limited use of compounding, while in German it
is heavily productive. It is therefore commonly necessary to render a German compound into an alternative French syntactic construction, such as
a prepositional phrase.” However, some types of noun-noun compounds
are excluded, and the resulting dataset is relatively small. Girju, 2009, compared these paraphrases to a set of classes.
Bos and Nissim, 2015, create more data using a competitive game called
“Burgers” as an annotation framework. Players paraphrase noun-noun
compounds using prepositions in sentence contexts, earning points for interannotator agreement based on the player’s confidence. The pre-selection of
a limited number of prepositions to present to the players is done caseby-case in a data-driven fashion using Google n-grams. The sentences are
generally in the same domain as FrameNet (e.g. newswire, texts from the
American National Corpus). This results in a dataset of player annotations
with confidence indications, with noun-noun paraphrases using 24 possible
prepositions. I used this dataset for evaluating noun-noun relation learning
from preposition data.
2.3
Information Extraction
Information extraction systems build a general-purpose knowledge base
from large volumes of text (e.g. on the web). The following methods all assume the knowledge base consists of (subject, relation, object) triples, where
the relation always has the same type of subject and object roles.
Lin and Pantel, 2001, introduce DIRT, a system for clustering extracted
relations in order to discover inference rules. The relations are extracted
using manually created rules for syntactic patterns from full parse trees,
and then clustered together based on co-occurrence statistics.
Banko et al., 2007, introduce T EXT R UNNER, a system for large-scale scalable open information extraction through a single pass over a corpus. It
is based on a efficient linear classifier, trained on a small set of syntactic
parses, which labels parts of the sentence as the relation and arguments,
after which the relation and argument tuples are normalized, stored and
assigned a probability. This method is much faster than using a full syntactic parser. However, the extracted structures are still based directly on
the words that are used in the text, instead of underlying semantics. This is
extended by Kok and Domingos, 2008, who merge relations and arguments
10
Chapter 2. Background
together into abstract concepts using Markov Logic. The result is a semantic network grounded in text, but with no way to judge the probability of
unseen facts.
Yates and Etzioni, 2009, propose a generative model that estimates the
probability that two predicates are synonymous by comparing pairs of arguments. Their model, RESOLVER, then performs hierarchical agglomerative clustering and achieves high precision. However, the model is unable
to deal with polysemy.
Carlson, Betteridge, and Kisiel, 2010, introduce NELL, which combines
a hand-crafted taxonomy of entities and relations with self-supervised largescale extraction from the Web, but they require additional processing for
linking and integration. They combine multiple high-precision classifiers,
and create a corpus of beliefs by combining their predictions and previously
extracted facts.
Fader, Soderland, and Etzioni, 2011, describe R E V ERB, which extracts
informative relations by combining a simple syntactic and a lexical constraint. Just like T EXT R UNNER, it forgoes a full syntactic parser for efficiency reasons. However, it extracts higher quality relations using a constraint on part-of-speech tags and noun chunks.
Galárraga et al., 2014 canonicalize relations and arguments extracted by
NELL and R E V ERB. They use high recall extractors, followed by clustering methods to improve the precision. The arguments are merged together
using hierarchical agglomerative clustering using various similarity scores,
and the relations are merged using association rule mining. It approaches
the performance of the method from Krishnamurthy and Mitchell, 2011,
which uses the typed taxonomy from NELL relations to disambiguate arguments.
2.3.1
Unsupervised Semantic Parsing and Factorization
To integrate textual information into a knowledge base, you need to create useful representations that merge textual descriptions into abstract concepts. The (subject,relation,object) triples from the above approaches have a
number of shortcomings. First of all, relations can hold between more than
two entities. Second, every relation in the methods described above can
have a inverse form, in which the subject and object are reversed. Therefore, it can be useful to split the relation into a semantic frame and semantic
roles, as in the FrameNet project. When the relations or roles are less related to their surface form, it becomes possible to predict unseen facts and
perform reasoning tasks.
Lang and Lapata, 2010, introduce the Latent Logistic model for inducing
semantic roles in text. Their approach allows for the induction of latent
classes using a rich feature set.
Titov and Klementiev, 2012, introduce two Bayesian models for unsupervised semantic role induction. As generative models, they necessarily make several independence assumptions in order to remain tractable.
However, they are able to induce a hierarchical model that shares information about semantic roles between relations, using a language-independent
feature set.
2.3. Information Extraction
11
Yao, Riedel, and McCallum, 2012, do semantic relation clustering using variations on LDA. They construct feature representations from dependency parse patterns and the named entities occurring within them. More
specifically, they use clusters from LDA as local features (from entities and
words) and global features (from the sentence and document). They partition entity pairs of a path into different sense clusters with sense-LDA,
a topic model that draws multiple observations from a hidden topic variable. These sense clusters are merged into semantic relations using hierarchical agglomerative clustering. The graphical model that they employ
assumes the features are conditionally independent given the topic assignments, which is not always desirable.
Bordes et al., 2013, introduce a model for inducing latent representations
of concepts in a semantic network, in order to predict the probability of unseen connections. Their model factorizes a semantic network using stochastic gradient descent. Factorizing the semantic network that results from a
surface-form information extraction system, it is able to integrate multiple
data sources and effectively perform entity ranking and word-sense disambiguation. Weston et al., 2013, describe a similar system to jointly perform
knowledge-base factorization and semantic parsing. They use a large-scale
semantic network to improve semantic parsing.
Riedel et al., 2013, perform a factorization of a matrix of entity pairs
and relations, from both shallow (surface-form) relations and relations from
knowledge bases. That factorization allows for the reconstruction of other
valid relations through asymmetric implicature, e.g. historian-at implies professor-at. Their weakly-supervised model induces latent representations of the relations and entities. The factorization provides inspiration for possible alternatives to the bilinear decoding in our work.
Lewis and Steedman, 2013, discover semantic relations by clustering
typed surface-form predicates. Predicates from a categorial grammar are
refined by the entity types of its arguments, after which similar typed predicates are merged using LDA. The probabilistic categorial grammar can then
be used to parse sentences into precise logical forms, over which reasoning
can be performed through logical proofs. They treat prepositions as argument positions of the predicate, which means the model incorporates the
preposition into the predicate but does not disambiguate it.
Titov and Khoddam, 2015, introduce the Reconstruction-error Minimization framework for semantic role induction. They induce an efficient linear
classifier by reconstructing the arguments of the latent relation. Because
this thesis is primarily based directly on this framework, I will discuss it
fully in the Model chapter. For semantic role induction, they report high
cluster overlap with semantic role annotation datasets in English and German. Compared to related approaches their method induces fewer roles,
which means they are more interpretable.
2.3.2
Latent Structured Prediction
Unsupervised parsing is structure prediction with latent structure. Recently, there has been progress on induction of linear classifiers through
reconstruction-error minimization, and the induction of latent representations of structured outputs. As this thesis concerns the induction of a linear
12
Chapter 2. Background
classifier through the reconstruction of argument structures, this research
has inspired parts of the model described in the next chapter.
Daumé III, Langford, and Marcu, 2009, reduce unsupervised structure
learning to supervised binary classification in a reconstruction framework.
The resulting search-based structure prediction framework is evaluated on
sequence labeling and unsupervised syntactic parsing.
Ammar, Dyer, and Smith, 2015, demonstrate a Conditional Random
Field Auto-Encoder that is able to work tractably with global features. It
is trained by block coordinate descent, alternating gradient descent and
Expectation-Maximization for the encoding and decoding.
Srikumar and Manning, 2014, induce latent representations of structured outputs. They induce a joint model of the output with latent representations, which is related to the decoder used in this thesis.
13
Chapter 3
Model
3.1
Reconstruction-Error Minimization
Features
Relations
gov: professor
obj: Stanford
prep: from
pos: NN-IN-NN
Reconstruction
a 1: p r o f e s s o r
a 2: S t a n f o r d
0% 30%60%
g(x)
r
a1(x), a2(x)
F IGURE 3.1: Variables for encoding and decoding preposition relations
In the Reconstruction-Error Minimization framework (Titov and Khoddam, 2015), the goal is to find a clustering of the input into relations that
help reconstruct the arguments of that relation. Inspired by neural network
auto-encoders, the model consists of two parts, an encoder and a decoder. The
encoder expresses how likely relations are given the input features. The decoder expresses how likely the arguments are together given a relation.
In my case, the model is trained on a dataset of parsed sentences, where
in this case every preposition usage x has two arguments a, consisting of a
governor a1 and an object a2 .
The model is parameterized by the encoder and decoder parameters θ.
These parameters are optimized to maximize the likelihood of the arguments in the training data. Given a set of relations R, the following probability is optimized for every preposition usage in the dataset:
X
pθ (a|x) =
pθ (r|x)pθ (a|r)
(3.1)
r∈R
3.2
Encoder
The encoder expresses the likelihood of relations given the features of the
input, g(x). For example, in the phrase a professor from Amsterdam, the
governor is professor, the object is Amsterdam, and the features could be
represented by a set { gov_professor, obj_Amsterdam, prep_from,
POS_NN-IN-NN }. This is expressed as a sparse boolean feature vector by
viewing each possible feature (out of m total features) as a dimension which
is 1 if that feature is present, and 0 otherwise. The output of the encoder is
a distribution over relations r, using weight vectors wr ∈ Rm .
14
Chapter 3. Model
The encoder is log-linear, in the sense that the log-probability of a relation given the input is proportional to a linear combination of the features
and the weights:
exp(wr · g(x))
0
r0 ∈R exp(wr · g(x))
pθ (r|x) = P
(3.2)
where the parameters θ include wr ∈ Rm for r ∈ R. The right-hand term
is normalized by dividing it by a summation over r. This results in a term
equivalent to multinomial logistic regression, also known as the softmax
function.
In the reconstruction-error minimization framework, the encoder can
generally be any differentiable function as long as the posterior distribution
of relations r can be efficiently computed or approximated. After training,
the encoder constitutes the final induced classifier. Because it is an efficient
log-linear model, such a classifier can be used for open-domain information
extraction (Banko et al., 2007). Of course, the specific model that I evaluate
in this thesis is not a complete semantic parser but only a component of
one. However, it is important to evaluate such a component separately, in
order to research its performance on this sub-task.
3.3
Decoder
The decoder reconstructs the arguments given the relation. For example,
the decoder should, for some relation, assign a high likelihood to the argument pair (a1 = professor, a2 = Amsterdam) and other pairs that are instances of an origin or source. For other relations, the decoder should assign
a high likelihood to pairs such as (a1 = professor, a2 = statistics)
(and other attribute pairs) and a lower likelihood to the former. In this way,
the decoder expresses the probability of arguments from a vocabulary V.
The decoder is log-proportional to a function ϕ of the arguments and
relation, parameterized by θ:
pθ (a|r) = P
a01 ∈V
exp(ϕ(a1 , a2 , r, θ))
P
0
0
a0 ∈V exp(ϕ(a1 , a2 , r, θ)
(3.3)
2
The scoring function ϕ expresses the ‘match’ between the relation and arguments.
As in the encoder 3.2, the right-hand term is normalized by dividing
over a summation. Here, the normalization term sums over both arguments.
To fully analyze the properties of the reconstruction-error minimization
framework, I will incrementally specify more complex decoders. I will also
report the performance of these decoders in the Experiments chapter, in
order to illustrate how the model generalizes.
3.3.1
Independent Argument Probability
While I aim at modelling the joint probability of argument pairs, I will first
introduce independent argument decoders. These are similar to generative
3.3. Decoder
15
models which typically assume that arguments are conditionally independent. Additionally, this allows me to break down the performance of the
reconstruction-error minimization framework into different aspects.
Categorical The simplest decoder model directly keeps track of independent argument probabilities. This is inspired by Ammar, Dyer, and Smith,
2015, who use a categorical (i.e. multinomial) distribution to model the reconstruction of the input to their Conditional Random Field auto-encoder.
However, their work is a graphical model in which the decoder is assigned
a Dirichlet prior, thereby benefiting from regularization.
ϕ(a1 , a2 , r, θ) = θa1 |r + θa2 |r
(3.4)
where θ include θa1 |r , θa2 |r ∈ R for a1 , a2 ∈ V, r ∈ R. The normalization
ensures that the final output is a probability distribution.
Selectional Preference To generalize this model, I factorize the independent argument model using word embeddings. The parameter θa1 |r that
expresses the score of an argument given a relation is replaced by an inner
product of a word vector ua with a parameter vector c. This also allows me
to use pre-trained word vectors.
The scoring function now expresses the degree to which the word and
argument representations are in the same direction in vector space:
ϕ(a1 , a2 , r, θ) = ua1 c1|r + ua2 c2|r
(3.5)
where θ include ua , c1|r , c2|r ∈ Rd for a ∈ V, r ∈ R.
Because these decoders do not express the interaction between arguments, they are more similar to the generative models, which assume the
arguments are conditionally independent. However, they do allow me to
analyse the interaction between the encoder, which does capture interdependencies, and the working of the decoder.
3.3.2
Joint Argument Probability
In reality I want to model the joint probability of arguments to model a
relation. However, that joint signal is very sparse, which precludes the explicit training of argument pair scores. Therefore, a joint argument model
is necessarily factorized to overcome sparsity.
While many factorized joint models are possible (e.g. Srikumar and
Manning, 2014, Bordes et al., 2012, or Weston et al., 2013), I will examine
only a bilinear model due to the scope of this thesis. As the interaction
between encoder and decoder will prove to be the primary cause of local
minima, that has been the focus of this work.
Bilinear Model To encode the interaction between both arguments, here
the scoring function is computed using the product of word representations
and a square matrix that represents the relation itself. By taking the left
or right product of the relation matrix with one of the word vectors, it is
possible to make a prediction about the other argument. Word vectors that
16
Chapter 3. Model
are near the result of this operation in vector space are likely candidates for
that argument.
For example, if a1 = professor the model should learn a relation r for
which ûa2 = uTprofessor Cr is near word representations such as uAmsterdam
or uStanford .
The motivation of this model is that the signal from the combination of
the arguments is much more informative than from the arguments separately. Hopefully, such a signal can replace direct supervision for training
semantic parsers. In this work, the relations are induced from a limited signal that is constrained to predicate-argument structures in certain context.
The experiments will show whether the learning signal is strong enough
for such an expressive model.
ϕ(a1 , a2 , r, θ) = uTa1 Cr ua2
where θ include ua ∈ Rd , Cr ∈ Rd×d for a ∈ V, r ∈ R.
3.4
3.4.1
Optimization
Stochastic Gradient Descent
The model is trained by stochastic gradient descent, using Adagrad (Duchi,
Hazan, and Singer, 2011) for learning rate adaptation. Stochastic Gradient
Descent is based on iteratively updating the parameters of the model using
the gradient of the loss function evaluated on part of the training data. In
my case, the unregularized loss function per training example is the negative log-likelihood
Qi (θ) = − log pθ (ai |xi )
for each input xi .
The training consists of iterating over random subsets S of the data
(called ‘mini-batches’), calculating the loss and updating the parameters
(here θj could be any parameter in accordingly, using a learning rate η:
θ ←θ−η
X
∇Qi (θ)
i∈S
AdaGrad However, this often causes the updates to perform poorly on
parameters that are used infrequently. Additionally, it is difficult to find the
right learning rate in practise. To overcome this problem I used AdaGrad,
which is a subgradient method that dynamically incorporates information
about the problem during the training. Here the gradient is divided by its
historical magnitude per parameter θj :
2
hj ← hj + ∇j Qi (θ)
X ∇j Qi (θ)
p
θj ← θj − η
hj
i∈S
3.4. Optimization
17
This results in faster and more reliable convergence.
3.4.2
Encoder regularization using its entropy
One problem when training the model described in this chapter is early
convergence on local minima. This happens when the encoder initially does
not reflect sufficiently high probabilities for all relations. In that case, the
learning signal of the decoder is focussed on only the relations predicted by
the encoder, which leads the decoder to neglect those relations and discard
them in the course of training.
A simple way to overcome this is to add a term to the objective that favors encoder predictions are more evenly distributed over relations. This is
reflected in the entropy of the encoder predictions. The higher the entropy,
the more evenly distributed the encoder predictions are. The loss thus becomes:
QH
i (θ) = Q(θ) − H pθ (r|xi )
X
pθ (r|xi ) log pθ (r|xi )
= Q(θ) +
r
In the experiments, I report on both the model with the entropy term,
and without. I also show it is possible to anneal the entropy by scaling
it during training. This should avoid local minima in the beginning, and
focus the model later on.
3.4.3
Extension: decoder approximation by sampling
An extension that would dramatically increase the efficiency of this model
is candidate sampling. In the model described in this chapter, the main
performance bottleneck is the calculation of the normalization for the argument prediction in 3.3, because it needs to sum over all words in the
vocabulary. Many approaches to factorization problems like this approximate the normalization instead of computing it completely (e.g. Mnih,
2013, Mikolov, Chen, et al., 2013, Bordes et al., 2013, Weston et al., 2013,
Riedel et al., 2013). Instead of summing over all possible words in order to
compute the normalization term, these approaches treat the argument prediction as a ranking problem. The model is thus trained to rank observed
arguments higher than random arguments.
For candidate sampling, the decoder log-proportional to ϕ would be
replaced by a ranking:
X
pθ (a|r) ≈
f ϕ(a1 , a2 , r, θ), ϕ(ã1 , ã2 , r, θ)
(ã1 ,ã2 )∈N
where N is a set of random argument pairs following some sampling distribution and f is a monotonically increasing function.
Using the camdidate sampling function explored by Titov and Khoddam, 2015, the decoder probability would become:
X
pθ (a|r) ≈ exp log σ(ϕ(a1 , a2 , r, θ)) −
log σ(ϕ(ã1 , ã2 , r, θ))
(ã1 ,ã2 )∈N
18
Chapter 3. Model
where the sampled pairs are generated to always contain one observed argument (i.e. in the form (a1 , ã2 ) or (ã1 , a2 )) and σ is the logistic sigmoid
function.
While I have experimented with this approximation in the initial stages
of my research, preliminary results indicated that its use would obscure the
primary contributions of this thesis. Namely, the optimization difficulties
arising from the interaction between the encoder and decoder outweighed
the gains that would result from efficiency. Althrough it would allow for a
larger training corpus, it would have been unclear how it had contributed
to the induced relations. Therefore, I have not used this setup in my final
experiments, and I report results only using the full softmax normalization.
19
Chapter 4
Experiments
Ideally, unsupervised semantic parsing methods should be evaluated in a
task-based setting such as question answering or information extraction.
However, such domains are too broad for the narrow domain of preposition
parsing. Therefore, I compare unsupervised semantic parses to datasets of
gold-standard annotations.
To evaluate the reconstruction-error minimization framework for preposition relation discovery, I performed three experiments. Two of these relate
preposition relations to noun-noun compound relations. My assumption is
that the relations expressed by prepositions and compounds are similar, as
described in Chapter 3.
4.0.1
Initialization and Hyperparameters
To prevent an explosion of the number of parameters in the bilinear model,
I used a relatively small word representation dimensionality of 40. When
using larger pre-trained word representations, I reduced their dimensionality using principal component analysis, and re-scaled them to have zero
mean and a unit L2 norm.
The encoder parameters w were initialized with a standard gaussian
distribution. The parameters of the decoder were initialized differently
per type of decoder. The categorical decoder (3.3.1) was initialized with
a flat distribution, i.e. all weights set to 1. The selectional preference decoder (3.3.1) was initialized with a standard gaussian, to match the word
representations. I initialized the parameters of the bilinear
√ decoder (3.3.2)
with a zero-mean gaussian, with a standard deviation of d, where d is the
word representation dimensionality. This ensured that its dot product with
a standard gaussian vector resulted in a vector of the same magnitude.
All models were optimized using AdaGrad with a step size of 0.5. I
used mini-batches of 50 inputs, and trained on 90% of the data until the
objective on the held-out 10% stopped falling.
The models were always trained using 33 possible classes, following
Srikumar and Roth, 2013a.
4.0.2
External Data
For the word representations, I used Google’s word2vec pre-trained word
vectors on about 100 billion words of Google News (from Mikolov, Chen,
et al., 2013). These are 300-dimensional representations of 3 million words
and phrases. Some words from my input data (e.g. some names) did not
occur in this vocabulary, so for those I used the average over all word representations.
20
Chapter 4. Experiments
For the input features, I used the 320 Brown clusters from Turian, Ratinov, and Bengio, 2010.
4.0.3
Evaluation
As in previous work on unsupervised semantic parsing, I evaluate my model
using purity, collocation and their harmonic mean F1 . Purity is the ratio of
largest overlaps of the clusters with the gold, and collocation is the ratio of
largest overlaps of the gold with the clusters:
1 X
max |Gj ∩ Ci |
j
N
i
1 X
CO =
max |Gj ∩ Ci |
i
N
PU =
(4.1)
(4.2)
j
F1 = 2
PU × CO
PU + CO
(4.3)
where Ci is the set of instances in the i-th induced cluster, Gj is the set of
instances in the j-th gold cluster, and N is the total number of instances.
4.1
Discovering Preposition Relations
To discover preposition relations, I trained the model on the prepositions in
a large parsed dataset of FrameNet sentences (Bauer, Fürstenau, and Rambow, 2012). This resulted in 360469 preposition instances. Then I evaluated
it with classes from Srikumar and Roth, 2013a.
4.1.1
Data and Evaluation
The majority of FrameNet sentences is taken from the British National Corpus. (Bauer, Fürstenau, and Rambow, 2012) parsed all sentences using the
Stanford dependency parser, version 2.0.1, and aligned the FrameNet annotations to the parses. I used the head and dependent of these prepositions
in the automatic parses as governor and object, resulting in a vocabulary of
53367 words.
Features From these parses I extracted a set of features that were inspired
by Srikumar and Roth, 2013b, who extend the feature set from Tratz and
D. Hovy, 2009. Because my aim is to induce a fully unsupervised and
language-independent classifier, I did not use any features based on WordNet, a thesaurus, or language-specific heuristics. My feature set therefore
consists of only word, part-of-speech, capitalization indicator and brown
cluster for the governor and object. Like Srikumar and Roth, 2013b, I use
both these base features, and the features conjoined with the preposition to
be classified. This resulted in a set of 320763 features. Evaluating a logistic
regression classifier on this feature set gives a cross-validation accuracy of
0.814.
4.1. Discovering Preposition Relations
21
Labels Then, I evaluated the clustering using datasets of preposition supersenses. Using the preposition 33-class supersense mapping from Srikumar and Roth, 2013a, I associated each preposition usage from the SemEval2007
challenge dataset (Litkowski and Hargraves, 2007), to a relation from the
inventory. This dataset consists of 14857 preposition annotations, on sentences from the FrameNet corpus. I treated this dataset as the gold standard
to evaluate against using the metrics described above.
Interestingly, this is not the only supersense inventory based on the
senses from the Preposition Project (Litkowski and Hargraves, 2005). The
parser described by Tratz and E. Hovy, 2011, uses a 29-class inventory of
preposition supersenses, but these are not described in publication. However, the supersenses are defined in the data supplied with the parser. Many
preposition senses from the Preposition Project are matched to a specific
‘major cluster‘, exactly like in Srikumar and Roth, 2013a. However, these
are incomplete: some preposition senses are not matched to any relation.
Therefore, I chose not to report any results on this relation inventory.
4.1.2
Baselines
Trivial Baselines In order to place the results in perspectve, I include a
number of trivial baselines in the results table. Every preposition can be
assigned a random class from a set of some size (Random) or the same
single class (One Class).
Another trivial baseline is assigning every of the 33 most frequent prepositions to their own class (Preposition Classes). This is equivalent to assuming that every preposition has a single meaning, and that there is no overlap in word senses between different prepositions. We shall see that this
assumption is stronger than we might like to imagine.
Non-Negative Matrix Factorization Non-Negative Matrix Factorization
is a decomposition method that assumes the data and components are nonnegative. It finds a decomposition of the inputs into two matrices of nonnegative elements, by optimizing for the squared Frobenius norm. When
using a low-rank decomposition, the components can be seen as classes
that explain the data. The implementation that I used alterates the minimization between both matrices using coordinate descent and is initialized
by nonnegative double singular value decomposition.
Latent Dirichlet Allocation Latent Dirichlet Allocation (Blei and Hoffman, 2010) is a generative model that infers latent classes from a set of
sparse inputs. Every input is associated with a mixture of classes (or ‘topics’) that explain the distribution of the input features. It is only trained on
the input features, and does not assume that the classes express a relation
between arguments.
For my experiments, I used the logistic regression, NMF and LDA implementations present in the Python package scikit-learn (Pedregosa et al.,
2011). The reconstruction-error minimization models were implemented in
Theano (Bergstra et al., 2010).
Chapter 4. Experiments
22
Size
1/18
1/23
1/26
1/27
1/28
Instance
clack behind him
smile on him
confided in him
note for them
trod on him
padded across room
lurched across room
rustled around room
sauntering around room
tiptoeing into room
heights above he
Festival in June
ownership of goods
criticized as breach
threw from bucket
wander along road
stamped down drive
hidden inside skirts
disapproval at behaviour
meditates on meaning
descended on one
skip from side
take round get
steps through woods
remarks from Jefferson
( A ) Categorical model
p(r|x)
1.000
1.000
1.000
1.000
1.000
0.999
0.998
0.993
0.985
0.983
0.993
0.991
0.984
0.982
0.974
0.989
0.977
0.976
0.951
0.948
0.993
0.984
0.975
0.969
0.956
Size
1/17
1/18
1/21
1/23
1/24
Instance
waded among crowds
clattered to ground
slithered along veins
wound around margins
hobbled to side
dwelt with me
renting from him
arched at him
glanced about him
smirked at him
certainty about Russia
pilgrimage to Grantchester
pilgrimage around world
consultant in Aberdeen
people in Vienna
wonderful with orchestras
unfriendly to outsiders
free for members
investigations by groups
distinguish between couples
draped over table
festooned with fruits
rested on table
reigned over Palestine
followed towards table
( B ) Selectional Preference model
p(r|x)
0.999
0.998
0.994
0.993
0.980
0.996
0.996
0.993
0.986
0.986
0.998
0.997
0.997
0.997
0.997
0.999
0.996
0.994
0.986
0.967
0.996
0.996
0.996
0.996
0.995
Size
1/12
1/20
1/20
1/21
1/27
Instance
swimming in gravy
aspire to comfort
sidled through door
aspire to consciousness
stared across table
walk around Bolton
resignation on 2
skip from side
excitement in eyes
revenge on enemies
cut through corridor
skulking among vegetation
reduce to month
ambled along riverbank
swished over sand
hem of jacket
torn into bits
decayed into shapes
brilliance of writing
happened before death
head of latter
view through haze
suspicions against him
displeasure in voice
astonished at achieved
( C ) Bilinear model
F IGURE 4.1: The five most confident predictions for the five largest clusters induced by the reconstruction-error minimization models
p(r|x)
0.999
0.997
0.981
0.978
0.971
0.987
0.975
0.968
0.954
0.953
0.997
0.995
0.982
0.932
0.919
0.989
0.976
0.971
0.969
0.961
0.957
0.956
0.940
0.926
0.925
4.1. Discovering Preposition Relations
Model
Random
One Class
Preposition Classes
23
PU
0.20
0.20
0.48
CO
0.05
1.00
0.58
F1
0.08
0.33
0.52
F IGURE 4.2: Trivial Baselines
Model
NMF
LDA
Categorical
Selectional Preference
Bilinear
-prep
(classifier: 0.814)
PU CO
F1
0.26 0.42 0.32
0.34 0.40 0.37
0.23 0.16 0.19
0.24 0.20 0.21
0.22 0.30 0.25
+prep
(classifier: 0.816)
PU CO
F1
0.47 0.47 0.47
0.42 0.53 0.47
0.26 0.18 0.21
0.24 0.19 0.21
0.21 0.37 0.27
F IGURE 4.3: Preposition Relation Induction Results. The
left side is excluding the preposition feature template, the
right side including it. All reconstruction-error models are
trained with an entropy regularization term. The scores between brackets are cross-validated accuracy scores of a logistic regression classfier.
Model
No entropy term
Categorical
Selectional Preference
Bilinear
Annealed entropy term
Categorical
Selectional Preference
Bilinear
PU
CO
F1
0.24
0.23
0.20
0.50
0.49
0.52
0.32
0.31
0.29
0.26
0.23
0.21
0.20
0.20
0.56
0.23
0.21
0.30
F IGURE 4.4: Preposition Relation Induction Results, with
varying entropy term. Here, no preposition feature template was used.
24
Chapter 4. Experiments
( A ) Preposition Classes
( B ) NMF
F IGURE 4.5: Confusion matrices for the Preposition Classes
baseline, and for NMF when the preposition feature template is used.
4.1.3
Results
From table 4.2, we can see that the prepositions themselves correspond
strongly to the annotated classes in the inventories. Inspecting the clustering baselines (NMF and LDA), I found that the features with most weight
were always frequently occurring features that were conjoined to the preposition. Indeed, when adding the preposition itself as a feature template, we
can observe striking results.
Comparing the left and right side of table 4.3, it is evident that the
preposition feature template has profound effects on unsupervised learning performance. However, when training a supervised logistic regression
classifier, the difference in performance is much smaller: the cross-validated
accuracy is 0.814 without, and 0.816 with the preposition feature. Additionally, from the confusion matrices of the Preposition Classes and the clustering baselines (figure 4.5) it is clear that NMF creates a clustering very similar to the Preposition Clusters. Therefore it is safe to say that these clusterings use the strong signal from the prepositions themselves to cluster the
data. The type of features that lead to good clustering performance might
be very different than the features that lead to good classification performance. Nevertheless, neither of the baselines is able to beat the Preposition
Classes baseline, which is based on the naïve assumption that prepositions
are not polysemous.
The models based on reconstruction-error minimization also fail to induce relations that correspond to the annotations. I have tried to shed light
on the interaction between encoder and decoder training through two additional experiments. In table 4.4, I report the clustering scores for two
variations of the models. In the first, they are trained without the entropy
term. In the second, the entropy term is gradually removed using a sigmoidal annealing schedule. This means both variants generally end up
with a lower number of predicted relations, which is reflected in higher
collocation scores. However, it is the purity of the models that we are most
interested in, and this is unfortunately hardly affected.
4.2. Paraphrasing Noun-Noun Compounds with Preposition Relations 25
Qualitative evaluation Why do the reconstruction-error models induce
clusters that differ so much from the annotations? In table 4.1, we can inspect the largest clusters, and the training instances that belong to them
with highest probability.
For the Categorical model, the largest two clusters seem to be heavily
skewed towards a single unique object (‘him’, and ‘room’). These trivial
clusters are succesful in reconstructing the highest frequency words, but do
not correspond to any interesting semantics. The other clusters do not seem
to reflect distinct concepts either.
The clusters in the Selectional Preference model have clearer semantics.
The largest cluster has a clear location focus. However, here too there are
clusters with trivial interpretations: the second-largest cluster again almost
exclusively clusters together instances based on the object word. Interestingly, third-largest cluster here has clear geographically-oriented object
matches, but doesn’t express the relation that captures them. The fourth is
clearly about groups of people, but again doesn’t match an actual relation.
The clusters in the Bilinear model are very hard to interpret. Sometimes
they can tend to have a location or non-location emphasis, but there is no
clear pattern. It is likely that the bilinear model is able to encode subtle
vector-space operations in its matrices that optimize argument reconstruction, but do not lead to interpretable relations.
4.2
Paraphrasing Noun-Noun Compounds with Preposition Relations
To investigate whether it is possible to transfer preposition relations to nounnoun compounds, I trained several classifiers to predict a preposition from
a pair of nouns. In particular, I wanted to know whether the factorization
of the prepositions occurrences would correspond to human paraphrases
of noun-noun compounds.
First I factorized the noun pairs of every noun-preposition-noun occurrance in the parsed dataset of FrameNet sentences (Bauer, Fürstenau, and
Rambow, 2012). Then I evaluted the suitability of those factorizations to
noun-noun pairs. This was done by comparing the noun-noun pair preposition paraphrases from Bos and Nissim, 2015 — the results of the Burgers
game in Wordrobe — to the expansion of the pairs for every preposition
factorization.
I also trained a number of classification baselines with which to compare
the factorizations.
4.2.1
Model
For this problem, I created a classifier based on the factorization of the
noun-noun co-occurrences. Like the bilinear decoder, the word pairs are
factorized by finding a matrix for every relation that encodes the compatibility of two word vectors. However, for this problem the matrix is induced
by optimizing the prediction of the preposition from the noun pair. More
26
Chapter 4. Experiments
about
Burgers
FrameNet
across
against
among
around
at
before
between
by
for
from
in
into
of
on
over
per
through
toward
towards
under
via
with
within
0.0
0.1
0.2
0.3
0.4
0.5
0.6
F IGURE 4.6: Differences in preposition distribution between Burgers and FrameNet
formally, I maximized the likelihood of the labels in the dataset:
exp(uTn1 Cr un2 )
T
r0 ∈R exp(un1 Cr0 un2 )
p(r | n1 , n2 ) = P
For the word embeddings, I explored two options: random initialization, and using fixed pre-trained representations.
4.2.2
Data and Evaluation
The training set consisted of the 125448 noun-preposition-noun triples in
the parsed FrameNet corpus, for the 14 prepositions that were used by Bos
and Nissim, 2015. This results in a vocabulary of 30236 different nouns.
The Burgers annotations from Bos and Nissim, 2015 are not disambiguated: the noun-noun compounds are paraphrased using different prepositions by different annotators. Therefore, I also evaluate against the distribution of annotations, using the cross-entropy between the annotation distributions and the model prediction distribution for all classes. Table 4.7
shows some examples of the Burgers paraphrases.
The training accuracy was computed by 3-fold cross-validation. The
Burgers accuracy was computed by taking the preposition annotated most
often as the gold label.
4.2.3
Baselines
The most frequent preposition in the dataset is of, but the training data is
much more skewed towards of than the Burgers annotations (see table 4.6).
4.2. Paraphrasing Noun-Noun Compounds with Preposition Relations 27
Noun compound
chemical company
rebel hideout
grain exports
hunger strike
fashion houses
opium production
level officials
drug trafficker
sting operation
Government forces
Paraphrase
company in chemical(s)
hideout for rebel(s)
exports on grain(s)
strike with hunger(s)
houses of fashion(s)
production of opium(s)
officials on level(s)
trafficker in drug(s)
operation for sting(s)
forces of Government(s)
F IGURE 4.7: Examples of noun-noun paraphrases from
Burgers
Logistic Regression The first baseline is a logistic regression model that
predicts a preposition from the two nouns. It uses separate parameters for
a noun as the governor or as the object.
RESCAL RESCAL is an approximate tensor factorization method based
on an alternating-least squares algorithm. It has previously been used to
factorize YAGO (Nickel, Tresp, and Kriegel, 2012). In this experiment, I
performed a grid-search over possible values of the ‘rank’ hyperparameter, which indicates the size of the word representations. Through crossvalidation, a rank of 8 performed highest, and that is the value that I report
results on. To make the Bilinear model with random initialization comparable to RESCAL, I enforced it to use word representations of the same size. It
is however still possible that the bilinear model would perform differently
with a different rank setting.
Word Vector Classifier To test the effectiveness of the word representations themselves, I trained another logistic regression model on the concatenated word vectors of the nouns, i.e. 80-dimensional dense vectors.
4.2.4
Results
From table 4.9, it is evident that a simple logistic regression model is hard to
beat using factorization for classification. Often these models do not even
outperform the majority-class baseline. However, factorizing the relations
using the pre-trained word embeddings does lead to a distribution that is
more similar to the human annotations. That seems to indicate that this
model generalizes noun-(preposition)-noun relations more similarly to human intuitions.
However, no matter how strong the assumption of single-relation prepositions performs (as we saw in the previous section), the fact is that prepositions are polysemous. This is also highlighted by Bos and Nissim, 2015,
who stress that prepositional paraphrases of noun-noun compounds are
not a panacea of relation annotations. In the next section, I will suggest
an evaluation for joint noun-noun and preposition relation induction that
strives to overcome this single-relation assumption.
Chapter 4. Experiments
28
of
in
for
on
with
Logistic Regression
kilograms of half
arrest of mass
aircraft of drone
staff of member
mine of coal
cases in tetanus
relay in torch
rates in interest
relations in trade
sectors in telecommunications
donations for cash
output for Electricity
items for food
reforms for market
offenses for press
freedoms on press
grounds on commission
charges on fraud
side on home
record on police
rating with bond
spokesman with hospital
roofs with tin
attendants with flight
burst with ice
RESCAL
body of world
home of way
reasons of health
sentence of life
water of waste
period in year
body in world
war in world
number in world
sentence in life
sentences for life
period for year
war for world
leaders for world
state for century
side on home
war on world
period on year
state on century
view on eye
number with world
state with century
sentences with life
visits with family
troops with government
Bilinear model
rates of poverty
base of force
supply of water
sports of roller
organization of relief
force in police
resources in nickel
organizations in relief
systems in weather
sugar in cane
facilities for detention
plant for power
economy for world
warrant for arrest
rates for kidnapping
dollars on aid
cases on corruption
workers on aid
company on chemical
limits on term
package with stimulus
protesters with student
ministry with health
rooms with hotel
officials with company
Log. Reg. using U
management of sound
roofs of tin
programs of destruction
arrest of mass
destruction of mass
line in subway
network in subway
system in subway
students in college
seats in college
opportunities for job
training for job
development for job
camp for training
seminar for training
teams on morning
event on slam
system on warning
injury on shoulder
strain on shoulder
planes with fighter
houses with fashion
tournament with tennis
injuries with cord
prosecutor with chief
Bilinear model using U
management of sound
roofs of tin
programs of destruction
ore of iron
prison-break of mass
line in subway
system in subway
network in subway
seats in college
students in college
opportunities for job
development for job
training for job
camp for training
seminar for training
teams on morning
event on slam
system on warning
injury on shoulder
strain on shoulder
planes with fighter
houses with fashion
tournament with tennis
injuries with cord
conversation with phone
F IGURE 4.8: Top five paraphrases for the top five prepositions, using noun compounds from Burgers. Bold prepositions indicate the
paraphrase is present in the Burgers corpus.
4.3. Discovering Combined Preposition and Noun-Noun Relations
Model
Random
Majority Class
Logistic Regression
RESCAL (rank 8)
Bilinear model (rank 8)
Logistic Regression using U
Bilinear model using U
CV
0.04
0.59
0.67
0.57
0.32
0.59
0.42
29
Burgers
Acc. Cross-entropy
0.04
3499
0.34
∞
0.25
2933
0.32
3048
0.23
2981
0.30
2276
0.34
1960
F IGURE 4.9: Results for paraphrasing noun-noun compounds with preposition relations.
CV is the crossvalidation accuracy on the training set. U is a set of 40dimensional pre-trained word embeddings.
Qualitative evaluation From table 4.8, we can see the kind of noun compounds that are most associated with the top five prepositions in every
model. While this is not the same as the actual predictions made by these
models, it gives an intuition for the distribution that they encode.
An interesting observation we can make is that RESCAL is biased towards high-frequency nouns, such as world, life, year and health.
It is also clear to see that both models that use the pre-trained word
embeddings U are very similar, which indicates that the Bilinear model is
primarily capturing independent relations instead of a the joint distribution. However, though it does not show in the top five paraphrases, the
low cross-entropy score from table 4.9 shows that the dependence assumption still contributes to a distribution over paraphrases that is closer to the
annotations.
4.3
Discovering Combined Preposition and Noun-Noun
Relations
To discover relations for both prepositions and noun-noun compounds, I
trained on both prepositions and noun-noun compounds in the large parsed
dataset of FrameNet sentences (Bauer, Fürstenau, and Rambow, 2012). Then
I evaluated the correspondence between the preposition and noun classes
using Wordrobe annotations (Bos and Nissim, 2015). Because human annotators paraphrased the noun compounds using prepositions, both the noun
compound and paraphrase should express the same relation. This assumption allows me to evaluate the combined relation discovery.
4.3.1
Data and Evaluation
The corpus yielded 504596 examples, with a total of 409178 features. Of
these examples, 357390 were triggered by a preposition and the remaining
147206 were noun-noun compounds. The vocabulary consisted of 74557
words.
The overlap is measured as whether the noun compound and preposition paraphrase from Bos and Nissim, 2015 are clustered into the same
30
Chapter 4. Experiments
Model
Random
NMF
LDA
Categorical
Selectional Preference
Bilinear
Overlap
0.19
0.00
0.05
0.55
0.63
0.27
Entropy
2.46
1.49
1.35
1.73
2.88
2.62
F IGURE 4.10: Results for combined noun-noun and preposition relation discovery. ‘Overlap’ the proportion of noun
compound and preposition paraphrase from Bos and Nissim, 2015 that are assigned the same relation. ‘Entropy’ is
the entropy of the distribution of predictions.
relation. For this I used all raw annotations, reflecting the complete set of
paraphrases.
To put this measure into perspective, it is important to also keep track
of the distribution over relations that the model predicts for the input. If
all noun compounds and prepositional paraphrases are clustered into the
same relation, the overlap will be total but meaningless. Therefore, I report
the entropy of the emperical relation distribution. The lower the entropy,
the smaller the number of relations is into which the input is clustered.
4.3.2
Results
On this task, the baseline models do not induce a combined set of relations
for prepositions and noun-noun compounds. In contrast, the reconstructionerror minimization models do induce relations that are shared. The relatively high entropy indicates that several different relations are predicted,
and there is an overlap between the noun-noun compounds and their paraphrase. However, this overlap is modest, and is not very large in the bilinear model, which expresses the joint probability of arguments.
Qualitative evaluation As in table 4.1, we can see from table 4.11 that
several clusters for the Categorical are based primarily on single words. The
second, third and fifth largest clusters are associated with which, part and is,
respectively. The top instances for the largest cluster are a mix between
temporal instances and the word back, which hints at the fact that the other
clusters might be mixed as well for less probable instances.
The top instances clusters for the Selectional Preference model are based
on the word family, parse errors, (geographical) names, the word way, and
(personal) names, respectively. Interestingly, they all have low probability, which means that the encoder has high entropy and spreads its predictions. This allows the decoder to spread its chances of reconstructing the
correct word. In a way, the different clusters are combined by the decoder
to assign the reconstruction a higher probability. However, as we saw in
table 4.4, removing the entropy term from the objective only significantly
raises collocation (i.e. decreases the number of predicted clusters) without
increasing the purity of their correspondence with annotations. This cluster
does clearly combine the preposition and noun-compound instances into
clusters together.
4.3. Discovering Combined Preposition and Noun-Noun Relations
31
The Bilinear model again does not provide clear interpretable clusters.
Only the first is clearly biased towards years. While some seem to correspond to parsing errors, others have no internal consistency. As in the
Selectional Preference model, the predictions have low probability.
Overall, it seems that the models can combine ad-hoc optimal ways
to reconstruct specific groups of arguments, without expressing their own
clear semantics. If there was a way to ‘purify’ the decoder classes, this
might increase their correspondance to annotations.
Chapter 4. Experiments
32
Size
1/16
1/19
1/20
1/24
1/25
Instance
broke on back
started on years
left for months
had after months
shed at back
made in which
seen in which
made at which
made along length
made on on
part of Germany
part of community
part of site
awareness of work
part of France
supply NN money
range of services
range of products
transfers NN money
range of ovens
is in way
is in place
is of course
is against policy
is in artefact
p(r|x)
0.866
0.787
0.712
0.690
0.687
0.941
0.704
0.701
0.682
0.580
0.740
0.719
0.718
0.714
0.713
0.977
0.928
0.908
0.870
0.856
0.941
0.813
0.756
0.644
0.634
( A ) Categorical model
Size
1/4
1/5
1/6
1/7
1/10
Instance
doctor NN family
process NN family
dog NN family
tradition NN family
man like father
failures of 10
appellant in and
role in and
flow out to
music of and
Canyon in Utah
half-askew NN White
break-up with White
cliffs at Stackpole
Yorkshire NN West
way of framing
way of expressing
way of saving
way of accomplishing
way of alerting
bag NN Tesco
stories NN Holmes
shoulders of Couples
love of Simpson
goal from Manley
p(r|x)
0.068
0.063
0.062
0.062
0.061
0.089
0.083
0.083
0.080
0.078
0.084
0.083
0.081
0.081
0.080
0.229
0.221
0.214
0.212
0.211
0.046
0.045
0.045
0.044
0.044
( B ) Selectional Preference model
Size
1/9
1/9
1/15
1/16
1/19
Instance
began in 1990
began in 1991
sat around him
moved in 1852
began in 1996
deterrent against Iraq
is against it
questions from human
is without doubt
throws on paraffin
is out certainly
top of peak
disclose without breaching
print by roll
is without doubt
turned since long
lived for in
turned at which
seen in which
lived in which
Isles of Skye
title of care
lot like sports
suntanned than youth
help in evaluating
( C ) Bilinear model
F IGURE 4.11: The five most confident predictions for the five largest clusters induced by the reconstruction-error minimization models
on both prepositions and noun-noun compounds
p(r|x)
0.191
0.181
0.171
0.169
0.167
0.085
0.080
0.076
0.065
0.065
0.065
0.063
0.063
0.062
0.062
0.133
0.125
0.117
0.113
0.113
0.067
0.065
0.063
0.062
0.060
33
Chapter 5
Discussion
The possible relations between two words that are expressed by prepositions or noun compounds are difficult to capture in a manually-created inventory. Nonetheless, it would be useful to be able to parse these parts of
a sentence with a computer program in order to the extract information in
them for further processing. In this thesis, I have explored the automatic
unsupervised induction of relations for prepositions and noun compounds
in several experiments. Because of its expressiveness, I focused on the joint
prediction and factorization of relations in a reconstruction-error minimization framework.
In the first experiment, I evaluated several approaches to find relations
for prepositions that resembled a set gold-standard labels based on preposition supersenses. However, none of the unsupervised learning approaches
was able to outperform the naive and incorrect assumption that every preposision always expresses one unique relation. When the prepositions were
included as features, the resulting clusterings were more similar to the preposition classes baseline. Using the reconstruction-error minimization framework, I was also unable to induce classes that roughly corresponded to the
gold-standard labels.
In the second experiment, I evaluated the scheme of prepositions-asrelations by training classifiers to paraphrase noun-noun compounds using
prepositions. In particular, I explored the performance of the bilinear decoder, the most expressive decoder from the reconstruction-error minimization framework. Here, a simple logistic regression model outperformed factorization approaches on the training set, but none of approaches predicted
prepositions similarly to a dataset of paraphrases. However, the distribution over prepositions predicted by the bilinear model using pre-trained
embeddings was closest to the distribution of paraphrase annotations.
In the third experiment, I induced one set of relations for both prepositions and noun compounds in several models. Here, the independent argument model based on word embeddings often predicted the same relation
for noun compounds and their preposition paraphrase, while the baselines
hardly ever did. Nevertheless, the evaluation does not elucidate whether
those relations correspond to concepts that are useful for semantic parsing.
34
5.1
5.1.1
Chapter 5. Discussion
Future Work
Task-oriented evaluation
The main challenge for open-domain unsupervised approaches is the lack
of task-oriented evaluation. The motivation in this work for jointly modeling the arguments of the relation was to use the signal from the combination
of arguments to train a semantic parser. If such a signal can replace supervised training, semantic parsers can be trained for far larger domains and
applications. However, if we abandon manual annotations we must find a
higher-level way to check the performance of the parser.
Question answering tasks are messy, but probably unavoidable. Recently, there has been progress in creating large datasets for question answering based on machine reading (Hermann et al., 2015). This type of
research invites unsupervised semantic parsing to be as empirical as possible, and evaluate its assumptions on real-world data.
5.1.2
Variational Methods
Abstractly, this thesis concerned a discrete latent variable model for a joint
distribution. Variational methods are a more principled approach to latent
variable models that would be intractable without approximations. However, it is still generally an open problem how to estimate the parameters of
a model with a discrete latent variable.
Most variational approaches based on reconstruction-error use Gaussian latent variables. Bowman et al., 2015 introduce a variational sequence
to sequence LSTM, using Gaussian latent variables. They anneal the weight
of the variational ‘regularization’, in order to avoid local minima in which
the system encodes a general language model without using the latent variable. This is related to the problem of avoiding single-cluster discrete latent variables. Miao, Yu, and Blunsom, 2015 introduce a generic variational
inference framework for generative and conditional models of text, using
Gaussian latent variables. They train a neural variational document model
by minimizing the reconstruction error.
For discrete latent variables, the variance in the learning signal is often
reduced using control variates. Gu et al., 2015 present MuProp, an unbiased
gradient estimator for stochastic networks, using discrete latent variables.
Their control variate corresponds to the first-order Taylor expansion around
the learning signal for some fixed input. However, this is mainly relevant
when computing the full distribution for the encoder is intractable, which
is not the case for log-linear encoders.
5.1.3
Larger Dataset
Most unsupervised NLP approaches (e.g. Mikolov, Chen, et al., 2013, Riedel
et al., 2013, Yao, Riedel, and McCallum, 2012) train on huge datasets in order to find meaningful representations. There is always the chance that my
dataset was simply not large enough to find meaningful relations. If the
reconstruction-error minimization framework were to be trained on a similar task but using more data, the models should be trained using candidate
sampling (section 3.4.3).
5.2. Conclusion
5.1.4
35
Word Sense Induction and Clustering
One approach that I did not cover in this thesis is disambiguating prepositions into their word senses as a first step, and clustering those senses afterwards. This is most related to the supervised approaches (such as Srikumar
and Roth, 2013b, D. Hovy, Tratz, and E. Hovy, 2010). It might be possible to
induce preposition senses in an unsupervised manner (either with generalpurpose word sense induction methods such as Bartunov et al., 2015 or
preposition-oriented such as D. Hovy, Vaswani, et al., 2011), after which
the individual senses are clustered in an unsupervised manner to end up
with a classification similar to Srikumar and Roth, 2013a. However, this
would not couple the disambiguation and clustering into a single process.
It would be an interesting research direction to compare both approaches.
5.2
Conclusion
The main research question of this thesis concerned the unsupervised induction of semantic relations expressed by prepositions and noun phrases.
Specifically, I was interested in the performance of expressive models based
on reconstruction-error minimization. While prepositions are undeniably
polysemous, the baseline assumption that they are not is hard to beat. However, using those prepositions as labels to train classifiers for noun-noun
compound paraphrases does not result in predictions that correspond strongly
to human annotations. This indicates that neither latent relations nor the
surface relations are easy to capture. Therefore, relation induction in this
setting remains an open problem.
Reconstruction-error minimization is a very general approach to unsupervised learning. The expressive models evaluated in this thesis do not
induce relations that correspond to annotations, but there are many options left to explore. More sophisticated statistical methods that have been
succesful in latent structure prediction are a particularly promising development.
Much of the progress in semantic parsing has been due to being able to
learn from weaker supervision. The main challenge for open-domain unsupervised approaches is the lack of task-oriented evaluation. Being constrained to using gold-standard annotations for the evaluation of relation
induction models that should improve upon those annotations is a suboptimal state of affairs. While there is plenty of linguistic research possible
on relation induction, the field of semantic parsing would benefit mostly
from innovative application-based training and evaluation methods.
37
Bibliography
Ammar, Waleed, Chris Dyer, and Noah A Smith (2015). “Conditional Random Field Autoencoders for Unsupervised Structured Prediction”. In:
Baker, Collin F, Charles J Fillmore, and John B Lowe (1998). “The berkeley
framenet project”. In: Proceedings of the 17th international conference on
Computational linguistics-Volume 1. Association for Computational Linguistics, pp. 86–90.
Baldwin, Timothy, Valia Kordoni, and Aline Villavicencio (2009). “Prepositions in Applications: A Survey and Introduction to the Special Issue”.
In: Computational Linguistics 35.2, pp. 119–149. ISSN: 0891-2017.
Banko, Michele et al. (2007). “Open Information Extraction from the Web”.
In: Proceedings of IJCAI-07, the International Joint Conference on Artificial
Intelligence, pp. 2670–2676. ISSN: 00010782.
Bartunov, Sergey et al. (2015). “Breaking Sticks and Ambiguities with Adaptive Skip-gram”. In: arXiv: 1502.07257.
Bauer, Daniel, Hagen Fürstenau, and Owen Rambow (2012). “The DependencyParsed FrameNet Corpus”. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pp. 3861–3867.
Bengio, Yoshua et al. (2003). “A Neural Probabilistic Language Model”. In:
The Journal of Machine Learning Research 3, pp. 1137–1155. ISSN: 15324435.
arXiv: arXiv:1301.3781v3.
Bergstra, James et al. (2010). “Theano: a {CPU} and {GPU} Math Expression
Compiler”. In: Proceedings of the Python for Scientific Computing Conference
({SciPy}).
Blei, David and M Hoffman (2010). “Online Learning for Latent Dirichlet
Allocation”. In: Neural Information Processing Systems.
Bordes, Antoine et al. (2012). “Joint learning of words and meaning representations for open-text semantic parsing”. In: International . . . 22, pp. 127–
135.
— (2013). “A semantic matching energy function for learning with multirelational data”. In: Machine Learning 94.2, pp. 233–259. ISSN: 0885-6125.
arXiv: arXiv:1301.3485v1.
Bos, Johan and Malvina Nissim (2015). “Uncovering Noun-Noun Compound
Relations by Gamification”. In: Nordic Conference of Computational Linguistics . . . i.
Bowman, Samuel R. et al. (2015). “Generating Sentences from a Continuous
Space”. In: pp. 1–13. arXiv: 1511.06349.
Carlson, Andrew, Justin Betteridge, and Bryan Kisiel (2010). “Toward an
Architecture for Never-Ending Language Learning.” In: In Proceedings
of the Conference on Artificial Intelligence (AAAI) (2010), pp. 1306–1313.
ISSN : 1098-2345.
Daiber, Joachim et al. (2015). “Splitting Compounds by Semantic Analogy”.
In: arXiv: 1509.04473.
38
BIBLIOGRAPHY
Daumé III, Hal, John Langford, and Daniel Marcu (2009). “Search-based
structured prediction”. In: Machine Learning 75.3, pp. 297–325. ISSN: 08856125.
arXiv: 0907.0786.
Dima, Corina (2015). “Reverse-engineering Language : A Study on the Semantic Compositionality of German Compounds”. In: September, pp. 1637–
1642.
Dima, Corina and Erhard Hinrichs (2015). “Automatic Noun Compound
Interpretation using Deep Neural Networks and Word Embeddings”.
In: pp. 173–183.
Downing, Pamela (1997). “On the creation and use of English noun compounds”. In: Language, pp. 810–842.
Duchi, John, Elad Hazan, and Yoram Singer (2011). “Adaptive Subgradient
Methods for Online Learning and Stochastic Optimization”. In: Journal
of Machine Learning Research 12, pp. 2121–2159. ISSN: 15324435. arXiv:
arXiv:1103.4296v1.
Fader, Anthony, Stephen Soderland, and Oren Etzioni (2011). “Identifying
relations for open information extraction”. In: Proceedings of the Conference on . . . Pp. 1535–1545. ISSN: 1937284115.
Galárraga, Luis et al. (2014). “Canonicalizing Open Knowledge Bases”. In:
Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management - CIKM ’14, pp. 1679–1688.
Girju, Roxana (2009). “The Syntax and Semantics of Prepositions in the Task
of Automatic Interpretation of Nominal Phrases and Compounds: A
Cross-Linguistic Study”. In: Computational Linguistics 35.2, pp. 185–228.
ISSN: 0891-2017.
Gu, Shixiang et al. (2015). “MuProp: Unbiased Backpropagation for Stochastic Neural Networks”. In: 2014, pp. 1–11. arXiv: 1511.05176.
Harris, Zellig S (1954). “Distributional structure”. In: Word.
Hartrumpf, Sven, Hermann Helbig, and Rainer Osswald (2006). “Semantic
interpretation of prepositions for NLP applications”. In: Computational
Linguistics April, pp. 29–36.
Hermann, Karm Moritz et al. (2015). “Teaching Machines to Read and Comprehend”. In: arXiv, pp. 1–13. arXiv: arXiv:1506.03340v1.
Hovy, Dirk, Stephen Tratz, and Eduard Hovy (2010). “What’s in a Preposition? Dimensions of Sense Disambiguation for an Interesting Word
Class”. In: Coling August, pp. 454–462.
Hovy, Dirk, Ashish Vaswani, et al. (2011). “Models and Training for Unsupervised Preposition Sense Disambiguation”. In: Computational Linguistics, pp. 323–328.
Kok, Stanley and Pedro Domingos (2008). “Extracting semantic networks
from text via relational clustering”. In: Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes
in Bioinformatics) 5211 LNAI.PART 1, pp. 624–639. ISSN: 03029743.
Krishnamurthy, Jayant and Tom M Mitchell (2011). “Which Noun Phrases
Denote Which Concepts ?” In: Acl 15213, pp. 570–580.
Lang, Joel and Mirella Lapata (2010). “Unsupervised Induction of Semantic
Roles”. In: In Human Language Technologies: The 2010 Annual Conference
of the North American Chapter of the ACL, Los Angeles, California, June 2010
June, pp. 939–947.
Lauer, Mark (1996). “Designing Statistical Language Learners: Experiments
on Noun Compounds”. In: Computing, p. 214. arXiv: 9609008 [cmp-lg].
BIBLIOGRAPHY
39
Levi, Judith N (1978). The syntax and semantics of complex nominals. Academic
Press.
Levy, Omer and Yoav Goldberg (2014). “Neural Word Embedding as Implicit Matrix Factorization”. In: Advances in Neural Information Processing
Systems, pp. 1–9.
Lewis, Mike and Mark Steedman (2013). “Combined Distributional and
Logical Semantics”. In: Transactions of the ACL 1, pp. 179–192.
Liang, Percy, Michael I. Jordan, and Dan Klein (2012). “Learning DependencyBased Compositional Semantics”. In: Computational Linguistics, pp. 1–94.
ISSN : 0891-2017. arXiv: 1109.6841v1.
Lin, Dekang and Patrick Pantel (2001). “DIRT @SBT@discovery of inference
rules from text”. In: Proceedings of the seventh ACM SIGKDD international
conference on Knowledge discovery and data mining KDD 01 datamining,
pp. 323–328.
Litkowski, Ken and Orin Hargraves (2005). “The preposition project”. In:
. . . on the Linguistic Dimensions of Prepositions . . .
— (2007). “SemEval-2007 Task 06: Word-Sense Disambiguation of Preposition”. In: Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007) June, pp. 24–29.
Manning, Christopher D and Hinrich Schütze (1999). Foundations of statistical natural language processing. MIT press.
Miao, Yishu, Lei Yu, and Phil Blunsom (2015). “Neural Variational Inference
for Text Processing”. In: pp. 1–14. arXiv: 1511.06038.
Mikolov, Tomas, Kai Chen, et al. (2013). “Efficient Estimation of Word Representations in Vector Space”. In: pp. 1–12. arXiv: 1301.3781.
Mikolov, Tomas, Ilya Sutskever, et al. (2013). “Distributed Representations
of Words and Phrases and their Compositionality”. In: Advances in Neural Information Processing Systems, pp. 3111–3119. arXiv: arXiv:1310.
4546v1.
Mnih, Andriy (2013). “Learning word embeddings efficiently with noisecontrastive estimation”. In: NIPS, pp. 1–9.
Nakashole, Ndapandula and Tom M Mitchell (2015). “A Knowledge-Intensive
Model for Prepositional Phrase Attachment”. In: Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pp. 365–375. ISBN: 9781941643723.
Nakov, Preslav (2007). “Using the Web as an Implicit Training Set : Application to Noun Compound Syntax and Semantics”. In: pp. 1–405.
Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel (2012). “Factorizing YAGO”. In: Proceedings of the 21st International Conference on World
Wide Web Nickel, M. Pp. 271–280.
O’Hara, Tom and Janyce Wiebe (2003). “Preposition semantic classification
via Penn Treebank and FrameNet”. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Vol. 75.
May. Association for Computational Linguistics, pp. 79–86.
Pedregosa, F et al. (2011). “Scikit-learn: Machine Learning in {P}ython”. In:
Journal of Machine Learning Research 12, pp. 2825–2830.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning (2014).
“GloVe: Global Vectors for Word Representation”. In: Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing.
40
BIBLIOGRAPHY
Riedel, Sebastian et al. (2013). “Relation Extraction with Matrix Factorization and Universal Schemas”. In: Proceedings of the 2013 Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies June, pp. 74–84.
Schneider, Nathan et al. (2015). “A Hierarchy with, of, and for Preposition
Supersenses”. In:
Séaghdha, Diarmuid Ó (2007). “Designing and evaluating a semantic annotation scheme for compound nouns”. In: Proc. Corpus Linguistics, pp. 1–
17.
Srikumar, Vivek and Christopher D Manning (2014). “Learning Distributed
Representations for Structured Output Prediction”. In: Advances in Neural Information . . . Pp. 1–9.
Srikumar, Vivek and Dan Roth (2013a). “An Inventory of Preposition Relations”. In: arXiv: 1305.5785.
— (2013b). “Modeling semantic relations expressed by prepositions”. In:
1.1, pp. 231–242.
Titov, Ivan and Ehsan Khoddam (2015). “Unsupervised Induction of Semantic Roles within a Reconstruction-Error Minimization Framework”.
In: pp. 1–12. arXiv: arXiv:1412.2812v1.
Titov, Ivan and a Klementiev (2012). “A Bayesian approach to unsupervised
semantic role induction”. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association
for Computational Linguistics c, pp. 12–22.
Tratz, Stephen and Dirk Hovy (2009). “Disambiguation of preposition sense
using linguistically motivated features”. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium June, pp. 96–100.
Tratz, Stephen and Eduard Hovy (2010). “A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation”. In: Proceedings
of the 48th Annual Meeting of the Association for Computational Linguistics
July, pp. 678–687.
— (2011). “A Fast, Accurate, Non-Projective, Semantically-Enriched Parser”.
In: Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP ’11) 2010, pp. 1257–1268.
Turian, Joseph, Lev Ratinov, and Yoshua Bengio (2010). “Word representations : A simple and general method for semi-supervised learning”. In:
Acl July, pp. 384–394.
Vanderwende, Lucy (1996). The Analysis of Noun Sequences using Semantic
Information Extracted from On-Line Dictionaries. Tech. rep. MSR-TR-95-57.
Microsoft Research, p. 312.
Warren, Beatrice (1978). “Semantic patterns of noun-noun compounds”. In:
Acta Universitatis Gothoburgensis. Gothenburg Studies in English Goteborg
41, pp. 1–266.
Weston, Jason et al. (2013). “Connecting Language and Knowledge Bases
with Embedding Models for Relation Extraction”. In: arXiv: 1307.7973.
Yao, Limin, Sebastian Riedel, and Andrew McCallum (2012). “Unsupervised relation discovery with sense disambiguation”. In: Proceedings of
ACL-2012 July, pp. 712–720.
BIBLIOGRAPHY
41
Yates, Alexander and Oren Etzioni (2009). “Unsupervised methods for determining object and relation synonyms on the web”. In: Journal of Artificial Intelligence Research 34, pp. 255–296. ISSN: 10769757. arXiv: 1401.
5696.