Version 0.8.7 March 2009

Version 0.8.7
March 2009
www.proxem.com
Contents
Natural Language Processing
3
What is Natural Language Processing?.................................................................. 3
Some NLP applications.......................................................................................... 3
Some problems faced by NLP systems .................................................................. 4
Antelope
5
What is Antelope? .................................................................................................. 5
Design .................................................................................................................... 6
Lexicon
9
Introduction to an electronic lexicon...................................................................... 9
Main interfaces..................................................................................................... 10
ILexicon interface ................................................................................................ 11
ISynset interface................................................................................................... 13
ILemma interface ................................................................................................. 17
Lexicon Explorer sample ..................................................................................... 19
User-defined lexicon ............................................................................................ 21
Text analysis
24
Introduction .......................................................................................................... 24
Performance vs. precision .................................................................................... 26
Tagging & chunking
28
Introduction .......................................................................................................... 28
Main interfaces..................................................................................................... 29
ITagger interface .................................................................................................. 29
IChunker interface................................................................................................ 31
Tagging and chunking sample.............................................................................. 33
Parsing
34
Introduction to dependencies and constituents..................................................... 34
Main interfaces..................................................................................................... 35
Parsing samples .................................................................................................... 35
Some code ............................................................................................................ 38
Deep syntax and predicates
41
Deep syntax.......................................................................................................... 41
Time and space complements............................................................................... 42
Predicates ............................................................................................................. 43
Main interfaces..................................................................................................... 43
www.proxem.com
Contents • i
Sample code ......................................................................................................... 43
Semantic parsing
47
Frames .................................................................................................................. 47
Thematic roles ...................................................................................................... 47
Selectional restrictions ......................................................................................... 48
Sample application demo ..................................................................................... 48
Interfaces .............................................................................................................. 49
Sample code ......................................................................................................... 49
Handling documents
52
Introduction .......................................................................................................... 52
Documents and Processing resources................................................................... 52
Document context detection................................................................................. 55
Coreference resolution ......................................................................................... 56
Word sense disambiguation ................................................................................. 58
Annotations .......................................................................................................... 59
Formatters ............................................................................................................ 59
Other samples
60
NanoProlog .......................................................................................................... 60
Proxem Web Search ............................................................................................. 60
Paraphrases detection ........................................................................................... 62
ii • Antelope
© 2006-2009 Proxem
Natural Language Processing
What is Natural Language Processing?
“Natural Language Processing (NLP) is a subfield of artificial intelligence
and computational linguistics. It studies the problems of automated
generation and understanding of natural human languages. […] Natural
language understanding systems convert samples of human language into
more formal representations that are easier for computer programs to
manipulate.” 1
Some NLP applications
Machine Translation
This is one of the most important applications of Natural Language
Processing. Translation is an activity comprising the interpretation of the
meaning of a text in one language (the source text) and the production, in
another language, of a new and equivalent text (the target text): the
translation. Traditionally, translation has been a human activity, although
attempts have been made to automate and computerize the translation of
natural language texts (machine translation) or to use computers as an aid to
translation (computer-assisted translation).
Information Retrieval
Information retrieval (IR) is the science of searching for information in
documents, searching for documents themselves, searching for metadata that
describe documents, or searching within databases, whether relational standalone databases or hypertext networked databases such as the Internet or
intranets, for text, sound, images or data.
Information Extraction
Information extraction (IE) is a type of information retrieval whose goal is to
automatically extract structured or semi structured information from
unstructured machine-readable documents. A typical example is the
extraction of information on corporate merger events, whereby instances of
the relation ”MERGE (company1, company2, date)” are extracted
from online news (“Yesterday, New-York based Foo Inc. announced their
acquisition of Bar Corp.”).
1
This citation (and some other text fragments you can read here) comes from Wikipedia, the free encyclopedia.
www.proxem.com
Natural Language Processing • 3
A typical subtask of IE is Named Entity Recognition: recognition of entity
names (for people and organizations), place names, temporal expressions,
and certain types of numerical expressions (currency amounts…)
Automatic Summarization
Automatic summarization is the creation of a shortened version of a text by a
computer program. The product of this procedure still contains the most
important points of the original text.
Speech Recognition
Speech recognition is the process of converting a speech signal (i.e. voice) to
a set of words, by means of an algorithm implemented as a computer
program.
Some problems faced by NLP systems
Sentence Boundary Disambiguation
What is a sentence? At first glance, it is a set of words ending with a dot.
However, in a filename with extension (foo.pdf) or an IP address
(127.0.0.1), this rule is not accurate enough. We should say that either a final
dot is the last character of the input sentence, or it is followed by a space.
Nevertheless, this improved rule fails on “John F. Kennedy”. Therefore,
Sentence boundary disambiguation into sentences is a not-so-trivial task.
Lexical ambiguity
Many words have more than one meaning; we have to select the meaning
that makes the most sense in context. (A “bank” can be a financial institution
or a part of a river.) The task of find the right sense within a context is called
Word Sense Disambiguation. Lexical ambiguity may involve one word
having two parts of speech or homonyms. (“Time flies like an arrow.”)
Syntactic ambiguity
The grammar for natural languages is ambiguous, i.e. there are often
multiple possible parse trees for a given sentence. (“The man saw the boy
with the telescope”: is the man or the boy who has a telescope?) Choosing
the most appropriate one usually requires contextual information.
Semantic ambiguity
Understanding the context is even more necessary to solve semantic
ambiguities. For instance, incorrect anaphora resolution can lead to
misunderstanding (in “every farmer who owns a donkey beats it”, is the
pronoun “it” a reference to “farmer” or “donkey”?). Identifying the logical
subject is also sometimes difficult (in “John asks his mother to do that”, is
John or his mother that is supposed to do the action?).
4 • Antelope
© 2006-2009 Proxem
Antelope
What is Antelope?
The acronym ANTELOPE stands for Advanced Natural Language Objectoriented Processing Environment. This framework makes the development
of Natural Language Processing software easier than ever.
Download
You can download the Antelope framework on www.proxem.com. You will
also find on this web site a comprehensive list of resources about NLP.
Version
Antelope is currently in beta version 0.8. This document presents what is in
the box.
The middle-term objective of Antelope (version 1.0, expected in early 2010)
is to offer a true, easy to use, semantic parser. It will be first applied on an
encyclopedia (a large subset of Wikipedia seems to be a nice target) to make
complex queries possible at a semantic level.
WHAT IS NEW? Compared to former version 0.7, the new features are:
www.proxem.com
9
A very versatile user-defined lexicon feature,
9
(Partial) support for the French language,
9
Tested under Windows XP, Vista 32 bits and Vista 64 bits (x86),
9
Multi-threading support,
9
All-new, ribbon-based, Graphical User Interface,
9
An improved chunker,
9
Word Sense Disambiguation (experimental),
9
Deep syntax extraction,
9
Paraphrases extraction (very experimental).
Antelope • 5
Design
Development platform
Antelope is designed for the Microsoft .NET framework (version 2.0 and
above). Therefore, you can use it with C#, Visual Basic.NET, Delphi.NET
and many other .NET compliant languages (even COBOL.NET!).
Supported natural languages
Only English is fully supported for the moment, since English has many
available linguistic resources. We also (partially) support French: a parser is
available (the TagParser) and we have integrated WOLF (WordNet libre du
français, a subset of WordNet in French).
WHAT IS COMING NEXT? We plan to fully (i.e. at a semantic level) support
French in late 2008 and to start supporting another other European
language (probably Spanish) by 2009.
Levels of complexity
In standard computing (i.e. when no NLP is needed), the basic text
processing consists mainly in splitting a text into tokens (using for instance
the String.Split method of the .Net framework).
In the NLP field, more complex processing is required. If we try to
enumerate typical NLP tasks, in an increasing complexity level, we have:
9
Tagging: to find the part of speech of single words (i.e.: is a word a
noun, or a verb, or an adjective, or a determiner…?),
9
Chunking: to group together words belonging to the same
constituent (typically separating noun groups and verb groups),
9
Parsing: to organize constituents through dependencies or syntactic
trees,
9
Deep parsing: to neglect surface syntactic phenomena, such as
finding the logical subject in the case of a relative clause, or a
passive verbal construction,
9
Semantic parsing: to find out the true meaning, or semantics, of a
sentence or a full text, by identifying the thematic roles in complex
constructions, and disambiguating the words in a sentence.
The two first levels are often qualified as “shallow parsing”. They are very
fast, compared to in-depth parsing, but less accurate. Antelope supports all
these levels of processing.
Overall design
Antelope is fully object-oriented. It supports an interface-based
programming model. Each module (lexicon, tagger, parser, etc.) defines
standard interfaces and many components can implement these interfaces.
6 • Antelope
© 2006-2009 Proxem
The help file Proxem.Antelope.Interfaces.chm (present in the
doc subdirectory) describes all Antelope interfaces.
Antelope consists in the following assemblies (stored in the bin
subdirectory). This diagram shows the dependencies between them:
www.proxem.com
9
Proxem.Antelope.Interfaces:
the
programmatic
interfaces, that will be described in the next chapter,
9
Proxem.Antelope.Lexicon: the implementation of the
lexicon (mainly a wrapper for accessing WordNet 2.1 data with
high performance),
9
Proxem.Antelope.Tagging: the implementation of part-ofspeech taggers, including a wrapper for accessing the SSTagger
(C++ DLL Tagger.dll),
9
Proxem.Antelope.LinkGrammar: a .NET wrapper for
accessing the Link Grammar parser (DLL LinkGrammar.dll),
9
Proxem.Antelope.Stanford: a .NET wrapper for accessing
the Stanford Parser (via StanfordParser.dll and
Stanford-Parser.jar.dll which uses the IKVM library:
IKVM.Runtime.dll and IKVM.GNU.Classpath.dll),
9
Proxem.Antelope.Coreferences: a detector of pronominal
and nominal anaphora and a coreference resolver.
Antelope • 7
9
Proxem.Antelope.Semantics: a basic semantic parser using
VerbNet (experimental version),
9
Proxem.Antelope.Disambiguation: tools for Word Sense
Disambiguation,
9
Proxem.Antelope.Tools: the implementation of many
general-purpose components, including a detector of time and space
features (experimental version),
9
Proxem.NanoProlog: the implementation of a tiny PROLOG
interpreter for .NET,
9
Lithium: a graph library producing trees,
9
HtmlLib: a HTML parser for .NET.
Multi-threading support
Antelope components now support multi-threading. That means that they
can be used safely on a Web server, and unleash the power of you dual-core
or quad-core processors, as shown on the following performance graph:
WHAT IS COMING NEXT? We hope to support the Task Parallel Library
(TPL), the task parallelism component of the Parallel Extensions to the .NET
FX.
8 • Antelope
© 2006-2009 Proxem
Lexicon
Introduction to an electronic lexicon
The Antelope Lexicon’s design is inspired by WordNet, a well known
semantic lexicon for English. The Antelope Lexicon also includes data that
comes from the SUMO ontology.
Synsets
English words are grouped into sets of synonyms called synsets. Each synset
provides short, general definitions, and records the various semantic
relations between these synonym sets.
Lemmas
A lemma is the canonical form of each word describing a synset. For
instance, the lemma of the plural noun “cars” is “car”. WordNet version 2.1
contains about 150,000 words organized in over 115,000 synsets for a total
of 207,000 word-sense pairs.
Relations
Semantic relations connect most synsets to other synsets. These relations
depend on the type of word, and include:
9
9
www.proxem.com
Nouns:
o
Y is a hypernym of X if every X is a (kind of) Y; for
instance, “cat” is a kind of “feline”,
o
Y is a hyponym of X if every Y is a (kind of) X,
o
Y is a holonym of X if X is a part of Y; for instance,
“France” is part of “Europe”,
o
Y is a meronym of X if Y is a part of X,
Verbs:
o
The verb Y is a hypernym of the verb X if the activity X is
a (kind of) Y (“travel” to “movement”),
o
The verb Y is a troponym of the verb X if the activity Y is
doing X in some manner (“lisp” to “talk”),
o
The verb Y is entailed by X if by doing X you must be
doing Y (“snoring” by “sleeping”),
Lexicon • 9
9
9
Adjectives:
o
Related nouns,
o
Participle of verb,
Adverbs:
o
Root adjectives.
While semantic relations apply to all members of a synset because they share
a meaning and are all mutually synonyms, lexical relations can also connect
words to other words: for instance, antonyms (opposites of each other) and
derivationally related words.
Performance issue
The Antelope framework does not use WordNet files directly. For
performance reasons, WordNet files have been preprocessed into a single
file Proxem.Lexicon.dat (present in the data subdirectory), which is
designed to be loaded into memory. As all data are cached into memory,
searches are amazingly fast.
WHAT IS COMING NEXT? A future version of Antelope will contain additional
lexicons (for computing, health, finance, companies…). We also plan a full
integration of the SUMO ontology.
Main interfaces
IConcept
ILexicon
Interface
Interface
Properties
Properties
Methods
ILemma
ISynset
Interface
IConcept
Interface
IConcept
Properties
Properties
Methods
Methods
ISynsetRelation
Interface
Properties
ILemmaRelation
Interface
Properties
10 • Antelope
© 2006-2009 Proxem
ILexicon interface
The single object that must be instantiated is the lexicon itself, which
supports the ILexicon interface. Its standard implementation is in the
assembly Proxem.Antelope.Lexicon. It can be instantiated in the
following manner:
C#
ILexicon lexicon = new Lexicon();
lexicon.LoadDataFromFile("Proxem.Lexicon.dat", null /* no progress bar */);
VB.NET
Dim lexicon As ILexicon = New Lexicon.Lexicon;
lexicon.LoadDataFromFile("Proxem.Lexicon.dat", Nothing); ' no progress bar
Please note that in order to run the following code samples, you should:
9
Add a using (C#) or Imports (VB.NET) directive to reference
the namespace Proxem.Antelope.Lexicon,
9
Install the NUnit test framework, which defines the Assert class
(you can download it on http://www.nunit.org).
The lexicon methods are:
9
GetBaseForms: find the possible base forms of an inflected word
for a given part-of-speech.
C#
IList<IInflectedWord> words;
words = lexicon.GetBaseForms("glasses", PartOfSpeech.Noun);
Assert.AreEqual(2, words.Count);
// The optical instrument, always in plural form
Assert.AreEqual("glasses", words[0].BaseWord);
Assert.AreEqual(EnglishWordForm.BaseForm, words[0].Form);
// The plural form of a singular noun
Assert.AreEqual("glass", words[1].BaseWord);
Assert.AreEqual(EnglishWordForm.Plural, words[1].Form);
VB.NET
Dim words As IList(Of IInflectedWord)
words = lexicon.GetBaseForms("glasses", PartOfSpeech.Noun)
Assert.AreEqual(2, words.Count)
' The optical instrument, always in plural form
Assert.AreEqual("glasses", words(0).BaseWord)
Assert.AreEqual(EnglishWordForm.BaseForm, words(0).Form)
' The plural form of a singular noun
Assert.AreEqual("glass", words(1).BaseWord)
Assert.AreEqual(EnglishWordForm.Plural, words(1).Form)
9
GetInflectedForms: generate all inflected forms of a base
word of a given part-of-speech.
C#
// The past tense of the verb "to abide" is "abided" or "abode"
IList<string> words;
words = lexicon.GetInflectedForms("abide", EnglishWordForm.PastTense);
Assert.AreEqual(2, words.Count);
Assert.IsTrue(words.IndexOf("abided") >= 0);
Assert.IsTrue(words.IndexOf("abode") >= 0);
www.proxem.com
Lexicon • 11
// The plural of the noun "cat" is "cats"
words = lexicon.GetInflectedForms("cat", EnglishWordForm.Plural);
Assert.AreEqual(1, words.Count);
Assert.AreEqual("cats", words[0]);
VB.NET
' The past tense of the verb "to abide" is "abided" or "abode"
Dim words As IList(Of String)
words = lexicon.GetInflectedForms("abide", EnglishWordForm.PastTense)
Assert.AreEqual(2, words.Count)
Assert.IsTrue(words.IndexOf("abided") >= 0)
Assert.IsTrue(words.IndexOf("abode") >= 0)
' The plural of the noun "cat" is "cats"
words = lexicon.GetInflectedForms("cat", EnglishWordForm.Plural)
Assert.AreEqual(1, words.Count)
Assert.AreEqual("cats", words(0))
9
FindSenses: find all possible senses of a word (given in its base
form, obtained with a call to the GetBaseForms method).
C#
IList<ILemma> senses;
senses = lexicon.FindSenses("glass", PartOfSpeech.Noun);
// The noun "glass" has 7 senses
Assert.AreEqual(7, senses.Count);
// We display each sense definition, using the synset of the lemma
foreach (ILemma lemma in senses)
Console.WriteLine(lemma.Synset.Definition);
VB.NET
Dim senses As IList(Of ILemma)
senses = lexicon.FindSenses("glass", PartOfSpeech.Noun)
' The noun "glass" has 7 senses
Assert.AreEqual(7, senses.Count)
' We display each sense definition, using the synset of the lemma
For Each lemma As ILemma In senses
Console.WriteLine(lemma.Synset.Definition)
Next
9
NounRoots: this property returns the roots of the noun ontology
of the lexicon (a single node "entity").
C#
IList<ISynset> nounRoots = lexicon.NounRoots;
Assert.AreEqual(1, nounRoots.Count);
Assert.AreEqual("entity", nounRoots[0].Lemma.Text);
VB.NET
Dim nounRoots As IList(Of ISynset) = lexicon.NounRoots
Assert.AreEqual(1, nounRoots.Count)
Assert.AreEqual("entity", nounRoots(0).Lemma.Text)
9
VerbRoots: this property returns the roots of the verb ontology of
the lexicon (557 nodes grouped by lexical category).
C#
IList<ISynset> verbRoots = lexicon.VerbRoots;
Assert.AreEqual(557, verbRoots.Count);
VB.NET
Dim verbRoots As IList(Of ISynset) = lexicon.VerbRoots
Assert.AreEqual(557, verbRoots.Count)
12 • Antelope
© 2006-2009 Proxem
9
GetHeadIndexOfNoun,
GetHeadIndexOfAdjective,
GetHeadIndexOfVerb: find the head word in a multi-word
expression (that can be expressed as a list of string, or better as a
list of IWord).
C#
// Head word of "big band" is "band"
string[] words = "big band".Split(' ');
int headWord = lexicon.GetHeadIndexOfNoun(words);
Assert.AreEqual(1, headWord);
// Head word of "get up" is "get"
words = "get up".Split(' ');
headWord = lexicon.GetHeadIndexOfVerb(words);
Assert.AreEqual(0, headWord);
VB.NET
' Head word of "big band" is "band"
Dim words As String() = "big band".Split(" "c)
Dim headWord As Integer = lexicon.GetHeadIndexOfNoun(words)
Assert.AreEqual(1, headWord)
' Head word of "get up" is "get"
words = "get up".Split(" "c)
headWord = lexicon.GetHeadIndexOfVerb(words)
Assert.AreEqual(0, headWord)
9
FindSynset: find a given synset by its part-of-speech and ID
(this method is mainly for internal use).
ISynset interface
Properties and methods
A ISynset has many properties describing it:
www.proxem.com
9
PartOfSpeech: the part-of-speech of the synset; its possible
values are noun, verb, adjective or adverb (as synsets are open-class
words),
9
SynsetId: an integer identifier (from WordNet 3.0) that is unique
within a part-of-speech (for instance, the noun “entity” and the
adjective “able” have both SynsetId = 1740).
9
Definition: a string gloss defining the synset.
9
Examples: a string containing example sentences that show
usages of the synset.
9
Lemmas: a list of lemmas (word base forms) describing the synset.
9
Relations: a list of all the relations that the synset has with other
synsets; this list can be restricted to a particular relation using the
function RelatedSynsets.
9
InformationContent: an integer with the cumulated
frequency of the synset’s lemmas (recursively including
hyponyms). That value gives the “relative importance” of the synset
from an ontological point of view.
Lexicon • 13
9
Domain: a string describing the ontology domain of the synset.
This information comes from SUMO (Suggested Upper Middle
Ontology).
You can explore relations around the synset using the following methods:
9
RelatedSynsets: this method returns a list of synset having a
relation of a given type with the current synset.
9
OrderedHypernyms: this property returns an array in which all
hypernyms (ancestors) of the current synset are flattened. As
multiple inheritance can be used, OrderedHypernyms gives an
easy-to-process flat view or the graph hierarchy.
9
InheritsFrom: this method indicates if the current synset
inherits (directly or not) from another synset, or from a well-known
ontological node.
The following program sample uses all these properties and methods:
C#
// We find the "cat" senses
IList<ILemma> senses = lexicon.FindSenses("cat", PartOfSpeech.Noun);
// The noun "cat" has 8 senses
Assert.AreEqual(8, senses.Count);
// Let's examine the first sense
ISynset trueCat = senses[0].Synset;
Assert.IsTrue(trueCat.Definition.StartsWith("feline mammal usually having
thick soft fur"));
// True cat has one direct hypernym (ancestor)
IList<ISynset> trueCatDirectAncestors =
trueCat.RelatedSynsets(RelationType.Hypernym);
Assert.AreEqual(1, trueCatDirectAncestors.Count);
// True cat ancestor is "feline"
Assert.AreEqual("feline", trueCatDirectAncestors[0].Lemma.Text);
// True cat has 2 direct hyponyms: "domestic cat" and "wild cat"
IList<ISynset> trueCatDirectDescendants =
trueCat.RelatedSynsets(RelationType.Hyponym);
Assert.AreEqual(2, trueCatDirectDescendants.Count);
// True cat has 12 ancestors ("feline", "carnivore", "placental", "mammal",
// "vertebrate", "chordate", "animal", "organism", "living thing",
// "object", "physical entity", "entity")
IList<ISynset> trueCatAllAncestors = trueCat.OrderedHypernyms;
Assert.AreEqual(12, trueCatAllAncestors.Count);
foreach (ISynset ancestor in trueCatAllAncestors)
{
Console.WriteLine(ancestor.Lemma.Text);
}
// True cat inherits from "organism"
Assert.IsTrue(trueCat.InheritsFrom(OntologyNode.OrganismOrBeing));
// True cat does NOT inherit from "person"
Assert.IsFalse(trueCat.InheritsFrom(OntologyNode.Person));
// How much similar are "true cat" and "jaguar"?
ILemma jaguar = lexicon.FindSenses("jaguar", PartOfSpeech.Noun)[0];
double similarJag = trueCat.Lemma.GetSimilarity(jaguar,
SimilarityMeasure.MutualInformationContent);
Assert.Greater(similarJag, 0.75); // ~0.779 => very similar
14 • Antelope
© 2006-2009 Proxem
// How much similar are "true cat" and "motorcar"?
ILemma car = lexicon.FindSenses("motorcar", PartOfSpeech.Noun)[0];
double similarCar = trueCat.Lemma.GetSimilarity(car,
SimilarityMeasure.MutualInformationContent);
// ~0.20 => not really similar
// "true cat" is much more similar to "jaguar" than to "motorcar"
Assert.Greater(similarJag, similarCar);
// What are the (direct) "parts" of a "motorcar"?
IList<ISynset> carParts = car.RelatedSynsets(RelationType.HasPart);
Assert.AreEqual(29, carParts.Count);
foreach (ISynset carPart in carParts)
{
// accelerator, air bag, auto accessory...
Console.WriteLine(carPart.Lemma.Text);
}
VB.NET
' We find the "cat" senses
Dim senses As IList(Of ILemma)
senses = lexicon.FindSenses("cat", PartOfSpeech.Noun)
' The noun "cat" has 8 senses
Assert.AreEqual(8, senses.Count)
' Let's examine the first sense
Dim trueCat As ISynset = senses(0).Synset
Assert.IsTrue(trueCat.Definition.StartsWith( _
"feline mammal usually having thick soft fur"))
' True cat has one direct hypernym (ancestor)
Dim trueCatDirectAncestors As IList(Of ISynset)
trueCatDirectAncestors = trueCat.RelatedSynsets(RelationType.Hypernym)
Assert.AreEqual(1, trueCatDirectAncestors.Count)
' True cat ancestor is "feline"
Assert.AreEqual("feline", trueCatDirectAncestors(0).Lemma.Text)
' True cat has two direct descendants: "domestic cat" and "wild cat"
Dim trueCatDirectDescendants As IList(Of ISynset)
trueCatDirectDescendants = trueCat.RelatedSynsets(RelationType.Hyponym)
Assert.AreEqual(2, trueCatDirectDescendants.Count)
' True cat has 12 ancestors ("feline", "carnivore", "placental", "mammal",
' "vertebrate", "chordate", "animal", "organism", "living thing",
' "object", "physical entity", "entity")
Dim trueCatAllAncestors As IList(Of ISynset)
trueCatAllAncestors = trueCat.OrderedHypernyms
Assert.AreEqual(12, trueCatAllAncestors.Count)
For Each ancestor As ISynset In trueCatAllAncestors
Console.WriteLine(ancestor.Lemma.Text)
Next
' True cat inherits from "organism"
Assert.IsTrue(trueCat.InheritsFrom(OntologyNode.OrganismOrBeing))
' True cat does NOT inherit from "person"
Assert.IsFalse(trueCat.InheritsFrom(OntologyNode.Person))
' How much similar are "true cat" and "jaguar"?
Dim jaguar As ILemma = lexicon.FindSenses("jaguar", PartOfSpeech.Noun)(0)
Dim similarJag As Double
similarJag = trueCat.Lemma.GetSimilarity(jaguar, _
SimilarityMeasure.MutualInformationContent)
' similarJag ~0.779 => very similar
Assert.Greater(similarJag, 0.75)
www.proxem.com
Lexicon • 15
' How much similar are "true cat" and "motorcar"?
Dim car As ILemma
car = lexicon.FindSenses("motorcar", PartOfSpeech.Noun)(0)
Dim similarCar As Double
similarCar = trueCat.Lemma.GetSimilarity(car, _
SimilarityMeasure.MutualInformationContent)
' similarCar ~0.20 => not really similar
' "true cat" is much more similar to "jaguar" than to "motorcar"
Assert.Greater(similarJag, similarCar)
' What are the (direct) "parts" of a "motorcar"?
Dim carParts As IList(Of ISynset)
carParts = car.RelatedSynsets(RelationType.HasPart)
Assert.AreEqual(29, carParts.Count)
For Each carPart As ISynset In carParts
' accelerator, air bag, auto accessory...
Console.WriteLine(carPart.Lemma.Text)
Next
Synset nouns and verbs
A noun synset also supports the ISynsetNoun interface (you can use the
as operator to obtain it). This interface defines:
9
DescendantCount: this integer property contains the recursive
count of nouns that specialize the current noun, including instances
(named entities).
9
InstanceDescendantCount: this integer property contains
the recursive count of instance nouns (named entities) that
specialize the current noun.
9
IsPartOf: this method indicates if the current noun synset is part
of another concept. It returns the path from the current synset to its
owner concept, if such a path exists.
A verb synset also supports the ISynsetVerb interface. This interface
defines:
16 • Antelope
© 2006-2009 Proxem
9
SynsetVerbFrames: this property returns an array containing
the typical frame constructions of the verb.
9
DescendantCount: this integer property contains the recursive
count of verbs that specialize the current verb.
IExtendedSynset contains parsed forms of the definition gloss and the
synset examples.
ILemma interface
Properties and methods
A lemma has many properties describing it:
www.proxem.com
9
PartOfSpeech: the part-of-speech of the lemma (noun, verb,
adjective or adverb),
9
SynsetId: an integer identifier (from WordNet 3.0) that is unique
within a part-of-speech (the fact that the noun “entity” and the
adjective “able” have both SynsetId = 1740 is possible because
they belong to distinct part-of-speech classes).
9
Text: a string property containing the base form of the word or
collocation.
9
WordNo: an integer property; the current lemma is the nth word
describing a given concept. Remark: lemmas are ordered in a way
such as lemmas with lower WordNos have higher frequencies.
9
SenseNo: an integer property; the nth sense of the word or
collocation. Example: for the word “cat”, the true cat (“feline
mammal”) is the first sense; the second sense is “an informal term
for a youth or man”, etc.
9
Frequency: an integer property that contains the mean
appearance frequency of the lemma in the WordNet reference
corpus. As many lemmas are over represented (such as “U.S.” for
instance), this quantity must be used with caution.
9
Relations: an array of all the relations that the lemma has with
other concepts.
9
GetSimilarity: this method indicates if the current lemma is
similar to another lemma. This similarity measure is based on either
the mutual Information Content or a gloss overlapping (other
metrics will be added in the future). The measure varies from 0.0
(absolutely not identical) to 1.0 (identical, i.e. same synset). For the
mutual Information Content similarity measure, both synsets
compared must be either nouns or verbs; gloss overlapping works
for all parts of speech. As there exist many ways of comparing
words, a similarity measure is highly dependent on a particular
point of view. For instance, if “samurai” and “Japan” are not
similar from a mutual Information Content point of view, they are
much nearer with a gloss overlapping distance.
Lexicon • 17
Lemma nouns, verbs and adjectives
A noun lemma also supports the ILemmaNoun interface. This interface
defines:
9
PluralForm: this morphology property gives the plural forms of
the current noun (“box” Æ “boxes”).
9
IsPartOf: this method indicates if the current noun lemma is part
of another concept.
A verb lemma also supports the ILemmaVerb interface. This interface
defines:
9
GerundForm: this morphology property gives the gerund form of
the verb (“go” Æ “going”).
9
ThirdPersonSingularForm: this morphology property gives
the third person singular form of the verb at present tense (“go” Æ
“goes”).
9
PastTenseForm: this morphology property gives the past tense
form of the verb (“go” Æ “went”).
9
PastParticipleForm: this morphology property gives the past
participle form of the verb (“go” Æ “gone”).
9
VerbSentences: this property returns a list of strings with
examples of sentences using the verb.
An adjective lemma also supports the ILemmaAdjective interface. This
interface defines:
18 • Antelope
9
ComparativeForm: this morphology property gives the
comparative form of the adjective (“fast” Æ “faster”,
“comprehensive” Æ “more comprehensive”).
9
SuperlativeForm: this morphology property gives the
superlative form of the adjective (“fast” Æ “fastest”,
“comprehensive” Æ “most comprehensive”).
9
SyntacticMarker: this property indicates a limitation on the
syntactic position the adjective may have in relation to the noun that
it modifies.
© 2006-2009 Proxem
Lexicon Explorer sample
You can launch the Lexicon Explorer sample with the File\Lexicon
menu. It gives an interactive experience of navigating through words.
“Search” tab
With the “Search” tab, you can search for a word. (You can also add a “%”
joker to the end of the word.) In the following example, the left panel shows
some senses of the noun “cat” lemma, while the right panel displays the
synset of its first (most usual) sense. You will remark in the right panel that
the display size of the hypernyms (ancestors) depends on their importance.
www.proxem.com
Lexicon • 19
Using the “Ontology” tab, you can display WordNet nouns or verbs with an
ontological view, in which only highly significant synsets are displayed:
Finally, with the tab “All words” tab, you can browse the whole hierarchy of
nouns or verbs, from their root.
“Similarity” tab
You can also display similarity between words (nouns or verbs):
“Meronymy” tab
You can also explore holonymy and meronymy (relations between wholes
and their parts).
The following example shows that “Paris” (i.e. Paris, France) belongs to
Europe#1 (as a continent) and to Europe#2 (as an economic organization), and
that the distance is equal to two (Paris Æ France Æ Europe):
20 • Antelope
© 2006-2009 Proxem
“Display” tab
Finally, the Display tab shows various display options that can be changed
interactively:
User-defined lexicon
The Antelope User Lexicon feature uses a versatile XML file format. It
makes possible to add new synsets, new lemmas and new relations to the
standard data of the lexicon (coming from WordNet 3.0).
You can use the LoadUserData method of the ILexicon interface to
load a user-defined lexicon. This method accepts two arguments: the userdefined lexicon filename, and a boolean indicating if the data must be
validated or not.
Optional validation schema
You will find a validation schema (UserLexicon.xsd) in the data
subdirectory. This XML schema can optionally be used to validate the user
XML data.
It is recommended to use validation, at least during the user lexicon
development stage. Loading is 25% faster without validation. The schema
must be in the same directory as the XML data.
User-defined lexicon structure
The user-defined lexicon file must be in XML format. It can be gzipped
(.gz extension) or not (.xml extension) to save disk space. Reading a
gzipped file is 5% slower than reading a XML file.
The root UserLexicon element defines some default attributes:
www.proxem.com
Lexicon • 21
9
DefaultWordNetVersion (default WN30): in a future release,
it will be possible to indicate another (previous) WordNet version,
and the mapping will be done automatically.
9
DefaultLanguage (default English): lemmas can be added in
many languages, not only English. If, for instance, you are creating
a French mapping to WN, you can set it to French, and that will
be taken into account as default for all new lemmas.
9
DefaultSynsetAction (default Select)
DefaultLemmaAction (default Insert): lemmas and synsets
can be selected (to provide navigation) or inserted (to create new
elements). These attributes give the default behavior.
You will find a basic sample, UserLexiconTest.xml, in the data
subdirectory. This sample demonstrates how to add new lemmas, new
synsets and new relations, and even new annotations.
Provided user-defined lexicons
To load a user-defined lexicon, you must first open a Lexicon Explorer, and
then choose the “Import user lexicon” option from the File menu. You will
find two (gzipped) user-defined lexicons in the data subdirectory:
22 • Antelope
9
French.Wolf.Lexicon.xml.gz contains 44,200 new French
lemmas (loading time: ~3 seconds). This resource is based on
WOLF from Benoît Sagot (http://alpage.inria.fr/~sagot/wolf.html)
and is under the Cecill-C (LGPL compatible) license. Please note
that French lemmas are currently highlighted in green in the
Lexicon Explorer.
9
Wikipedia.Lexicon.xml.gz contains ~300,000 new synsets
(loading time: ~50 seconds) with a (partial) mapping to the English
Wikipedia. This resource is based on DBPedia (wiki.dbpedia.org)
data (namely, the wordnetlink_en.nt file, licensed under the
terms of GNU Free Documentation License), enriched with new
relations (for instance, software is linked to the editor company
© 2006-2009 Proxem
when possible with a see-also relation). Please note that terms
linked to the Wikipedia appear with a hyperlink.
If you develop a user-defined lexicon (in the medical, financial, tech…
field), we will be happy to integrate it in a future Antelope release.
www.proxem.com
Lexicon • 23
Text analysis
Introduction
Syntactic analysis
Antelope enables many stages of text analysis, from the faster to the more
precise. The diagram above shows the main syntactic analyzer interfaces.
We will illustrate their action on the simple sentence “the small mice ate
cheese”.
A tokenizer breaks up a text into its
constituent tokens. It is quite like using
the String.Split method, but with
some more complex splitting rules:
ITokenizer
Interface
Methods
{"the", "small", "mice", "ate", "cheese"}
A tagger associates to each token its part
of speech, using its context:
ITagger
Interface
ITokenizer
•
•
•
•
•
Methods
"the": determiner
"small": adjective
"mice": noun (plural)
"ate": verb (past tense)
"cheese": noun
A chunker groups words into chunks:
IChunker
Interface
Methods
IParser
Interface
ITokenizer
Methods
•
•
•
{"the" ,"small", "mice"}: noun chunk
{"ate"}: verb chunk
{"cheese"}: noun chunk
A parser organizes chunks to show the
dependencies that exist between them,
and also between their inner words:
•
•
•
{"the", "small", "mice"}: noun chunk
o is subject of
{"ate"}: verb chunk
o has as direct object
{"cheese"}: noun chunk
As you can see, a tagger and a parser is also a tokenizer. A parser can be
considered as a tagger, since tagging is a subset of parsing, but keep in mind
that parsing costs much more CPU than tagging. When using taggers and
chunkers, people often speak about shallow parsing.
As you can see on the following figure:
24 • Antelope
© 2006-2009 Proxem
9
Each chunk node puts together one or many consecutive words,
9
Chunk nodes are a linear view of a subset of the syntactic nodes
produced by a parser.
In the Antelope sample program, you can choose the syntactic analysis level
and the required component:
Semantic analysis
After the syntactic analysis stage, the Antelope sample let you choose
between several kinds of semantic analysis:
These options will be explained in detail further. For the moment, let’s just
introduce them:
•
Context: allow detection of the context of a text: what is it talking
about?
•
Coreferences: when there is a pronoun (“it”, “he”, “they”…) in a
text, what antecedent is it referencing?
•
Word sense disambiguation (experimental): what’s the sense of a
particular word, with respect to the lexicon, in the current context?
These three options can be used on a tagged, chunked or parsed text. The
other options need a parsed text. To illustrate them, we will use the sentence
example “the flowers were given to the girl by the boy”. Its parsed form is:
www.proxem.com
Text analysis • 25
┌──────prep>─────┐
┌─<nsubjpass┤
┌─pobj>─┐
├─pobj>┐
┌─<det┤
┌<aux┼prep┤ ┌<det┤
│ ┌<de┤
│
│
│
│
│ │
│
│ │
│
the flowers were given to the girl by the boy
•
Deep syntax: for instance, the chunk “the boy” is a prepositional
phrase of “to give” from a syntactic point of view; but it is the
logical subject of the verb from a deep syntax point of view.
the flowers were given to the girl by the boy
│
│
│
│
└<DirectObj─┼IndirectObj─┘
│
└──────Subject(by)>─────┘
•
Predicates: the deep syntactic predicate is given(Subject:
boy, DirectObj: flowers, IndirectObj: girl).
•
Semantic frames: the associated frame (with identification of the
thematic roles) is: “the flowers(Theme) were given(Verb) to the
girl(Recipient) by the boy(Agent)” and the detailed semantics is:
has_possession(start(E), AGENT=boy, THEME=flowers)
has_possession(end(E), RECIPIENT=girl, THEME=flowers)
transfer(during(E), THEME=flowers)
cause(AGENT=boy, E)
•
Time and space: some prepositional phrases are time or space
complements. The system is smart enough to detect that “it was
during World War II” is in the 1939-45 period; and even better, that
“it was during the end of World War II” should be around 194445.
•
Sentiment analysis: what sentiments does someone feel when
he/she reads the text? Is it globally positive or negative? (NB: this
analyzer does not belong to the Antelope standard distribution.)
Performance vs. precision
Parsing vs. tagging/chunking
When you play with the parsers, you notice that it can take upwards of many
seconds to parse a sentence with 15-20 words. Parsers are noticeably slower
than taggers, and they need a large RAM footprint and much CPU. There is
no easy way to improve its performances. So:
9
Either you need precision and true syntactic parsing: you can
consider using a server farm to parallelize parsing on many
cores/many processors/many computers. (Recall that Antelope now
–experimentally– supports multi-threading.)
9
On the other hand, shallow parsing is enough for your needs: you
can then use a tagger or a chunker.
Don’t forget that anaphora / coreference resolution and word sense
disambiguation work well since the tagging / chunking stage. If you apply
them on parsed texts, you can expect to obtain better results, since some
heuristics apply only on the output of a parser, but the gain is generally
minor.
26 • Antelope
© 2006-2009 Proxem
Benchmark - Some figures
We parsed a small document (10 sentences, 134 words) on a PC with an
Intel Core2 Quad CPU at 2.40 GHz. These are the results for a basic
syntactic analysis (all times are in milliseconds):
Analysis level
Tagging
Tagging+Chunking
Parsing2
Component
SSTagger
SimpleTagger
SSTagger+chunker
SimpleTagger+chunker
Link Grammar Parser
Stanford Parser
Time
78 ms
35 ms
130 ms
82 ms
1100 ms
4300 ms
Words/sec
1720
3830
1030
1630
120
30
The parsing speed depends highly on the sentence length. It goes typically
from 50 words/sec (for short sentences) to 10 words/sec (for long sentences).
On the same document, additional operations require extra time:
Operation
Collocation collapsing
Context extraction
Coreferences extraction
Word sense disambig.
Deep syntax
Deep syntax+semantics
Previous analysis level
Tagging
Chunking
Parsing
Tag./Chunk./Parsing
Tagging/Chunking
Parsing
Tagging
Chunking
Parsing (Stanford Parser)
Parsing (Link Grammar)
Parsing (Stanford Parser)
Parsing (Stanford Parser)
Time
+10 ms
+10 ms
+100 ms
+7 ms
+9 ms
+90 ms
+615 ms
+615 ms
+615 ms
+100 ms
+200 ms
Overhead
+20%
+10%
+5%
+12%
+2%
+650%
+430%
+50%
+15%
+3%
+5%
2
Note that the Link Grammar Parser is faster than the Stanford Parser, in that particular bench, essentially
because these sentences are short. On longer sentences (30+ words), they obtain comparable performances.
www.proxem.com
Text analysis • 27
Tagging & chunking
Introduction
Part-of-speech Tagging
Part-of-speech tagging is the process of marking up the words in a text as
corresponding to a particular part-of-speech, based on both its definition, as
well as its context. A simplified form of this is commonly taught to schoolage children, in the identification of words as noun, verb, adjective,
preposition, pronoun, adverb, conjunction and interjection.
Some words can represent more than one part-of-speech at different times.
This is not rare, as in natural languages, a huge percentage of word-forms
are ambiguous. For example, even “dogs” which is usually thought of as a
just a plural noun, can also be a verb: “the sailor dogs the hatch”.
Example
Once tagged, the sentence “Born in a Kentucky log cabin, Abraham Lincoln
was elected as President of the United States” becomes:
Born/VBN in/IN a/DT Kentucky/NNP log/NN cabin/NN ,/, Abraham/NNP
Lincoln/NNP was/VBD elected/VBN as/IN President/NNP of/IN the/DT
United/NNP States/NNPS
The VBN tag means “verb past participle”, DT “determiner”, NNP “proper
noun singular”, NN “noun singular or mass”, etc.
Collocation detection
When CollapseCollocations is activated, the tagger detects
collocations and collapses them as single words (for instance, “President of
the United States”, “log cabin” are collocations).
The system is smart enough to detect proper nouns (“Abraham Lincoln”)
even when the last name does not directly follow the first name. For
instance, in “Pierre and Marie Curie”, “Pierre Curie” will be correctly
identified as a single proper noun.
Plural nouns resulting of the conjunction of two common nouns are
transformed into singular when it is relevant. For instance, “Mississippi and
Coosa rivers” becomes “Mississippi river” and “Coosa River”.
28 • Antelope
© 2006-2009 Proxem
Main interfaces
ITokenizer
ProcessingType
Interface
Enum
Methods
PartOfSpeech
Enum
IWord
Interface
ISerializable
None
Tagging
Chunking
Parsing
Noun
Verb
Adjective
Adverb
PronounDeterminer
Article
Preposition
Conjunction
Numeral
Interjection
Residual
Punctuation
Properties
ITagger
Interface
ITokenizer
Methods
ISentence
Interface
ISerializable
IMultiwordExpression
TagType
Interface
Enum
IChunker
IChunk
SyntacticNodeTy…
Interface
Interface
IMultiwordExpression
Enum
Methods
Properties
Currently, four classes support the ITagger interface:
9
SimpleTagger and SSTagger are true taggers (they are
implemented in the Proxem.Antelope.Tagging assembly
that also references Proxem.Antelope.Tools); the first is
very fast, and the second is very accurate. Thousands of words can
be tagged per minute.
9
Stanford.Parser and LinkGrammar.Parser are parsers
that can be used as (very slow but very accurate) taggers.
Their accuracy is around 95-97% (i.e. 95-97% of the word tokens in
arbitrary English text receive the correct tag).
A chunker takes the output of a tagger as its input, and groups together the
tagged words that belong to the same constituent (or chunk).
ITagger interface
The ITagger methods are:
www.proxem.com
9
TagText: this method returns a (read-only) list of tagged words
(IWord) from an input sentence. It marks up the words in the input
text with their corresponding parts of speech.
9
CollapseCollocations: this method collapses a tagged text,
identifying collocations and merging the words that form them. A
collocation is a sequence of words that go together to form a
specific meaning, such as “car pool”. We consider as a collocation
a multi-word expression defined in the lexicon (i.e. WordNet).
Tagging & chunking • 29
9
Clone: this method creates a deep copy of a list of tagged words.
Each tagged word is also cloned during the process.
9
ModifyTag: this method makes possible to modify the text and
type of a tag (mainly for internal use).
A IWord has many properties, including PartOfSpeech (an Enum), Tag
(an other Enum) and TagAsString (the string associated with the
enumerated Tag property).
The following code sample shows how to create a SSTagger instance. You
must have referenced the Proxem.Antelope.Tagging assembly; you
must also add a using (C#) or Imports (VB.NET) directive for the
Proxem.Antelope.Tagging namespace:
C#
string ssTaggerDll = @"bin\Tagger.dll";
string ssTaggerModelFile = @"data\SSTagger\model.bidir.%d";
SSTagger theTagger = new SSTagger(ssTaggerDll);
for (int i = 0; i <= 15; i++)
{
theTagger.LoadModelFile(ssTaggerModelFile, i);
}
ITagger ssTagger = theTagger;
VB.NET
Dim ssTaggerDll As String = "bin\Tagger.dll"
Dim ssTaggerModelFile As String = "data\SSTagger\model.bidir.%d"
Dim theTagger As New SSTagger(ssTaggerDll)
For i As Integer = 0 To 15
theTagger.LoadModelFile(ssTaggerModelFile, i)
Next
Dim ssTagger As ITagger = theTagger
The following code sample shows how to create a SimpleTagger
instance (you must have referenced the Proxem.Antelope.Tagging
assembly):
C#
string simpleTaggerFile = @"data\BrillTaggerLexicon.txt";
ITagger simpleTagger = new SimpleTagger(simpleTaggerFile);
VB.NET
Dim simpleTaggerFile As String = "data\BrillTaggerLexicon.txt"
Dim simpleTagger As ITagger = New SimpleTagger(simpleTaggerFile)
The following code sample shows how to use the various methods of the
ITagger interface:
C#
const string sentence = "Born in a Kentucky log cabin, Abraham Lincoln was
elected as President of the United States";
IList<IWord> words = tagger.TagText(sentence);
Assert.IsNotNull(words, "words should not be null");
// Remark: each punctuation symbol counts for one word
Assert.AreEqual(17, words.Count);
Assert.AreEqual("a", words[2].Text);
Assert.AreEqual(TagType.EnglishDT, words[2].Tag);
Assert.AreEqual("Kentucky", words[3].Text);
Assert.AreEqual(TagType.EnglishNNP, words[3].Tag);
Assert.AreEqual("as", words[11].Text);
Assert.AreEqual(TagType.EnglishIN, words[11].Tag);
// Test cloning
IList<IWord> wordsCopy = tagger.Clone(words);
30 • Antelope
© 2006-2009 Proxem
Assert.AreEqual(words.Count, wordsCopy.Count);
for (int i = 0; i < words.Count; i++)
{
Assert.AreEqual(words[i].Text, wordsCopy[i].Text);
Assert.AreEqual(words[i].TagAsString, wordsCopy[i].TagAsString);
}
// Test collapsing collocations (multi-word expressions)
IList<IWord> collapsedWords = tagger.CollapseCollocations(words, lexicon);
// "log_cabin", "Abraham_Lincoln", "President_of_the_United_States"
Assert.AreEqual(11, collapsedWords.Count);
Assert.AreEqual("a", collapsedWords[2].Text);
Assert.AreEqual(TagType.EnglishDT, collapsedWords[2].Tag);
Assert.AreEqual("log_cabin", collapsedWords[4].Text);
Assert.AreEqual(TagType.EnglishNN, collapsedWords[4].Tag);
Assert.AreEqual("President_of_the_United_States", collapsedWords[10].Text);
Assert.AreEqual(TagType.EnglishNNP, collapsedWords[10].Tag);
VB.NET
Dim words As IList(Of IWord)
words = tagger.TagText("Born in a Kentucky log cabin, Abraham Lincoln was
elected as President of the United States")
Assert.IsNotNull(words, "words should not be null")
' Remark: each punctuation symbol counts for one word
Assert.AreEqual(17, words.Count)
Assert.AreEqual("a", words(2).Text)
Assert.AreEqual(TagType.EnglishDT, words(2).Tag)
Assert.AreEqual("Kentucky", words(3).Text)
Assert.AreEqual(TagType.EnglishNNP, words(3).Tag)
Assert.AreEqual("as", words(11).Text)
Assert.AreEqual(TagType.EnglishIN, words(11).Tag)
' Test cloning
Dim wordsCopy As IList(Of IWord) = tagger.Clone(words)
Assert.AreEqual(words.Count, wordsCopy.Count)
For i As Integer = 0 To words.Count - 1
Assert.AreEqual(words(i).Text, wordsCopy(i).Text)
Assert.AreEqual(words(i).Tag, wordsCopy(i).Tag)
Next i
' Test collapsing collocations (multi-word expressions)
Dim collapsedWords As IList(Of IWord) = tagger.CollapseCollocations(words,
lexicon)
' "log_cabin", "Abraham_Lincoln", "President_of_the_United_States"
Assert.AreEqual(11, collapsedWords.Count)
Assert.AreEqual("a", collapsedWords(2).Text)
Assert.AreEqual(TagType.EnglishDT, collapsedWords(2).Tag)
Assert.AreEqual("log_cabin", collapsedWords(4).Text)
Assert.AreEqual(TagType.EnglishNN, collapsedWords(4).Tag)
Assert.AreEqual("President_of_the_United_States", collapsedWords(10).Text)
Assert.AreEqual(TagType.EnglishNNP, collapsedWords(10).Tag)
IChunker interface
Chunking is the step next to tagging. Also called shallow parsing, chunking
is an analysis of a sentence, which identifies its constituents (noun groups,
verbs…), but does not specify their internal structure, or their role in the
main sentence.
Antelope implements a chunker that tries to mark up all the noun and verb
phrases in a text. The expected accuracy of this chunker is around 91-92%.
www.proxem.com
Tagging & chunking • 31
The only IChunker method is:
9
ChunkTags: this method takes a list of tagged words as input
parameter, and returns a new list with the chunk nodes.
WHAT IS NEW? In former version 0.7, the chunker was a simple NP (Noun
Phrase) chunker that followed the “IOB” convention (indicating if a word is
inside a NP, outside a NP, or at the border between two NPs). Now, the
chunker returns a subset of a syntactic tree, where each chunk node can be a
NP, a VP, a PP…
Example: once chunked (with word collapsing), the sentence “Born in a
Kentucky log cabin, Abraham Lincoln was elected as President of the United
States” becomes:
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
#0
#1
#2
#3
#4
#5
#6
#7
VP
PP
NP
NP
VP
PP
NP
[Born]
[in]
[a Kentucky log_cabin]
,
[Abraham_Lincoln]
[was elected]
[as]
[President_of_the_United_States]
You will remark on chunk #3 that terminals (such as punctuation) are
syntactic tree leafs.
C#
IList<IWord> words = tagger.TagText(sentence); // 17 words
// We collapse multi-words expressions
words = tagger.CollapseCollocations(words, lexicon); // 11 words
IChunker chunker = new Tagging.Chunker();
// A chunker takes as input the output of a tagger
IList<IChunk> chunks = chunker.ChunkText(words); // 8 chunks
Assert.AreEqual(8, chunks.Count);
// We verify the type of each chunk
Assert.AreEqual(Parsing.SyntacticNodeType.VP,
Assert.AreEqual(Parsing.SyntacticNodeType.PP,
Assert.AreEqual(Parsing.SyntacticNodeType.NP,
Assert.AreEqual(Parsing.SyntacticNodeType.Leaf,
Assert.AreEqual(Parsing.SyntacticNodeType.NP,
Assert.AreEqual(Parsing.SyntacticNodeType.VP,
Assert.AreEqual(Parsing.SyntacticNodeType.PP,
Assert.AreEqual(Parsing.SyntacticNodeType.NP,
chunks[0].Type);
chunks[1].Type);
chunks[2].Type);
chunks[3].Type);
chunks[4].Type);
chunks[5].Type);
chunks[6].Type);
chunks[7].Type);
// We verify the word span of the third chunk
// Born in [a] Kentucky log_cabin ...
Assert.AreEqual(2, chunks[2].FirstWordIndex);
// Born in a Kentucky [log_cabin]...
Assert.AreEqual(4, chunks[2].LastWordIndex);
VB.NET
Dim words As IList(Of IWord) = tagger.TagText("Born in a Kentucky log
cabin, Abraham Lincoln was elected as President of the United States")
' We collapse multi-words expressions: 17 words->11 words
words = tagger.CollapseCollocations(words, lexicon)
' A chunker takes as input the output of a tagger
Dim chunks As IList(Of IChunk) = chunker.ChunkText(words)
Assert.AreEqual(8, chunks.Count)
' We verify the type of each chunk
Assert.AreEqual(SyntacticNodeType.VP, chunks(0).Type)
Assert.AreEqual(SyntacticNodeType.PP, chunks(1).Type)
32 • Antelope
© 2006-2009 Proxem
Assert.AreEqual(SyntacticNodeType.NP, chunks(2).Type)
Assert.AreEqual(SyntacticNodeType.Leaf, chunks(3).Type)
Assert.AreEqual(SyntacticNodeType.NP, chunks(4).Type)
Assert.AreEqual(SyntacticNodeType.VP, chunks(5).Type)
Assert.AreEqual(SyntacticNodeType.PP, chunks(6).Type)
Assert.AreEqual(SyntacticNodeType.NP, chunks(7).Type)
' We verify the word span of the third chunk
' Born in [a] Kentucky log_cabin ...
Assert.AreEqual(2, chunks(2).FirstWordIndex)
' Born in a Kentucky [log_cabin]...
Assert.AreEqual(4, chunks(2).LastWordIndex)
Tagging and chunking sample
The Antelope sample includes a “Tagging” tab and a “Chunking” tab.
www.proxem.com
Tagging & chunking • 33
Parsing
Introduction to dependencies and constituents
There are two complementary ways of handling syntactic analysis:
dependencies and constituents. Antelope makes it possible to handle both
structures simultaneously. The following image shows both of them:
34 • Antelope
9
A constituent tree structure (above the words); each constituent is a
word or a group of words that functions as a single unit within a
hierarchical structure. Phrases (noun phrases, verbal phrases, etc.)
are usually constituents of a clause, but clauses may also be
embedded into a bigger structure. For instance, “the man” is a NP
(noun phrase) constituent.
9
A dependency graph (under the words); each dependency is a
grammatical relation between two words: the headword and the
dependent word. For instance, in “the man”, the “determiner”
relation links “man” (head) and “the” (dependant).
© 2006-2009 Proxem
Main interfaces
IParser
ISentence
IWord
Interface
ITokenizer
Interface
ISerializable
Interface
ISerializable
Methods
Properties
IDependency
Interface
ISerializable
Properties
IRichTextParser
Methods
Interface
Methods
IChunk
Interface
IMultiwordExpression
IHtmlTag
SyntacticNodeTy…
Interface
ISerializable
Enum
ISyntacticNode
Properties
ProcessingType
Enum
Interface
IChunk
ISerializable
Methods
A sentence can be parsed either with an IParser (for plain text) or with an
IRichTextParser (if it contains HTML tags). The result is an
ISentence constituted by:
9
A list of tagged words: the Words property,
9
A dependency graph: the Dependencies list property,
9
A constituent tree structure: the Root syntactic node property,
recursively owning Children syntactic nodes; in that tree, a
terminal node directly points on a tagged word (Leaf property,
that is null for a non-terminal constituent),
9
A list of chunks: the Chunks property; the chunks are a subset of
the syntactic nodes,
9
Optionally, a list of HTML tags present in the source sentence: the
HtmlTags property.
Parsing samples
The Stanford Parser and the Link Grammar parser produce a forest of parse
trees. Each syntactic possible interpretation of a sentence is called an
analysis (a linkage in the Link Grammar jargon).
www.proxem.com
Parsing • 35
The Link Grammar and the Stanford parsers share the same interfaces. They
can return as many analyses as needed.
Stanford Parser output
Antelope includes a wrapper for accessing the Stanford Parser, based on a
statistical approach. This parser produces many analyses.
Link Grammar Parser output
Antelope includes a wrapper for accessing the Link Grammar, a rule-based
English parser. Its original documentation can be found on
http://www.link.cs.cmu.edu/link/.
Link Grammar is similar to dependency grammar, but includes directionality
in the relations between words; contrary to dependency grammar, it is
lacking a head-dependent relationship. Antelope reconstitutes this headdependent relationship.
36 • Antelope
© 2006-2009 Proxem
Differences between both parsers
You can see that both parsers share the same structure (i.e. constituents and
dependencies). Nevertheless, you must notice some subtle differences:
9
If the constituents labels are normalized (“NP”, “VP”, “PP”…), that
is not the case for the dependencies labels; for instance, the
‘subject’ dependency between a noun and a verb is labeled ‘S’ in
the Link Grammar, and ‘nsubj’ in the Stanford Parser.
9
Dependencies structure is far from being identical in both
systems, as you can see in the following sentence:
Stanford Parser output
www.proxem.com
Link Grammar Parser output
Parsing • 37
TagParser for French
Since version 0.8.3, the TagParser for the French language is included with
Antelope. This parser produces a single parse tree (a “best parse” tree).
Some code
Splitting a text into sentences
Taggers and chunkers can be applied on arbitrary long texts; on the other
hand, it is highly recommended to use a parser on a single sentence (that
gives better results).
You can use a sentence splitter to cut a text into sentences (you must have
referenced the Proxem.Antelope.Tools assembly):
C#
ISentenceSplitter splitter = new Tools.MIL_HtmlSentenceSplitter();
splitter.Text = text;
foreach (string sentence in splitter.Sentences)
{
// Process sentence…
}
VB.NET
Dim splitter As ISentenceSplitter = New Tools.MIL_HtmlSentenceSplitter
splitter.Text = text
For i As Integer = 0 To splitter.Sentences.Count - 1
' Process splitter.Sentences(i)…
Next i
Parser instantiation
You can instantiate the Antelope wrapper on the Stanford Parser in the
following
way (you
must have
referenced
the assembly
Proxem.Antelope.Stanford):
C#
// May choose another parser file (*.txt ou *.ser.gz)
38 • Antelope
© 2006-2009 Proxem
string lexicalizedParser = @"Data\StanfordParser\wsjPCFG.txt";
IParser parser = new Stanford.Parser(lexicalizedParser);
VB.NET
' May choose another parser file (*.txt ou *.ser.gz)
Dim lexicalizedParser As String = "Data\StanfordParser\wsjPCFG.txt"
Dim parser As Parsing.IParser = New Stanford.Parser(lexicalizedParser)
You can instantiate the Antelope wrapper on the Link Grammar Parser in the
following
way (you
must have
referenced
the assembly
Proxem.Antelope.LinkGrammar):
C#
string dllPath = @"bin";
string dictPath = @"LinkGrammar\data";
// The third argument (lexicon) is optional but recommended
IParser parser = new LinkGrammar.Parser(dllPath, dictPath, lexicon);
VB.NET
Dim dllPath As String = "bin"
Dim dictPath As String = "LinkGrammar\data"
Dim parser As Parsing.IParser
' The third argument (lexicon) is optional but recommended
parser = New LinkGrammar.Parser(dllPath, dictPath, lexicon)
Handling the parsing results
Once you have instantiated a parser, you can use it to parse sentences, and
detect particular constructions:
C#
string text =
"Atlanta's primary election showed it. " +
"It is the Texas House of Representatives.";
int[] wordCountBeforeCollapse = new int[] { 7, 8 };
int[] wordCountAfterCollapse = new int[] { 6, 5 };
int[] subjectPositions = new int[] { 2, 0 };
ISentenceSplitter splitter = new Tools.MIL_HtmlSentenceSplitter();
splitter.Text = text;
Assert.AreEqual(2, splitter.Sentences.Count);
for (int i = 0; i < splitter.Sentences.Count; i++)
{
ISentence sentence = parser.ParseSentence(splitter.Sentences[i]);
Assert.AreEqual(wordCountBeforeCollapse[i], sentence.Words.Count);
sentence.CollapseCollocations(lexicon);
Assert.AreEqual(wordCountAfterCollapse[i], sentence.Words.Count);
// We examine dependencies to detect the subject
int subjectWordPosition = -1;
foreach (IDependency dep in sentence.Dependencies)
{
if (dep.Type == "nsubj" /* Stanford */
|| dep.Type == "S" /* LinkGrammar */)
{
subjectWordPosition = dep.DependentIndex;
break;
}
}
Assert.AreEqual(subjectPositions[i], subjectWordPosition);
}
VB.NET
Dim text As String = _
"Atlanta's primary election showed it. " + _
"It is the Texas House of Representatives."
Dim wordCountBeforeCollapse As Integer() = New Integer() {7, 8}
Dim wordCountAfterCollapse As Integer() = New Integer() {6, 5}
www.proxem.com
Parsing • 39
Dim subjectPositions As Integer() = New Integer() {2, 0}
Dim splitter As ISentenceSplitter = New Tools.MIL_HtmlSentenceSplitter
splitter.Text = text
Assert.AreEqual(2, splitter.Sentences.Count)
For i As Integer = 0 To splitter.Sentences.Count - 1
Dim sentence As ISentence = parser.ParseSentence(splitter.Sentences(i))
Assert.AreEqual(wordCountBeforeCollapse(i), sentence.Words.Count)
sentence.CollapseCollocations(lexicon)
Assert.AreEqual(wordCountAfterCollapse(i), sentence.Words.Count)
' We examine dependencies to detect the subject
Dim subjectWordPosition As Integer = -1
Dim dep As IDependency
For Each dep In sentence.Dependencies
If (dep.Type = "nsubj") OrElse (dep.Type = "S") Then
subjectWordPosition = dep.DependentIndex
Exit For
End If
Next
Assert.AreEqual(subjectPositions(i), subjectWordPosition)
Next i
40 • Antelope
© 2006-2009 Proxem
Deep syntax and predicates
Deep syntax
Let’s now see on an example what deep syntax is. The sentences “Eve loves
Adam” and “Adam is loved by Eve” mean roughly the same thing and use
similar words. Some linguists (in particular Noam Chomsky) have tried to
account for this similarity by positing that these two sentences are distinct
surface forms that derive from a common deep structure. In the following
figure, the syntactic dependencies are above the words, and the deep syntax
dependencies are under the words.
┌─<nsubj─┬──dobj>───┐
│
│
│
Eve
loves
Adam
│
│
│
└<Subject┴DirObject>┘
┌───<nsubj──┐
│
┌─<aux─┼prep>┬pobj>┐
│
│
│
│
│
Adam is
loved
by
Eve
│
│
│
└<DirObject─┴──Subject>─┘
Each parser used by Antelope knows how to handle a deep syntax layer.
Even better, while each parser defines its own set of labels to name
dependencies, the labels for deep syntax are homogeneous and are the same
for all the parsers.
A (normal) syntax dependency links a dependent word to a governor word.
In order to loose no information, a deep syntax dependency adds an optional
middle word that is a third word linking the dependent and the governor. For
example, in the passive sentence above, the (deep syntax) subject “Eve” is
linked to the verb “loved” by the mean of the (preposition) middle word
“by”. The current set of deep syntax dependencies is defined in the
DeepDependencyType enumeration:
www.proxem.com
9
Subject: it is a dependency between a subject and a verb. In a
passive construction, a middle word “by” should be present.
9
DirectObject: it is a dependency between a verb and a direct
object.
9
IndirectObject: it is a dependency between a verb and an
indirect object
9
PrepObject: it is a dependency between a verb and a
prepositional object. A middle word containing the preposition
should be present.
9
AdjectiveNoun: it is a dependency between an adjective and a
noun modified by this adjective.
9
NounOfNoun: it is a dependency between a noun (owner) and
another noun (part of the owner).
Deep syntax and predicates • 41
9
TimeComplement: it is a dependency between a verb and a time
complement.
9
SpaceComplement: it is a dependency between a verb and a
space complement.
WHAT IS COMING NEXT? (1) The set of deep syntax labels will be enriched in
the next version. (2) For the moment, the deep syntax support is better for
the Stanford Parser than for the Link Grammar parser. (3) The support for
ditransitive constructions will be improved.
Time and space complements
Beyond deep syntax, there is also a time and space features detection. In
the following example, you can remark that the node “Word War II” has a
[1939Æ1945] time interval, while the node “the beginning Word War II” has
a [1939Æ1940] time interval (some fuzzy logics is applied here).
They result, into deep syntax dependencies, in TimeComplement and
SpaceComplement.
42 • Antelope
© 2006-2009 Proxem
Predicates
One other step beyond deep syntax is predicates extraction. In traditional
grammar, a predicate is one of the two main parts of a sentence; the other
part is the subject, which the predicate modifies. The predicate must contain
a verb; the verb requires or permits other sentence elements to complete the
predicate. These elements are objects (direct, indirect, prepositional),
predicate complements and adverbials (either obligatory or adjuncts). The
predicate provides information about the subject, such as what it is doing or
what it is like.
Antelope knows how to extract predicates from sentences. The following
examples show surface syntax dependencies above the words, deep syntax
dependencies under them, and finally the predicate, that gives a synthetic
view on deep syntax.
┌──<poss─┐
┌───dobj>──┐
├pos┐
├<nsub┼iob┐ ┌<det┤
│
│
│
│
│ │
│
John 's mother gave me a present
│
│
│
│
│
└<NounOfN┴<Subj┼Ind┘
│
└DirectObje┘
gave(Subject: mother,
DirectObject: present,
IndirectObject: me)
┌─<nsubj─┬───dobj>──┐
│
┌<aux┤
┌<det┤
│
│
│
│
│
John is reading the book
│
│
│
└<Subject┴DirectObje┘
reading(Subject: John,
DirectObject: book)
┌<nsub┐
│
│
she cries
│
│
└<Subj┘
cries(Subject: she)
Main interfaces
Sample code
The following code shows how to extract deep syntax dependencies, time &
space complements, and predicates from two sentences that share the same
deep syntax representation:
www.proxem.com
Deep syntax and predicates • 43
┌─────────prep>────────┐
├────prep>───┐
│
┌────<det───┐
├──dobj>─┐
│
│
│
┌<amod┼<nsubj┤
┌<po┤
├pobj┐
├po┐
│
│
│
│
│
│
│
│
│ │
this pretty woman bought his car in France in 97
│
│
│
│
│
│
└<Adje┴<Subje┼DirectOb┘
│
│
├SpaceComplement(i┘
│
└───TimeComplement(in)>───┘
┌────────<prep────────┐
┌─────pobj>────┐
│
┌<nsubjpas┼─────prep>────┤ ┌────<det───┤
├po┐
┌<de┤
┌<auxp┼prep┬pobj┐
│ │
┌<amod┤
│
│ │
│
│
│ │
│
│
│
│
│
In 97 , the car was bought in France by the pretty woman
│
│
│
│
│
│
│
└<DirectOb┼SpaceComp┘
└<Adje┤
└<TimeComplement(In┴─────────Subject(by)>────────┘
In both cases, the predicate is:
Bought (Subject: by woman,
DirectObject: car,
TimeComplement: in 97,
SpaceComplement: in France)
C#
ISentence sentence = parser.ParseSentence(text);
// If we want to distinguish Time & Space complements from other
// prepositional objects, we must compute them BEFORE reading the
// DeepSyntaxDependencies property for the first time by using
// the ComputeTimeAndSpace method of the Utility class (defined in
// Proxem.Antelope.Tools assembly).
Features.Utility.ComputeTimeAndSpace(sentence, lexicon);
IList<IDeepDependency> deepDeps = sentence.DeepSyntaxDependencies;
Assert.GreaterOrEqual(deepDeps.Count, 4);
TestDep(deepDeps,
TestDep(deepDeps,
TestDep(deepDeps,
TestDep(deepDeps,
DeepDependencyType.Subject, "bought", "woman");
DeepDependencyType.DirectObject, "bought", "car");
DeepDependencyType.TimeComplement, "bought", "97");
DeepDependencyType.SpaceComplement, "bought", "France");
// More easily, we now extract predicates from the parsed sentence
IPredicateExtractor predExtractor = new PredicateExtractor();
IList<IPredicate> predicates = predExtractor.GetPredicates(sentence);
// In these sentences, we have a single predicate
Assert.AreEqual(1, predicates.Count);
// We verify the base form ("buy") of the predicate main word ("bought")
IList<IInflectedWord> baseForms = lexicon.GetBaseForms(
predicates[0].MainWord.Text,
predicates[0].MainWord.Tag);
Assert.AreEqual(1, baseForms.Count);
Assert.AreEqual("buy", baseForms[0].BaseWord);
// The predicate has a subject, a direct object and time&space complements
Assert.AreEqual(4, predicates[0].Roles.Count);
IRole subject, dirObj, time, space;
subject = predicates[0].FindRoleByType(DeepDependencyType.Subject);
Assert.IsNotNull(subject);
Assert.AreEqual("woman", subject.Word.Text);
dirObj = predicates[0].FindRoleByType(DeepDependencyType.DirectObject);
44 • Antelope
© 2006-2009 Proxem
Assert.IsNotNull(dirObj);
Assert.AreEqual("car", dirObj.Word.Text);
time = predicates[0].FindRoleByType(DeepDependencyType.TimeComplement);
Assert.IsNotNull(time);
Assert.AreEqual("97", time.Word.Text);
space = predicates[0].FindRoleByType(DeepDependencyType.SpaceComplement);
Assert.IsNotNull(space);
Assert.AreEqual("France", space.Word.Text);
private static void TestDep(IList<IDeepDependency> deepDeps,
DeepDependencyType type, string governor, string dependent)
{
// We search for a deep dependency of the given type
IDeepDependency aDep = null;
foreach (IDeepDependency deepDep in deepDeps)
{
if (deepDep.DepType == type)
{
aDep = deepDep;
break;
}
}
// We verify that we found it...
Assert.IsNotNull(aDep);
// ...and that the governor word text and the dependent word text
// are these that were expected
Assert.AreEqual(governor, aDep.Governor.Text);
Assert.AreEqual(dependent, aDep.Dependent.Text);
}
VB.NET
Dim sentence As ISentence = parser.ParseSentence(text)
' If we want to distinguish Time & Space complements from other
' prepositional objects, we must compute them BEFORE reading the
' DeepSyntaxDependencies property for the first time by using
' the ComputeTimeAndSpace method of the Utility class (defined in
' Proxem.Antelope.Tools assembly).
Features.Utility.ComputeTimeAndSpace(sentence, lexicon)
Dim deepDeps As IList(Of IDeepDependency) = sentence.DeepSyntaxDependencies
Assert.GreaterOrEqual(deepDeps.Count, 4)
TestDep(deepDeps,
TestDep(deepDeps,
TestDep(deepDeps,
TestDep(deepDeps,
DeepDependencyType.Subject, "bought", "woman")
DeepDependencyType.DirectObject, "bought", "car")
DeepDependencyType.TimeComplement, "bought", "97")
DeepDependencyType.SpaceComplement, "bought", "France")
' More easily, we now extract predicates from the parsed sentence
Dim predExtractor As IPredicateExtractor = New PredicateExtractor
Dim predicates As IList(Of IPredicate)
predicates = predExtractor.GetPredicates(sentence)
' In these sentences, we have a single predicate
Assert.AreEqual(1, predicates.Count)
' We verify the base form ("buy") of the predicate main word ("bought")
Dim baseForms As IList(Of IInflectedWord) = lexicon.GetBaseForms( _
predicates(0).MainWord.Text, predicates(0).MainWord.Tag)
Assert.AreEqual(1, baseForms.Count)
Assert.AreEqual("buy", baseForms(0).BaseWord)
' The predicate has a subject, a direct object and time&space complements
Assert.AreEqual(4, predicates(0).Roles.Count)
www.proxem.com
Deep syntax and predicates • 45
Dim subject, obj, time, space As IRole
subject = predicates(0).FindRoleByType(DeepDependencyType.Subject)
Assert.IsNotNull(subject)
Assert.AreEqual("woman", subject.Word.Text)
obj = predicates(0).FindRoleByType(DeepDependencyType.DirectObject)
Assert.IsNotNull(obj)
Assert.AreEqual("car", obj.Word.Text)
time = predicates(0).FindRoleByType(DeepDependencyType.TimeComplement)
Assert.IsNotNull(time)
Assert.AreEqual("97", time.Word.Text)
space = predicates(0).FindRoleByType(DeepDependencyType.SpaceComplement)
Assert.IsNotNull(space)
Assert.AreEqual("France", space.Word.Text)
Private Shared Sub TestDep(ByVal deepDeps As IList(Of IDeepDependency),
ByVal type As DeepDependencyType, ByVal governor As String, ByVal dependent
As String)
' We search for a deep dependency of the given type
Dim aDep As IDeepDependency = Nothing
Dim deepDep As IDeepDependency
For Each deepDep In deepDeps
If (deepDep.DepType = type) Then
aDep = deepDep
Exit For
End If
Next
' We verify that we found it...
Assert.IsNotNull(aDep)
' ...and that the governor word text and the dependent word text
' are these that were expected
Assert.AreEqual(governor, aDep.Governor.Text)
Assert.AreEqual(dependent, aDep.Dependent.Text)
End Sub
46 • Antelope
© 2006-2009 Proxem
Semantic parsing
Frames
The subcategorization frame of a word is the number and type of arguments
that it co-occurs with (i.e. the number and kind of other words that it selects
when appearing in a sentence). So, in “Indiana Jones ate chilled monkey
brain”, “eat” selects (or subcategorizes for) “Indiana Jones” (as a subject)
and “chilled monkey brain” (as a direct object).
From a programmer point of view, you can see a frame like the signature of
a method: frame_eat(Human, Thing).
Thematic roles
A thematic role is the semantic relationship between a predicate (e.g. a verb)
and an argument (e.g. the noun phrases) of a sentence. Thematic roles
include:
9
AGENT: deliberately performs the action (e.g. “Bill ate his soup
quietly”).
9
EXPERIENCER: receives sensory or emotional input (e.g. “The smell
of lilies filled Jennifer’s nostrils”).
9
THEME/PATIENT: undergoes the action (e.g. “The falling rocks
crushed the car”).
9
INSTRUMENT: used to carry out the action (e.g. “Jamie cut the
ribbon with a pair of scissors”).
9
CAUSE: mindlessly performs the action (e.g. “An avalanche
destroyed the ancient temple”).
9
LOCATION: where the action occurs (e.g. “Johnny and Linda played
carelessly in the park”).
9
SOURCE: where the action originated (e.g. “The rocket was
launched from Central Command”).
From a programmer point of view, you see that thematic roles enforce the
signature of a method: frame_eat(Human agent, Thing theme)
www.proxem.com
Semantic parsing • 47
Selectional restrictions
One or many selectional restrictions (such as HUMAN, ANIMATE, ANIMAL,
ORGANIZATION…) can apply on a thematic role, in the context of a given
frame.
Sample application demo
Identifying frames and their thematic roles (while disambiguating word
senses) is the basis of semantic parsing.
Antelope includes an experimental semantic parser, using the Stanford
parser or the Link Grammar parser, and the VerbNet data.
As can be seen on the following example, two frames can be identified in the
sentence “the general to whom President Lincoln gave all powers captured
Lee’s troops”:
48 • Antelope
9
Give (AGENT=Lincoln, THEME=all powers, RECIPIENT=the general)
9
Capture (AGENT=the general, THEME=Lee’s troops).
© 2006-2009 Proxem
Moreover, due to selectional restrictions (“the general” must be an Animate,
so is “President Lincoln”, etc.), the semantic parser can restrict the possible
senses for these terms in the lexicon.
WHAT IS NEW? In former version 0.7, the semantic parser worked only on
simplistic phrases (subject, verb, direct object, indirect object… in linear
order). Due to the presence of the deep syntax layer, the semantic parser has
been improved to make possible analyzing more complex sentences,
including subject / object inversion.
Interfaces
Sample code
You must add a reference to the Proxem.Antelope.Semantics
assembly:
C#
ISentence sentence = parser.ParseSentence(text);
foreach (Tagging.IWord word in sentence.Words)
{
// Needed to compare senses
word.ComputeSenses(lexicon);
}
// To distinguish Time & Space complements from other prepositional objects
// (Utility class is defined in the Proxem.Antelope.Tools assembly).
Features.Utility.ComputeTimeAndSpace(sentence, lexicon);
IPredicateExtractor predicateExtractor = new PredicateExtractor();
string verbNetPath = @"data\VerbNet\v1.5";
string verbNetPrologPath = @"data\VerbNet\VNProlog";
string verbNetMappings = @"data\VerbNet\VNProlog\WNMapping.xml";
Semantics.IFrameExtractor frameExtractor =
new Semantics.FrameExtractor(lexicon,
verbNetPrologPath, verbNetMappings, verbNetPath);
sentence.ComputeFrames(predicateExtractor, frameExtractor);
www.proxem.com
Semantic parsing • 49
Assert.IsNotNull(sentence.Frames);
Assert.AreEqual(1, sentence.Frames.Count);
// "buy" has 5 possible senses...
Assert.AreEqual(5, sentence.Frames[0].Predicate.MainWord.Senses.Count);
// ... but in the context, possible senses are only buy#1 and buy#5
Assert.AreEqual(2, sentence.Frames[0].PossibleSenses.Count);
// frame_get has 5 possible thematic roles...
Assert.AreEqual(5, sentence.Frames[0].ThematicRoles.Count);
Assert.AreEqual("Agent", sentence.Frames[0].ThematicRoles[0].Name);
Assert.AreEqual("Theme", sentence.Frames[0].ThematicRoles[1].Name);
Assert.AreEqual("Source", sentence.Frames[0].ThematicRoles[2].Name);
Assert.AreEqual("Beneficiary", sentence.Frames[0].ThematicRoles[3].Name);
Assert.AreEqual("Asset", sentence.Frames[0].ThematicRoles[4].Name);
// ... but only Agent and Theme are filled here...
Assert.IsNotNull(sentence.Frames[0].ThematicRoles[0].Role);
Assert.AreEqual("woman",
sentence.Frames[0].ThematicRoles[0].Role.Word.Text);
Assert.IsNotNull(sentence.Frames[0].ThematicRoles[1].Role);
Assert.AreEqual("car", sentence.Frames[0].ThematicRoles[1].Role.Word.Text);
// ... other thematic roles are null.
Assert.IsNull(sentence.Frames[0].ThematicRoles[2].Role);
Assert.IsNull(sentence.Frames[0].ThematicRoles[3].Role);
Assert.IsNull(sentence.Frames[0].ThematicRoles[4].Role);
VB.NET
Dim sentence As ISentence = parser.ParseSentence(text)
Dim word As Tagging.IWord
For Each word In sentence.Words
' Needed to compare senses
word.ComputeSenses(lexicon)
Next
' To distinguish Time & Space complements from other prepositional objects
' (Utility class is defined in the Proxem.Antelope.Tools assembly).
Features.Utility.ComputeTimeAndSpace(sentence, lexicon)
Dim predicateExtractor As IPredicateExtractor = New PredicateExtractor
Dim
Dim
Dim
Dim
verbNetPath As String = "data\VerbNet\v1.5"
verbNetPrologPath As String = "data\VerbNet\VNProlog"
verbNetMappings As String = "data\VerbNet\VNProlog\WNMapping.xml"
frameExtractor As Semantics.IFrameExtractor = _
New Semantics.FrameExtractor(lexicon, _
verbNetPrologPath, verbNetMappings, verbNetPath)
sentence.ComputeFrames(predicateExtractor, frameExtractor)
Assert.IsNotNull(sentence.Frames)
Assert.AreEqual(1, sentence.Frames.Count)
' "buy" has 5 possible senses...
Assert.AreEqual(5, sentence.Frames(0).Predicate.MainWord.Senses.Count)
' ... but in the context, possible senses are only buy#1 and buy#5
Assert.AreEqual(2, sentence.Frames(0).PossibleSenses.Count)
' frame_get has 5 possible thematic roles...
Assert.AreEqual(5, sentence.Frames(0).ThematicRoles.Count)
Assert.AreEqual("Agent", sentence.Frames(0).ThematicRoles(0).Name)
Assert.AreEqual("Theme", sentence.Frames(0).ThematicRoles(1).Name)
Assert.AreEqual("Source", sentence.Frames(0).ThematicRoles(2).Name)
Assert.AreEqual("Beneficiary", sentence.Frames(0).ThematicRoles(3).Name)
Assert.AreEqual("Asset", sentence.Frames(0).ThematicRoles(4).Name)
' ... but only Agent and Theme are filled here...
Assert.IsNotNull(sentence.Frames(0).ThematicRoles(0).Role)
50 • Antelope
© 2006-2009 Proxem
Assert.AreEqual("woman",
sentence.Frames(0).ThematicRoles(0).Role.Word.Text)
Assert.IsNotNull(sentence.Frames(0).ThematicRoles(1).Role)
Assert.AreEqual("car", sentence.Frames(0).ThematicRoles(1).Role.Word.Text)
' ... other thematic roles are null.
Assert.IsNull(sentence.Frames(0).ThematicRoles(2).Role)
Assert.IsNull(sentence.Frames(0).ThematicRoles(3).Role)
Assert.IsNull(sentence.Frames(0).ThematicRoles(4).Role)
www.proxem.com
Semantic parsing • 51
Handling documents
Introduction
Until now, we presented only operations dealing with isolated sentences. It
is time to see how Antelope can process full documents, with many
sentences, in a very easy way.
Documents and Processing resources
Main interfaces
IDocument
ISentence
ProcessingType
Interface
IEnumerable<ISentence>
ISerializable
Interface
ISerializable
Enum
None
Tagging
Chunking
Parsing
Properties
IProcessingResources
Interface
ICoreference
IDocumentContext
Interface
Interface
ITagger
IChunker
IParser
IRichTextParser
Interface
ITokenizer
Interface
Interface
ITokenizer
Interface
Properties
ILexicon
IPredicateExtractor
IFrameExtractor
Interface
Interface
Interface
IResolver
Interface
SentencePreprocessorEventHandler
IHeuristicManager
Delegate
Interface
ISentenceSplitter
Interface
IContextExtractor
Interface
The IDocument interface
A document is primarily a list of sentences. (It also defines a context, and a
list of coreferences; these two notions will be detailed further.) Each
52 • Antelope
© 2006-2009 Proxem
sentence has a processing type, which indicates if it has been produced by a
tagger, by a chunker or by a parser.
The IProcessingResources interface
An instance of IDocument is associated to some processing resources.
Each time an operation is needed (due to a request in the user code), the
right processing resource is automatically invoked. This feature makes NLP
programming as simple as possible.
Despite its apparent complexity, the IProcessingResources is easy to
use. It centralizes each resource that a document needs to be processed:
9
A lexicon (almost mandatory),
9
A sentence splitter (also almost mandatory),
9
A mandatory syntactic analysis module that is either (1) a tagger, or
(2) a tagger and a chunker, or (3) a parser (or a rich text parser),
9
An optional (deep syntax) predicate extractor,
9
An optional semantics analyzer (a frame extractor),
9
Optional semantics resources specialized in context extraction,
word sense disambiguation, coreference resolving.
WHAT IS COMING NEXT? To simplify the use of processing resources, some
Inversion of Control / Dependency Injection will be used in a future release.
Document serialization
Each document can be serialized as XML to a persistent storage.
Sample code
Please refer to the ResourcesManager class of the sample to have more
code examples.
C#
ISentenceSplitter splitter = new Tools.MIL_HtmlSentenceSplitter();
IProcessingResources resources = new ProcessingResources(
GetLexicon() /* ILexicon */,
splitter /* ISentenceSplitter */,
null /* ITagger */,
null /* IChunker */,
null /* ISentenceBreaker */,
GetStanfordParser() /* IParser */,
null /* IRichTextParser */,
null /* SentencePreprocessorEventHandler */,
true /* bool collapseCollocations */,
null /* Coreferences.IResolver */,
null /* IHeuristicManager wordSenseDisambiguator */,
null /* IContextExtractor */,
GetFrameExtractor() /* IFrameExtractor */,
true /* bool detectTimeAndSpace */);
www.proxem.com
Handling documents • 53
string text = "A computer is a machine that stores information in its
memory, and does automated calculations on that knowledge. Automated
calculation means that if the machine is given some input, it will produce
some output. This can be used to do many things, from playing games, to
solving very hard math. This allows people to use computers to do jobs, or
just to have fun.";
IDocument document = new Document(null, text, resources);
Assert.AreEqual(4, document.Length);
// Each sentence is
Assert.AreEqual(20,
Assert.AreEqual(18,
Assert.AreEqual(19,
Assert.AreEqual(16,
automatically parsed once it is accessed
document[0].Words.Count);
document[1].Words.Count);
document[2].Words.Count);
document[3].Words.Count);
// Serialization
string xml;
Tools.XmlFormatter formatter = new Tools.XmlFormatter(typeof(Document));
using (System.IO.StringWriter writer = new System.IO.StringWriter())
{
formatter.Serialize(writer, document);
xml = writer.ToString();
}
// Deserialization
using (System.IO.StringReader reader = new System.IO.StringReader(xml))
{
IDocument clone = (IDocument)formatter.Deserialize(reader);
Assert.AreEqual(4, clone.Length);
foreach (Parsing.ISentence sentence in clone)
{
Assert.AreEqual(document[i].Words.Count, clone[i].Words.Count);
// ...
}
}
VB.NET
Dim text As String = "A computer is a machine that stores information in
its memory, and does automated calculations on that knowledge. Automated
calculation means that if the machine is given some input, it will produce
some output. This can be used to do many things, from playing games, to
solving very hard math. This allows people to use computers to do jobs, or
just to have fun."
Dim splitter As ISentenceSplitter = New Tools.MIL_HtmlSentenceSplitter
Dim resources As IProcessingResources=New ProcessingResources(GetLexicon, _
splitter, _
Nothing, _
Nothing, _
Nothing, _
GetStanfordParser, _
Nothing, _
Nothing, _
True, _
Nothing, _
Nothing, _
Nothing, _
GetFrameExtractor, _
True)
Dim document As IDocument = New Document(Nothing, text, resources)
Assert.AreEqual(4, document.Length)
' Each sentence is automatically parsed once it is accessed
54 • Antelope
© 2006-2009 Proxem
Assert.AreEqual(20,
Assert.AreEqual(18,
Assert.AreEqual(19,
Assert.AreEqual(16,
document(0).Words.Count)
document(1).Words.Count)
document(2).Words.Count)
document(3).Words.Count)
' Serialization
Dim xml As String
Dim formatter As New Tools.XmlFormatter(GetType(Document))
Using writer As IO.StringWriter = New IO.StringWriter
formatter.Serialize(writer, document)
xml = writer.ToString
End Using
' Deserialization
Using reader As IO.StringReader = New IO.StringReader(xml)
Dim clone As IDocument=CType(formatter.Deserialize(reader), IDocument)
Assert.AreEqual(4, clone.Length)
Dim i As Integer
For i = 0 To clone.Length - 1
Assert.AreEqual(document(i).Words.Count, clone(i).Words.Count)
' ...
Next i
End Using
Document context detection
IContextExtractor
ISynset
Interface
Interface
IConcept
Methods
IDocumentContext
IWeightedSynset
Interface
Interface
Properties
Properties
Using the lexicon and at least a tagger, it becomes possible to build a
program that can detect the context of a text (what is it talking about?). A
IContextExtractor produces a IDocumentContext.
www.proxem.com
Handling documents • 55
Coreference resolution
Coreferences and anaphora
In linguistics, coreference occurs when multiple expressions in an utterance
have the same referent. Coreference can concern nouns as like as verbs. For
instance, in “Oswald killed Kennedy; this assassination was awful”,
“assassination” refers to the former killing.
Pronominal anaphora is a special case of coreference, where pronouns refer
to an antecedent. For instance, in “Pam went home because she felt sick”,
“she” is an anaphora that refer to “Pam”.
56 • Antelope
© 2006-2009 Proxem
Main interfaces
IWord
ISentence
IDocument
Interface
ISerializable
Interface
ISerializable
Interface
IEnumerable<ISentence>
ISerializable
IMultiwordExpression
ICoreference
IHeuristicManager
Interface
Interface
Interface
IHeuristic
Interface
IResolver
Interface
Methods
IReferringExpression
Interface
IMultiwordExpression
Properties
IReferringLink
ReferringLinkType
Interface
Enum
Properties
PronominalAnaphora
SameName
Hypernym
Anaphora Resolution sample
Antelope include an anaphora solver that implements many heuristics.
Sample code
C#
(TO BE ADDED)
VB.NET
(TO BE ADDED)
www.proxem.com
Handling documents • 57
Word sense disambiguation
Introduction
Word sense disambiguation (WSD) is the problem of determining in which
sense a word having a number of distinct senses is used in a given sentence.
Main interfaces
IHeuristic
IHeuristicManager
ISentence
Interface
Interface
Interface
ISerializable
IWsdHeuristic
IWord
Interface
IHeuristic
Interface
ISerializable
ISynset
Lexicon.GetWordSenses(IWord)
Interface
IConcept
IWordSense
Interface
ILemma
Properties
Interface
IConcept
Sample code
C#
(TO BE ADDED)
VB.NET
(TO BE ADDED)
58 • Antelope
© 2006-2009 Proxem
Annotations
Almost all Antelope interfaces have an Annotation property. This
property makes it possible to aggregate dynamic new information on
existing object instances. This property is available on ISynset, ILemma,
IDocument, ISentence, IWord, IChunk, ICoreference and
IReferringExpression.
Formatters
The Proxem.Antelope.Formatters namespace defines an
IDocumentFormatter and an ILexiconFormatter interface
(implemented by the DocumentFormatter and LexiconFormatter
classes in the Proxem.Antelope.Tools assembly).
They make information formatting in HTML very easy. You can see the
DocHtmlFormatter class of the sample to have more information.
www.proxem.com
Handling documents • 59
Other samples
NanoProlog
Antelope also includes a small lightweight PROLOG interpreter for .NET
(used by the semantic parser). The help file Proxem.NanoProlog.chm
(in the doc subdirectory) contains NanoProlog full documentation.
Proxem Web Search
This Antelope sample is a standalone Web meta-search utility. It uses
syntactic search (in contrast to the classic keyword search). You express
what you are looking for, using many paraphrases (in a future version, we
hope to automate this step). If you intend to extract a specific piece of
information, you can use the RESULT keyword instead.
60 • Antelope
© 2006-2009 Proxem
Beware of the fact that each search criterion must be syntactically correct;
i.e. it must contain a subject and a verb, as in “company (subject) purchased
(verb) RESULT (optional direct object)”. If you want to specify a noun
phrase only, we must prefix it with a “+”, as in “+company’s acquisition of
RESULT”.
You can then specify the search engines that will be used (Google, Yahoo!,
MSN Search) and press the “>” button.
Proxem Web Search will start collecting as many candidate documents as
requested in the “maximum documents per criterion” value.
Here comes the very difference between keyword search and syntactic
search. Consider the criterion “Microsoft purchased RESULT”: in a classic
keyword search, you will obtain as a result sentences like “this
administration purchased Microsoft’s products”, because all the target
keywords appear in the sentence. With syntactic search, you will only obtain
relevant results.
Moreover, Proxem Web Search can deal with surface syntactic variations.
For instance, in a candidate target such as “Microsoft recently purchased
this company for…” the adverb “recently” between the subject and the verb
does not produce any erroneous interference.
www.proxem.com
Other samples • 61
Proxem Web Search is far from being perfect. For the moment, paraphrases
must be manually expressed. The syntactic parsing is actually very slow.
Nevertheless, this utility provides unrivalled relevant results and is of great
value for any repetitive search task. We hope to improve it in the next
release of the Antelope framework.
Paraphrases detection
Paraphrase recognition and generation are crucial to create applications that
approximate our understanding of language. For instance, in an Information
Extraction program, a smart paraphrase generator should be of invaluable
interest.
Antelope includes an experimental sample that tries to find paraphrases from
two comparable texts (i.e. texts talking about the same subject). This sample
uses the anaphora solver, the predicates produced by a deep syntax parser,
and the lexicon.
62 • Antelope
© 2006-2009 Proxem
www.proxem.com
Other samples • 63