WordNet – Structure and use in natural language processing

Jesper Segeblad
[email protected]
Semantic analysis in language technology
HT 13
WordNet – Structure and use in natural language processing
Abstract
There are several electronic dictionaries, thesauri, lexical databases, and so forth today. WordNet is
one of the largest and most widely used of these. It has been used for many natural language
processing tasks, including word sense disambiguation and question answering. This is an attempt
to explore and understand the structure of WordNet, and how it is used and for what applications it
is used, and also to see where it's strength and weakness lies.
good introduction, nice that you mention the general
1. WordNet as a lexical database aim of your work, so the reader is always aware of what
you're eyplaining. Double use of "and", maybe you can
reformulate the last sentence
1.1 Background
Before the 1990s, most of the dictionaries for English existed only in paper form. The dictionaries
that were available in electronic form was limited to a few groups of researchers. This was were
something that hindered much work to be done in certain areas of computational linguistics, for
example word sense disambiguation (WSD). why were CL tasks more difficult without electronical
databases (explain advantages of electronic database)
In 1993, WordNet was introduced. It is a lexical database, organized as a semantic network. The
development began in 1985 at Princeton University by a group of psychologists and linguists, and
the university still is the maintainer of this lexical database. Even though it was not created with the
intention to serve as knowledge source for tasks in computational linguistics, it has been used as
such. It has been widely used as a lexical resource for different tasks, have been ported to several has been
different languages, and has spawned many different subsets. One task that it has been widely used
for is the previous mentioned WSD. double use of has been used: reformulate (applicated as lexical
resource..)
why do you mention WSD so often? are you especcially focusing on that
task? If so name that you focuse on WSD or say why it is so important
WordNet consists of three separate databases, one for nouns, one for verbs and one for adjectives
and adverbs. It does not include closed class words. The current version available for download is
WordNet 3.0, which was released in December 2006. It contains 117,097 nouns, 22,141 adjectives,
11,488 verbs and 4,601 adverbs. [2] There is a later release, 3.1, which is available for online usage.
were is your first footnote? :)..
The basic structure is synsets. These are sets of synonyms, or more correct, near-synonyms, since
there exists none to few true synonyms. Synsets contains a set of lemmas, and these sets are tagged
with the sense they represent. These senses can be said to be concepts, all of the lemmas (or words),
can be said to express the same concept. Word forms which have different meanings appear in
different synsets. For example the noun bank, has 10 different senses in WordNet, and thus it
appear in 10 different synsets. It also appear as verb in 8 different synsets. Each of these synsets are is
also connected in some way to other synsets, expressing some kind of relation. Which these
relations are depend on the part of speech of the word itself, although the hypernym/hyponym (the
what do you mean with this sentence? (some of these relations?)
1.2 Structure
appears
Jesper Segeblad
[email protected]
Semantic analysis in language technology
HT 13
is-a relation) relationship is the most common, and appears for both nouns and verbs (hypoyms for
verbs are known as troponyms, and the difference between them leading to the different names will
be expanded on). very good explanation of synsets. what do you mean with will be expanded on? that
they will work on that or that you will explain it in detail later?
One thing that WordNet does not take in to account is pronunciation, which can be observed by
looking at the noun bass. The pronunciation differs whether talking about bass in the sense of the
low tone or the instrument, or talking about the fish bass.
very good that you mention this aspect. maybe you can also state that this is a disadvantage for some CL tasks
1.2.1 Nouns
Nouns have the richest set of relations of all parts of speech represented in WordNet, with 12
different relations. As previously stated, the hyponym/hypernym relation is the most frequent used
one. For example, if we look at the noun bass again (which have 8 different senses), now in the has
context of sea bass, it is a saltwater fish, which is a kind of a seafood, which is a kind of solid
food, and so on. These relations are also transitive, which means that sea bass is a type of food, as
much as it is a type of saltwater fish.
Sense 4
sea bass, bass
=> saltwater fish
=> seafood
=> food, solid food
=> solid
=> matter
=> physical entity
=> entity
Table 1: Hypernyms of bass in the sense of sea bass.
good that you show a table for illustrations
WordNet also separates the hyponyms between types and instances. A chair is a type of furniture.
Hesse, however is not a type of author, but an instance of author. So an instance is a specific form
of hyponyms, and these instances are usually proper nouns, describing a unique entity, such as
go
persons, cities and companies. These instances goes both ways, just like the types.
what do you mean by: they go both ways?
Meronymi, the part-of relationship is divided into three different types, member meronymi, part
meronymi and substance meronymi. It also has it's counterpart, just like hyponyms, holonymi.
Where meronymi is has-part, holonymi is part-of. And just like homonyms, meronyms are a
transitive relationship. If a tree has branches, and a branch has leaves, the tree has leaves.
Part meronymi, which is the relationship most commonly associated with the word, describes parts
of an entity. is the blue sentence an explanation/example for meronyms?
Substance meronymi describes substances contained in an entity. For example, using the word
water in the sense of the chemical substance H2O, it has substance hydrogen and substance oxygen.
The last subset of meronymi, member meronymi, describes the relationship of belonging to a larger
kind of group. Looking at the word tree again, we can see that it is a member of the entity forest,
Jesper Segeblad
[email protected]
Semantic analysis in language technology
HT 13
wood and woods. See table 2 for a description of the different types of meronymis.
Part meronymi:
Sense 1
tree
HAS
HAS
HAS
HAS
HAS
PART:
PART:
PART:
PART:
PART:
stump, tree stump
crown, treetop
limb, tree branch
trunk, tree trunk, bole
burl
very good table, before I didn't
understand totally the different
types, but this helps a lot!
Substance meronymi:
Sense 1
water, H2O
HAS SUBSTANCE: hydrogen, H, atomic number 1
HAS SUBSTANCE: oxygen, O, atomic number 8
Member meronymi:
Sense 1
forest, wood, woods
HAS MEMBER: underbrush, undergrowth, underwood
HAS MEMBER: tree
Table 2: Different types of meronymi used in WordNet.
Antonyms describe words that are semantically opposed. If you are a parent, you can not be a child in the
sense of someones child. However, they do not have to rule out one another. Even though poor and rich
are antonyms, just saying that one is rich does not automatically mean that they are poor.
I am not sure if the example of parent and child is so good. here it sounds as if YOU are a parent
1.2.2 Verbs
YOU cannot be a child of someone, but of course everyone is a child of someone. I know what you
mean by it, but just because i know what an antonym is. maybe something like: love - hate would be
better, or you should add: they are antonyms because they have opposit characteristics
Verbs, just like nouns, have the hypernym relationship. Where the counterpart to hypernyms in the
case of nouns is called hyponyms, this relationship among verbs are called troponyms. These goes go.
from the event to a superordinate event, and from an event to a subordinate event, respectively.
Troponyms can also be described as in which manner something is done, therefor explaining the
difference of names. Antonymi also exists for verbs, and functions the same way, stop is an
antonym of start.
do you mean difference of senses?
The third relation, entails goes from an event to an event it entails. Entailment is used in pragmatics
to describe a relationship between to sentences, where the truth condition of one sentence depends two
on the truth of the other. If sentence A is true, then sentence B also has to be true. For example If A entails B
“The criminal was sentenced to death” (A), and “The criminal is dead” (B). If A is true, then B
also has to be true. This is the kind of relationship described by entails in WordNet. If you snore,
you are also sleeping, which is represented as an entails relation of the two words, and thus you entailment
have an entails mapping from snore to sleep.
Jesper Segeblad
[email protected]
Semantic analysis in language technology
HT 13
1.2.3 Adjectives and adverbs
Adjectives are mostly organized in the terms of antonymi. As in the case of nouns and verbs, these
are words which have meanings that are semantically opposed. As all words in WordNet, they are
also part of a synset. The other adjectives in this particular synset also have their antonyms, and
thus the antonyms of the other words become indirect antonyms for the synonyms.
Pertainyms is a relation which points the adjective to the nouns that they were derived from. This is
one of the relations that cross the part of speech, though there are a few rare cases in which it points
to another adjective. There's an extra paragraph for cross relations, maybe put it there?
The amount of adverbs are quite small. This depends on the fact that most of the adverbs in English
are derived from adjectives. Those that does exist are ordered mostly in the same way that
adjectives, with antonyms. They also have a relationship that is like the pertainym relation of
adjectives, which also is a cross part of speech pointer, and points to the adjective that they were
derived from. there's also the relation "similar to" an "also see" which conntects as it sais "similar
adjectives, that are not so synsonymous to be in obe synset. you could explain that as well.
1.2.4 Relations across part of speech
point
Most of the relations in WordNet are relations among words of the same part of speech. There is are
however some pointers across the subfields the part of speeches it consists of. One has already been
mentioned, pertainyms, which points from an adjective to the noun that it was derived from. Other other than
than that, there are pointers that points to semantically similar words which share the same stem, that are
called derivationally related form. For many of these pairs of nouns and verbs, the thematic role is pointers
also described. The verb kill has a pointer to the noun killer, and killer would be the agent of kill.
2. Using WordNet for Natural Language Processing
There are several subfields in natural language processing which can benefit from having a large
lexical database, especially one as big and extensive as WordNet. Obviously, many semantic
applications can draw benefits from using WordNet, including WSD and sentiment analysis. Many
papers have been published regarding WordNet and WSD, exploring different approaches and
algorithms, which is the main field for using this. In fact, WordNet can be said to be the de facto what is which?
standard knowledge source for WSD in English.[4] This success depends on several factors. It is
not domain specific, it is very extensive and publicly available.
Since WSD has been the subfield which has used WordNet most extensively, this is what will be good :)
focused on here. Though, an interesting mention is that there do exists packages to access WordNet exist
in several programming languages, including Perl and Python. For Python, the Natural Language
Tool Kit (NLTK), which offers many modules and tools to analyze and process natural language
and is widely used, has tools for using WordNet, such as finding synsets and other relations between
words.
Jesper Segeblad
[email protected]
Semantic analysis in language technology
HT 13
2.1 WordNet for Word Sense Disambiguation
has
WSD is a field which has been around since humans have tried to process natural language with
computers. It is has been described as an AI-complete problem and is considered to be an
intermediate step in many NLP tasks. The two main approaches to solving this problem are
knowledge-based methods and supervised methods. Supervised methods suffer from sparseness in
data to train on, in contrast to syntactic parsing, where there exists many resources of tagged data to exist
work with. SemCor is a subset of the Brown Corpus, tagged with senses from WordNet. 186 files
out of the 500 that constitutes the Brown Corpus have tags for all of the content words (nouns,
verbs, adjectives and adverbs) and another 166 files have tags for the verbs. Even if this may be
sufficient for evaluation, it is not enough for building a robust system for WSD.[3] The why footnote 3 after 4
knowledge-based methods use some kind of knowledge source, such as WordNet, to retrieve word
senses. It is for these methods that WordNet has been used extensively. so are knowledge-based
methods better?
WordNet keeps occurring in papers regarding WSD to this date. Due to the knowledge-based
methods using WordNet performs worse than supervised methods, approaches to extend the perform
knowledge contained in WordNet have been proposed. They range from semantically tagging the
glosses in WordNet to enrich the semantic relations, to extracting knowledge from Wikipedia.
and to improve
Combining WordNet with ConceptNet, a semantic network which contains semantic relations, to
improve performance have also been proposed.[5]
first you say supervised is not robust enough, but then you say that knowledge-based
perform worse. I am confused. Maybe you can make two smaller paragraphs: supervised
3. Discussion approach and knowledge-based approach. There you can explain what
supervised/knowledgebased is and how they use WordNet and what advantages/
disadvantages they have/how use of WordNet has to be improved in each approach
WordNet is an impressive database, with its large amount of words and the encoding of the
relations. Also being freely available makes it very practical to use for natural language processing,
just as it have been. However there are quite a few things that may speak against it. The very
fine-grained distinctions in the database can be problematic for several tasks. Difficulty, for
example, have four different senses in WordNet, all of them very similar, and can be hard to set
apart, just not for computers, but also for humans. As such, not all senses may be relevant to
disambiguate a word. Other problems may be that it was mainly annotated and tagged by humans,
which may produce some inconsistencies, and that it was not produced just to solve NLP tasks.
WordNet is still widely used by people working in semantic natural language processing, as can be
understood when reading papers, specifically regarding WSD. This can be seen in recent research,
where WordNet have not been abandoned, but instead been used in combination with other
resources, or has been tried to be improved in different ways. And since WordNet 3.0, it also
contains a corpus of semantically annotated disambiguated glosses, which itself can prove to be
useful.[8] WordNet will be used for a time to come for WSD, mostly because the sparseness of data
for supervised methods. Improvement of the lexical knowledge and algorithms to use for this may
be the best way to go for the time being.
and since WordNet also contains a corpus.., itself can prove to be
WordNet will be mainly used for WSD?
may be the best wary to find a good sollution for the NLP tasks/WSD
Jesper Segeblad
[email protected]
Semantic analysis in language technology
HT 13
sory now I get your footnote system :) forget about the
footnote comments
Bibliograhy
[1] George A. Miller, Richard Beckwith, Christine, Fellbaum, Derek Gross &
Katherine Miller, Introduction to WordNet: An On-line Lexical Database
(1993)
[2] Daniel Jurafsky & James H. Martin, Speech and Language Processing (Pearson Education International, 2009)
[3] Eneko Agirre & Philip Edmonds, Word Sense Disambiguation: Algorithms and
Applications (Springer, 2006)
[4] Robert Navigli, Word Sense Disambiguation: A Survey (2009)
[5] Junpeng Chen & Juan Liu, Combining ConceptNet and WordNet for Word Sense
Disambiguation (2011)
[6] Jorge Morato, Miguel Ángel Marzal, Juan Lloréns & José Moreiro, Wordnet
Applications (2003)
[7] Julian Szymański & Włodzisław Duch, Annotating Words Using WordNet
Semantic Glosses (2012)
I liked your discussion a lot, especially what the problems of WordNet are. It would be nice if
you could extend / restructure the "WordNet for WDS" part, so one can understand what
these two methods are, what the difference between them is ,how WordNet is used in these
approaches and what method performs better. I also like your tables with examples for the
WordNet relations. Have a merry Chrismas :)