Distributional Semantics

Anca Dinu, University of Bucharest
ACED 15, Bucharest 2013
Ack: This work was supported by a grant of the Romanian National Authority for
Scientific Research, CNCS –UEFISCDI, project number PN-II-ID-PCE-2011-3-0959.
What this paper is about;
Formal versus Distributional Semantics – a
Complementarity perspective;
 Compositionality Problem for Distributional
Semantics;
 Our idea –dropping the row frequencies, using
rankings;
 Experimental results;
 Further work;
 Conclusions.




This paper focuses on an alternative method of
measuring word-word semantic relatedness in
the Distributional Semantics (DS) framework.
Distributional Semantics is an approach to
semantics that is based on the contexts of
words (and linguistic expressions) in large
corpora.

Frege: Linguistic signs have a reference and a
sense.
 Formal Semantics studies “meaning” as “reference”.
 Distributional semantics focuses on “meaning” as
“sense” leading to the “language as use” view.
• Focus of FS:
Grammatical words:
• prepositions,
• articles,
• quantifiers,
• coordination,
• auxiliary verbs,
• pronouns,
• negation.
• Focus of DS:
Content words:
• nouns,
• adjectives,
• verbs,
• adverbs.


Formal semantics gives an elaborate and
elegant account of the productive and
systematic nature of language.
Then whay bother with Distributional
Semantics?
 Because many semantic issues come from the lexicon
of content words and not from grammatical terms.
 And because it’s a solvable problem.
 And because there are a number of applications for it
(like in information retrieval tasks or search engines).

For instance:
“In order for a cow to be brown most of its body's
surface should be brown, though not its udders, eyes,
or internal organs. A brown crystal, on the other hand,
needs to be brown both inside and outside. . .
Furthermore, in order for a cow to be brown, brown
should be the animal's natural color. A table, on the
other hand, is brown even if it is only painted brown.
But while a table or a caw are not brown if covered
with brown sugar, a cookie is. In short, what is to be
brown varies from one linguistic context to another.”
(Lahav 1993:76)


The problem of lexical semantics is primarily a
problem of size: even considering the many
regularities found in the content lexicon, a handby-hand analysis is simply not feasible.
There have been important attempts to tackle
the lexicon problem from the point of view of
formal semantics, like Pustejovsky's (1995)
theory of the Generative Lexicon.


Distributive Semantics is rather like a concrete
implementation of a feature-based semantic
representation, akin to Generative Lexicon
(Pustejovsky, 1995) or Lexical Conceptual
Structures (Jackendo , 1990).
However, unlike these theories, DS Models can
be induced fully automatically on a large scale,
from corpus data.


DS assumes that the statistical distribution of
words in context plays a key role in
characterizing their semantic behavior.
The idea that word co-occurrence statistics
extracted from text corpora can provide a basis
for semantic representations can be traced back
at least to Firth (1957): ”You shall know a word
by the company it keeps” and to Harris (1954)
”words that occur in similar contexts tend to
have similar meanings”.





For instance, by looking at a word's context, one
can infer some of its meaning:
tasty tnassiorc
greasy tnassiorc
FOOD
tnassiorc with butter
tnassiorc for breakfast

He filled the wampimuk, passed it around and we
all drunk some
DRINK

We found a little, hairy wampimuk sleeping
behind the tree
ANIMAL


Traditionally, vector space models were applied
to measure word similarity in DS framework.
A word is represented by a vector in which the
elements are derived from the co-occurrences of
the word in various contexts, such as:
 windows of words (Lund and Burgess (1996)),
 grammatical dependencies (Lin (1998), Pado and
Lapata (2007)),
 richer contexts consisting of dependency links and
selectional preferences on the argument positions
(Erk and Pado (2008)).


For instance, for corpus = English Wikipedia
(Hinrich Schütze, Distributional Semantics,
2012)
Co-occurrence defined as occurrence within k =
10 words of each other:






cooc.(rich, silver) = 186
cooc.(poor, silver) = 34
cooc.(rich, disease) = 17
cooc.(poor, disease) = 162
cooc.(rich, society) = 143
cooc.(poor, society) = 228



cooc.(poor, silver)=34,
cooc.(poor, disease)=162,
cooc.(poor, society)=228,
cooc.(rich, silver)=186,
cooc.(rich, disease)=17,
cooc.(rich, society)=143


The similarity between two words is measured
with the cosine of the angle between them.
Small angle: silver and gold are similar.


Instead of only two dimension space it is usual to
employ a very high-dimensional space with a
large number of vectors represented in it.
We can compute the nearest neighbors of any
word in this word space.

Nearest neighbors of “silver”:
1.000 silver / 0.865 bronze / 0.842 gold / 0.836
medal / 0.826 medals / 0.761 relay / 0.740
medalist / 0.737 coins / 0.724 freestyle / 0.720
metre / 0.716 coin / 0.714 copper / 0.712 golden /
0.706 event / 0.701 won / 0.700 foil / 0.698 Winter
/ 0.684 Pan / 0.680 vault / 0.675 jump

Nearest neighbors of “disease”:
1.000 disease / 0.858 Alzheimer / 0.852 chronic /
0.846 infectious / 0.843 diseases / 0.823 diabetes
/ 0.814 cardiovascular / 0.810 infection / 0.807
symptoms / 0.805 syndrome / 0.801 kidney /
0.796 liver / 0.788 Parkinson / 0.787 disorders /
0.787 coronary / 0.779 complications / 0.778 cure
/ 0.778 disorder / 0.778 Crohn / 0.773 bowel



The vectors in the space have been words so far.
But we can also represent other entities like:
phrases, sentences, paragraphs, documents,
even entire books.
Compositionality problem: how to obtain the
distribution vector of a phrase?

Option 1: The distribution of phrases – even
sentences – can be obtained from corpora, but...
 those distributions are very sparse;
 observing them does not account for productivity in
language.

Option 2: Use vector addition or product of two
or more words to compute the phrase
distribution, but...
 Addition and Multiplication are commutative in a
word-based model:
▪ [[The cat chases the mouse]] = [[The mouse chases the cat]].
 There is a excelent programm that takes into account
the syntax: “Frege in Space: A Program for
Compositional Distributional Semantics” by Baroni et
al., 2012.



This paper presents an alternative method to
measuring word-word semantic relatedness in DS
framework.
Instead of representing words in vector spaces, we
represent them as rankings of co-occuring words
arranged in their co-occurence frequency.
Advantages:
 Simpler;
 Presumably more robust, since rankings of features are
expected to vary less then the row frequency with the choice of
corpus;
 Opens the perspective of experimenting with new methods of
composing (distributional) meaning by agregating rankings
instead of combining (adding, multiplying) vectors.


We used the Wacky corpus (Baroni et al., 2009),
which is lemmatized and pos tagged.
We have extracted from Wacky corpus the cooccurrence vectors for the words in WS-353 Test
(Finkelstein et al. 2002), in a 10 words window .


WS-353 Test is a semantic relatedness test set
consisting of 353 word pairs and a gold standard
defined as the mean value of evaluations by up
to 17 human judges.
Completly unrelated words were assigned a
value of 0, while identical words a value of 10.


WS-353 Test has been widely used in the
literature and has become the de facto standard
for semantic relatedness measure evaluation.
For each of the distinct 437 target-word in WS353 Test, we computed its co-occurrences
frequency with words in the corpus.


This resulted in 437 rankings of words (features)
ordered in decreasing order of their cooccurence frequency.
Then we computed 4 different measures of
similarity (Jaro, Rank, Inverse Rank and
MeanRank) for each pair of rankings
representing the pair of words in the WS-353
Test.

The best Spearman Corellation value with
human jugements (0,46) was achived by Jaro
distance deffined as:
where t the number of character transpositions
in S1 and S2 (i.e. the number of common
characters, but in different positions, divided by
2) and c is the number of common characters.
 This is close to the baseline Spearman Corelation
obtained by DSMs which is around 0.5.

When inspecting the worst mismatches
between
human/machine
relatedness
judgements between pairs of words, we
observed that most of them were following a
pattern:
 Lower values asigned by humans corresponded to
much higher values computed by machine, such
as:










(month, hotel)
(money, operation)
(king, cabbage)
(coast, forest)
(rooster, voyage)
(governor, interview)
(drink, car)
(day, summer)
(architecture, century)
(morality, marriage)
Humans
Machine
1,81
3,31
0,23
3,15
0,62
3,25
3,04
3,94
3,78
3,69
6,239567
6,40989
4,171145
6,409761
4,656631
6,08319
5,931482
6,576498
5,927852
5,450308



Our method performs around the baseline for
DSMs, so there is hope for improvement;
It is also less computationally expensive;
It provides a new framework for
experimenting with distributional semantic
compositionality.
Thank you!


Contextual information – bag of words:
The first category of models aims at integrating the
widest possible range of context information without
recourse to linguistic structure. The best-known work
in this category is Schutze (1998). He first computes
“first-order” vector representations for word meaning
by collecting co-occurrence counts from the entire
corpus. Then, he determines “second-order” vectors
for individual word instances in their context, which is
taken to be a simple surface window, by summing up
all first-order vectors of the words in this context. The
resulting vectors form sense clusters.

Contextual information – including syntactic
information (A Structured Vector Space Model for
Word Meaning in Context Katrin Erk and Sebastian
Pado, Proceedings of the 2008 Conference on
Empirical Methods in Natural Language Processing,
pages 897–906, Honolulu, October 2008):
 they encode each word as a combination of one vector
that models the lexical meaning of the word, and a set of
vectors, each of which represents the selectional
preferences for the particular syntactic relations that the
word supports (such as subj, obj, mod).

Or
 (Structured Composition of Semantic Vectors,
Stephen Wu and William Schuler, Proceedings of
the International Conference on Computational
Semantics, Oxford, UK, 2011.) proposes a
structured vectorial semantic framework, in which
semantic vectors are defined and composed in
syntactic context.

Cases where simple word space models fail:
 Antonyms are judged to be similar: “disease” and
“cure”
 Ambiguity: “Cambridge”
 Homonimy: ”bank”
 Non-specificity (occurs in a large variety of
different contexts and has few/no specific
semantic associations): “person”