slides day 3 (Distributional semantics III: One distributional semantics, many semantic relations)

Distributional semantics III:
One distributional semantics,
many semantic relations
Marco Baroni
UPF Computational Semantics Course
Outline
Attributional and relational similarity
Distributional Memory
Conclusion
Is distributional semantics really “general purpose”?
I
I
Most standard tasks (synonym detection, similarity
judgments, categorization, coordinate priming. . . ) evaluate
taxonomic similarity
Taxonomic similarity is fundamental, but a general purpose
model should also:
I
I
capture other semantic relations
distinguish between different semantic relations
Other semantic relations
Baroni & Lenci, in press
SVD
SUBJECTS
TAXO RELENT
PART
QUALITY
ACTIVITY
FUNCTION
LOCATION
Different types of semantic relations
Nearest neighbours of book in a ukWaC-based SVD-ed semantic space
I
reading function
I
story part?
I
article coordinate
I
writing agentive
I
excerpt part?
I
writer relent
I
commentary coordinate
I
to write agentive
I
interesting quality
Finding and distinguishing semantic relations
I
Classic distributional semantic models are based on
attributional similarity
I
Single words/concepts that tend to share contexts/attributes
are similar
I
However, corpora (especially large ones) also contain
many instances of pairs (or longer tuples) of
words/concepts, that, by the very fact that they co-occur,
will tend to be related, and not only by taxonomic similarity
(parts, functions, activities. . . )
I
Pairs that occur in similar contexts (in particular:
connected by similar words and structures) will tend to be
related, with the shared contexts acting as a cue to the
nature of their relation, i.e., measuring their relational
similarity (Turney 2006)
Attributional and relational similarity
Turney (2006)
I
I
Policeman is attributionally similar to soldier since they
both occur in contexts such as: kill X, with gun, for security
The pair policeman-gun is relationally similar to
teacher-book since they both occur in contexts in which
they are connected by with, use, of
I
Structuralists might call this: looking for paradigmatic
relations of syntagmatic relations
Finding and distinguishing semantic relations
I
Find non-taxonomic semantic relations:
I
I
Look at direct co-occurrences of word pairs in texts (when
we talk about a concept, we are likely to also mention its
parts, function, etc.)
Distinguish between different semantic relations
I
I
Use the contexts of pairs to measure pair similarity, and
group them into coherent relation types by their contexts
Analogical reasoning (Turney 2006, 2008)
General purpose?
I
It seems like we need to collect statistics to build two
different models:
I
I
word-by-context statistics to measure attributional similarity
word-pair-by-context statistics to measure relational
similarity
General purpose?
Should we care?
I
I
No?
Dominating paradigm in computational linguistics is
“special purpose”:
I
I
I
define the task
build appropriate training resources
extract generalizations from training data, and use them to
tackle the task
General purpose?
Should we care?
I
I
Yes!
20 years of statistical methods in semantics have produced
no general purpose resource such as WordNet, FrameNet,
ConceptNet, etc.
I
With possible exception of Dekang Lin’s (1998) thesaurus
I
Going back to the corpus to collect ad-hoc statistics is
time-and-resource consuming, and leaves little time to
think of the “next steps” (what do you do with the resources
you collected)
I
Human semantic memory is general purpose!
General purpose?
I
It seems like we need to collect statistics to build two
different models:
I
I
I
Baroni and Lenci (2009): if we go for an attributional
semantic space where contexts include links, the two
models are different views of the same tuple statistics
graph, that can be collected once and for all from the
corpus
I
I
I
word-by-context statistics to measure attributional similarity
word-pair-by-context statistics to measure relational
similarity
The extended example below uses dependency paths as
links, but links could also be pattern-based
A distributional semantic model is a set of weighted
concept+link+concept tuples collected from the corpus;
different matrices are generated on demand from these
tuples, on a task-specific basis
We refer to this approach as Distributional Memory (DM),
by analogy with human semantic memory
Outline
Attributional and relational similarity
Distributional Memory
Conclusion
One set of tuple statistics from the corpus
concept1
soldier
gun
policeman
gun
kill
victim
kill
soldier
link
use
use−1
with
with−1
obj
obj−1
subj_tr
subj_tr−1
concept2
gun
soldier
gun
policeman
victim
kill
soldier
kill
weight
41.0
41.0
30.5
30.5
915.4
915.4
1306.9
1306.9
The DM graph
die
subj_tr
109.4
subj_in
1335.2
obj
915.4
obj
9.9
teacher
use
10.1
handbook
with
3.2
with
28.9
kill
subj_in
68.6
subj_tr
22.4
victim
in
11894.4
at
in
7020.1 2.5
school
subj_tr
38.2
subj_in obj
4547.5 538.1
policeman
in
2.5
use
7.4
subj_tr obj
1306.9 8948.3
soldier
with
30.5
at
10.3
gun
with
105.9
use in
41.0 2.8
Different matrix views of the DM graph
Concept by Link+Concept (CxLC)
teacher
soldier
policeman
subj_in−1
die
109.4
4547.5
68.6
subj_tr−1
kill
0.0
1306.9
38.2
obj−1
kill
9.9
8948.3
538.1
with
gun
0.0
105.9
30.5
use
gun
0.0
41.0
7.4
Concept+Concept by Link (CCxL)
teacher
teacher
soldier
school
handbook
gun
in
11894.4
2.5
2.8
at
7020.1
0.0
10.3
with
28.9
3.2
105.9
use
0.0
10.1
41.0
Same underlying corpus model, different similarities
Empirical validation from Baroni & Lenci (2009)
I
ukWaC pre-parsed with MINIPAR (Lin 1998)
I
Target concepts/words: Top 20k most frequent nouns, top
5k most frequent verbs
Links:
I
I
I
I
I
the top 30 most frequent direct V-N dependency paths (e.g.
kill+obj+victim)
the top 30 preposition-mediated N-N or V-N paths (e.g.
soldier+with+gun)
the top 50 transitive-verb-mediated N-N noun paths (e.g.
soldier+use+gun)
all the inverse relations of above (e.g. victim+obj−1 +kill)
I
Weighting by Local MI (Evert 2008)
I
69 million tuples extracted
I
No SVD on the attributional/relational matrices, although in
principle it would be possible
Same underlying corpus model, different similarities
Empirical validation from Baroni & Lenci (2009)
I
The emphasis is on the model generality and adaptivity
I
Goal is to achieve state-of-the-art results (not necessarily
the best score), without task-specific parameter tuning
Attributional similarity
The Concept-by-Link+Concept (CxLC) space
I
Rows: target concepts
I
Columns: nodes linked to the targets in the graph, “typed”
with the corresponding edge label (links)
I
Standard dependency-based linked model a-la
Grefenstette, Lin, Curran & Moens, etc.
teacher
victim
soldier
policeman
subj_in−1
die
109.4
1335.2
4547.5
68.6
subj_tr−1
kill
0.0
22.4
1306.9
38.2
obj−1
kill
9.9
915.4
8948.3
538.1
with
gun
0.0
0.0
105.9
30.5
use
gun
0.0
0.0
41.0
7.4
Rubenstein & Goodenough results
model
ukWaCDV
DM
ukWaCWindow
r
0.70
0.64
0.63
model
DV+
ukWaCHAL
cosDV+
r
0.62
0.61
0.47
Categorization
I
I
44 concrete nouns (ESSLLI 2008 Distributional Semantics
shared task)
24 natural entities
I
I
I
20 artifacts
I
I
13 tools (hammer), 7 vehicles (car)
Clustering with CLUTO
I
I
15 animals: 7 birds (eagle), 8 ground animals (lion)
9 plants: 4 fruits (banana), 5 greens (onion)
6-way (birds, ground animals, fruits, greens, tools and
vehicles), 3-way (animals, plants and artifacts) and 2-way
(natural and artificial entities) clustering
Cluster evaluation again in terms of purity
Categorization
Results
model
Katrenko
Peirsman+
ukWaCDV
DM
ukWaCHAL
Peirsman−
ukWaCWindow
Shaoul
6-way
89
82
80
77
75
73
70
41
3-way
100
84
75
79
68
71
68
52
2-way
80
86
61
59
68
61
59
55
Relational similarity
The Concept+Concept-by-Link (CCxL) space
I
Same tuples, different matrix view:
I
I
Rows: concept pairs
Columns: links between the concept pairs
teacher
teacher
soldier
school
handbook
gun
in
11894.4
2.5
2.8
at
7020.1
0.0
10.3
with
28.9
3.2
105.9
use
0.0
10.1
41.0
Relational tasks
I
Finding the most similar pair to a target pair (recognizing
SAT analogies)
I
Defining a relation type by example pairs, and deciding
whether a target instantiates the relation type (SEMEVAL
semantic relation classification)
I
Unsupervised clustering of relation types (McRae nominal
relation data-set)
Recognizing SAT analogies
I
374 SAT multiple-choice questions (Turney 2006)
I
Each question includes 1 target pair (stem) and 5 answer
pairs
I
the task is to choose the pair most analogous to the stem
mason
stone
teacher
chalk
carpenter
wood
soldier
gun
photograph camera
book
word
Recognizing SAT analogies
I
374 SAT multiple-choice questions (Turney 2006)
I
Each question includes 1 target pair (stem) and 5 answer
pairs
I
the task is to choose the pair most analogous to the stem
mason
stone
teacher
chalk
carpenter
wood
soldier
gun
photograph camera
book
word
Recognizing SAT analogies
I
374 SAT multiple-choice questions (Turney 2006)
I
Each question includes 1 target pair (stem) and 5 answer
pairs
I
the task is to choose the pair most analogous to the stem
mason
stone
teacher
chalk
carpenter
wood
soldier
gun
photograph camera
book
word
Relational analogue to the TOEFL task!
I
I
Approach: for each stem-pair, select the answer-pair with
the highest cosine in CCxL space
Recognizing SAT analogies
Results
model
LRA
PERT
PairClass
VSM
DM+
PairSpace
k-means
% correct
56.1
53.3
52.1
47.1
45.3
44.9
44.0
model
KnowBest
DM−
LSA
AttrMax
AttrAvg
AttrMin
Random
% correct
43.0
42.3
42.0
35.0
31.0
27.3
20.0
Domain analogies
I
Turney (2008) extends the relational approach to entire
analogical domains
solar system → atom
sun
→ nucleus
planet
→ electron
mass
→ charge
attracts
→ attracts
revolves
→ revolves
gravity
→ electromagnetism
Semantic relation classification
I
SemEval-2007 Task 04: Semantic relations between
Nominals
I
7 relation types: C AUSE -E FFECT, I NSTRUMENT-AGENCY,
P RODUCT-P RODUCER, O RIGIN -E NTITY, T HEME -TOOL,
PART-W HOLE, C ONTENT-C ONTAINER
I
Instances harvested with patterns from the Web, and
manually labeled as hits or misses
I
For each relation, 140 training examples (about 50% hits),
about 80 test cases
Semantic relation classification
I
For each relation, we build hit and miss prototype vectors
by averaging across vectors of training examples in CCxL
space
I
For each test pair, we base hit/miss choice on cosine
similarity to hit and miss average vectors
Prototype vectors
I
No need to rescale the resulting vector (e.g., to mean
values) as long as we are only going to look at its
angle/cosine with other vectors
8
6
y
4
2
0
0
2
4
y
6
8
10
Simply sum the vectors of relevant class to obtain
prototype (average, centroid) vector
10
I
0
2
4
6
x
8
10
0
2
4
6
x
8
10
Semantic relation classification
model
UCD-FC
UCB
ILK
DM+
UMELB-B
SemeEval avg
DM−
UTH
majority
probmatch
UC3M
alltrue
precision
66.1
62.7
60.5
60.3
61.5
59.2
56.7
56.1
81.3
48.5
48.2
48.5
recall
66.7
63.0
69.5
62.6
55.7
58.7
58.2
57.1
42.9
48.5
40.3
100.0
F
64.8
62.7
63.8
61.1
57.8
58.0
57.1
55.9
30.8
48.5
43.1
64.8
accuracy
66.0
65.4
63.5
63.3
62.7
61.1
59.0
58.8
57.0
51.7
49.9
48.5
Can we discover relation types?
I
In both previous tasks, relation types are (implicitly) given
by example (bottle-drink exemplifies the “function” relation
type, etc.)
I
Can we discover relation types in the CCxL space, like we
found concept types (animals, fruit, etc.) in the CxLC
space, by unsupervised similarity-based clustering?
At least some evidence that we do, form Baroni et al
(submitted)
I
I
Using a different underlying graph from the one of the other
experiments!
Relation clustering
I
I
205 prototypycally related concept pairs from
spontaneously produced concept descriptions (McRae et
al’s 2005 conceptual norms)
Manual annotation by McRae and colleagues into 4
classes:
I
I
I
I
I
85 part relations: motorcycle-engine
31 category relations: pineapple-fruit
25 location relations: pig-farm
64 (nominally expressed) function relations: ship+cargo
Clustering with CLUTO in CCxL space
Clustering relation types: results
clust
1
part
77
cat
0
loc
1
funct
10
2
1
31
0
1
3
2
0
24
21
4
5
0
0
32
descriptive links
P_of_C, C_’s_P, C_with_P
C_have_P, P_on_C
P_such_as_C, P_like_C, C_as_P
P_as_C, P_from_C
C_on_P, P_with_C, C_in_P
C_from_P, C_to_P
C_of_P, P_from_C, C_for_P
P_in_C, P_by_C
Clustering relation types: results
The problem with function
I
Function as annotated in the McRae norms (as recoded by
us) is a mixed bag:
I
I
I
I
mushroom+hallucinogen (category?), ship+cargo (part?),
cherry+cake (location?)
Activities: pen+writing
Objects and participants in the functional activity:
pen+page, pen+author
Fillers of prototypical functions: cup+tea
Ad interim summary
I
Attributional and relational similarity can be seen as
different views of the same corpus-estimated statistical
model
I
Empirical evidence that this is a viable strategy
I
Empirical evidence that relational similarity can tackle the
problems of finding different classes of related
words/concepts, and identifying the nature of the relation
The Devil’s Advocate
I
I
Relational similarity works for arbitrary relations
Including taxonomic ones: “category” above
(=hypo-/hypernym), synonym detection. . .
I
I
Counterintuitively, given a sufficiently large corpus you can
do the TOEFL synonym test fairly well (73.75% accuracy)
based on direct synonym co-occurrence (Turney 2001)
Then, why do we need attributional similarity at all?
Taxonomic as relational
soldier
sing
artist
accident
Garbanzo_bean
and
and
such_as
or
also_known_as
policeman
dance
painter
incident
chickpea
Attributional similarity and generalization
I
Relational similarity is based on direct co-occurrence of
concepts in pair
I
If a pair does not co-occur in corpus, we have no relational
information about it
Consequently, methods based on relational similarity need
to rely on huge corpora
I
I
I
I
Turney uses a 50 billion word corpus
Rapp and others got better TOEFL synonym results using
attributional similarity on a 100M word corpus than Turney
with relational similarity on the WWW
More importantly, the whole point of human language is to
allow us to express new notions by combining words and
their meanings
I
I
He devoured the topinambur (unattested, totally fine once
you know topinambur is a vegetable)
He devoured the sympathy (unattested, bad)
Attributional similarity and generalization
I
Suppose that neither devour+obj+topinambur nor
devour+obj+sympathy were seen in the corpus, and thus
they are not in the tuple graph
I
Extract tuples that contain relational information about
devour and its objects (devour+obj+meat,
devour+obj+banana, etc.)
I
Build an attributional vector in CxLC space for the
prototypical object of devour, by summing the meat,
banana, etc. vectors
Measure the attributional similarity of topinambur and
sympathy to the devour object prototype
I
I
The prototype and topinambur vectors will probably share
high values on features such as obj− 1_cook, obj− 1_eat,
on_table, etc.
Mixing relational and attributional similarity to model
selectional preferences
I
I
Basic idea from Padó et al (2007)
Correlation with human acceptability judgments of
noun-verb pairs from McRae et al. (1997) (100 pairs, 36
raters) and Padó (2007) (211 pairs, ∼20 raters per pair):
I
I
For each verb we build its prototype subj (obj) argument
vector
I
I
I
how typical is deer as an object of shoot?
sum the vectors of the 50 nouns with the highest weight on
the appropriate link to the verb (e.g., the top 50 nouns
connected to shoot by an obj link)
NB: if the target noun is in the prototype set, its vector is
subtracted from the prototype: we treat each argument as
topinambur!
Cosine between prototype and candidate noun in CxLC
space is taken as the model “plausibility judgment” about
the noun occurring as the relevant verb argument
Modeling selectional preferences
Results
I
Performance measured by Spearman ρ correlation
coefficient between the average human ratings and the
model predictions (Padó et al. 2007)
model
Padó
DM
ParCos
ukWaCDV
ukWaCHAL
ukWaCWindow
Resnik
McRae
coverage
56
96
91
96
96
96
94
ρ
41
28
21
21
12
12
3
Padó
coverage
97
98
98
98
98
98
98
ρ
51
50
48
39
29
27
24
Acceptability of some potential objects of kill
object
kangaroo
person
robot
hate
flower
stone
fun
book
conversation
sympathy
cosine
0.51
0.45
0.15
0.11
0.11
0.05
0.05
0.04
0.03
0.01
Acceptability of some potential instruments of kill
object
hammer
stone
brick
smile
flower
antibiotic
person
heroin
kindness
graduation
cosine
0.26
0.25
0.18
0.15
0.12
0.12
0.12
0.12
0.07
0.04
Modeling selectional preferences
I
Once more, no need to collect ad-hoc statistics
I
We have all the necessary relational information (to find
typical subjects/objects of a verb) and attributional
information (to compare the prototype subject/object to the
candidate) in the underlying tuple graph
Other views of the graph
I
Concept-by-Link+Concept: attributional similarity
I
Concept+Concept-by-Link: relational similarity
I
Concept+Link-by-Concept: “slot” similarity, coming next
I
Link-by-Concept+Concept: “type” similarity, left to further
work
I
Direct Concept+(Link+)+Concept ranking: properties of
concepts
The Concept+Link-by-Concept (CLxC) space
I
Rows: concept+link pairs
I
Columns: concepts related to the concept+link pairs in the
graph
kill
kill
die
subj_tr
obj
subj_in
teacher
0.0
9.9
109.4
victim
22.4
915.4
1335.2
soldier
1306.9
8948.3
4547.5
policeman
38.2
538.1
68.6
Argument alternations
I
I
Measuring similarity between verb slots can be used to
study syntactic alterations (Joanis et al. 2008)
Causative/inchoative alternations (Levin 1993)
I
the patient of break can also surface as its (inchoative)
subject, whereas this does not happen with mince
I
I
The cook broke the vase → The vase broke
The cook minced the meat → *The meat minced
Argument alternations in the CLxC space
I
for alternating verbs, the direct object vector (verb +
obj) should be similar to the intransitive subject vector
(verb + subj-in)
I
I
the same things that are broken break
for non-alternating verbs, the two slots should not be
similar
I
the things that are minced are different from those that
mince them
Argument alternations
Experiments
I
402 verbs extracted from Levin Classes (Levin 1993)
I
I
I
232 alternating causatives/inchoatives (break)
170 non-alternating transitives (mince)
Median per-verb pairwise cosines among slots:
alternating
non-alternating
subj-intr
subj-tr
0.28
0.29
subj-intr
obj
0.31
0.09
subj-tr
obj
0.16
0.11
Outline
Attributional and relational similarity
Distributional Memory
Conclusion
The main points
Attributional and relational similarity
I
I
Traditional distributional semantics models capture
attributional similarity (similarity of words that share many
contexts, and thus, presumably, many properties)
Attributional similarity tends to capture taxonomically
related concepts (synonyms, antonyms, etc.) but:
I
I
I
it does not capture other important semantic relations (part,
function, etc.)
it does not distinguish between relations of different types
(e.g., coordinates vs. categories)
Both issues can be tackled by measuring the relational
similarity of pairs of co-occurring terms by comparing the
contexts (in particular: links) in which they co-occur
I
I
Strongly related terms will tend to co-occur, no matter what
type of relation they instantiate
Similarity of pairs in terms of contexts/links will group pairs
that instantiate the same relation, allowing supervised or
unsupervised relation classification
The main points
One model, many views
I
I
I
I
To build link-based attributional similarity models, we need
to collect, from the corpus, weighted concept+link+concept
tuples (one concept will be represented by the vector of
co-occurring link+other-concept pairs, with the
corresponding weights as dimensions)
Instead of seeing the tuple list as a temporary resource in
building the concept-by-link+concept matrix, we should see
it as the basic data structure of our distributional semantics
model
From the same tuple (conveniently represented as a graph)
we can extract a relational similarity matrix, by representing
concept pairs by the vector of links that connect them, with
the corresponding weights (same as above) as dimensions
Thus, attributional and relational similarity are different
views of the same distributional data, harvested once and
for all from the corpus
The main points
Attributional similarity for generalization
I
Relational similarity can capture a wider range of relations
than attributional similarity, and provide some form of
typing of these relations
I
I
I
Taxonomic relations might also be captured by relational
similarity
However, relational similarity can only tell us something
about word pairs that co-occur in our source corpus
We can combine relational and attributional similarity to
perform productive semantic generalizations
I
From concepts for which we have direct evidence that they
are in a certain relation (relational similarity) to
attributionally similar concepts
The main points
Views of the tuple graph
I
From weighted concept+link+concept tuples, we can build
four kinds of matrices to compare different kinds of objects,
all of semantic interest:
I
I
I
I
concept-by-link+concept:
concept+concept-by-link:
concept+link-by-concept:
link-by-concept+concept:
attributional similarity
relational similarity
slot similarity
link/type similarity