LEARNING VALID ADVERB

LEARNING
VALID ADVERBADJECTIVE
PAIRS
CAROLINE SUEN
CS224U WINTER 2013
THE CHALLENGE
We can say:
•  “The glass is half full.”
•  or “Wow, Bob is really tall.”
But can we say:
•  “Wow, Bob is half tall”.
•  or “The glass is really full.” ?
Goal: develop a model that can learn whether an adverb and
an adjective can be used together and make grammatical
sense.
PRIOR WORK
Syrett and Lidz (2010)
•  Use linguistics to develop patterns
Sentiment analysis
•  Benemara et. al (2007), Liu et. al (2009)
Adjective-noun pairs
•  Hatzivassiloglou et. al (1993)
EXTRACTING DATA
half
completely
extremely
nearly
full
5
3
3
1
tall
0
0
4
0
smart
0
1
4
0
daylong
0
0
0
1
•  New York Times dataset, ~18000 articles
•  Stanford POS tagger to find valid adverb-adjective pairs
•  1019 adverbs, 4876 adjectives, 19337 pairs
BUILDING A GRAPH
half
full
completely
tall
extremely
smart
nearly
daylong
Relatively sparse bipartite graph
PARTITIONING
half
full
completely
tall
extremely
smart
nearly
daylong
BUILDING A GRAPH:
TECHNICAL DETAILS
• 
Used Stanford Network Analysis Platform
• 
Experimented:
• 
• 
Find dense bipartite subgraphs using the frequent itemset
algorithm
Build adverb graphs and adjective graphs and run
community detection algorithms on these graphs
• 
Based on common neighbors
half
full
completely
tall
extremely
smart
nearly
daylong
Adjective graph
full
tall
daylong
half
completely
smart
extremely
nearly
Adverb graph
CLIQUE PERCOLATION
From Wikipedia
CLASSIFY: DOES AN
EDGE BELONG?
Use the communities that adverbs u and adjective v are in.
If, by combining these communities, the edge density is
sufficiently high, we claim that u and v can be paired up.
Harder case:
•  An adverb is in communities C1 and C2. How likely is it to
be connected to an adjective in communities D1, D2, and
D3?
•  Thankfully, this is rare!
•  Larger and more densely connected communities are
given higher weight
EVALUATION: RECALL
• 
Find “test data” (1100 edges) – remaining edges is
“training data”
• 
Find communities based on training data
• 
Observe fraction of test data edges recovered
EVALUATION: RECALL
Not enough
connections:
260 (21.7%)
Not discovered
by community
detection
algorithm: 129
(11.7%)
Correctly discovered
by community
detection algorithm:
711 (64.6%)
CHALLENGES + NEXT
STEPS
• 
Not enough pairings
• 
•  (recall for test data with enough connections: 84.6%)
Clique percolation is slow
• 
•  priority was building evaluation framework first
•  next steps: experimenting with clustering
Adjective edge connections are much more important
than adverb connections
• 
Current framework does not test precision
• 
•  MTurk for crowd-sourced, hand-labeled data
Potential next step:
• 
Check Syrett and Lidz’ linguistic results
THE END
THANKS FOR LISTENING! J