Xiang Ren, Advisor: Jiawei Han,
{xren7,hanj}@Illinois.edu
Entity Recognition and Typing [KDD’15 tutorial]
Synonym Discovery for Structured Entities [WWW’15]
In this work, we study the problem of identifying token spans of entity mentions from
massive text corpora and labeling them into taget types of interest, with distant
supervision. We propose a relation phrase-based framework to disambiguate each
entity mention by its surrounding relation phrases and co-occurring entity mentions.
To tackle context sparisty, we design a joint optimization problem to integrate entity
typing with relation phrase clustering, which mutually enhance each other.
In this work, we aim to discover synonyms (alternate strings used to reference an
entity) for entities in knowledge bases from query logs. Different from previous
work, which only take “literal” view of the entity, we take a “structured” view of
each entity by considering not only its surface name, but also other important
structured attributes such as source URLs and existing synonyms. A heterogeneous
graph-based ranking method is designed to incorporate several problem insights.
Challenges in handling a domain-specific corpus
Challenges in synonym discovery
• Traditional NER system: require additional steps to adapt to new domains/types
• Entity linking: limited coverage (>50% unlikable mentions) and freshness
Progress in entity recognition for domain-specific corpora
• Weak supervision needs careful seed selection by human
• Distant supervision: NO human supervision
• Existing synonyms & redirect links in KBs are manually curated limited coverage
Progress in string-based synonym discovery
• Ambiguity of target entity’s surface name/synonym names (“JR Smith”)
• Ignore sub-query synonyms (e.g., “facts on diamond state”)
• Have difficulties to handle tailed synonyms (e.g., “first bank”)
Method: A relation phrase-based framework
POS-constrained phrase segmentation
Typical workflow of distant supervision
•
•
Method: Ranking on heterogeneous graphs
Entity
Idea I: leverage target
Knowledg
entity’s structured
ebase
attributes to disambiguate
the surface name, and help
OR
enrich signals
Ad-Hoc Entity Collection
Phrase mining on a POS-tagged corpus to extract
entity mention candidates and relation phrases
del dept of revenue
delaware revenue division
Web
page2
• Source pages
• Descriptive
Text
• Entity Type
• Existing alias
• …
1st state
Structured
relationships
first state in us
Web
page3
list of 50 states
Web pages
Structured Entity
User queries
Query Click Log
Construct a heterogeneous graph to encode our
problem insights in a unified way
•
Framework Overview
Collect seeds by linking candidates to KB
1.
•
Estimate type for unlinkable candidates by
clustering-integrated type propagation
2.
3.
4.
Relation phrase clustering: (i) StringSim; (ii)
ContextSim; (iii) argument types
facts on diamond state
Web
page1
={
}
Candidate synonym generation
Generate candidate synonyms and extract keywords
based on user queries
Construct a heterogeneous graph which encodes different
kinds of information
Derive entity synonym scores for candidates by synonym
discovery on graph
Output a subset of candidates with highest synonym
scores by: (1) selecting top-K candidates, or (2) automatic
cut-off technique on ranked list
Idea II: Explore sub-queries to generate candidate synonyms
- Enables to discover synonyms which only appear as sub-queries
- Augments tailed synonyms by leveraging information across
multiple support queries collectively
Construct heterogeneous graph as data model
mutually exclusive
relationships
between
candidate
synonyms from
the same query--“single entity
query” assumption
Five types of relations between objects for modeling object properties:
A mutually enhancing framework:
• Type propagation provides entity argument’s type
information as clustering featires
• Relation phrase clustering helps bridge entities
for propagation
Results
Performance breakdown by type on Tweets
Compare our candidate generation with NP chunker
Results
System example output
Entity
Freebase Exist
Synonym
New Synonyms by us
Entity
Freebase Exist
Synonym
New
Synonyms by
us
volkswagen
type 2
bus
transporter
kombi
camper
microbus
combi
thesamba type
w combi
type 2 vw
buses
type2 sale
minibus
bmw
bayerische
motoren werke;
bavarian
motor works
beamer
automobile
bmwgroup
bmv car
beamer car
bimmer
skydrive
onedrivecom
windows live
skydrive
microsoft
skydrive
windows live storage
windows drive
microsoft storage
online
microsoft free
storage
8th
armored
division
N/A
thundering
herd 8th
eighth
armored div
Example domains/entities used in evaluation
Compare ClusType with other methods and its variants
Result Analysis:
vs. baselines: Over 45%
improvement in F1 on
Tweet and Yelp
overcomes domain
restriction; resolves
context sparsity
Not very sensitive to
#clusters & #seeds;
robust to corpus size
1. Candidate-page relation (CU subgraph):
- Interplay between candidate’s entity synonym score & Web page’s entity page score
2. Keyword-page relation (WU subgraph):
- Interplay between keyword’s entity context score & Web page’s entity page score
3. Candidate-keyword (CW) subgraph:
- Interplay between candidate’s entity synonym score & keyword’s entity context score
4. Candidate mutual exclusion relation (ME subgraph)
- Interplay between candidates in queries helps extract the right entity mention
vs. variants: modeling
mention correlation
enables name
disambiguation; integrate
clustering brings mutual
enhancement
Compare with baselines and variants
vs. variants: clearly see how each component
(which impose different insights) can help
improve the performance
Influence of entity source web pages
vs. baselines: performance
improvement by leveraging source
web pages (StrucSyn-CU); Significant
enhancement by further jointly
modeling the heterogeneous graph
(StrucSyn)
Performance study on enriching Freebase synonym lists
Other related publications:
Phrase Mining: 1. Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han. Mining Quality Phrases from Massive Text Corpora, SIGMOD, 2015. 2. Marina Danilevsky, Chi Wang, Nihit Desai, Xiang Ren,
Jingyi Guo, and Jiawei Han, Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents, SDM, 2014. Recommendation: 3. Xiang Ren, Jialu Liu, Xiao Yu, Urvashi
Khandelwal, Quanquan Gu, Lidan Wang, and Jiawei Han. ClusCite: Effective Citation Recommendation by Information Network-Based Clustering. KDD 2014. 4. Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan
Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. Personalized Entity Recommendation: A Heterogeneous Information Network Approach. WSDM, 2014.
© Copyright 2025 Paperzz