Wei Shen†, Jianyong Wang†, Ping Luo‡, Min Wang‡
†Tsinghua University, Beijing, China
‡HP Labs China, Beijing, China
WWW 2012
Presented by Tom Chao Zhou
July 17, 2012
7/13/2017
1
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
2/34
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
3/34
Motivation
Many large scale knowledge bases have emerged
DBpedia, YAGO, Freebase, and etc.
www.freebase.com
7/13/2017
4/34
Motivation
Many large scale knowledge bases have emerged
DBpedia, YAGO, Freebase, and etc.
As world evolves
New facts come into existence
Digitally expressed on the Web
Maintaining and growing the existing knowledge bases
Integrating the extracted facts with knowledge base
Challenge
Name variations
“National Basketball Association” “NBA”
“New York City” “Big Apple”
Entity ambiguity
7/13/2017
“Michael Jordan”
NBA player
Berkeley professor
……
5/34
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
6/34
Problem Definition
Entity linking task
Input:
A textual named entity mention m, already recognized in the
unstructured text
Output:
The corresponding real world entity e in the knowledge base
If the matching entity e for entity mention m does not
exist in the knowledge base, we should return NIL for m
7/13/2017
7/34
Entity linking task
German Chancellor
Angela Merkel and her
husband Joachim Sauer
went to Ulm, Germany.
NIL
Figure 1: An example of YAGO
7/13/2017
Source: From Information to Knowledge:Harvesting
Entities and Relationships from Web Sources. PODS’10.
8/34
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
9/34
Previous Methods
Essential step of entity linking
Define a similarity measure between the text around the
entity mention and the document associated with the entity
Bag of words model
Represent the context as a term vector
Measure the co-occurrence statistics of terms
Cannot capture the semantic knowledge
Example:
Text: Michael Jordan wins NBA champion.
7/13/2017
The bag of words model
cannot work well!
10/34
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
11/34
LINDEN Framework
Candidate Entity Generation
For each named entity mention m
Retrieve the set of candidate entities Em
Named Entity Disambiguation
For each candidate entity e∈Em
Define a scoring measure
Give a rank to Em
Unlinkable Mention Prediction
For each etop which has the highest score in Em
7/13/2017
Validate whether the entity etop is the target entity for mention
m
12/34
Candidate Entity Generation
Intuitively, the candidates in Em should have the name
of the surface form of m.
We build a dictionary that contains vast amount of
information about the surface forms of entities
Like name variations, abbreviations, confusable names,
spelling variations, nicknames, etc.
Leverage the four structures of Wikipedia
7/13/2017
Entity pages
Redirect pages
Disambiguation pages
Hyperlinks in Wikipedia articles
13/34
Candidate Entity Generation (Cont’)
For each mention m
Search it in the field of surface forms
If a hit is found, we add all target entities of that surface
form m to the set of candidate entities Em
Table 1: An example of the dictionary
7/13/2017
14/34
Named Entity Disambiguation
Goal:
Give a rank to candidate entities according to their
scores
Define four features
Feature 1: Link probability
Based on the count information in the dictionary
Semantic network based features
Feature 2: Semantic associativity
Based on the Wikipedia hyperlink structure
Feature 3: Semantic similarity
Derived from the taxonomy of YAGO
Feature 4: Global coherence
Global document-level topical coherence among entities
7/13/2017
15/34
Link Probability
LP
0.81
0.05
Table 1: An example of the dictionary
Feature 1: link probability LP(e|m) for candidate
entity e
7/13/2017
where countm(e) is the number of links which point to
entity e and have the surface form m
16/34
Semantic Network Construction
Recognize all the Wikipedia concepts Γd in the document d
The open source toolkit Wikipedia-Miner1
Example:
The Chicago Bulls’ player Michael Jordan won his first NBA championship in 1991.
Set of entity mentions: {Michael Jordan, NBA}
Candidate entities:
Michael Jordan {Michael J. Jordan, Michael I. Jordan}
NBA {National Basketball Association, Nepal Basketball Association}
Γd : {NBA All-Star Game, David Joel Stern, Charlotte Bobcats, Chicago Bulls}
Hyperlink structure of Wikipedia articles
Taxonomy of concepts in YAGO
Figure 2: An example of the constructed semantic network
7/13/2017
1http://wikipedia-miner.sourceforge.net/index.htm
17/34
Semantic Associativity
Feature 2: semantic associativity SA(e) for each
candidate entity e
Figure 2: An example of the constructed semantic network
7/13/2017
18/34
Semantic Associativity (Cont’)
Given two Wikipedia concepts e1 and e2
Wikipedia Link-based Measure (WLM) [1]
Semantic associativity between them
where E1 and E2 are the sets of Wikipedia concepts that
hyperlink to e1 and e2 respectively, and W is the set of all
concepts in Wikipedia
7/13/2017
[1] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of
WIKIAI, 2008.
19/34
Semantic Similarity
Feature 3: semantic similarity SS(e) for each candidate
entity e
where Θk is the set of k context concepts in Γd which
have the highest semantic similarity with entity e
k=2
7/13/2017
Figure 2: An example of the constructed semantic network
20/34
Semantic Similarity (Cont’)
Given two Wikipedia concepts e1 and e2
Assume the sets of their super classes are Φe1 and Φe2
For each class C1 in the set Φe1
Assign a target class ε(C1) in another set Φe2 as
Where sim(C1, C2) is the semantic similarity between two classes C1
and C2
To compute sim(C1, C2)
Adopt the information-theoretic approach introduced in [2]
7/13/2017
Where C0 is the lowest common ancestor node for class nodes C1 and
C2 in the hierarchy, P(C) is the probability that a randomly selected
object belongs to the subtree with the root of C in the taxonomy.
[2] D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304, 1998.
21/34
Semantic Similarity (Cont’)
Calculate the semantic similarity from one set of
classes Φe1 to another set of classes Φe2
Define the semantic similarity between Wikipedia
concepts e1 and e2
7/13/2017
22/34
Global Coherence
Feature 4: global coherence GC(e) for each candidate entity
e
Measured as the average semantic associativity of candidate
entity e to the mapping entities of the other mentions
where em’ is the mapping entity of mention m’
Substitute the most likely assigned entity for the mapping
entity in Formula 9
The most likely assigned entity e’m’ for mention m is
defined as the candidate entity which has the maximum
link probability in Em
7/13/2017
23/34
Global Coherence (Cont’)
Figure 2: An example of the constructed semantic network
7/13/2017
24/34
Candidates Ranking
To generate a feature vector Fm(e) for each e ∈ Em
To calculate Scorem(e) for each candidate e
where
is the weight vector which gives different weights
for each feature element in Fm(e)
Rank the candidates and pick the top candidate as the
predicted mapping entity for mention m
To learn
, we use a max-margin technique based on the
training data set
Assume Scorem(e∗) is larger than any other Scorem(e) with a
margin
We minimize over ξm ≥ 0 and the objective
7/13/2017
25/34
Unlinkable Mention Prediction
Predict mention m as an unlinkable mention
If the size of Em generated in the Candidate Entities
Generation module is equal to zero
If Scorem(etop) is smaller than the learned threshold τ
7/13/2017
26/34
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
27/34
Experiment Setup
Data sets
CZ data set: newswire data used by Cucerzan [3]
TAC-KBP2009 data set: used in the track of Knowledge
Base Population (KBP) at the Text Analysis Conference
(TAC) 2009
Parameters learning:
10-fold cross validation
7/13/2017
[3] S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of
EMNLP-CoNLL, pages 708–716, 2007.
28/34
Results over the CZ data set
7/13/2017
29/34
Results over the CZ data set
7/13/2017
30/34
Results on the TAC-KBP2009 data set
7/13/2017
31/34
Results on the TAC-KBP2009 data set
7/13/2017
32/34
Outline
Motivation
Problem Definition
Previous Methods
LINDEN Framework
Experiments
Conclusion
7/13/2017
33/34
Conclusion
LINDEN
A novel framework to link named entities in text with
YAGO
Leveraging the rich semantic knowledge derived from
the Wikipedia and the taxonomy of YAGO
Significantly outperforms the state-of-the-art methods
in terms of accuracy
7/13/2017
34/34
Thanks!
Q&A
7/13/2017
35/34
© Copyright 2025 Paperzz