幻灯片 1

Wei Shen†, Jianyong Wang†, Ping Luo‡, Min Wang‡
†Tsinghua University, Beijing, China
‡HP Labs China, Beijing, China
WWW 2012
Presented by Tom Chao Zhou
July 17, 2012
7/13/2017
1
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
2/34
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
3/34
Motivation
 Many large scale knowledge bases have emerged
 DBpedia, YAGO, Freebase, and etc.
www.freebase.com
7/13/2017
4/34
Motivation
 Many large scale knowledge bases have emerged
 DBpedia, YAGO, Freebase, and etc.
 As world evolves
 New facts come into existence
 Digitally expressed on the Web
 Maintaining and growing the existing knowledge bases
 Integrating the extracted facts with knowledge base
 Challenge
 Name variations


“National Basketball Association”  “NBA”
“New York City”  “Big Apple”
 Entity ambiguity

7/13/2017
“Michael Jordan”
NBA player
Berkeley professor
……
5/34
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
6/34
Problem Definition
 Entity linking task
 Input:

A textual named entity mention m, already recognized in the
unstructured text
 Output:

The corresponding real world entity e in the knowledge base
 If the matching entity e for entity mention m does not
exist in the knowledge base, we should return NIL for m
7/13/2017
7/34
Entity linking task
German Chancellor
Angela Merkel and her
husband Joachim Sauer
went to Ulm, Germany.
NIL
Figure 1: An example of YAGO
7/13/2017
Source: From Information to Knowledge:Harvesting
Entities and Relationships from Web Sources. PODS’10.
8/34
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
9/34
Previous Methods
 Essential step of entity linking
 Define a similarity measure between the text around the
entity mention and the document associated with the entity
 Bag of words model
 Represent the context as a term vector
 Measure the co-occurrence statistics of terms
 Cannot capture the semantic knowledge
 Example:
 Text: Michael Jordan wins NBA champion.
7/13/2017
The bag of words model
cannot work well!
10/34
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
11/34
LINDEN Framework
 Candidate Entity Generation
 For each named entity mention m

Retrieve the set of candidate entities Em
 Named Entity Disambiguation
 For each candidate entity e∈Em

Define a scoring measure
 Give a rank to Em
 Unlinkable Mention Prediction
 For each etop which has the highest score in Em

7/13/2017
Validate whether the entity etop is the target entity for mention
m
12/34
Candidate Entity Generation
 Intuitively, the candidates in Em should have the name
of the surface form of m.
 We build a dictionary that contains vast amount of
information about the surface forms of entities
 Like name variations, abbreviations, confusable names,
spelling variations, nicknames, etc.
 Leverage the four structures of Wikipedia




7/13/2017
Entity pages
Redirect pages
Disambiguation pages
Hyperlinks in Wikipedia articles
13/34
Candidate Entity Generation (Cont’)
 For each mention m
 Search it in the field of surface forms
 If a hit is found, we add all target entities of that surface
form m to the set of candidate entities Em
Table 1: An example of the dictionary
7/13/2017
14/34
Named Entity Disambiguation
 Goal:
 Give a rank to candidate entities according to their
scores
 Define four features
 Feature 1: Link probability

Based on the count information in the dictionary
 Semantic network based features
 Feature 2: Semantic associativity
Based on the Wikipedia hyperlink structure
Feature 3: Semantic similarity
 Derived from the taxonomy of YAGO
Feature 4: Global coherence
 Global document-level topical coherence among entities



7/13/2017
15/34
Link Probability
LP
0.81
0.05
Table 1: An example of the dictionary
 Feature 1: link probability LP(e|m) for candidate
entity e

7/13/2017
where countm(e) is the number of links which point to
entity e and have the surface form m
16/34
Semantic Network Construction
 Recognize all the Wikipedia concepts Γd in the document d
 The open source toolkit Wikipedia-Miner1
 Example:
 The Chicago Bulls’ player Michael Jordan won his first NBA championship in 1991.
 Set of entity mentions: {Michael Jordan, NBA}
 Candidate entities:


Michael Jordan {Michael J. Jordan, Michael I. Jordan}
NBA {National Basketball Association, Nepal Basketball Association}
 Γd : {NBA All-Star Game, David Joel Stern, Charlotte Bobcats, Chicago Bulls}
 Hyperlink structure of Wikipedia articles
 Taxonomy of concepts in YAGO
Figure 2: An example of the constructed semantic network
7/13/2017
1http://wikipedia-miner.sourceforge.net/index.htm
17/34
Semantic Associativity
 Feature 2: semantic associativity SA(e) for each
candidate entity e
Figure 2: An example of the constructed semantic network
7/13/2017
18/34
Semantic Associativity (Cont’)
 Given two Wikipedia concepts e1 and e2
 Wikipedia Link-based Measure (WLM) [1]
 Semantic associativity between them
 where E1 and E2 are the sets of Wikipedia concepts that
hyperlink to e1 and e2 respectively, and W is the set of all
concepts in Wikipedia
7/13/2017
[1] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of
WIKIAI, 2008.
19/34
Semantic Similarity
 Feature 3: semantic similarity SS(e) for each candidate
entity e
 where Θk is the set of k context concepts in Γd which
have the highest semantic similarity with entity e
k=2
7/13/2017
Figure 2: An example of the constructed semantic network
20/34
Semantic Similarity (Cont’)
 Given two Wikipedia concepts e1 and e2
 Assume the sets of their super classes are Φe1 and Φe2
 For each class C1 in the set Φe1
 Assign a target class ε(C1) in another set Φe2 as

Where sim(C1, C2) is the semantic similarity between two classes C1
and C2
 To compute sim(C1, C2)
 Adopt the information-theoretic approach introduced in [2]

7/13/2017
Where C0 is the lowest common ancestor node for class nodes C1 and
C2 in the hierarchy, P(C) is the probability that a randomly selected
object belongs to the subtree with the root of C in the taxonomy.
[2] D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304, 1998.
21/34
Semantic Similarity (Cont’)
 Calculate the semantic similarity from one set of
classes Φe1 to another set of classes Φe2
 Define the semantic similarity between Wikipedia
concepts e1 and e2
7/13/2017
22/34
Global Coherence
 Feature 4: global coherence GC(e) for each candidate entity
e
 Measured as the average semantic associativity of candidate
entity e to the mapping entities of the other mentions

where em’ is the mapping entity of mention m’
 Substitute the most likely assigned entity for the mapping
entity in Formula 9
 The most likely assigned entity e’m’ for mention m is
defined as the candidate entity which has the maximum
link probability in Em
7/13/2017
23/34
Global Coherence (Cont’)
Figure 2: An example of the constructed semantic network
7/13/2017
24/34
Candidates Ranking
 To generate a feature vector Fm(e) for each e ∈ Em
 To calculate Scorem(e) for each candidate e
 where
is the weight vector which gives different weights
for each feature element in Fm(e)
 Rank the candidates and pick the top candidate as the
predicted mapping entity for mention m
 To learn
, we use a max-margin technique based on the
training data set
 Assume Scorem(e∗) is larger than any other Scorem(e) with a
margin
 We minimize over ξm ≥ 0 and the objective
7/13/2017
25/34
Unlinkable Mention Prediction
 Predict mention m as an unlinkable mention
 If the size of Em generated in the Candidate Entities
Generation module is equal to zero
 If Scorem(etop) is smaller than the learned threshold τ
7/13/2017
26/34
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
27/34
Experiment Setup
 Data sets
 CZ data set: newswire data used by Cucerzan [3]
 TAC-KBP2009 data set: used in the track of Knowledge
Base Population (KBP) at the Text Analysis Conference
(TAC) 2009
 Parameters learning:
 10-fold cross validation
7/13/2017
[3] S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of
EMNLP-CoNLL, pages 708–716, 2007.
28/34
Results over the CZ data set
7/13/2017
29/34
Results over the CZ data set
7/13/2017
30/34
Results on the TAC-KBP2009 data set
7/13/2017
31/34
Results on the TAC-KBP2009 data set
7/13/2017
32/34
Outline
 Motivation
 Problem Definition
 Previous Methods
 LINDEN Framework
 Experiments
 Conclusion
7/13/2017
33/34
Conclusion
 LINDEN
 A novel framework to link named entities in text with
YAGO
 Leveraging the rich semantic knowledge derived from
the Wikipedia and the taxonomy of YAGO
 Significantly outperforms the state-of-the-art methods
in terms of accuracy
7/13/2017
34/34
Thanks!
Q&A
7/13/2017
35/34