Collective annotation of wikipedia entities in web text

COLLECTIVE ANNOTATION OF
WIKIPEDIA ENTITIES IN WEB
TEXT
- Presented by Avinash S Bharadwaj (1000663882)
ABSTRACT

The aim of the paper
Annotation of open domain unstructured web text
with uniquely identified entities in a social media
like Wikipedia.
 Use of annotations for search and mining tasks

WHAT IS ENTITY DISAMBIGUATION?
An entity is something that is real and has a
distinct existence.
 Wikipedia articles can be considered as entities.
 Entity disambiguation is the art of resolving
correspondence between mentions of entities in
natural language and real world entities.
 In this paper the disambiguation is carried out
between annotations in web pages along and
Wikipedia articles.

ENTITY DISAMBIGUATION EXAMPLE
PREVIOUS WORK IN DISAMBIGUATION

SemTag:
First webscale disambiguation system.
Annotated about 250 million web pages with IDs from the Stanford
TAP.
 SemTag preferred high precision over recall, with an average of two
annotations per page



Wikify!



Wikify performed both keyword extraction and disambiguation.
Wikify could not achieve collective disambiguation across spots
Milne and Witten (M&W):
It’s a form of collective disambiguation which results better than
Wikify.
 M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1
measure of 0.83


Cucerzan’s algorithm:


Each entity is represented as a high dimensional feature vector.
Cucerzan annotates sparingly about 4.5% of all possible tokens are
annotated.
TERMINOLOGIES

Spots


Attachment


Possible entities in Wikipedia to which a spot can be
linked
Annotation


Occurrence of text on a page that can be possibly
linked to a Wikipedia article
Process of making an attachment to spots on a page
Gama list

List of all possible annotations
TERMINOLOGIES ILLUSTRATED
Spots
Attachment
Gama list
COLLECTIVE ENTITY DISAMBIGUATION
Sometimes
disambiguation can
not be carried out
by using single
spots in a page.
 Multiple spots in a
page are required to
disambiguate an
entity
 All spots in an
article are
considered to be
related

COLLECTIVE ENTITY DISAMBIGUATION
EXAMPLE
CALCULATING RELATEDNESS BETWEEN
WIKIPEDIA ENTITIES
Relatedness between two entities is defined as
r(γ, γ’)= g(γ) · g(γ’).
 Cucerzan’s proposal defined relatedness between
entity based on cosine measure
 Milne et al. proposal: c = number of Wikipedia
pages; g(γ)[p] = 1 if page p links to page γ, 0
otherwise.

CONTRIBUTIONS OF THIS PAPER
The paper proposes posing entity disambiguation
as an optimization problem.
 The paper provides a single optimization
objective.

Using integer linear programs
 Using heuristics for approximate solutions

Paper also describes about rich node features
with systematic learning
 Paper also describes about back off strategy for
controlled annotations

MODELING COMPATIBILITY BETWEEN
WIKIPEDIA ARTICLES



Entities modeled using a feature vector defined as fs(γ).
The feature vector expresses local textual compatibility
between (context of) spot s and candidate label γ.
Components of the feature vector

Spot side


Wikipedia side





Context of the spot
Snippet
Full text
Anchor text
Anchor text with context
Similarity Measures



Dot product
Cosine Similarity
Jaccard Similarity
METHODS FOR EVALUATING THE MODEL
Authors use two ways for evaluating the model,
Node score and Clique Score
 Node Score

Defined by the function
 W is a training set obtained from linear adaptation of
rank SVM


Clique score


Uses the related measure of Milne and Witten.
Total objective
BACK-OFF METHOD
Not all spots in a web page may be tagged.
 Uses a special tag “NA” for articles that can’t be
tagged
 Spots in the webpage marked “NA” will not
contribute to the clique potential.
 A factor called “RNA” defines the aggressiveness
of the tagging algorithm.

IMPLEMENTATION

Integer linear program (ILP) based formulation
Casting as 0/1 integer linear program
 Relaxing it to an LP


Simpler heuristics

Hill climbing for optimization
EVALUATING THE ALGORITHM

Evaluation measures used

Precision


Recall


Number of spots tagged correctly out of total number of
spots tagged
Number of spots tagged correctly out of total number of
spots in ground truth
F1

F1 is described using the following formula
DATASETS USED FOR EVALUATION
The authors use WebPages crawled and stored in
the IITB database.
 Publicly available data from Cucerzan’s
experiments (CZ)

EXPERIMENTAL RESULTS
NAMED ENTITY DISAMBIGUATION IN
WIKIPEDIA
Named ambiguity problem has resulted in a
demand for efficient high quality disambiguation
methods
 Not a trivial task, the application should be
capable of deciding whether the group of name
occurrences belong to the same entity
 Traditional methods of named entity
disambiguation uses the Bag Of Words (BOW)
method

WIKIPEDIA AS A SEMANTIC NETWORK
Wikipedia is an open database covering most of
the useful topics in the world.
 The title of Wikipedia article describes the
content within the article.
 The title may sometimes be noisy. These are
filtered using rules from Hu, et al.

SEMANTIC RELATIONS BETWEEN
WIKIPEDIA CONCEPTS
Wikipedia contains rich relation structures
within the page
 The relatedness is represented by links between
the Wikipedia pages.

WORKING OF NAMED ENTITY
DISAMBIGUATION USING WIKIPEDIA
Uses vectors as to represent a Wikipedia entity.
 Similarity between each vector is measured for
named entity disambiguation.

MEASURING SIMILARITY BETWEEN TWO
WIKIPEDIA ENTITIES
The similarity measure takes into account the
full semantic relations indicated by hyperlinks in
Wikipedia.
 The algorithm works in three steps. Described as
follows

STEP 1
In order to measure the similarity between two
vector representations, the correspondence
between the concepts of one vector to another
have to be defined
 Semantic relations between articles is used to
match the articles.

STEP 2



Compute the semantic relatedness from one concept
vector representation to another
Using the alignments shown in previous step
SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 +
0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 +
0.54×0.51 + 0.51×0.51)=0.62, and
SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 +
0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 ×
0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 ×
0.54)=0.60.
STEP 3


Compute the similarity between two concept
vector representations.
Similarity SIM(MJ1, MJ2) is computed as (0.60 +
0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as
0.10 and SIM(MJ1, MJ3) is computed as 0.0.
RESULTS
QUESTIONS ????