WebSets: Extracting Sets of Entities from the Web Using

Collectively Representing Semi-Structured Data
from the Web
Bhavana Dalvi , William W. Cohen and Jamie Callan
Language Technologies Institute
Carnegie Mellon University
Paper ID : 02
This work is supported by Google and the Intelligence Advanced Research Projects Activity
(IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
1
Motivation
 Entities on the Web can be present in multiple datasets.
E.g. HTML tables, text documents etc.
 Traditional systems : Entities as sparse vector of document
Ids in which it occurs.
 We propose a low-dimensional representation for such
entities.
 Helps to efficiently perform different tasks with a small
number of primitive operations :
 Semi-supervised Learning (SSL)
 Set Expansion (SE)
 Automatic Class Instance Acquisition (ASIA)
2
Entities in HTML tables
Country
Capital City
India
Delhi
USA
Washington DC
Canada
France
Ottawa
Paris
Entity-Column
Bi-partite Graph
Entity
USA
India
Table-column
TC-1
TC-2
TC-3
TC-2
Hockey
Country
Sports
India
Hockey
UK
Cricket
USA
Tennis
TC-3
Cricket
Tennis
TC-4
3
Entities in unstructured text
“Such as”
Bi-partite Graph
Suchas
Entity
Country
Location
USA
Countries such as India are
developing rapidly in terms of
infrastructure.
India
Hockey
Cricket
Sports
Outdoor sports include Tennis
and Cricket.
Tennis
4
Resultant Tri-partite Graph
“Such as”
Bi-partite Graph
Entity-Column
Bi-partite Graph
Suchas
Entity
Country
Location
USA
India
Table-column
TC-1
TC-2
Hockey
TC-3
Cricket
Sports
Tennis
TC-4
5
Encoding the graph
Low-dimensional embedding using
bipartite Power Iteration Clustering
(Lin & Cohen, ICML 2010/ECAI 2010)
“Entity-Column”
Bi-partite Graph
Entity
USA
India
Table-column
TC-1
TC-2
Hockey
Entity
X1
X2
USA
0.43
0.66
India
0.41
0.69
Hockey
0.36
0.80
Cricket
0.35
0.82
Tennis
0.34
0.79
TC-3
Cricket
Tennis
TC-4
Entities with similar X1/X2 values should
be ontologically similar - values
summarize tabular co-occurrence
6
Encoding the graph
Low-dimensional embedding using
bipartite Power Iteration Clustering
(Lin & Cohen, ICML 2010/ECAI 2010)
“Such as”
Bi-partite Graph
Suchas
Country
Location
Entity
Entity
Y1
Y2
USA
USA
0.23
0.76
India
0.21
0.79
Hockey
0.66
0.35
Cricket
0.16
0.92
Tennis
0.14
0.89
India
Hockey
Cricket
Sports
Tennis
Entities with similar Y1/Y2 values should be
ontologically similar - values summarize
“such as pattern” co-occurrence
7
Low-dimensional PIC3 embedding
n * m PIC embedding
m << t
n*t
entity-tableColumn
Bipartite graph
n * 2m PIC3 embedding
PIC
Concatenate
n * m PIC embedding
m << s
PIC
n * s entity-suchas
Bipartite graph
Entity
X1
X2
Y1
Y2
USA
0.43
0.66
0.23
0.76
India
0.41
0.69
0.21
0.79
Hockey
0.36
0.80
0.66
0.35
Cricket
0.35
0.82
0.16
0.92
Tennis
0.34
0.79
0.14
0.89
Using PIC3 Representation
• Semi-Supervised Learning : Given few seed
examples for each class, predict class-labels for
unlabeled data-points.
• Set Expansion : Given a set of seed entities, find
more entities similar to seed entities.
• Automatic Set Instance Acquisition (ASIA) : Given
a concept name automatically find instances of
that concept.
9
Quantitative Evaluation: Datasets
Dataset
#entities
Toy_Apple Delicious_Sports
14,996
438
156
176,598
925
9,192
2,348
7,683
11
419
1,649
4,799
3
39
#hand-coded column types
31
30
#columns in labeled types
156
925
# table-columns
#entity-table column edges
#suchas concepts
#entity-suchas edges
#general entity classes (NELL KB)
#entities in general classes
Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online
SSL using PIC3
Input : Few seed examples for each class label
Output : Class-labels for unlabeled data-points
Task
Training
Testing
SemiSupervised
Learning
PIC3 +
Train SVM
classifier
Predict using learnt SVM
model
PIC clusters similar entities together  better SVM classifier
on unlabeled data (use of background data)
11
SSL Task - I
# dimensions : 2504  10
12
SSL Task - II
# dimensions : 2574  10
13
Set Expansion using PIC3
Input : Few seed entities
e.g. Football, Hockey, Tennis
Output : More entities of same type as seeds
e.g. Baseball, Badminton, Cricket, Golf ….
Task
Training
Testing
Set
Expansion
PIC3
Centroid(entity set) +
K-NN (centroid)
K-NN operation is extremely efficient using KD-trees.
14
Query Times
• PIC3 preprocessing : 0.02 sec
• # SE queries = 881
Method
K-NN + PIC3
K-NN-Baseline
MAD
Total Query Time (s)
12.7
80.1
38.2
• Precision Recall Curve : K-NN+PIC3 consistently beats K-NNModified Adsorption
Baseline. Modified
Adsorption method is better on 2/5
: Graph based label
query classes
at the expense of larger query time.
propagation algorithm
15
Automatic Set Instance Acquisition
(ASIA) : using PIC3
Input : Class label
e.g. Country
Output : Entities belonging to the given class label
e.g. India, China, USA, Canada, Japan …..
Task
Training
Testing
Automatic
Set Instance
Acquisition
PIC3 +
seeds = top-k-entities
Inverted index
(lookup concept in index)
(suchasConcept + Set Expansion (seeds)
 entities)
Previously described Set Expansion algorithm is used as a
subroutine here.
16
Query Times
• PIC3 preprocessing : 0.02 sec
• # ASIA queries = 25
Method
K-NN + PIC3
K-NN-Baseline
MAD
Total Query Time (s)
0.5
1.4
150.0
• Precision Recall Curve : K-NN+PIC3 consistently beats
K-NN-Baseline. Modified Adsorption method is better
on 2/4 query classes at the expense of much larger
query time.
17
Conclusions & Future Work
 Presented a novel low-dimensional PIC3 representation for
entities on the Web using Power Iteration Clustering (PIC).
 Simple primitive operations on PIC3 to perform following
tasks :
 Semi-Supervised Learning
 Set Expansion
 Automatic Set Instance Acquisition
 Future work : Use PIC3 representation for
 Named entity disambiguation and
 Unsupervised class-instance pair acquisition
18
Thank You !!
Please visit our poster ID : 02
This work is supported by Google and the Intelligence Advanced Research Projects Activity
(IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
19
Examples : Set Expansion
20
Examples : ASIA
21
Set Expansion
22
ASIA Task
23