Web Table

Web Table
by Michael J.Cafarella, Alon Halevy, and
Jayant Madhavan
Outline
•
•
•
•
THE CONCEPTS
THE PROCESS
THE BENEFITS
THE APPLICATIONS
2
•
•
•
•
THE CONCEPTS
THE PROCESS
THE BENEFITS
THE APPLICATIONS
3
WebTable:
“The WebTables system is
designed to extract relationalstyle data from the Web
expressed using the HTML table
tag”
4
Relational-Style Data:
Estimate that there are ~154 M good relational
tables embeded in the HTML
5
•
•
•
•
THE CONCEPTS
THE PROCESS
THE BENEFITS
THE APPLICATIONS
6
WebTable system automatically extracts
database from web crawl
7
Two stages for recovering relations:
• Relational filtering
• Metadata detection (in top row of table)
8
Relational Filtering
<table border="1" align="center" width="100%">
<tr>
<th bgcolor="silver" width="220">Cover</th>
<th bgcolor="silver">AlbumInfo</th>
<th bgcolor="silver">Number</th>
<th bgcolor="silver">Song</th>
<th bgcolor="silver">Genre</th>
<th bgcolor="silver">Time</th>
</tr>
.....
9
Metadata Detection
<table border="1" align="center" width="100%">
<tr>
<th bgcolor="silver" width="220">Cover</th>
<th bgcolor="silver">AlbumInfo</th>
<th bgcolor="silver">Number</th>
<th bgcolor="silver">Song</th>
<th bgcolor="silver">Genre</th>
<th bgcolor="silver">Time</th>
</tr>
.....
10
Recovery results:
•
•
271M databases, about 125M are good
•
What can we get from resulting data ?
2.6M unique relational schemas
11
•
•
•
•
THE CONCEPTS
THE PROCESS
THE BENEFITS
THE APPLICATIONS
12
ACSDb (Attribute Correlation Statistics Database)
•
It contains the schema information derived from millions of structured tables that
recovered from a large general web crawl. This work was done by Google.
•
Here are two example lines in ACSDb
Combo_make_model_year=13
Single_make=3068
•
The first line indicates that a schema with exactly three elements
(make,model,year)was seen in 13 different tables. the second line indicates that
attribute make was seen in 3068 different tables
13
Schema
{make, model, year}
{make, model, year, colour}
{name, addr, city, state, zip}
{name, size, last-modified}
Freg
2
1
1
1
ACSDb is useful for computing attribute probability
p(“make”), p(“model”), p(“zipcode”)
p(“make”| “model”), p(“make” | “zipcode”)
14
•
•
•
•
THE CONCEPTS
THE PROCESS
THE BENEFITS
THE APPLICATIONS
15
App#1 Schema auto-complete
Problem: given a user’s input attribute, suggest a full
schema
input: make
output: make, model, year, price
16
App#2 Synonyms finder
Problem: given an attribute, automatically compute related
attribute synonyms (for schema matching)
Principles:
•
•
Not appear in the same schema
Appear in schema with same attribute
17
App#3 Construct domain ontology
1. Classifier schema according to domain
2. Combine schema to form schema ontology
Provide a better semantic service !!
18
Provide a better semantic service !!
19
Example: Table Search Engine
20
Example: Table Search Engine
21
Conclusion
Disadvantages:
● Automatically extract relational database from HTML web is difficult.
● The databases extracted usually is incomplete, missing values or attributes.
Advantages:
● Extract large-scale database embeded in HTML .
● Supports Data Mining operation.
● Supports database rearch instead of URLs rearch
● Schema auto-complete ,synonyms finder and construct domain ontology can
provide better semantic service
22
Merci !