Data Extraction from HTML Tables - BYU Data Extraction Research

Schema Matching and Data
Extraction over HTML Tables
Cui Tao
Data Extraction Research Group
Department of Computer Science
Brigham Young University
supported by
NSF
Introduction


Many tables on the Web
How to integrate data stored in different
tables?




Detect the table of interest
Form attribute-value pairs (adjust if necessary)
Do extraction
Infer mappings from extraction patterns
Problem
Detecting The Table of Interest
?
Problem
Different schemas

Different source table schemas





{Run #, Yr, Make, Model, Tran, Color, Dr}
{Make, Model, Year, Colour, Price, Auto, Air Cond.,
AM/FM, CD}
{Vehicle, Distance, Price, Mileage}
{Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy}
Target database schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr},
{Car, Feature}
Problem
Attribute is Value
Problem
Attribute-Value is Value
?
?
Problem
Value is not Value
Problem
Factored Values
Problem
Split Values
Problem
Merged Values
Problem
Information Behind Links
Table
extending
over several
pages
Single-Column
Table (formatted
as list)
Solution




Detect the table of interest
Form attribute-value pairs (adjust if
necessary)
Do extraction
Infer mappings from extraction patterns
Solution
Detect The Table of Interest

‘Real’ table test




Same number of values
Table size
Attribute test
Density measure test
# of ontology extracted values
total # of values in the table
Solution
Remove Factoring
2001
2001
2001
2000
2000
2000
2000
2000
2000
1999
1999
Solution
Replace Boolean Values
Solution
Form Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,
<Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Solution
Adjust Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,
<Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Solution
Unstructured and
semi-structured:
Addconcatenate
Information Hidden Behind Links
Single attribute value pairs:
Pair them together
<
<Price, $7,988>, <Mileage, 63,168
miles>, <Body Type, Car>, <Body
Style, 4 DR Sedan>, <Transmission,
Automatic>, <Engine, 3.0 L V-6>,
<Doors, 4>, <Fuel Type, Gas>,
<Stock Number, 22764>, <VIN,
1FAFP52U2WA139879>
List:
Mark the beginning
and the end
>
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
Each row is a car.
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Experimental Results
Car Advertisement Application domain
 10 “training” tables



100% of the 57 mappings (no false mappings)
94.6% precision of the values in linked pages
(5.4% false declarations)
50 test tables


94.7% of the 300 mappings (no false mappings)
On the bases of sampling 3,000 values in linked
pages, we obtained 97% recall and 86% precision
Other Applications


Cell Phone Plan Application domain
Soccer Player Application domain
Contribution


Provides an approach to extract
information automatically from HTML
tables
Suggests a different way to solve the
problem of schema matching