Schema Matching and Data
Extraction over HTML Tables
Cui Tao
Data Extraction Research Group
Department of Computer Science
Brigham Young University
supported by
NSF
Introduction
Many tables on the Web
How to integrate data stored in different
tables?
Detect the table of interest
Form attribute-value pairs (adjust if necessary)
Do extraction
Infer mappings from extraction patterns
Problem
Detecting The Table of Interest
?
Problem
Different schemas
Different source table schemas
{Run #, Yr, Make, Model, Tran, Color, Dr}
{Make, Model, Year, Colour, Price, Auto, Air Cond.,
AM/FM, CD}
{Vehicle, Distance, Price, Mileage}
{Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy}
Target database schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr},
{Car, Feature}
Problem
Attribute is Value
Problem
Attribute-Value is Value
?
?
Problem
Value is not Value
Problem
Factored Values
Problem
Split Values
Problem
Merged Values
Problem
Information Behind Links
Table
extending
over several
pages
Single-Column
Table (formatted
as list)
Solution
Detect the table of interest
Form attribute-value pairs (adjust if
necessary)
Do extraction
Infer mappings from extraction patterns
Solution
Detect The Table of Interest
‘Real’ table test
Same number of values
Table size
Attribute test
Density measure test
# of ontology extracted values
total # of values in the table
Solution
Remove Factoring
2001
2001
2001
2000
2000
2000
2000
2000
2000
1999
1999
Solution
Replace Boolean Values
Solution
Form Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,
<Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Solution
Adjust Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>,
<Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Solution
Unstructured and
semi-structured:
Addconcatenate
Information Hidden Behind Links
Single attribute value pairs:
Pair them together
<
<Price, $7,988>, <Mileage, 63,168
miles>, <Body Type, Car>, <Body
Style, 4 DR Sedan>, <Transmission,
Automatic>, <Engine, 3.0 L V-6>,
<Doors, 4>, <Fuel Type, Gas>,
<Stock Number, 22764>, <VIN,
1FAFP52U2WA139879>
List:
Mark the beginning
and the end
>
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
Each row is a car.
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Solution
Inferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Experimental Results
Car Advertisement Application domain
10 “training” tables
100% of the 57 mappings (no false mappings)
94.6% precision of the values in linked pages
(5.4% false declarations)
50 test tables
94.7% of the 300 mappings (no false mappings)
On the bases of sampling 3,000 values in linked
pages, we obtained 97% recall and 86% precision
Other Applications
Cell Phone Plan Application domain
Soccer Player Application domain
Contribution
Provides an approach to extract
information automatically from HTML
tables
Suggests a different way to solve the
problem of schema matching
© Copyright 2026 Paperzz