(min) AA GO MS (stock) 10 - Large extraction and integration of data

Data Extraction and
Integration from Imprecise
Web Sources
Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi,
Paolo Merialdo, Paolo Papotti
Università degli Studi Roma Tre
(Creative Commons License, see last slide)
Data-intensive websites
Data-intensive websites
target
Website
Template1
Template2
Database
Template3
Flint goal
Last
Min
Max
StockQuote
…
Volume
52high
Open
System architecture
Web
Search
Flint
Data Extraction
[WIDM08]
Data Integration
The Web
Novel contribution
Data Extraction
• Unsupervised
• Automatic
• Scalable
• No knowledge available
RoadRunner [Vldb01]
ExAlg [Sigmod03]
TurboWrapper [Vldb07]
Data Integration
• Unsupervised
• Automatic
• Scalable
• Uncertain Data
• No labels available
• No corpus available
WebTables [Vldb08]
Cimple [Vldb07]
MetaQuerier [Cidr05]
PayGo [Cidr07]
Data Extraction
Data Extraction
Data Extraction
AAPL, GOOG, MSFT, INTC, …
128.09, 439.54, 34.89, 112.37, …
127.81, 439.25, 32.13, 111.01, …
132.43, 443.82, 33.67, 114.32, …
0.50%, -0.38%, 1.23%, 3.92%, -1.65%, …
Add AAPL to Your Portfolio,
Add GOOG to Your Portfolio,
Add MSFT to Your Portfolio,
Add INTC to Your Portfolio, …
…
Data Extraction
HTML fragments taken from two pages belonging to the same website:
?
/html/body/table/tr[1]/td[2]
1,132,228 , 1,735,857
/html/body/table/tr[2]/td[2]
$20.66 , $414.58
/html/body/table/tr[3]/td[2]
$11.70 , $247.30
/html/body/table/tr[4]/td[2]
$20.72 , $414.06
/html/body/table/tr[5]/td[2]
/html/body/table/tr[6]/td[2]
$0.02 , 99,494,200
4,732,600 , null
Extraction
error!
Data Integration
10
33
16
4
25
10
AA
GO
MS
(max)
(min)
(stock)
Data Integration
t=0.5
t=0.5
t=0.5
10
33
16
4
25
10
AA
GO
MS
(max)
(min)
(stock)
Data Integration
t=0.5
t=0.5
t=0.5
10
33
16
4
25
10
AA
GO
MS
(max)
(min)
(stock)
10
33
16
(max)
1.0
4
25
10
(min)
1.0
AA
GO
MS
(stock)
1.0
Data Integration
t=0.5
t=0.5
10
33
16
10
33
16
(max) (max)
4
25
10
4
25
10
(min) (min)
t=0.5
AA
GO
MS
AA
GO
MS
(stock)(stock)
Data Integration
t=0.5
t=0.5
10
33
16
10
33
16
(max) (max)
4
25
10
t=0.5
4
25
10
AA
GO
MS
(min) (min)
0.6
AA
GO
MS
(stock)(stock)
6
26
12
(price)
1.0
4
25
10
(min)
1.0
AA
GO
MS
(stock)
Data Integration
t=0.5
t=0.5
10
33
16
10
33
16
(max) (max)
4
25
10
4
25
10
t=0.5
6
26
12
AA
GO
MS
?
(min) (min) (price)
AA
GO
MS
(stock)(stock)
1.0
4
25
10
(min)
1.0
AA
GO
MS
(stock)
Data Integration
t=0.5
t=0.5
10
33
16
10
33
16
(max) (max)
4
25
10
4
25
10
6
26
12
(min) (min) (price)
4
25
10
(min)
AA
GO
MS
AA
GO
MS
(stock)(stock)
1.0
AA
GO
MS
(stock)
Data Integration
10
33
16
10
33
16
(max) (max)
t=0.7
t=0.7
t=0.5
4
25
10
4
25
10
4
25
10
(min) (min) (min)
6
26
12
(price)
t=0.5
AA
GO
MS
AA
GO
MS
(stock)(stock)
1.0
AA
GO
MS
(stock)
Data Integration
10
33
16
10
33
16
(max) (max)
t=0.7
t=0.7
t=0.5
4
25
10
4
25
10
4
25
10
(min) (min) (min)
6
26
12
(price)
t=0.5
AA
GO
MS
AA
GO
MS
AA
GO
MS
(stock)(stock) (stock)
Wrapper Refinement
10
33
16
10
33
16
t=0.7
t=0.7
t=0.5
?
(max) (max)
4
25
10
4
25
10
4
25
10
(min) (min) (min)
0.3 (weak)
10
null
10
(min/max)
0.3 (weak)
6
26
12
?
(price)
0.0
t=0.5
AA
GO
MS
AA
GO
MS
AA
GO
MS
(stock)(stock) (stock)
0.0
Wrapper Refinement
matching
value
nearby
template
tokens
//td[contains(text(),‘Open')]/../td[2]
//td[contains(text(),‘Open')]/../../tr[5]/td[1]
//td[contains(text(),‘Open')]/../../tr[5]/td[2]
//td[contains(text(),‘High')]/../td[2]
…
Wrapper Refinement
t=0.7
t=0.5
10
33
16
10
33
16
4
25
10
(max) (max)
4
25
10
4
25
10
(min) (min) (min)
1.0
1.0
10
33
16
4
25
10
(max)
(min)
10
null
10
(min/max)
t=0.7
6
26
12
(price)
t=0.5
AA
GO
MS
AA
GO
MS
AA
GO
MS
(stock)(stock) (stock)
//td[contains(text(),‘Max')]/../td[2]
//td[contains(text(),‘Min')]/../td[2]
Wrapper Refinement
10
33
16
10
33
16
(max) (max)
t=0.7
t=0.7
t=0.5
10
33
16
(max)
10
null
10
(min/max)
4
25
10
4
25
10
4
25
10
(min) (min) (min)
4
25
10
6
26
12
(min)
(price)
t=0.5
AA
GO
MS
AA
GO
MS
AA
GO
MS
(stock)(stock) (stock)
Experimental Results
(100 websites for each domain)
Soccer domain
Videogame domain
Finance domain
(45,714 pages)
(49,262 pages)
(57,623 pages)
Attribute
|m|
Attribute
|m|
Attribute
|m|
• Name
• Birth Date
• Height
• Nationality
• Club
• Position
• Weight
• League
90
61
54
48
43
43
34
14
• Title
• Publisher
• Developer
• Genre
• ESRB rating
• Release Date
• Platform
• # Players
86
59
45
28
40
9
9
6
• Stock Symbol
• Price Change
• % Change
• Volume
• Day Low
• Day High
• Last Price
• Open Price
84
73
73
52
43
41
29
24
Demo
• Found Websites
• Integrated Data
the end!
http://flint.dia.uniroma3.it
License
• This work is licensed under the Creative
Commons Attribution-ShareAlike License. To
view a copy of this license, visit
http://creativecommons.org/licenses/bysa/1.0/ or send a letter to Creative Commons,
559 Nathan Abbott Way, Stanford, California
94305, USA.