Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative Commons License, see last slide) Data-intensive websites Data-intensive websites target Website Template1 Template2 Database Template3 Flint goal Last Min Max StockQuote … Volume 52high Open System architecture Web Search Flint Data Extraction [WIDM08] Data Integration The Web Novel contribution Data Extraction • Unsupervised • Automatic • Scalable • No knowledge available RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07] Data Integration • Unsupervised • Automatic • Scalable • Uncertain Data • No labels available • No corpus available WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07] Data Extraction Data Extraction Data Extraction AAPL, GOOG, MSFT, INTC, … 128.09, 439.54, 34.89, 112.37, … 127.81, 439.25, 32.13, 111.01, … 132.43, 443.82, 33.67, 114.32, … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio, Add INTC to Your Portfolio, … … Data Extraction HTML fragments taken from two pages belonging to the same website: ? /html/body/table/tr[1]/td[2] 1,132,228 , 1,735,857 /html/body/table/tr[2]/td[2] $20.66 , $414.58 /html/body/table/tr[3]/td[2] $11.70 , $247.30 /html/body/table/tr[4]/td[2] $20.72 , $414.06 /html/body/table/tr[5]/td[2] /html/body/table/tr[6]/td[2] $0.02 , 99,494,200 4,732,600 , null Extraction error! Data Integration 10 33 16 4 25 10 AA GO MS (max) (min) (stock) Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) Data Integration t=0.5 t=0.5 t=0.5 10 33 16 4 25 10 AA GO MS (max) (min) (stock) 10 33 16 (max) 1.0 4 25 10 (min) 1.0 AA GO MS (stock) 1.0 Data Integration t=0.5 t=0.5 10 33 16 10 33 16 (max) (max) 4 25 10 4 25 10 (min) (min) t=0.5 AA GO MS AA GO MS (stock)(stock) Data Integration t=0.5 t=0.5 10 33 16 10 33 16 (max) (max) 4 25 10 t=0.5 4 25 10 AA GO MS (min) (min) 0.6 AA GO MS (stock)(stock) 6 26 12 (price) 1.0 4 25 10 (min) 1.0 AA GO MS (stock) Data Integration t=0.5 t=0.5 10 33 16 10 33 16 (max) (max) 4 25 10 4 25 10 t=0.5 6 26 12 AA GO MS ? (min) (min) (price) AA GO MS (stock)(stock) 1.0 4 25 10 (min) 1.0 AA GO MS (stock) Data Integration t=0.5 t=0.5 10 33 16 10 33 16 (max) (max) 4 25 10 4 25 10 6 26 12 (min) (min) (price) 4 25 10 (min) AA GO MS AA GO MS (stock)(stock) 1.0 AA GO MS (stock) Data Integration 10 33 16 10 33 16 (max) (max) t=0.7 t=0.7 t=0.5 4 25 10 4 25 10 4 25 10 (min) (min) (min) 6 26 12 (price) t=0.5 AA GO MS AA GO MS (stock)(stock) 1.0 AA GO MS (stock) Data Integration 10 33 16 10 33 16 (max) (max) t=0.7 t=0.7 t=0.5 4 25 10 4 25 10 4 25 10 (min) (min) (min) 6 26 12 (price) t=0.5 AA GO MS AA GO MS AA GO MS (stock)(stock) (stock) Wrapper Refinement 10 33 16 10 33 16 t=0.7 t=0.7 t=0.5 ? (max) (max) 4 25 10 4 25 10 4 25 10 (min) (min) (min) 0.3 (weak) 10 null 10 (min/max) 0.3 (weak) 6 26 12 ? (price) 0.0 t=0.5 AA GO MS AA GO MS AA GO MS (stock)(stock) (stock) 0.0 Wrapper Refinement matching value nearby template tokens //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] … Wrapper Refinement t=0.7 t=0.5 10 33 16 10 33 16 4 25 10 (max) (max) 4 25 10 4 25 10 (min) (min) (min) 1.0 1.0 10 33 16 4 25 10 (max) (min) 10 null 10 (min/max) t=0.7 6 26 12 (price) t=0.5 AA GO MS AA GO MS AA GO MS (stock)(stock) (stock) //td[contains(text(),‘Max')]/../td[2] //td[contains(text(),‘Min')]/../td[2] Wrapper Refinement 10 33 16 10 33 16 (max) (max) t=0.7 t=0.7 t=0.5 10 33 16 (max) 10 null 10 (min/max) 4 25 10 4 25 10 4 25 10 (min) (min) (min) 4 25 10 6 26 12 (min) (price) t=0.5 AA GO MS AA GO MS AA GO MS (stock)(stock) (stock) Experimental Results (100 websites for each domain) Soccer domain Videogame domain Finance domain (45,714 pages) (49,262 pages) (57,623 pages) Attribute |m| Attribute |m| Attribute |m| • Name • Birth Date • Height • Nationality • Club • Position • Weight • League 90 61 54 48 43 43 34 14 • Title • Publisher • Developer • Genre • ESRB rating • Release Date • Platform • # Players 86 59 45 28 40 9 9 6 • Stock Symbol • Price Change • % Change • Volume • Day Low • Day High • Last Price • Open Price 84 73 73 52 43 41 29 24 Demo • Found Websites • Integrated Data the end! http://flint.dia.uniroma3.it License • This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/bysa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
© Copyright 2026 Paperzz