Interactive Query Formulation over Web Service-Accessed Sources Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou ACM SIGMOD, June 2006 Large-Scale Data Integration Systems End User Developer Web Domain Application Domain Application Integration Engineer Integration Domain Source Domain • CNET’s Top Combinations • CNET’s Search Desktops • PCWorld’s Product Finder Web Forms & Reports Mediator Web Service • CNET • PCWorld Application Compatible Combinations of Computers, Routers and Printers Integrated Schema Web Web Service Service … Source Owner Source Schema Data Source Data Source Source Schema Computer Portals … • Dell Computers by CPU • Cisco Routers by Rate • HP Printers by Speed • Dell Computers • Cisco Routers • HP Printers 2 Large-Scale Data Integration Systems End User Developer Web Domain Web Forms & Reports Application Domain Application Integration Engineer Integration Domain Source Domain Mediator What queries can the mediator answer for me? CLIDE Application Integrated Schema Web Service Web Web Service Service Data Source Source Data Schema Source … Source Owner Source Schema … 3 Running Example Parameterized Views Dell Cisco Schema Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1 ComByCpu(cpu) (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu Routers(rate, standard, price, type) Views Computers V3 RouWired() (Router)* for a given cpu SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' Computers & NetCards (Router)* for a given V4 cpuRouWireless() & rate V2 ComNetByCpuRate(cpu, rate) (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Wired Routers Wireless Routers SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Conjunctive Queries CQ • • Equality & Comparison Conditions Parameters 4 Running Example Integrated Schema Developer Application Mediator V1 V2 Dell Integrated Schema V3 V4 • Integrated schema puts together the Dell and Cisco schemas Attribute Associations • (Computers.cid, NetCards.cid) • (NetCards.rate, Routers.rate) • (NetCards.standard, Routers.standard) Cisco 5 Sophisticated Mediators Make Feasible Queries Hard to Predict Feasible Queries FQ • Equivalent CQ query rewritings using the views • Might involve more than one views • Order might matter Query: Infeasible Feasible Get all Computers ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers Computers.* NetCards.* Routers.* A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless B Routers.* 10 .11b 50 Wireless 54 .11g 120 Wireless E Computers.* NetCards.* 512 400 A123 10 .11b USB B123 P4 1024 550 B123 54 .11g USB MediatorA123 P4 D Mediator A RouWireless() V4 C ComNetByCpuRate(‘P4’, ‘10’) V1 ComNetByCpuRate(‘P4’, ‘54’) V2 6 Problem 1. Large number of sources 2. Large number of views (web-services) 3. Mediator capabilities Developer formulates an application query Is an application query feasible? If not, how do I know which ones are feasible? Previous options: – – The developer had to browse the view definitions and somehow formulate a feasible query Or formulate queries until a feasible one is found (trial-and-error) No system-provided guidance 7 The CLIDE Solution CLIDE Developer Application Mediator V1 V2 Dell Integrated Schema V3 A query formulation interface, which interactively guides the developer toward feasible queries by employing a coloring scheme V4 Cisco 8 QBE-Like Interfaces Microsoft SQL-Server 9 CLIDE Interface Feasibility Flag Selection Boxes • Table, selection, projection and join actions • Feasibility Flag • Color-based suggestions 10 Example Interaction Snapshot 1 Yellow required action – All feasible queries require this action White optional action – Feasible queries can be formulated w/ or w/o these actions 11 Example Interaction Snapshot 3 Blue required choice of action – At least one of them is required to be taken in order to reach a feasible query Join Lines: • Only yellow and blue are displayed • Must appear in Attribute Associations 13 CLIDE Properties • Completeness of Suggestions – Every feasible query can be formulated by performing yellow and blue actions at every step • Minimality of Suggestions – At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions – The shortest sequence of actions from a query to any feasible query consists of suggested actions 14 Example Interaction Snapshot 4 • * any other constant • Red prohibited action – Does not appear in any feasible query – Lead to “Dead End” state 15 Example Interaction Snapshot 5 ram price rate interface price 512 400 10 USB 50 1024 550 54 USB 120 F Mediator A Routers.* 10 .11b 512 Wireless 54 .11g 1024 Wireless RouWireless() V4 B D Computers.* NetCards.* A123 P4 512 400 A123 10 .11b 50 B123 P4 1024 550 B123 54 .11g 120 ComNetByCpuRate(‘P4’, rate) V2 E 16 CLIDE Properties • Completeness of Suggestions – Every feasible query can be formulated by performing yellow and blue actions at every step • Minimality of Suggestions – At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions – The shortest sequence of actions from a query to any feasible query consists of suggested actions 17 Interaction Graph s Com1 Com1.ram Com1.price Selection Action Table Action Join Action Com1.cpu=‘P4’ Net1 Com1.cid=Net1.cid Rou1 … … … … … … … … … … • Nodes are queries – One for each qCQ • Edges are actions – Table, selection, projection and join actions • Green nodes are feasible queries • Infinitely big structure – All CQ queries – All possible combinations of actions formulating them 18 Interaction Graph: Colorable Actions Com1.cid Com1.cpu … Current Node … Com1.cid=* … Com1.cpu=* … Com1.ram=* • Colorable actions AC label outgoing edges of the current node … Com1.price=* … Net1 … Rou1 … Com2 … 19 Interaction Graph: Colors • Yellow action – Every path from current node n to a feasible node contains • Blue action – At least one path from current node n to a feasible node nF contains – There is not path from n to nF that contains a feasible node Com1.cid Com1.cpu • Red action … Current Node … – No path to a feasible node contains Com1.cid=* … … Com1.cpu=* Net1 Current Node Com1.cid=Net1.cid Net1.rate=’54Mbps’ … … … … Com1.ram=* … … … Net1 Com1.cpu=* Net1.rate=’54Mbps’ Rou1 Com1.cid=Net1.cid Net1.rate=Rou1.rate … … … … … … Rou1 Com1.cid=Net1.cid … … Com1.price=* Com2 … Com2 Com2.cid=Net1.cid Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ … … … … … 20 CLIDE Properties • Rapid Convergence By Following Suggestions – The shortest path from the current node to any feasible node consists of suggested actions 21 CLIDE Architecture Actions Front-End Current Query Colored Actions + Feasibility Flag User Back-End Color Algorithm Seed Queries SQ Parameters Algorithm Closest Feasible Queries FQC Closest Feasible Queries Algorithm Aliases Collapse Rule Maximally-Contained Rewriter Schemas Views Minimal Feasible Extension Queries Attribute Associations • Back-End invoked every time the user performs an action – i.e., the user arrives at a new node in the interactions graph 22 Color Determined By a Finite Set of Feasible Queries Challenge: Infinitely Many Feasible Queries ? … … n … … … … Closest Feasible Queries FQC Solution: • Finite Set of Feasible Queries • Closest Feasible Queries FQC • FQC is sufficient to color actions Challenge: How far can the Closest Feasible Queries FQC be? Solution: Based on Maximally Contained Queries FQMC 23 Maximally Contained Queries FQMC Maximally Contained Query Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu Not Maximally Maximally Contained Contained Query Query: Q4 Q3 Get all Computers with a given ram cpu & ram • Assuming fixed SELECT clause (projection list) • Covered extensively in literature – MiniCon, Bucket, InverseRules Algorithms • FQMC is finite 24 Closest Feasible Queries FQC Algorithm Solution: Based on Maximally Contained Queries FQMC Maximally Contained Queries FQMC … … n … … … … Closest Feasible Queries FQC • Compute maximally contained queries FQMC • Radius pL is the longest path to a maximally contained query • Theorem: All FQC queries are reachable via a path of length p pL Challenge: The set of nodes within pL can be too many to explore 25 Closest Feasible Queries FQC Algorithm Solution: Find the Closest Feasible Queries Directly Maximally Contained Queries FQMC … … n … … … … Closest Feasible Queries FQC More feasible nodes • Theorem: All queries in FQMC are in FQC • But not all queries in FQC are in FQMC 26 Closest Feasible Queries FQC Algorithm Solution: Collapse Aliases Maximally Contained Queries FQMC … … n … … … … Closest Feasible Queries FQC • Collapse Aliases to compute FQC \ FQMC • Check satisfiability 27 Color Algorithm Yellow and Blue • An action is colored based on which closest feasible queries it appear in • Yellow, if appears in all queries in FQC • Blue, if appears in at least one (but not all) query in FQC White and Red • Attach Maximum Projection Lists to Closest Feasible Queries – Projections that can be added to a feasible query, without compromising feasibility • Projection is white if in the maximum projection list • Color selections based on projections 28 CLIDE Implementation Issues • Maximally contained queries need to be minimized (affect CLIDE’s rapid convergence and minimality) Implemented the minimization module from scratch MiniCon computes mappings between the query and the views Exposed these mappings to guide minimization • Optimizations to bring the response time below 3 sec Amortize the computation across multiple interactions steps by observing that query is changed minimally at every step Instead of calling MiniCon from scratch at every step, we incrementally compute the closest feasible queries 29 CLIDE Implementation & Optimizations Maximally-Contained Rewriter Front-End Current Query Minimal Feasible ExtensionColored QueriesActions FQME + Feasibility Flag + Maximum Projection Lists Back-End Redundant Actions Removal Color Algorithm Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Seed Queries SQ Redundant Queries Removal Parameters Algorithm Feasible Extension Queries Closest Feasible Queries FQC + Maximum Projection Lists Views Expansion Closest Feasible Queries Algorithm Aliases Collapse Rule Maximally-Contained Feasible Queries over Views Minimal Feasible + Containment Mappings Extension Queries Maximally-Contained Rewriter MiniCon Containment Mappings Logging Attribute Schemas Views Associations • Views expansion introduce redundancy – Affects CLIDE’s rapid convergence and minimality • Efficient containment test crucial to redundancy removal 30 CLIDE Performance Chains of Stars Time (sec) 7 6 112 168 224 280 5 4 Views Views Views Views 3 2 (2 ,1 ) (2 ,2 ) (4 ,3 ) (4 ,4 ) (6 ,5 ) (6 ,6 ) (8 ,7 ) (8 ,8 (1 ) 0, (1 9) 0, 1 (1 0) 2, 1 (1 1) 2, 1 (1 2) 4, 1 (1 3) 4, 14 ) 1 0 (# of Joins, # of Selections) in Current Query C1 B B11 … … …… A B2Bi1 Bi BiM BK … … … … • Schema Queries Views C C1C i1 i A-span = 7 CiM B-span = 3 C1C L Selections = 4,6,8,10 31 Thank you Play with our demo: http://www.clide.info 32
© Copyright 2025 Paperzz