Optimization of Analytic Data Flows for Next Generation Business Intelligence Applications Umeshwar Dayal, Kevin Wilkinson, Alkis Simitsis, Malu Castellanos, Lupita Paz HP Labs Palo Alto, CA, USA [email protected] 1 © Copyright © Copyright 2010 Hewlett-Packard 2010 Hewlett-Packard Development Development Company, Company, L.P. L.P. OUTLINE – Next Generation Business Intelligence – Analytic Data Flows – Challenges in Analytic Data Flow Design and Optimization – QoX Framework – QoX-driven Optimization of Data Integration Flows – Micro-benchmarks – Open Problems – Summary © Copyright 20101Hewlett-Packard Development Company, L.P. TRADITIONAL BUSINESS INTELLIGENCE ARCHITECTURE BI: Technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making • Focus on enterprise OLTP data sources • Batch ETL (Extract-Transform-Load) pipeline • Front-end analytics for strategic decision making © Copyright 20101Hewlett-Packard Development Company, L.P. EMERGENCE OF AN INFLECTION POINT… • Sensors, real time events, unstructured data and web contents have the promise of transforming the way we manage our customers, resources, environments, health,… …The shift to the web makes available unprecedented quantities of structured and unstructured databases about human activities… …and sensing technologies are making available great quantities of data…that provide unprecedented scope and resolution… …the availability of rich streams of data, …create(s) an inflection point .. for generating insights and guiding decision making… …To date, we have only scratched the surface of the potential for learning from these large-scale data sets.. - “From Data to Knowledge to Action – A Global Enabler for the 21st Century”, Computing Community Consortium, August 2010 Analytics, and specifically technologies that enable real-time analysis of large data volumes, social media analytics, and mobile BI will continue to dominate the enterprise agenda - Computer World, Jan 2011 4 © Copyright 20101Hewlett-Packard Development Company, L.P. Live BI: Decisions at any time scale Executives Model Builders Enterprise Operational systems Operations staff Business Control Systems Operational Data Stores Enterprise Data Warehouse Decreasing time scales for response (decision latencies) Increasing need for automation 5 5 © Copyright 20101Hewlett-Packard Development Company, L.P. Live Business Intelligence Traditional Business Intelligence ANALYTICS DATA FLOWS Front-end Analytics • Long development cycles • Little Analysis querying, reporting, data mining, modeling, analytics, visualization operations support for optimization or automation • Possibly many objectives: Correctness, Fault-tolerance, Scalability, Maintainability, Cost • Many Back-end Analytics End-to-end Analytics Data Flow Challenges Loading Streaming Transformation Cleaning, de-noising, entity identification, etc. Information extraction, access types of data sources: Structured/ Unstructured, Stored/ Streaming, Internal/ External • Complex flows, involving ETL, stream processing, queries, analytics, visualization operations • Possibly engines © Copyright 20101Hewlett-Packard Development Company, L.P. executed on multiple NEXT-GENERATION: LIVE BI Execution-time Development & Deployment-time Interaction and Delivery Data Flow Composer and Editor Federation and Orchestration Optimizer Execution Engines Massively parallel, RDBMS Hadoop ETL large memoryStream cluster engine engine Data Stores RDB © Copyright 20101Hewlett-Packard Development Company, L.P. Unstructured File Data Cloud Storage other Analytics Engine Streaming data sources (sensor data, event logs, social media, web data…) EXAMPLE FLOW 1: STRUCTURED DATA -TPC-H SOURCES Store 1 Lineitem Store n Lineitem Union Surrogate key generate Lineitem Orders Compute sales = quantity * unitCost Filter on date range Join on orderKey (get relevant sales for date range) Filter on cmpnKey (get products, date range) Join on prodKey (get relevant sales for products) CmpnDim 9 © Copyright 2010 Hewlett-Packard Development Company, L.P. Rollup sales on prodKey, cmpnKey RptSalesByProdCmpn Sales of products in response to marketing campaign EXAMPLE FLOW 2: UNSTRUCTURED DATA SENTIMENT ANALYSIS 10 © Copyright 2010 Hewlett-Packard Development Company, L.P. EXAMPLE FLOW 3 (STRUCTURED & UNSTRUCTURED DATA) Tweets Filter on any mention of enterprise or its products Relevant tweets Sentiment analysis for products by tweet RptSAbyTwtProd CmpnDim Filter on cmpnKey (get products, date range) Surrogate key generate Filter on date range (get tweets during campaign) Join on prodKey (get tweets for products) Rollup sentimentScore on prodKey, cpmnKey, attribute (aggregate sentiment scores) RptSalesByProdCmpn 11 © Copyright 2010 Hewlett-Packard Development Company, L.P. Join on prodKey, cmpnKey RptSAbyProdCmpn Correlate Sales and Sentiments for products in response to marketing campaign EXAMPLE FLOW 4: EVENT AND STREAM ANALYTICS Sensor data, external event Streams Stream pre-processing Detect Patterns/Events Events Event Stream Event Correlation Complex Events Diagnosis & Root Cause Analysis Diagnoses Prediction Contextual data (structured, unsructured) Event History Event Discovery and Creation 12 © Copyright 20101Hewlett-Packard Development Company, L.P. Predicted Events Recommended Action Assessment Actions Event Analysis QoX Objectives Performance Fault-tolerance Maintainability Freshness Scalability Availability Cost FROM DESIGN TO EXECUTION [Dayal, et al., EDBT’09] [Wilkinson, Simitsis ER’10] Flow design editor Logical Data Flow Graph xLM Optimized Physical Data Flow Graph Optimizer Sources D oc Info = (D oc id, Time Stamp , Source type , Source U R L , Author Profile , GeoTag , Score) Data Sources Author Profile = (id, name , email, location , gender , …) document batch file Pre processing ( Adaptor ) Hadoop (Doc Info, Text body) vertical slice profile info Scripts (Doc Info, Text body) Splitter on D oc Id, hash partition , 3x parallel (Doc Info, Text body) R emove D ocs from Blacklisted Authors (Doc Info, Text body) Sentence D etector [Doc Info, Sentence id , Sentence Loc, Sentence] Tokenizer [Doc Info, Sentence id, Sentence Loc , Sentence , Token id, Token Loc, Token], order < Doc Id, Sentence Id, Token Loc> POS Tagging [Doc Info, Sentence id , Sentence Loc, Sentence, Token id, Token Loc, Token], order <Doc Id, Sentence Id, Token Loc> profile info Enginespecific codegen tool-specific DBMS [Doc Info, Sentence id , Sentence Loc, Sentence, Token2 id , Token Loc, Token, POS Tag, Negation Flag, Attribute Flag, Sentiment Flag] ETL Tool R elate Sentiment to Attribute attribute sentiment value execution instructions 13 Merger on D oc id, 3 x parallel create recovery point recovery point xLM Scripts © Copyright 20101Hewlett-Packard Development Company, L.P. Sentiment W ord D etection [Doc Info, Sentence id , Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation Flag, Attribute Flag] sentiment word dict. [Doc Info, Sentence id, Sentence Loc, Token2 id, Token Loc, Attribute, Sentiment Value] Attribute Extraction attribute dict . Join on doc Id profile info [Doc Info, Sentence id, Sentence Loc , Token2 id, Token Loc, Attribute, Sentiment Value] [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation flag] N egation D etection negation dict . [Doc Info, Sentence id , Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag], order <Doc Id, Sentence Id, Token Loc> C ompound Phrases phrase dict. Target Aggregate by D oc document sentiment value (Doc Info, Attribute, Sentiment Value ]) DASHBOARD (Doc Info, [Sentence id , Sentence Loc, Sentence, Token id, Token Loc, Token, POS Tag]) Quality objectives – functional correctness • the optimization of the initial flow does not affect the flow semantics or correctness – performance • measures the flow completion in terms of time and resources used – freshness • the latency between the occurrence of an event at a source system and the reflection of that event at the target system – reliability • 14 14 the ability of flow to complete successfully within the specified time window despite failures © Copyright 20101Hewlett-Packard Development Company, L.P. QUALITY OBJECTIVES – Performance: usually throughput, total execution time – Reliability: probability that a flow will complete correctly (~ mean time between failures) – Recoverability: ability to restore the flow after a failure (~mean time to recover) – Scalability: ability to process higher volumes of data or higher data rates – Freshness: latency between source data and data in the DW – Consistency: accuracy (correctness) of the data in the DW – Availability: probability that the data is available for queries – Maintainability: ability to operate at specified service levels – Flexibility: ability to respond to evolving requirements – Affordability: cost of integration project – Traceability: ability to track lineage (provenance) of data – Auditability: ability to satisfy legal/regulatory compliance requirements © Copyright 20101Hewlett-Packard Development Company, L.P. TRADEOFFS AMONG QoX OBJECTIVES Fastest Recoverable Perf REC REC Rec Rel Cost w/o RPs + - - ∅ w/ 2 RPs - ++ - + Rec Rel Cost Performance STATE=AZ Perf REC REC STATE=CA REC REC w/o RPs ++ - -- ++ STATE=NV REC REC w/ 6 RPs + ++ -- ++ Rec Rel Cost Reliable Perf w/o RPs © Copyright 20101Hewlett-Packard Development Company, L.P. + n/a ++ +++ QoX Optimization [Simitsis, et al., ICDE’10 Simitsis, et al., SIGMOD’09] ] – Logical Optimization: – Physical optimization Flow restructuring Parallelization, partitioning Collapsing chains Reordering operators Inserting recovery (persistence) points Replication Optimization for performance and faulttolerance Given a data flow F, objective function OFw n: input recordset size w: execution time window k: number of faults to be tolerated © Copyright 20101Hewlett-Packard Development Company, L.P. Implementations of operators Materialize vs. Virtualize Choice of execution engines Data shipping vs.Function shipping STATE SPACE SEARCH initial state min-cost state 18 © Copyright 20101Hewlett-Packard Development Company, L.P. OPTIMIZATION TACTICS: TRANSITIONS IN THE STATE SPACE GRAPH factorize/ distribute partition 19 © Copyright 20101Hewlett-Packard Development Company, L.P. swap replicate addRP OPTIMIZATION FOR MULTI-ENGINE EXECUTION – Characterize the execution of operations on different execution engines • Implement micro-benchmarks for individual operators – Learn cost functions to plug into optimization objective • Interpolate • Compose • Account from micro-benchmarks costs for individual operations into cost of flow for resource contention – Extend transitions considered by the optimizer to include implementation and engine choices – Extend heuristics for state space search © Copyright 20101Hewlett-Packard Development Company, L.P. Example: Sentiment Analysis Flow – Logical Model Data Sources Doc Info = (Doc id, Time Stamp, Source type, Source URL, Author Profile, GeoTag, Score) Author Profile = (id, name, email, location, gender, …) Preprocessing (Adaptor) (Doc Info, Text body) phrase dict. Sentence Detector [Doc Info, Sentence id, Sentence Loc, Sentence] Tokenizer document batch file Other Operators (Doc Info, Attribute, Sentiment Value]) Aggregate by Doc [Doc Info, Sentence id, Sentence Loc, Token2 id, Token Loc, Attribute, Sentiment Value] document sentiment value 21 © Copyright 20101Hewlett-Packard Development Company, L.P. Remove Docs from Blacklisted Authors [Doc Info, Sentence id, Sentence Loc, Sentence, Token id, Token Loc, Token], order <Doc Id, Sentence Id, Token Loc> [Doc Info, Sentence id, Sentence Loc, Token2 id, Token Loc, Attribute, Sentiment Value] attribute sentiment value POS Tagging Relate Sentiment to Attribute (Doc Info, [Sentence id, Sentence Loc, Sentence, Token id, Token Loc, Token, POS Tag]) [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation Flag, Attribute Flag, Sentiment Flag] Compound Phrases Sentiment Word Detection sentiment word dict. [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag], order <Doc Id, Sentence Id, Token Loc> [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation Flag, Attribute Flag] negation dict. Negation Detection [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation flag] Attribute Extraction attribute dict. Example: Sentiment Analysis Flow – Physical Model Sources Doc Info = (Doc id, Time Stamp, Source type, Source URL, Author Profile, GeoTag, Score) Data Sources Author Profile = (id, name, email, location, gender, …) document batch file Preprocessing (Adaptor) Hadoop (Doc Info, Text body) vertical slice profile info Scripts (Doc Info, Text body) Splitter on Doc Id, hash partition, 3x parallel (Doc Info, Text body) Remove Docs from Blacklisted Authors (Doc Info, Text body) Sentence Detector [Doc Info, Sentence id, Sentence Loc, Sentence] Tokenizer [Doc Info, Sentence id, Sentence Loc, Sentence, Token id, Token Loc, Token], order <Doc Id, Sentence Id, Token Loc> POS Tagging [Doc Info, Sentence id, Sentence Loc, Sentence, Token id, Token Loc, Token], order <Doc Id, Sentence Id, Token Loc> profile info Merger on Doc id, 3x parallel create recovery point recovery point DBMS [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation Flag, Attribute Flag, Sentiment Flag] ETL Tool Relate Sentiment to Attribute attribute sentiment value 22 Scripts Sentiment Word Detection [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation Flag, Attribute Flag] sentiment word dict. [Doc Info, Sentence id, Sentence Loc, Token2 id, Token Loc, Attribute, Sentiment Value] Attribute Extraction attribute dict. Join on doc Id profile info © Copyright 20101Hewlett-Packard Development Company, L.P. [Doc Info, Sentence id, Sentence Loc, Token2 id, Token Loc, Attribute, Sentiment Value] [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag, Negation flag] Negation Detection negation dict. [Doc Info, Sentence id, Sentence Loc, Sentence, Token2 id, Token Loc, Token, POS Tag], order <Doc Id, Sentence Id, Token Loc> Compound Phrases phrase dict. Target Aggregate by Doc document sentiment value (Doc Info, Attribute, Sentiment Value]) DASHBOARD (Doc Info, [Sentence id, Sentence Loc, Sentence, Token id, Token Loc, Token, POS Tag]) MICRO-BENCHMARKS FOR CONVENTIONAL (ETL, RELATIONAL) OPERATORS – Experiments on sort, join, surrogate-key, etc. – Example: Sort • os-sort −home-grown distributed sort implementation • hd-sort −Hadoop-based sort (used Pig Latin) • pdb-sort −used a commercial, parallel DBMS – Dataset • TPC-H lineitem © Copyright 20101Hewlett-Packard Development Company, L.P. SF Size (GBs) Rows (x106) 1 0.76 6.1 10 7.3 59.9 100 75 600 Sort (Comparison) – Hadoop vs. Custom script vs. pDB (with and without loading) Pig script (hd-sort): C and shell scripts (os-sort): sf$sf = load 'lineitem.tbl.sf$sf' split file: “filesize/#nodes” (leftovers go using PigStorage('|') as (f1,f2,...,f17); to chunk #1) ord1_$nd = order sf$sf by f1 parallel $nd; transfer data ... sort chunks on nodes ord10_$nd = order ord9_$nd by f10 parallel $nd; transfer back and merge (hierarchical) store ord$op_$nd into 'res_sf$sf_Ord$op-$nd.dat‘ (to be fair with hd-sort and etl-sort, avoid using PigStorage('|'); pipelining) 24 © Copyright 20101Hewlett-Packard Development Company, L.P. (TPC-H data SF:1,10,100) Sort (Tuning) – Hadoop – e.g., #reducers – Shell scripts – e.g., analyze sort internal phases 25 © Copyright 20101Hewlett-Packard Development Company, L.P. MICRO-BENCHMARKS FOR UNCONVENTIONAL OPERATORS – Experiments on Sentiment Analysis operators •individual •the operators whole sentiment analysis flow treated as a single operator – Implementations • Single node • Distributed, scripts • Distributed, Hadoop – Datasets •Small•Large to medium-sized document corpus: up to 200,000 documents corpus: ~30 million documents © Copyright 20101Hewlett-Packard Development Company, L.P. MICRO-BENCHMARKS FOR UNCONVENTIONAL OPERATORS (SENTIMENT ANALYSIS) Single node, small to mediumsized corpus Use linear regression to learn interpolation function 27 © Copyright 2010 Hewlett-Packard Development Company, L.P. Single node, large corpus; effect of limited memory 12-node Hadoop cluster SENTIMENT ANALYSIS (COMPARISON) – Single node vs. Distributed with Hadoop vs. Distributed without Hadoop © Copyright 20101Hewlett-Packard Development Company, L.P. OPEN PROBLEMS: FUTURE WORK – Micro-benchmarks of other operators: ETL, relational, content analytics, front-end analytics, stream processing – Cost functions: interpolation, composition – Benchmarks and objective functions for other QoX measures – Optimization strategies for physical optimization • Assign segments of the flow to execution engines • Where, when, and what to materialize © Copyright 20101Hewlett-Packard Development Company, L.P. SUMMARY – Design of data flows for BI applications is a very expensive, largely manual activity: little support for automation, optimization – Situation gets worse with next gen BI: fuse back-end and front-end operations into end-to-end analytic data flows – Many different types of sources, more complex operators, multiple execution engines – Layered methodology: QoX-driven optimization allows trading off different quality objectives – Need benchmarks and micro-benchmarks to drive the design and optimization – Many challenges remain © Copyright 20101Hewlett-Packard Development Company, L.P. Thank You 31 © Copyright 2010 Hewlett-Packard Development Company, L.P.
© Copyright 2026 Paperzz