Live Business Intelligence: From Data to Insights

Optimization of Analytic Data Flows for Next
Generation Business Intelligence Applications
Umeshwar Dayal, Kevin Wilkinson, Alkis Simitsis, Malu Castellanos, Lupita Paz
HP Labs
Palo Alto, CA, USA
[email protected]
1
© Copyright
© Copyright
2010 Hewlett-Packard
2010 Hewlett-Packard
Development
Development
Company,
Company,
L.P. L.P.
OUTLINE
– Next Generation Business Intelligence
– Analytic Data Flows
– Challenges in Analytic Data Flow Design and Optimization
– QoX Framework
– QoX-driven Optimization of Data Integration Flows
– Micro-benchmarks
– Open Problems
– Summary
© Copyright 20101Hewlett-Packard Development Company, L.P.
TRADITIONAL BUSINESS
INTELLIGENCE ARCHITECTURE
BI: Technologies, tools, and practices for collecting, integrating, analyzing, and
presenting large volumes of information to enable better decision making
• Focus on enterprise OLTP data sources
• Batch ETL (Extract-Transform-Load) pipeline
• Front-end analytics for strategic decision making
© Copyright 20101Hewlett-Packard Development Company, L.P.
EMERGENCE OF AN INFLECTION
POINT…
• Sensors, real time
events, unstructured
data and web
contents have the
promise of
transforming the way
we manage our
customers, resources,
environments,
health,…
…The shift to the web makes available unprecedented
quantities of structured and unstructured databases about
human activities…
…and sensing technologies are making available great
quantities of data…that provide unprecedented scope and
resolution…
…the availability of rich streams of data, …create(s) an
inflection point .. for generating insights and guiding
decision making…
…To date, we have only scratched the surface of the
potential for learning from these large-scale data sets..
- “From Data to Knowledge to Action – A Global Enabler for the 21st
Century”,
Computing Community Consortium, August 2010
Analytics, and specifically technologies that enable real-time analysis of
large data volumes, social media analytics, and mobile BI will continue to
dominate the enterprise agenda - Computer World, Jan 2011 4
© Copyright 20101Hewlett-Packard Development Company, L.P.
Live BI: Decisions at any time scale
Executives
Model Builders
Enterprise
Operational
systems
Operations
staff
Business
Control
Systems
Operational
Data Stores
Enterprise
Data Warehouse
Decreasing time scales for response (decision latencies)
Increasing need for automation
5
5
© Copyright 20101Hewlett-Packard Development Company, L.P.
Live
Business
Intelligence
Traditional
Business
Intelligence
ANALYTICS DATA FLOWS
Front-end Analytics
• Long
development cycles
• Little
Analysis
querying, reporting, data mining,
modeling, analytics, visualization
operations
support for optimization or
automation
• Possibly
many objectives:
Correctness, Fault-tolerance,
Scalability, Maintainability, Cost
• Many
Back-end Analytics
End-to-end Analytics Data Flow
Challenges
Loading
Streaming
Transformation
Cleaning, de-noising, entity identification,
etc.
Information extraction, access
types of data sources:
Structured/ Unstructured,
Stored/ Streaming,
Internal/ External
• Complex
flows, involving ETL,
stream processing, queries,
analytics, visualization operations
• Possibly
engines
© Copyright 20101Hewlett-Packard Development Company, L.P.
executed on multiple
NEXT-GENERATION: LIVE BI
Execution-time
Development &
Deployment-time
Interaction and Delivery
Data Flow Composer and Editor
Federation and Orchestration
Optimizer
Execution Engines
Massively parallel,
RDBMS
Hadoop
ETL large memoryStream
cluster
engine
engine
Data Stores
RDB
© Copyright 20101Hewlett-Packard Development Company, L.P.
Unstructured
File Data
Cloud
Storage
other
Analytics
Engine
Streaming data sources
(sensor data, event logs,
social media, web
data…)
EXAMPLE FLOW 1: STRUCTURED DATA -TPC-H SOURCES
Store 1
Lineitem
Store n
Lineitem
Union
Surrogate key generate
Lineitem
Orders
Compute sales = quantity * unitCost
Filter on date range
Join on orderKey
(get relevant sales for date range)
Filter on cmpnKey
(get products, date range)
Join on prodKey
(get relevant sales for products)
CmpnDim
9
© Copyright 2010 Hewlett-Packard Development Company, L.P.
Rollup sales on prodKey, cmpnKey
RptSalesByProdCmpn
Sales of products
in response to
marketing
campaign
EXAMPLE FLOW 2: UNSTRUCTURED DATA
SENTIMENT ANALYSIS
10
© Copyright 2010 Hewlett-Packard Development Company, L.P.
EXAMPLE FLOW 3 (STRUCTURED &
UNSTRUCTURED DATA)
Tweets
Filter on any mention of enterprise or its products
Relevant tweets
Sentiment analysis for products by tweet
RptSAbyTwtProd
CmpnDim
Filter on cmpnKey
(get products, date range)
Surrogate key generate
Filter on date range (get tweets during campaign)
Join on prodKey (get tweets for products)
Rollup sentimentScore on prodKey, cpmnKey, attribute (aggregate
sentiment scores)
RptSalesByProdCmpn
11
© Copyright 2010 Hewlett-Packard Development Company, L.P.
Join on prodKey, cmpnKey
RptSAbyProdCmpn
Correlate
Sales and
Sentiments for
products in
response to
marketing
campaign
EXAMPLE FLOW 4: EVENT AND STREAM
ANALYTICS
Sensor
data,
external
event
Streams
Stream
pre-processing
Detect Patterns/Events
Events
Event
Stream
Event
Correlation
Complex
Events
Diagnosis &
Root Cause
Analysis
Diagnoses
Prediction
Contextual
data
(structured,
unsructured)
Event
History
Event Discovery
and Creation
12
© Copyright 20101Hewlett-Packard Development Company, L.P.
Predicted
Events
Recommended
Action
Assessment Actions
Event
Analysis
QoX
Objectives
Performance
Fault-tolerance
Maintainability
Freshness
Scalability
Availability
Cost
FROM DESIGN TO EXECUTION
[Dayal, et al., EDBT’09]
[Wilkinson, Simitsis ER’10]
Flow design
editor
Logical Data Flow Graph
xLM
Optimized Physical Data Flow Graph
Optimizer
Sources
D oc Info = (D oc id, Time Stamp ,
Source type , Source U R L ,
Author Profile , GeoTag , Score)
Data Sources
Author Profile = (id, name ,
email, location , gender , …)
document
batch file
Pre processing
( Adaptor )
Hadoop
(Doc Info,
Text body)
vertical slice
profile info
Scripts
(Doc Info,
Text body)
Splitter
on D oc Id,
hash partition ,
3x parallel
(Doc Info,
Text body)
R emove D ocs
from Blacklisted Authors
(Doc Info,
Text body)
Sentence
D etector
[Doc Info,
Sentence id ,
Sentence Loc,
Sentence]
Tokenizer
[Doc Info,
Sentence id,
Sentence Loc ,
Sentence ,
Token id,
Token Loc,
Token],
order < Doc Id,
Sentence Id,
Token Loc>
POS Tagging
[Doc Info,
Sentence id ,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token],
order <Doc Id,
Sentence Id,
Token Loc>
profile
info
Enginespecific codegen
tool-specific
DBMS
[Doc Info,
Sentence id ,
Sentence Loc,
Sentence,
Token2 id ,
Token Loc,
Token,
POS Tag,
Negation Flag,
Attribute Flag,
Sentiment Flag]
ETL Tool
R elate
Sentiment to
Attribute
attribute
sentiment value
execution
instructions
13
Merger
on D oc id,
3 x parallel
create
recovery
point
recovery
point
xLM
Scripts
© Copyright 20101Hewlett-Packard Development Company, L.P.
Sentiment
W ord
D etection
[Doc Info,
Sentence id ,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation Flag,
Attribute Flag]
sentiment
word dict.
[Doc Info,
Sentence id,
Sentence Loc,
Token2 id,
Token Loc,
Attribute,
Sentiment Value]
Attribute
Extraction
attribute
dict .
Join on doc Id
profile
info
[Doc Info,
Sentence id,
Sentence Loc ,
Token2 id,
Token Loc,
Attribute,
Sentiment Value]
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation flag]
N egation
D etection
negation
dict .
[Doc Info,
Sentence id ,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag],
order <Doc Id,
Sentence Id,
Token Loc>
C ompound
Phrases
phrase
dict.
Target
Aggregate by
D oc
document
sentiment value
(Doc Info,
Attribute,
Sentiment Value ])
DASHBOARD
(Doc Info,
[Sentence id ,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token,
POS Tag])
Quality objectives
– functional correctness
•
the optimization of the initial flow does not affect the flow semantics or
correctness
– performance
•
measures the flow completion in terms of time and resources used
– freshness
•
the latency between the occurrence of an event at a source system and the
reflection of that event at the target system
– reliability
•
14
14
the ability of flow to complete successfully within the specified time window
despite failures
© Copyright 20101Hewlett-Packard Development Company, L.P.
QUALITY OBJECTIVES
– Performance: usually throughput, total execution time
– Reliability: probability that a flow will complete correctly
(~ mean time between failures)
– Recoverability: ability to restore the flow after a failure (~mean time to
recover)
– Scalability: ability to process higher volumes of data or higher data rates
– Freshness: latency between source data and data in the DW
– Consistency: accuracy (correctness) of the data in the DW
– Availability: probability that the data is available for queries
– Maintainability: ability to operate at specified service levels
– Flexibility: ability to respond to evolving requirements
– Affordability: cost of integration project
– Traceability: ability to track lineage (provenance) of data
– Auditability: ability to satisfy legal/regulatory compliance requirements
© Copyright 20101Hewlett-Packard Development Company, L.P.
TRADEOFFS AMONG QoX OBJECTIVES
Fastest
Recoverable
Perf
REC
REC
Rec
Rel
Cost
w/o RPs
+
-
-
∅
w/ 2 RPs
-
++
-
+
Rec
Rel
Cost
Performance
STATE=AZ
Perf
REC
REC
STATE=CA
REC
REC
w/o RPs
++
-
--
++
STATE=NV
REC
REC
w/ 6 RPs
+
++
--
++
Rec
Rel
Cost
Reliable
Perf
w/o RPs
© Copyright 20101Hewlett-Packard Development Company, L.P.
+
n/a
++
+++
QoX Optimization
[Simitsis, et al., ICDE’10
Simitsis, et al., SIGMOD’09]
]
– Logical
Optimization:
– Physical optimization
Flow restructuring
Parallelization, partitioning
Collapsing chains
Reordering operators
Inserting recovery (persistence)
points
Replication
Optimization for performance and faulttolerance
Given a data flow F, objective function OFw
n: input recordset size
w: execution time window
k: number of faults to be tolerated
© Copyright 20101Hewlett-Packard Development Company, L.P.
Implementations of operators
Materialize vs. Virtualize
Choice of execution engines
Data shipping vs.Function
shipping
STATE SPACE SEARCH
initial state
min-cost state
18
© Copyright 20101Hewlett-Packard Development Company, L.P.
OPTIMIZATION TACTICS: TRANSITIONS
IN THE STATE SPACE GRAPH
factorize/
distribute
partition
19
© Copyright 20101Hewlett-Packard Development Company, L.P.
swap
replicate
addRP
OPTIMIZATION FOR MULTI-ENGINE
EXECUTION
– Characterize the execution of operations on different
execution engines
• Implement
micro-benchmarks for individual operators
– Learn cost functions to plug into optimization objective
• Interpolate
• Compose
• Account
from micro-benchmarks
costs for individual operations into cost of flow
for resource contention
– Extend transitions considered by the optimizer to include
implementation and engine choices
– Extend heuristics for state space search
© Copyright 20101Hewlett-Packard Development Company, L.P.
Example: Sentiment Analysis Flow –
Logical Model
Data Sources
Doc Info = (Doc id, Time Stamp,
Source type, Source URL,
Author Profile, GeoTag, Score)
Author Profile = (id, name,
email, location, gender, …)
Preprocessing
(Adaptor)
(Doc Info,
Text body)
phrase
dict.
Sentence
Detector
[Doc Info,
Sentence id,
Sentence Loc,
Sentence]
Tokenizer
document
batch file
Other
Operators
(Doc Info,
Attribute,
Sentiment Value])
Aggregate by
Doc
[Doc Info,
Sentence id,
Sentence Loc,
Token2 id,
Token Loc,
Attribute,
Sentiment Value]
document
sentiment value
21
© Copyright 20101Hewlett-Packard Development Company, L.P.
Remove Docs
from Blacklisted Authors
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token],
order <Doc Id,
Sentence Id,
Token Loc>
[Doc Info,
Sentence id,
Sentence Loc,
Token2 id,
Token Loc,
Attribute,
Sentiment Value]
attribute
sentiment value
POS Tagging
Relate
Sentiment to
Attribute
(Doc Info,
[Sentence id,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token,
POS Tag])
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation Flag,
Attribute Flag,
Sentiment Flag]
Compound
Phrases
Sentiment
Word
Detection
sentiment
word dict.
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag],
order <Doc Id,
Sentence Id,
Token Loc>
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation Flag,
Attribute Flag]
negation
dict.
Negation
Detection
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation flag]
Attribute
Extraction
attribute
dict.
Example: Sentiment Analysis Flow –
Physical Model
Sources
Doc Info = (Doc id, Time Stamp,
Source type, Source URL,
Author Profile, GeoTag, Score)
Data Sources
Author Profile = (id, name,
email, location, gender, …)
document
batch file
Preprocessing
(Adaptor)
Hadoop
(Doc Info,
Text body)
vertical slice
profile info
Scripts
(Doc Info,
Text body)
Splitter
on Doc Id,
hash partition,
3x parallel
(Doc Info,
Text body)
Remove Docs
from Blacklisted Authors
(Doc Info,
Text body)
Sentence
Detector
[Doc Info,
Sentence id,
Sentence Loc,
Sentence]
Tokenizer
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token],
order <Doc Id,
Sentence Id,
Token Loc>
POS Tagging
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token],
order <Doc Id,
Sentence Id,
Token Loc>
profile
info
Merger
on Doc id,
3x parallel
create
recovery
point
recovery
point
DBMS
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation Flag,
Attribute Flag,
Sentiment Flag]
ETL Tool
Relate
Sentiment to
Attribute
attribute
sentiment value
22
Scripts
Sentiment
Word
Detection
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation Flag,
Attribute Flag]
sentiment
word dict.
[Doc Info,
Sentence id,
Sentence Loc,
Token2 id,
Token Loc,
Attribute,
Sentiment Value]
Attribute
Extraction
attribute
dict.
Join on doc Id
profile
info
© Copyright 20101Hewlett-Packard Development Company, L.P.
[Doc Info,
Sentence id,
Sentence Loc,
Token2 id,
Token Loc,
Attribute,
Sentiment Value]
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag,
Negation flag]
Negation
Detection
negation
dict.
[Doc Info,
Sentence id,
Sentence Loc,
Sentence,
Token2 id,
Token Loc,
Token,
POS Tag],
order <Doc Id,
Sentence Id,
Token Loc>
Compound
Phrases
phrase
dict.
Target
Aggregate by
Doc
document
sentiment value
(Doc Info,
Attribute,
Sentiment Value])
DASHBOARD
(Doc Info,
[Sentence id,
Sentence Loc,
Sentence,
Token id,
Token Loc,
Token,
POS Tag])
MICRO-BENCHMARKS FOR CONVENTIONAL
(ETL, RELATIONAL) OPERATORS
– Experiments on sort, join, surrogate-key, etc.
– Example: Sort
• os-sort
−home-grown distributed sort implementation
• hd-sort
−Hadoop-based sort (used Pig Latin)
• pdb-sort
−used a commercial, parallel DBMS
– Dataset
• TPC-H
lineitem
© Copyright 20101Hewlett-Packard Development Company, L.P.
SF
Size (GBs)
Rows
(x106)
1
0.76
6.1
10
7.3
59.9
100
75
600
Sort (Comparison)
– Hadoop vs. Custom script vs. pDB (with and without loading)
Pig script (hd-sort):
C and shell scripts (os-sort):
sf$sf = load 'lineitem.tbl.sf$sf'
split file: “filesize/#nodes” (leftovers go
using PigStorage('|') as (f1,f2,...,f17);
to chunk #1)
ord1_$nd = order sf$sf by f1 parallel $nd;
transfer data
...
sort chunks on nodes
ord10_$nd = order ord9_$nd by f10 parallel $nd; transfer back and merge (hierarchical)
store ord$op_$nd into 'res_sf$sf_Ord$op-$nd.dat‘ (to be fair with hd-sort and etl-sort, avoid
using PigStorage('|');
pipelining)
24
© Copyright 20101Hewlett-Packard Development Company, L.P.
(TPC-H data SF:1,10,100)
Sort (Tuning)
– Hadoop
– e.g., #reducers
– Shell scripts
– e.g., analyze sort internal phases
25
© Copyright 20101Hewlett-Packard Development Company, L.P.
MICRO-BENCHMARKS FOR
UNCONVENTIONAL OPERATORS
– Experiments on Sentiment Analysis operators
•individual
•the
operators
whole sentiment analysis flow treated as a single operator
– Implementations
• Single node
• Distributed, scripts
• Distributed, Hadoop
– Datasets
•Small•Large
to medium-sized document corpus: up to 200,000 documents
corpus: ~30 million documents
© Copyright 20101Hewlett-Packard Development Company, L.P.
MICRO-BENCHMARKS FOR
UNCONVENTIONAL OPERATORS
(SENTIMENT ANALYSIS)
Single node, small to mediumsized corpus
Use linear regression to learn
interpolation function
27
© Copyright 2010 Hewlett-Packard Development Company, L.P.
Single node, large corpus;
effect of limited memory
12-node Hadoop cluster
SENTIMENT ANALYSIS (COMPARISON)
– Single node vs. Distributed with Hadoop vs. Distributed without Hadoop
© Copyright 20101Hewlett-Packard Development Company, L.P.
OPEN PROBLEMS: FUTURE WORK
– Micro-benchmarks of other operators: ETL, relational, content
analytics, front-end analytics, stream processing
– Cost functions: interpolation, composition
– Benchmarks and objective functions for other QoX measures
– Optimization strategies for physical optimization
• Assign
segments of the flow to execution engines
• Where,
when, and what to materialize
© Copyright 20101Hewlett-Packard Development Company, L.P.
SUMMARY
– Design of data flows for BI applications is a very expensive, largely
manual activity: little support for automation, optimization
– Situation gets worse with next gen BI: fuse back-end and front-end
operations into end-to-end analytic data flows
– Many different types of sources, more complex operators, multiple
execution engines
– Layered methodology: QoX-driven optimization allows trading off
different quality objectives
– Need benchmarks and micro-benchmarks to drive the design and
optimization
– Many challenges remain
© Copyright 20101Hewlett-Packard Development Company, L.P.
Thank You
31
© Copyright 2010 Hewlett-Packard Development Company, L.P.