Efficient Evaluation of XQuery over Streaming Data

A Framework for Optimizing
and Parallelizing XQuery
Xiaogang Li
Motivations
 Developing data processing applications is
hard
- Many data formats exist
- Different architectures
- Need independence from data format and architecture
 XML has gained great popularity!
- Now the standard language for the internet
- Already extensively used as part of Grid/Distributed
Computing
 High-level declarative languages ease
application development
-Popularity of Matlab for scientific computations
The Whole Picture
XQuer
y
HDF5
NetCDF
TEXT
XML
RDMS
XML
Contributions
 Architectural independence
- Provide compilation support of XQuery for
- Stream processing (VLDB2005)
- Parallel processing on clusters (ICS2003, DBPL2003)
 Data format independence
- Developed techniques to use XML as a logical interface
over physical datasets (LCPC 2003)
 Performance
- Developed a series optimization techniques for efficient
XQuery processing
- Developed static analysis techniques to guide compiler
optimizations and transformations (XIMP2004,IPDPS2003)
Roadmap
Background
- XML, XQuery
- Related work
Stream Processing
Virtual XML
Parallelization
Conclusion
eXtensible Markup Language
 Specification of a syntax for “encoding” data, with strict syntax rules
about how to do so.
 A text-based syntax -- written using printable characters (no
explicit binary data)
 Extensible -- you can define your own tags (essentially data types),
within the constraints of the syntax rules
 Universal -- the syntax rules ensure that all XML processing
software MUST identically handle a given piece of XML.
An ideal data exchange format
XML Example
element
tags
<order>
attribute of this
quantity element
xmlns=“http://w3c.org/Spec/” >
<item>
<code>“30100026266” </code>
<desc> Viewsonic E90f Monitor,
0.21mm, DELL Outlet
</desc>
<price> 229.99 </price>
<quantity units=“gross”> 2 </quantity>
<deliveryDate date=“20APr2004-12:00h” />
</item>
<item>
<code> “2001234”
</code>
. . . . . .
</item>
</order>
XQuery
A declarative language for querying XML
-Widely accepted language for querying XML
- Declarative: like SQL, easy to use
- Powerful: types, user-defined functions, binary
expressions
- FLWR (for, let, where, return) expressions
 Support XPath as a subset
- A query language that selects particular subsets of nodes
from an XML document
VMScope- XQuery Code
Unordered (
for $i in ( $x1 to $x2)
for $j in ($y1 to $y2)
let p:=document(“vmscope.xml”)
/data/pixel [(x=$i) and ( y=$j)
and (sacle >=$z1)
return
<pixel>
<latitute> {$i} </latitute>
<longtitute>{$j} <longtitute>
<sum>{accumula($p)}</sum>
</pixel>
)
Define function accumulate ($p)
as element
{
if (empty( $p)
then $null
else
let $max =accumula(subsequence($p,2))
let $q := item-at( $p, 1)
return
if ($q/scale < $max/scale ) or
($max = $null )
then $max
else $q
}
XQuery Example: Apriori
Users can write very complex, flexible programs.
Recursive functions are the only way for reduction
Roadmap
Background
- XML, XQuery
- Related work
Stream Processing
High-level Abstraction
Parallelization
Conclusion
Query Processing- Related Work
Much of the work focuses on XPath
-Xpath expressions are regular expressions-easy to analyze
Limited work on optimizing XQuery
-Optimizing from high-level using algebra
-Translating query into a tree of operators
-Query rewriting based on algebra
Algebra Approach: Limitations
Can not handle low level optimizations
- loop invariants, common subexpression …
Hard to catch all features using
algebra
- Recursive functions, types, aggregations
XQuery is complex, a simple algebra
just does not exist
Our Overall Approach
Using compiler technologies for Query
optimization
- Compiler techniques are well developed
- Data flow analysis, loop transformation,
parallelization
Advanced program analysis, loop transformation
and parallelization techniques can allow efficient
execution of XQuery
Roadmap
Background
- XML, XQuery
- Related work
Stream Processing
Virtual XML
Parallelization
Conclusion
Motivation
 Why Streaming Data
 Data needs to be analyzed at real time
- Stock market, Security, Climate, Network monitoring,
Telecommunication data management etc
 Huge amount of data
- NASA EOS project – 50 GB per hour
 Rapid improvements in networking
technologies
- 101.13 Gbps at SC2004 bandwidth challenge
Motivation
 Why XML
- Standard data exchanging format for the Internet
-
Widely adapted in web-based, distributed and grid
computing
 Why XQuery
- Widely accepted language for querying XML
- Easy to use
XQuery is the ideal language for querying XML streams
Can we compile it correctly and efficiently for streaming data?
Challenges
 For an arbitrary query, can it be evaluated
correctly on unbounded streaming data?
- Single traversal of the data is required
- Decision should be made by the compiler, not the user
 If not, can it be transformed accordingly?
 How to generate efficient code for XQuery?
- Computations involved is nontrivial
- Recursive functions are frequently used
- Efficient memory usage is important
Our Solutions
 For an arbitrary query, can it be evaluated
correctly on unbounded streaming data?
- Construct data-flow graph for a query
- Static analysis based on data-flow graph
 If not, can it be transformed accordingly?
- Query transformation techniques based on static
analysis
 How to generate efficient code for XQuery?
- Techniques based on static analysis to minimize
memory usage and optimize code
- Generating imperative code
- Recursive analysis and aggregation rewrite
Query Evaluation Model
Op1
Op2
Op3
Op4
 Single input stream
 Internal computations
Limited memory
linked operators
 Pipeline operator and
Blocking operator
Pipeline and Blocking Operators
 Pipeline Operator:
- Each input element produces an output element independently
- Selection etc
 Blocking Operator:
- Can only generate output after receiving all input elements
- Cannot be processed in a single pass
- Sort, Join etc
 Progressive Blocking Operator:
(1)|output|<<|input|: we can buffer the output
(2) Associative and commutative operation: discard input
- count(), sum()
Single Pass?
Pixels with x and y
Q1:
let $i := …/pixel
sortby (x)
Q2:
let $i := …/pixel
[x < count(/pixel)]
(1) A blocking operator exists
(2) A progressive blocking
operator is referred by
another pipeline operator
(or progressive blocking
operator)
Check condition 2 in a query
Single-Pass? Challenges
 Must analyze data dependence
- Something like Data Dependence Graph may be
helpful
 A Query may be flexible and complex
- Need a simplified view of the query to make
decision
Overall Framework
Data Flow Graph Construction
Low level Transformation
High level Transformation
GNL Generation
Horizontal Fusion
Recursion Analysis
Vertical Fusion
Aggregation Rewrite
Single-Pass Analysis
Stream Code Generation
Stream Data Flow Graph (DFG)
 Node: variable


S1
S2
v1
i
b
S1:stream/pixel[x>0]
S2:stream/pixel
V1: count()
Sequence
Atomic
 Edge: dependence
relation
v1->v2 if v2 uses v1
 Aggregate dependence
 Flow dependence
 A DFG is acyclic
High-level Transformation
 Goals
1: Enable single pass evaluation
2: Simplify the DFG for single-pass
analysis
 Horizontal Fusion and Vertical
Fusion
- Based on DFG
Horizontal Fusion
 Enable single-pass evaluation
- Merge sequence node with common prefix
S1
S2
v1
v2
b
S0
S1
S2
v1
v2
b
S1:stream/pixel[x>0]
S2:stream/pixel/y
V1: count() V2: sum()
S0:/stream/pixel
S1:[x>0] S2: /y
V1: count() V2: sum()
Horizontal Fusion with nested loops
 Perform loop unrolling first
 Merge sequence node accordingly
Before Horizontal Fusion
Output
Require 3 Scanning
Datasets
After Horizontal Fusion
Output
Requires Just one
Scanning
Datasets
Vertical Fusion
 Simplify DFG and single-pass analysis
- Merge a cluster of nodes linked by flow dependence edges
S1
S1
i
i
b
b
S2
j
S
v
S2
j
v
v
Single-pass Analysis
 Can a query be evaluated on-the fly?
THEOREM 1. If a DFG contains more than one
sequence node after vertical fusion, it can not
be evaluated correctly in a single pass.
Reason: for single input stream, each sequence
node requires one traversal
Single-pass Analysis- Continue
THEOREM 2. For any given two atomic nodes n1 and n2,
if (1) n1 and n2 are aggregate dependent on a
sequence node
(2) there is a path between them,
the query may not be evaluated in a single pass.
Reason: A progressive blocking operator is referred
by another progressive blocking operator
Example : count (pixel)
where /x>0.01*sum(/pixel/x)
Single-pass Analysis - Continue
THEOREM 3. In there is a cycle in a DFG, the corresponding
query may not be evaluated correctly using a single
pass.
Reason: A progressive blocking operator is referred
by a pipeline operator
S1
i
S2
b
S2
j
v
v
Single-pass Analysis
 Check conditions corresponding to Theorem 1
2 and 3
-Stop further processing if any condition is true
 Completeness of the analysis
- If a query without blocking operator pass the test, it can be
evaluated in a single pass
THEOREM 4. If the results of a progressive blocking operator are
referred to by a pipeline operator or a progressive blocking
operator, then for its DFG, at least one of the three
conditions holds true
A Review of the High-level
Transformation and Analysis
Can not be
evaluated in a
single pass!!
S1
S2
v1
i
b
S
S
S
v
i
v1
v1
b
b
i
Code Generation
 Using SAX XML stream parser
- XML document is parsed as stream of
events
- Event-Driven: Need to generate code
to handle each event
 Using Java JDK
-Our compiler generates Java source
code
Experiment
 Query Benchmark
- Selected Benchmarks from XMARK
- Satellite, Virtual Microscope, Frequent Item
 Systems compared with
- Galax
- Saxon
- Qizx/Open
Performance: XMARK Benchmark
>25% faster on small dataset
Scales well on very large datasets
Performance: Real Applications
>One order of magnitude faster on small dataset
Works well for very large datasets
Summary
 Provide a formal approach for query
evaluation on XML stream
- Query transformation to enable correct execution on
stream
- Formal methods for single-pass analysis
- Strategies for efficient low-level code generation
- Experiment results show advantage over other wellknown systems
Roadmap
Background
- XML, XQuery
- Related work
Stream Processing
Virtual XML
Parallelization
Conclusion
Support High-Level Abstraction
 Understanding the physical details is hard, but
necessary for performance
Logical Schema: A logical view over the data for programmer
Physical Schema: Low level details of physical storage, provided
to compilers
System Architecture
External Schema
XML Mapping Service
logical XML schema
physical XML schema
Compiler
XQuery Sources
C++/C
High-level and low-level XQuery
 High-level query:
- Query base on logical schema
- Developed by programmers
 Low-level query:
- Query base on physical schema
- Retrieve data by calling library functions
 High-level Query is transformed to low-level
query by our compiler
-User can still modify low level query if not satisfied
Mapping to low-level Query
 A number of getData
functions to retrieve data
stream
-getData($x)
-getData($x,$y)
 getData functions Written
in Xquery
-allow analysis and
transformation

Find the optimal library
function to call
Unordered (
for $i in ( $x1 to $x2)
for $j in ($y1 to $y2)
let p:= getData($i,$j)
return
<pixel>
<latitute> {$i} </latitute>
<longtitute>{$j}
</longtitute>
<sum>{accumulate($p)}</sum>
</pixel>
)
Compiler Techniques
 Insert getData functions
- Compatible: output should be superset of original data
stream
- performance: want smallest superset
 Query rewritten based on relational algebra
- Reduce to canonical forms
- Compare canonical forms
Comparison with Manual - VMScope
4500
4000
3500
3000
2500
Xquery
C
2000
1500
1000
500
0
1
2
4
8
Roadmap
Background
- XML, XQuery
- Related work
Stream Processing
Virtual XML
Parallelization
Conclusion
Generalized Nested Loop (GNL)
An intermediate
representation
explicitly defines
For $b in student/score @t =cis
sum = sum +b
count = count +1
- iterative structures for
retrieving data
- aggregation operations to
be performed on the
qualified data
Filter Expr
index
variable
Path Expr
Loop
Body
Parallelization of XQuery
GNL offer a convenient base for
parallelization
- Iterative structure
- Explicitly defined reduction
Use ADR,MPI for parallel code
generation
-ADR: a C++ class library and runtime system
for building parallel databases of multidimensional datasets
-MPI : a standard communication library (C++)
Parallel Code Generation
 1. From XQuery to C++
- ADR(MPI) is a C++ library
- Type systems of XQuery and C++ is quite different
 2. Generation of Processing functions
- Local reduction
- Global reduction
Global Reduction Function
 Local reduction:
process input, update on
local copy
 Global reduction: process
local copy, update global copy
 From local to global,
How ?
1. Extract a program slice
from local reduction
2. Replace data dependence
on input with those on local
copy of output
3. Remove control
dependence on input
Initialize Output
For $b in //score
If b in ( ee,cis,math)
output[b]++;
Initialize global Output
For $b in local copy
output[b]++;
Parallel Performance- Q5,Q20
180
160
140
120
100
Q 20
Q5
80
60
40
20
0
1
2
4
8
Q20 6GB, Q5 4GB, Good speedup
Parallel Performance- VMScope
4500
4000
3500
3000
2500
COMM
NO COMM
2000
1500
1000
500
0
1
2
4
8
Conclusion
Provided a new framework for processing
XQuery based on compiler techniques
Designed new optimization and analysis
techniques for XQuery
Support of high-level abstractions to hide
low-level details
Experiment results show effectiveness
Thank you !!!
GNL Example
 DFG
 GNL
S0
S1
S2
v1
v2
b
Facilitate code generation for any desired platform
Conservative analysis
 Our analysis is conservative
- A valid query may be labeled as “cannot be evaluated
in a single-pass”
Example:
Horizontal Fusion: Side-effect
 May resulted incorrect result due to interdependence
let $b = count(stream/pixel)
for $i in stream/pixel
return $i/x idiv $b
for $i in stream/pixel
return $i/x idiv count()
Partial result of count is used to compute output
Will be dealt with at single-pass analysis