The State of the Art in Distributed Query Processing

The State of the Art in
Distributed Query Processing
by Donald Kossmann
Presented by Chris Gianfrancesco
Introduction
Distributed database technology is
becoming an increasingly attractive
enhancement to many database systems
Cost and scalability
Software integration
Legacy systems
New applications
Market forces
Introduction
Topics covered in this paper
Basics of distributed query processing
Client-server distributed DB models
Heterogeneous distributed DB models
Data placement techniques
Other distributed architectures
Client-Server Database Systems
Relationships between distributed nodes
take a client-server form
Client: makes requests of the servers,
usually the source of queries
Server: responds to client requests,
usually the source of data
System architectures: peer-to-peer, strict
client-server, middleware/multitier
Architectures: Peer-to-Peer
 All nodes are
equivalent
 Each can be either a
client or server on
demand (can store
data and/or make
requests)
 Ex: SHORE system
Peer Node
Server or Client
Peer Node
Peer Node
Server or Client
Server or Client
Architectures: Strict Client-Server
 Client or server status
is pre-defined and
can never change
 Clients supply
queries, servers
supply data
 Most common
architecture in
commercial DBMS’s
Client
Query source
Server
Data source
Architectures: Middleware/Multitier
 Multiple levels of
client-server
interaction
 Nodes act as clients
to those below them
and servers to those
above
 SAP R/3, web servers
with DB backends
Node 1
Client to Node 2
Node 2
Server to Node 1, Client to Node 3
Node 3
Server to Node 2
Architectures: Evaluation
Peer-to-Peer
Simplest setup
Equal load sharing
Strict Client-Server
Specialization
Administration for servers only
Middleware/Multitier
Functionality integration
Scalability
Client-Server Query Processing
Queries initiated at clients, data stored at
servers
Where do we execute the query?
Query shipping: move the query down to
the data
Data shipping: move the data up to the
query
Hybrid shipping: combination of both
Query Shipping
 SQL query code is
sent down to the
server
 Server parses and
evaluates query,
returns result
 Used in DB2, Oracle,
MS SQL Server
Data Shipping
 Client parses query
and requests data
from server
 Server provides data,
then client executes
query
 Data can be cached
at client (main
memory or disk)
Hybrid Shipping
 Mix-and-match data
shipping and query
shipping
 Query parts can be
executed at any level
according to query
plan
 Data is cached when
beneficial
Evaluation
Query Shipping
Reliant on server performance
Scales poorly with increasing client load
Data Shipping
Good scalability
High communication costs
Hybrid
Potential to outperform other options
More complex optimizations
Hybrid Shipping Observations
Some observations of optimal
performance using hybrid shipping
Preference to not use a client cache
If network transfer cost < client access cost
Shipping down cached data
If in main memory & execution at server
Multiple small updates
Maintain at client and post to server only when
necessary
Query Optimization
Query plans must also specify where the
query pieces are executed
Data shipping: all execution done at client
Query shipping: all execution done at
server
Hybrid: choice can be made for each
operator
Results display to user is always at client
Distributed Query Plans
Each operator is annotated with a logical
site of execution – plans are shareable
client means an operator is executed from
the client where the query is issued
server means:
for scan operators, execute at a location that
has the necessary data
for updates, execute at all locations with the
relevant data
Query Optimization: Where?
Should optimization occur at the client or
the server?
At client: less load on servers, better
scalability
At server: more information about system
statistics, especially server loads
Potential solution: primary parsing and
query rewriting at client, further
optimization at server
Query Optimization: Statistics
Even when optimization is done at a
server, that server does not usually have
full knowledge of the system
System can either:
Guess the status of other servers – less
accuracy, less cost
Ask other servers their status – fully accurate,
additional communication costs
Query Optimization: When?
Tradeoff of accuracy vs. cost
Traditional-style: optimize once, store plan
No support for changing DB conditions
No incurred cost for query execution
Plan sets: optimize for possible scenarios
Generate a few query plans for diff. conditions
Choose plans based on runtime statistics
On-the-fly: observe intermediate results
Re-optimize query if different from expectations
Query Optimization: Two-Step
Compile-time: generate join order, etc.
Runtime: perform site selection
Reasonable cost at each end
Responds well to changing server loads
Fully utilizes client data caching
Two-Step Optimization: Downside
1.
2.
3.
•
Optimal plan is generated traditional-style
Site selection is performed
True optimal plan was missed
Optimal was missed because first optimization step
was done with no knowledge of the system
Query Execution Techniques
Standard fare: row blocking, multithread
when possible
Issues: transactions with both updates and
retrieval queries using hybrid shipping
We want to wait to propagate updates for
efficiency’s sake
Other option: perform query before update and
temporarily pad results
Questions?
Comments?