Relational Database Internals

Relational Database
Internals
Alex Scotti
Bloomberg LP
Outline of talk
• History - origins and background
• Internals - theory and practice
• Internals - brief discussion of real systems
• Future - observations, trends, predictions
History
• The early database systems differed from the
relational ones in 2 main regards
• Data model
• Transactional semantics
• We’ll be more heavily focused on the
transactional issues than the data
modeling issues in this talk
• Pre Relational Systems
• Hierarchical Data Models
• IMS
• Network Data Models
• CODASYL
• IMS
• Each record was typed by a “record type”
(think “table”)
• Relationships between records are
represented as trees (hierarchies) between
records linked by their “keys”
• Writing a “query” consisted of writing a
program to navigate through these links,
traversing records until the right one was
found.
• Data types available were SEQUENTIAL,
HASH, TREE
• Each acted differently - A program written to
use a tree could not have the data structure
changed out from under it
• lack of PHYSICAL INDEPENDENCE
• CODASYL
• A more complex “evolution” of the IMS idea,
standardized by ANSI, implemented by several
vendors
• Honeywell, DEC, Univac
• Idea is instead of pointers forming a strict
hierarchy, they now form an arbitrarily complex
“network.”
• Able to represent graphs
• Even HARDER to program with than IMS
• In 1970, Ted Codd wrote the foundational
paper “A Relational Model if Data for Large
Shared Data Banks”
• Codd was primarily a mathematician, not
particularly concerned with transaction
processing
• However, the two problems were incredibly
tightly coupled
• Work at IBM began on “relational databases”
• Including locking, logging, all sorts of things
that became the core of an RDBMS
• Codd’s insight
• A database is nothing more than a “fact
store” from which it should be possible to
logically infer “new facts.”
• Simple but amazingly powerful.
• If the goal is to store facts, then there is no
benefit from storing the same fact multiple
times or in multiple forms. A fact does not
become MORE TRUE by repeating it. The
basis of “normalization.”
• Codd’s insight
• If the system knows enough about the data it is
storing, you can ASK IT QUESTIONS rather
than TELL IT WHAT TO DO.
• Declarative vs Procedural programming model
• AKA, The nail in the coffin for all the Pre
Relational systems
• Just a matter of time - If a system is easier
to use, and performs fine, why wouldn’t you
use it?
• Codd’s terminology has (for the most part)
become replaced by the SQL terminology,
which we’ll be using throughout the rest of this
talk
Attribute
Tuple
Relation
Column
Row
Table
• Generalized simplification
• “Logically organize your data into tables”
• Going further, Codd defined 12 “rules” that he
hoped to define what it meant to be “relational”
• Key points are
• All information is represented in tables
• Nulls must be uniformly handled by all
datatypes
• Physical representation must be abstracted
from logical representation
• Physical location and distribution of data
must be invisible to users
• Key points are
• “Set based” operations for insert / update /
delete
• Integrity constraints must be enforceable by
the database system
• There must be no way AROUND the set of
enforced constraints
• There must be support for at least 1
“relational language”
• Codd’s work became the basis for a “next
generation” database product at IBM called
System R.
• System R was treated as a “production proof of
concept.” At the end of the project there were
several commercial customers.
• Around the same time, work was going on at
UC Berkeley on the “Ingres” system, also
based on Codd’s idea.
• Neither system was successful at
commercializing a general purpose database.
• The award goes to Oracle.
• Oracle shipped a working commercial
RDBMS to anyone who would pay before
IBM.
• Based also on Codd’s work.
• No common code between System R,
Ingres, and Oracle - 3 unique lineages all
based on the same idea
• IBM evolved the System R “prototype” into
their second system : DB2.
• Ingres went on to be the basis of numerous
successful commercial products
• Sybase was based on Ingres code
• Informix contains Ingres code (through
Illustra)
• MSSQL contained Ingres code (through
Sybase)
• Newer systems - all inspired by the same ideas
and following the same principles, but without
direct code sharing
• Mysql
• SQLite
• PostgreSQL
Internals
• Buffer Pool
• Log
• Concurrency Control
• Btrees
• Relational Layer
Buffer Pool
• Often known as “the cache”
• A page/block oriented data structure
• A page in the pool conceptually “maps” to a
block on a disk. (not really always true)
• Needs to interface with the systems BELOW
and the systems ABOVE.
• Below - Disks, File systems
• Above - Btrees
• Both page/block oriented interfaces above and
below.
• Conceptually, very similar to the VM subsystem of
any modern UNIX
• “demand paging”
• Eviction policy based on LRU approximations,
often with more “smarts” than VM.
• Higher levels of the system often can pass
down “hints” about intended access patterns all
the way to the buffer pool.
Buffer Pool - Why?
• Whats the story? it’s a cache, we get it.
• Much more than that going on here!
• Basics of transaction management begins with
the buffer pool and the policies and protocols
enforced there
• Terminology
• “pinned” - a page that cannot be evicted
• “dirty” - a page that contains data that DOES
NOT match the data on the disk
• “clean” - the opposite
• A dirty page BECOMES a clean page when the
data in that page is DURABLY written to the
disk
• Can we really just write a page to the disk?
Not really, it usually involves logging
protocols - wait for the next section!
• More terminology
• “forcing” - when a transaction commits, it’s
dirty pages are FORCED to durable storage
before considering the commit complete
• “stealing” - A dirty page which is a part of an
UNCOMMITTED transaction can be written
to the disk in an effort to produce usable
space in the buffer pool
• What is the simplest?
• FORCE / NOSTEAL
• What is the highest performing and most
powerful?
• NOFORCE / STEAL
• Not surprisingly, most real world systems today
implement a NOFORCE / STEAL buffer pool
policy
• Support for this policy requires logging
• More terminology
• OVERWRITE / NO OVERWRITE
• Whether or not the buffer pool will write
changes to a page ON TOP of an existing
page, or leave the existing page alone and
write to a NEW page.
• OVERWRITE systems are higher performing
• most real world systems implement an
OVERWRITE buffer pool.
• NO OVERWRITE example: System R,
shadow paging
• How does data actually get written to the disk?
• The “clients” of the buffer pool (the layers
above) never concern themselves with writing
data. They work at a layer of abstraction
where they “get buffer” and “dirty” them.
• Pages get written out (cleaned!) as part of a
background process.
• Goal is to keep some portion of the buffer pool
clean.
• Why are we trying to keep writing out these
pages to disk in the background?
• To make the system more reliable?
• NO! Completely unrelated. Reliability
ensured through other means
• To make sure that a READ doesn’t become a
WRITE!
• Need a page? Cant get one, all dirty.
• You get to “clean one” (write it) now!
Logging
• Basic Idea behind logging
• Before you do something, write down what it
is you intend to do.
• Sounds slow. Why bother with this, just DO IT!
• Nope - The opposite is true. Logging can
make things quicker
• The highest performing buffer pool policy of
NOFORCE/STEAL actually REQUIRES
logging
• Without logging you would compromise with
a lower performing policy
• Logging has the capacity to perform “magic”
• Converts RANDOM (slow) I/O into
SEQUENTIAL (fast) I/O!
• We’ll come back to this idea
• Expanding on the basic idea of logging
• Theres really two distinct things that you are
“writing down” here
• Write down what it is you are about to do:
REDO logging - can “do it over”
• Write down the procedure to follow to make
it as if what you did NEVER HAPPENED:
UNDO logging
• Many times both of these pieces of information
are embedded into single “log record” Or not.
Conceptually 2 things.
• Mechanics of logging - What’s the data
structure?
• In it’s basic form, a log is a simple sequential
file. Conceptually it’s not unlike a tape drive.
• Each “record” in the log is identified by a
unique identifier, which is typically just the
physical location of the record in the file.
• Call this the Log Sequence Number (LSN)
• “Log Buffer” - exactly what it sounds like - a
buffer of memory in front of the log.
• An obvious and common optimization to
make it less expensive to “write to the log”
• Recoverability is endangered unless the log
exposes an interface to FLUSH THE
BUFFER. (and it gets called at the right
places)
• All real systems work this way
• Subsystems are said to “generate log records”
(calling APIs provided by log subsystem)
• Buffer pool may need to log the allocation of a
new page
• Btree may need to log a page split
• Relational layer may log an INSERT statement
• Customers of this subsystem all over the
database
• 2 approaches to logging
• “Physical logging”
• “Logical logging”
• Physical logging
• Log entire page images
• “redo record” : “log what the page is GOING
TO look like”
• “undo record” : “log what the page LOOKS
LIKE NOW”
• Problems?
• Inefficient, expensive
• Poor concurrency
• Problems
• Inefficiency mess
• Why log 2 copies of a page when I only
changed a few bytes?
• Concurrency mess
• Systems with concurrency control at a
finer granularity than the page cannot log
this way. We’ll come back to that.
• On the other hand
• Physical logging is appealing because it is
simple, and it works because of a nice
property of being “testable”
• We can look at a log record ABOUT a page,
then look at the page, and determine which
state it’s in because we RECORDED the
two possible states
• This turns out to be an essential property
of reovery
• Logical logging
• Log the high level operations only
• SQL
• INSERT INTO TBL(A) VALUES(1)
• REDO
• “INSERT INTO TBL(A) VALUES(1)”
• UNDO
• “DELETE FROM TBL WHERE A = 1”
• Elegant!
• Simple!
• Compact!
• but it doesn’t WORK!
• That SQL INSERT could decompose into
dozens of page writes.
• Some may have been done, then crash.
You can’t look at thee pages and tell which
ones were done (UNDO THEM) and which
ones weren’t
• NO FORCE allows us to mark a transaction
“committed” WITHOUT writing all of the pages.
• Some may have been written, then crash
• We can’t tell which one WERE NOT written
(REDO them) and which ones WERE (leave
them alone)
• It’s often UNSAFE to perform actions multiple
times
• Making logical logging work - “Physiological
Logging”
• “Physical ABOUT pages, logical ABOUT the
contents INSIDE the page”
• The idea is to keep the logging centered on
the idea of pages, which works well
• But log less information than a physical
scheme would require
• Example Physiological Operation
• “Add item X to page N”
• Push down the logical concept into the page
level - logical INSIDE the page
• SQL INSERT statement will decompose into
several independent physiological operations
• Each one is INDEPENDENTLY TESTABLE /
UNDOABLE / REDOABLE
• AKA, “it works”
• Logging for purposes of recovery
• Key technique is based on something called
the “pagelsn”
• Intertwining of the buffer pool and the logger
• Each time you modify a page, store the LSN
of the log record describing that modification
ON THE PAGE ITSELF
• Testability
• Look at the pagelsn to determine state
• Write Ahead Logging (WAL) Protocol
• Tightly integrated with buffer pool
• Before a dirty page is written to disk, the
UNDO information for that page must be
durable
• Before a transaction is considered
committed, the REDO information for that
transaction’s pages must be durable
• And that’s how a NO FORCE / STEAL
system can convert random I/O into
sequential
• Basic idea behind recovery after crash
• REDO all COMMITTED transactions
• Some pages MAY NOT be written
• as allowed by NO FORCE
• UNDO all UNCOMMITTED transactions
• Some pages MAY HAVE BEEN written
• as allowed by STEAL
Concurrency Control
• Lets talk about ACID now (finally?)
• We’ll use Chris Date’s definition
• Atomic
• A transaction fully completes or no part of it
does.
• Correct (Consistent)
• Transactions transform a database from one
correct state to another, not necessarily
enforcing correctness during the transition
between these two states
• Isolated
• Transactions are isolated from each other in
such a way that a transaction will be
“correct” regardless of what other
transactions may be simultaneously
executing
• Durable
• A “committed” transaction CAN NOT be
“lost” after a system failure
• Concurrency control intertwines will all of these
concepts.
• But mostly the I in ACID
• ISOLATED is really just a layman’s shorthand
for “SERIALIZABLE”
• Basic Serializability Theory
• A system which runs all transactions
sequentially (with no concurrency) produces
a “history” known as a “serial history”
• A serial history is BY DEFINITION correct
• You can’t have concurrency problems
WITHOUT CONCURRENCY!
• A system which allows for concurrency
produces histories comprised of the interleaved
execution of the concurrent transactions
• If that history can be said to be EQUIVALENT
to a serial history (one produced through non
concurrent execution) then the concurrent
system’s history is said to be SERIALIZABLE
• EQUIVALENT - “Produces the same output
and has the same effect on the database”
• Some formal notation
• rn[x] : Transaction n reads object x
• wm[y] : Transaction m writes object y
• cl : Transaction l commits
• “conflicting operations”
• r conflicts with w
• w conflicts with r
• w conflicts with w
• Conflict Serializability Testing
• A history can be considered equivalent to a
serial history if it holds that for all conflicting
operations the ordering of the conflicts is the
same
• r1[x] r1[y] w1[x] r1[z] c1 r2[x] r2[a] c2
• r1[x] r1[y] w1[x] r2[a] r2[x] r1[z] c1 c2
• r2[x] conflicts with w1[x]
• In both histories order of conflict is same
• Serializability Graph Testing
• A technique to analyze any history for
serializability is the “serialization graph”
• For each committed transaction add a
directed edge from T1 to T2 if any step of T2
conflicts with T1
• If the resulting graph contains NO CYCLES
then the history is serializable
• “Schedulers”
• Histories are said to be “produced” by the
execution of event as determined by the
“scheduler”
• This may or may not be a “real thing” in a
real system.
• As a mental model we consider the
scheduler to be a real thing who’s job it is to
schedule the interleaving of transactions in
such a manner to produce serializable
histories
• “Conservative Schedulers”
• Err on the side of delaying execution
(blocking) in the hopes of producing
serializable histories
• Extreme case - no concurrent execution
allowed!
• “Agressive Schedulers”
• Aim to run with more concurrency with the
understanding that non serializable histories
may be produced and later rejected
• Extreme case – SGT based validating
scheduler
• Locking based schedulers
• The most common real world schedulers all
involve forms of locking as the basic
mechanism
• Serializable histories are produced through
a locking technique called 2 Phase Locking
(2PL)
• 2PL Rules
• Acquire “read locks” on all objects read
• Acquire “write locks” on all object written
• Only release locks at Commit
• It can be proven mathematically that all
possible histories output from a 2PL scheduler
are serializable
• It’s not that hard to convince yourself of this
intuitively without the math
• 2PL drawbacks
• 2PL can be overly conservative in many
cases, delaying concurrency needlessly
when serializability would not have been
compromised
• 2PL suffers from deadlocks as it allows for
arbitrary interleaving of concurrent blocking
operations in no defined order
• Serialization Graph Testing (SGT) Schedulers
• At commit time build a serialization graph
and detect cycles.
• No real world system works this way
• Just too computationally expensive
• (fancy term for “slow”)
• Optimistic Concurrency Control (OCC)
Schedulers
• Track “read sets” and “write sets” of all
transactions
• At commit, ensure that no conflict between
these sets has occurred.
• Make sure no transaction that started
after your BEGIN has any overlap in its
write set with your read set
• OCC Problems
• Tradeoff the deadlock problem of 2PL for
the “rejection” problem of OCC
• Can be very difficult to efficiently track
conflicts.
• Difficult to allow high concurrency - “giant
lock” around “validate” and “commit” phases
• No real world system implements a 100%
pure OCC scheduler
• Predicate based Concurrency Control
• SQL: “UPDATE X WHERE Y>5 AND Y<10”
• Don’t lock all the rows between 5 and 10
• instead lock the SINGLE PREDICATE of
• “5<y<10”
• Need not be a “lock” - Compatible with
OCC/Validating techniques as well
• Problems with predicates
• Gets very complicated very fast to support
arbitrarily complex predicates
• Gets really really complicated to detect
compatibility/conflicts between arbitrary
predicates - much worse than the basic OCC
problem
• But basic “degenerate” predicates have been
used in real systems. In some systems our
example would have been a “range lock”
• Less than serializable
• Many real world systems either do not fully
implement serializability or offer optional
(typically default) isolation levels that are
WEAKER than serializable
• This is almost ALWAYS done for reasons of
performance
• One very successful model of reduced
isolation in real systems is known as
“Snapshot Isolation”
• Snapshot Isolation (SI)
• An SI scheduler is frequently implemented
as a Multi Version Concurrency Control
(MVCC) system
• MVCC permits the notion of “versions” of
objects
• The notation r1[x] w2[y] is extended to r1[x2]
r2[y4]
• Transaction 1 reads version 2 of x
• Transaction 2 writes version 4 of y
• SI defines 2 rules for an MVCC system to
follow
• Each version of an object x that is READ BY
transaction T is the most recently committed
version of x as of the BEGIN of T
• 2 Transactions that overlap in BEGIN and
COMMIT time do not write to the SAME
OBJECT
• Problems with SI schedulers
• SI can generate histories that are not
serializable
• The main issue is referred to as “write skew”
- Idea is it becomes visible sometimes that
there is a “skew” in time - as your reads and
your writes appear to execute at different
points in time
• Write Skew
• Simple example - Imagine trying to enforce
an integrity constraint inside the application
(the db doesn’t know of this constraint)
• In a 2PL system, its easy - read all your
conditions before committing
• In an SI system that doesn’t work
• SI anomaly of write skew can be worked
around in the application if the DBMS provides
explicit “locking” primitives that can be used.
• In previous example, the application would be
responsible for “locking” the items read to
ensure serializability
• Oracle: SELECT FOR UPDATE
• Comdb2: SELECTV
• SI is often “good enough” and can provide
much greater concurrency in many cases than
2PL.
• The SI anomalies are not recognized by ANSI
SQL.
• So strangely, according to ANSI SQL, an SI
system actually IS serializable. (it isn’t)
• Dirty Read, Non Repeatable Read, Phantom
•An SI scheduler can be implemented as type of
aggressive, validating scheduler.
•Retaining some of the aspects of OCC deferring
the validation of w-w conflicts (rule 2, no overlap
in writes) until commit time
•An SI scheduler can also be built from a
conservative locking scheduler
•Write locks can be acquired to enforce second
rule of SI, ensuring blocking or deadlock for non
compliant histories
Btrees
• The workhorse data structure of a Relational
Database System
• Most common choice for implementing an
index. Sometime a choice for storing data too.
• Key Idea
• Like a binary tree (balanced) but allowing
more than one item on a “node” and more
than 2 siblings per node
• A node becomes a PAGE - out of practical
necessity
• Buffer pool wants pages
• Logging, recovery wants pages
• Concurrency control wants pages
• A page in Btree maps 1 to 1 into a page in the
buffer pool
• Which maps (somehow) into a block on a disk
• Buffer pool could overlay on disk
• Typically it overlays on filesystem
• Filesystem often further abstracted from
disk
• Hardware RAID, etc
• Logging is often physiological
• “Add item X to Btree” (operation) can generate
log record of
• “Insert item X into the array on page 2”
• Forms of logical logging are often used for
internal data structure maintenance
• A “page split” may be a logged event
• Key insight into Btree recovery
• If 2 Btrees ACT the same, then they ARE
the same
• Recovery NEED NOT create a bit for bit
perfect copy of the original data structure,
only one that is indistinguishable from the
original over all operations defined to be
supported by the Btree
• Concurrency control WITHIN the Btree is
typically based on a complex locking protocol
with the goal of allowing maximum concurrency
(reads and writes) to distinct pages in parallel
• Much like concurrency control in general, many
“exotic” non locking variants exist - few if any
are really used
• Simplified locking
• Always access tree from “top” (parent) to
“bottom” (leaf)
• Always hold lock on item ABOVE before
attempting access to item BELOW
• release lock on item ABOVE when you know
it’s “safe” (you won’t be going “up”)
• Page level granularity for concurrency in the
Btree structure
• Real systems often provide transactions with
finer (ROW) granularity for concurrency than
pages
• Key insight
• A form of logical logging
• Call the low level (page oriented) work the
“physical” level and the high level (descriptive)
work the “logical level”
• A logical operation to a Btree could be “insert
X” while physical (physiological) is
• “Add X to page 43” or even something like
• “split page 43 into pages 43 and 54, update
page 532(parent) to see new sibling, update
page 87 (to right of 54) to see left to 54, add
X to page 54”
• Use logical logging on the Btree for undo
• A tree need not be the same if it can ACT
the same!
• In our previous example, we CAN’T
physically undo. If we released our page
locks BEFORE transaction commit (which
we HAVE TO if we want better than page
granularity) then another committed
transaction could have put data into newly
created page 54
• A physical undo would remove page 54.
• It would cause the loss of data from a
subsequent COMMITTED transaction!
• We need to logically undo - leave the tree
structure alone
• Remove X from 54 is all we need to do.
• Modified 2PL protocol for row level concurrency
and serializability
• When reading row X obtain read lock on row
• When writing row Y, obtain write locks on
pages modified by row write, obtain row lock on
row Y, release page locks on pages modified
by row write
• Row locks follow 2PL protocol, always held
until commit
• Page locks are released early
Relational Layer
• Relational Algebra defines 8 primitive
operations
• RESTRICT: Chose rows
• PROJECT: Chose columns
• PRODUCT: Multiply 2 sets of columns
• UNION: Add 2 sets of rows
• INTERSECT: Produce set of rows in common
between 2 sets
• DIFFERENCE: Remove the commonality of set
of rows between set 1 and 2 from set 1
• NATURAL JOIN: Produce a set of rows based
on common values of a column
• DIVIDE: Opposite of PRODUCT
• Relational Algebra is a procedural way of
expressing a problem
• Lay out the “steps” in terms of the “operators”
• Not procedural in terms of “implementation”
• The implementation of each operator is a
procedural operation.
• The algebra has specific rules (in terms of
what is commutative, etc) which can be used
for simplification of expressions
• Relational Languages
• In practice, nobody is writing any math to
run a query! No relational algebra, no
relational calculus
• SQL is the dominant (only?!) Relational
Language. Others have existed.
• Informix: “Informer”
• Ingres: QUEL / PostrgreSQL POSTQUEL
• System R: SEQUEL
• The purpose of a relational language is to
expose enough power to allow one to express
anything that would be possible in the
relational algebra.
• Codd termed this to be “relationally complete”
• SQL is a relationally complete language
• Inspired by parts of the calculus, parts of the
algebra, and a desire to be “english like”
rather than “mathematical”
• SQL is a “compiled” programming lamguage
• The database parses SQL then compiles it into
an intermediary form for execution
• Conceptually, this intermediary form can be
thought of as relational algebra
• This compiled form is often referred to as the
“query plan”
• SELECT * from users WHERE uuid=123;
• σ uuid 123 (users)
• SELECT name, age FROM users WHERE
numchildren > 2 and numcars > 3;
• π name, age (σ numchildren >2 and numcars
> 3 (users) )
• Producing a query plan is HARD WORK!
• It’s the job of a component called the “query
planner”
• A single query can be represented by an infinite
number of query plans
• Most are absurd and would never be
generated by anything other than a defective
or malicious planner
• Some are MUCH LESS WRONG
• But only 1 is THE BEST (for this input!)
• The job of the planner is to quickly prune down
the search space of plans to ones that might
have a chance at being good, then quickly
evaluate the “goodness” of the remaining
choices
• Quick - This is an overwhelming source of
tension in the planner - quick vs correct
• If the system took 1 minute to generate a
plan to run your query in 1 second or 1
second to get a plan to run your query in 5
seconds, which would you chose?
• 2 Main approaches
• “Rules” based optimization
• Follow specific mechanical rules about
the way the SQL was written to produce a
plan
• “Cost” based optimization
• Use heuristics to evaluate multiple plans,
looking for the one with the lowest “cost”
• Most real world systems today are cost based,
with some cases of using rules
• It may be advantageous to employ boolean
algebra to rewrite expressions containing
ANDs and NOTs to contain ORs if your
system allows for OR to be implemented
with multiple indexes (and not AND)
• Called a “query rewrite rule”
• Mechanically followed as considered to be
“always good”
• Many of the early systems were purely rules
based
• A “bad rule” - The order that the tables are
listed in should be the order (inner/outer) of
the tables in the nested loop of a JOIN
• Exactly what the rules based systems did for
years
• (including the first version of Comdb2)
• Cost based optimization is based on the
concept of “statistics.”
• The database keeps internal statistics about
the CONTENTS of the data
• SELECT * from tbl where X=5
• Table scan on tbl, filtering on X=5
• Index lookup on X=5
• Which is better? It depends
• In most real systems, a table scan is faster
than an index scan when the “break even
point” is reached - more than a % of rows
visited
• The system needs to “know” which % of the
rows in the table are likely to contain X=5
• Only with that information can it chose the
fastest plan. (for this input and this data!)
• SELECT * from tbl where X=5 and Y=6
• Use index on X=5 and filter for Y=6
• Use index on Y=6 and filter for X=5
• SELECT * from tbl1, tbl2 where tbl1.a=tbl2.a
• For every row in tbl1 look into tbl2 with an
index to find corresponding a
• For every row in tbl2 look into tbl1 with an
index to find corresponding a
• Real systems gather all sorts of statistics about
the data which all feed into the query planner
• Size of table, Size of indexes
• Selectivity of indexes
• Distribution of values in indexes
• Sampling of commonly occurring values
• And more. An open field, still filled with
trade secrets.
• Running a query
• The query planner ultimately generates a
“program” in the form of some internal
intermediary representation of the
procedural execution of the query which is
handed to the “query executor” for
execution.
• The query executor is a “customer” of all the
subsystems
• Query executor
• Uses Btrees for access to indexes
• Uses concurrency control to support the
SQL notion of a transaction
• Uses logging to make modifications Durable
and Atomic
• Uses the buffer pool to retrieve items from
disk
Real world Systems
• DB2
• Oracle
• Postgres
• Comdb2
DB2
• Provides serializable isolation
• Uses a 2PL locking protocol
• Complex Btree locking techniques
• Next-key locking
• Key/Value locking
• Key-range locking
• NO FORCE / STEAL buffer policy
• System R was originally FORCE / NO
STEAL
• System R didn’t even log
• Over time, it became clear that logging + no
force / steal is the key to high performance
systems
• UNDO and REDO logging
• Sophisticated cost based query planner
• Cost based query planning was invented in
the System R project, described by Selinger
in a paper published in 1979
• Oracle sold a Rules based planner until
1992
• Row level locking
• System R was using Row level locking in the
late 70s, early 80s.
• DB2 gained row locking in 1995 (main frame
only; it took even longer to reach UNIX)
• Oracle gained row locking in 1988
Oracle
• Oracle is it’s own system - Shares nothing at
all with System R
• Many interesting approaches to solving the
same problems were developed
• Provides Snapshot Isolation
• Does not use 2PL, instead uses a form of
MVCC
• The buffer pool itself in Oracle is versioned
• Objects (rows) are not versioned per se
• The pages they exist on are
• When an update occurs, pages are modified IN
PLACE.
• When a read needs to see an earlier version of
a page, the UNDO logs are consulted to
recreate a prior version of this page and place
it into the buffer pool
• The algorithm is roughly based on usage of the
“pagelsn” (Different terminology in Oracle,
same rough concept)
• When you start a transaction (snapshot) record
the current LSN as your “birthlsn”
• If you are looking at a page, and the page has
a pagelsn LESS THAN the birthlsn of your
transaction, then you know you are meant too
see everything on that page
• Else, use UNDO log records to reconstruct a
version of that page that now has a pagelsn
LESS THAN your birthlsn
• Place new(old) page in buffer pool, proceed
• UNDO and REDO logging
• NO FORCE / STEAL buffer management
• UNDO and REDO logs are physically “split”
into 2 distinct data structures
• The REDO logs act like a conventional “log”
file in Oracle
• The UNDO logs have a much more complex
organization for performance reasons due to
the unique requirements Oracle places on
UNDO for MVCC
• Oracle’s locking protocol is relatively simple
• MVCC takes care of many of the issues that
DB2 solves with locking
• No long term read locking ever, not even on
rows.
• “first rule of SI” enforced through MVCC
policy of producing most recently committed
data as of BEGIN
• Long term write locks taken on modified rows
• Used to enforce “second rule of SI”
PostgreSQL
• “Second system” developed after Ingres
• Not based on Ingres, at the time meant as a
proving ground for “new ideas”
• Key idea
• NO OVERWRITE
• At the row level. The buffer pool will
overwrite pages.
• Old versions of rows don’t disappear after
an update, they simply become “older
versions” of that row
• Used to implement an SI isolation model on
top of a row based MVCC system
• NO FORCE / STEAL
• REDO logging
• No UNDO logging!
• Able to get away with this because of the “no
overwrite” nature of updates!
• Earlier versions of PostgreSQL attempted to
run without logging. They used a FORCE
policy
• Eventually came to the same conclusions as
everyone: LOG + NO FORCE + STEAL
Comdb2
• “Second system” developed after Comdbg
• Attempt to produce a relational system
maintaining some level of compatibility with
earlier pre relational systems.
• Provides Snapshot Isolation
• Rows are versioned
• Undo logs are used to reconstruct rows (not
pages)
• Does not use 2PL
• Uses a form of OCC
• Aggressive, Validating scheduler
• Attempt to run transactions concurrently under
hopes that work to backout and retry will be
minimal
Future
• The RDBMS will continue to evolve as our
hardware continues to change
• X86 is dominant - overperforming, underpriced!
• Support that platform and support it well
• Memory is becoming cheap and huge
• Assumptions about what is reasonable to
keep in memory and what is on disk are
changing
• Networks are reaching latency levels
comparable to SMP interconnects
• Distributed systems are more realistic now
• Conversely, HIGH LATENCY, low availability
(“the internet”) networks are becoming another
reality that must be acknowledged
• Research on relaxed isolation levels that scale
across these types of environments will
continue - Last word far from said there
• Generally speaking, the highly available,
distributed systems will be the most able to
adapt and survive
• The idea of a “disk” is changing
• SSD challenges many assumptions about
“sequential” vs “random” access
• At best, “tuning” may be needed for some
RDBMS
• At worst, a “rewrite” may be in order
• SSD challenges the notion of the OVERWRITE
buffer policy being hands down superior.
• SSD is at heart (under the hood) IS a NO
OVERWRITE system. It’s easy to imagine a
NO OVERWRITE buffer pool manager plugged
DIRECTLY into SSD, bypassing file system
abstractions
• The best ideas from the “post relational” (no
sql) camp will converge with the ideas from the
RDBMS producing best of breed systems
• Ease of scaling across commodity hardware
• High availability DESPITE unreliable
hardware
• The two unstoppable ideas from the Relational
Systems will continue to be the reason why
these systems will dominate
• Data Abstraction
• Declarative languages