Slides - Cornell Computer Science

Inclusion of New Types in
Relational Database Systems
Michael Stonebraker
Why ORDBMS?
• Allows the addition of complex data and the
use of a query language to access this data
(e.g. insurance)
• Conveniently supports new applications
(e.g. Internet, photography)
Complete Extended Type System
• Allow the definition of user-defined data types
(e.g. 2-D boxes)
• Allow the definition of new operators for these
data types (e.g. overlaps, contained in)
• Allow the implementation of new access methods
for data types (e.g. R-trees)
• Allow optimized query processing for commands
containing new data types and operators
Motivating Example
Consider a relation consisting of data on two dimensional
boxes. It can be represented by an identifier and the
coordinates of two corner points as follows:
create box (id = i4, x1 = f8, x2 = f8, y1 = f8, y2 = f8)
Now consider the query to find all the boxes that overlap the
unit square, i.e. the box with coordinates (0, 1, 0, 1). A
representation of this request in QUEL is as follows:
retrieve (box.all) where not (box.x2 <= 0 or box.x1 >= 1 or
box.y2 <= 0 or box.y1 >= 1)
Problems
• The command is too hard to understand.
• The command is too slow because the query
planner will not be able to optimize
something this complex.
• The command is too slow because there are
too many clauses to check.
Solution
Support a box data type so that the box relation and the
resulting user query can be defined as follows:
create box (id = i4, desc = box)
retrieve (box.all) where box.desc !! “0, 1, 0, 1”
Here “!!” is an overlaps operator with two operands of data
type box which returns a boolean.
Intuition
• New user-defined type and operator make
resulting query more readable
• New user-defined access methods can allow
query planner to optimize query
Consequences
• Need ability to define operators for user defined
types
• Require support for fast access paths for queries
- Extend current access methods (e.g. B-trees for
boxes using ascending area)
- Define new access methods (e.g. R-trees for
boxes using contained in)
• Require support for the query optimizer to
construct an efficient plan
Overview
• What will we discuss:
– ADTs
– New Access Methods
– Query Processing and Access Path Selection
Examples of Operators for Box
Data Type
Definition of New Types
The new type can be implemented as follows:
define type-name length = value,
input = file-name,
output = file-name
Definition of New Operators
Zero or more operators can be implemented
for the new type as follows:
define operator token = value,
left-operand = type-name,
right-operand = type-name,
result = type-name,
precedence-level like operator-2,
file = file-name
Example
define operator token = !!,
left-operand = box,
right-operand = box,
result = boolean,
precedence like *,
file = /usr/foobar
Comments on the Prototype
• Problem: ADT routines are a security loophole
• Possible solutions:
- Run in separate address space
- Interpret ADT procedure
- Use hardware support for protected procedure calls
• Author’s solution: Provide two environments for ADT
procedures
- Protected environment for debugging
- Unprotected one for performance
Registering New Access Methods
• Basic idea:
- Access methods contain a small number of
procedures that define its characteristics
- Replace these by others which operate on
a different data type
Example
• Consider a B-tree and the following generic query:
retrieve (target-list) where relation.key OPR value
- Supports fast access if OPR is in {=, <, <=, >=,
>}
- Includes procedure calls to support these
operators for a particular data type
• We just have to write procedures for our new
operators which must have properties P1, …, P7
and the B-tree will function correctly!
Example (cont’d)
• Appropriate information recorded on two access
method templates:
- TEMPLATE-1 describes conditions which must
be true for the operators provided by the access
method (only used by humans)
- TEMPLATE-2 provides necessary information
on the data types of operators
• In AM relation, designer can implement one or
more collections of operators which satisfy the
template
F1 = (value – low-key) / (high-key – low-key)
F2 = (high-key – value) / (high-key – low-key)
Example (cont’d)
• User can modify relations to B-tree using any
class of operators defined in AM relation as
follows:
modify box to B-tree on desc using area-op
• Secondary index can also be constructed as
follows
index on box is box-index (desc) using area-op
Implementing New Access
Methods
• Collection of procedure calls that retrieve and update
records.
• Need to construct open, close, get-first, get-next, getunique, insert, delete, replace, and build
• Open and close are usually universally usable and designer
only needs to construct the remaining procedures
• Replace and delete do not require modification if the same
physical page layout as some existing access method is
used
Implementation Problems
•
•
•
•
Interface to transaction management code
Concurrency control subsystem issues
Interface to buffer manager
Only briefly discussed in the paper
Query Processing And Access
Path Selection
• Require four pieces of information when defining an
operator to allow optimization:
- Selectivity factor, Stups, estimates the expected number
of record satisfying the clause:
where rel-name.field-name OPR value
- Selectivity factor, S, is the expected number of records
which satisfy the clause:
where relname-1.field-1 OPR relname-2.field-2
- Feasibility of merge-sort
- Feasibility of hash-join
Example
define operator token = AE,
left-operand = box,
right-operand = box,
result = boolean,
precedence like *,
file = /usr/foobar,
Stups = 1,
S = min (N1, N2),
merge-sort with AL,
hash-join
Generating Query Processing
Plan
• Assumptions:
- Relations stored keyed on one field in a single
file
- Secondary indexes can exist for other fields
• Queries involving a single relation can be
processed as follows:
- Scan of the relation
- Scan of a portion of the primary index
- Scan of a portion of a secondary index
Generating Query Processing
Plan (cont’d)
• Joins can be processed as follows:
- Iterative substitution
- Merge-sort
- Hash-join
• Modify standard query planner to compute best
plan using appropriate rules to generate legal plans
and the selectivities provided
Summary
• Main contributions of paper:
- Shows how to adapt existing access
methods for new data types
- Explains how to code new access methods
- Demonstrates how to support automatic
generation of optimized query plans
The Postgres Next-Generation
Database Management System
Michael Stonebraker
Greg Kemnitz
Applications of DBMS
•
•
•
•
Data management (traditional)
Object management (new)
Knowledge management (new)
An example which requires services in all
three dimensions is an application that
stores and manipulates text and graphics to
facilitate the layout of newspaper copy.
Postgres Data Model and Query
Language
• Orientation toward database access from a query
language
- Emphasis on query language, optimizer, and runtime system
• Orientation toward multilingual access
- No programming language-specific tight
integration
• Small number of concepts
- Classes, inheritance, types, and functions
Classes (Constructed Types,
Relations)
• Named collection of instances (records, tuples) of objects
- Each instance has same collection of named attributes
- Each attribute is a specific type
- Each instance has a unique (never-changing) identifier (OID)
• Can be created as follows:
create EMP (name = c12, salary = float, age = int)
• Can inherit data elements from other classes:
create SALESMAN (quota = float) inherits EMP
• POSTGRES allows a class to inherit from an arbitrary collection
of other parent classes (multiple inheritance)
Classes (cont’d)
• Three kinds of classes in POSTGRES: real
classes, derived classes, and versions.
- A real (or base) classes’ instances are stored in
the database.
- A derived (or view or virtual class) classes’
instances are not physically stored but are
materialized only when necessary.
- A version of another class is stored as a
differential relative to its parent class.
Types
• Three kinds of types in POSTGRES: base types, arrays of
base types, and composite types.
• Base types include hard-wired types (e.g. integers, floats,
character strings) and constructed ADTs
- Can assign values to attributes of base types in
POSTQUEL by either specifying a constant or a function
which returns the correct type
• Arrays of base types are supported using standard bracket
notation and we could define a class as follows:
create EMP (name = c12, salary = float[12], age = int)
retrieve (EMP.name) where EMP.salary[4] = 4000
Types (cont’d)
• Composite types allow a user to construct complex objects, that is,
attributes which contain other instances as part or all of their value.
- Complex objects have a hierarchical internal structure
- Zero or more instances of any class is automatically a composite
type. For example:
create EMP (name = c12, salary = float[12], age = int, manager = EMP,
coworkers = EMP)
• Note, each time a class is constructed, a type is automatically available
to hold a collection of instances of the class.
• POSTGRES also supports a final constructed type, set, whose value is
a collection of instances from all classes. For example, hobbies
information can be added to the EMP class as follows:
add to EMP (hobbies = set)
Types (cont’d)
• Path expressions:
- Elements of an attribute that are a composite type can be
hierarchically addressed by nested dot notation. For
example, one could write:
retrieve (EMP.manager.age) where EMP.name = “Joe”
• Composite types can have a value that is a function which
returns the correct type. For example:
replace EMP (hobbies = compute-hobbies(“Jones”)) where
EMP.name = “Jones”
Functions
• Three different kinds of functions in
POSTGRES:
- C functions
- Operators
- POSTQUEL functions
C Functions
• To be able to perform complex calculations on objects, POSTGRES
supports C functions.
• Can define an arbitrary number of C functions whose arguments are
base or composite types
• Can have an argument which is a class name. For example:
retrieve (EMP.name) where overpaid (EMP)
- Inherited down the class hierarchy in the standard way
- Can be considered as a new attribute for the class whose type is the
return type of the function. For example:
retrieve (EMP.name) where EMP.overpaid
• Queries with C functions in the qualification cannot be optimized by
the POSTGRES query optimizer. For example, the preceding query
will result in a sequential scan of all instances of the class.
Operators
• To be able to use indexes in processing queries, POSTGRES
supports operators.
• Operators are functions with one or two operands which use the
standard operator notation in the query language. For example:
retrieve (DEPT.dname) where DEPT.floorspace AGT “(0,0), (1,1),
(0,2)”
• Only available for operands which are base types
- Access methods support fast access to specific fields in records
- Unclear what an access method should do for a constructed
type
Operators (cont’d)
• To assist the query optimizer, hints such as the
negator of an operator can be included in the
definition of an operator.
• For example, the following query cannot be
optimized, but it can be written as the previous
query which can be:
retrieve (DEPT.dname) where not DEPT.floorspace
ALE “(0,0), (1,1), (0,2)”
• Information on available access paths is stored in
the POSTGRES system catalogs.
POSTQUEL Functions
• Any collection of commands in the POSTQUEL query language can
be packaged together and defined as a function. For example:
define function high-pay returns EMP as
retrieve (EMP.all)
where EMP.salary > 50000
• POSTQUEL functions can also have parameters. For example:
define function sal-lookup (c12) returns float as
retrieve (EMP.salary)
where EMP.name = $1
• Can be placed in a query or directly executed using the fast path
facility
POSTQUEL Functions (cont’d)
• Attributes of a composite type automatically have values which are
functions that return the correct type. For example, consider the
following function and command:
define function mgr-lookup (c12) returns EMP as
retrieve (EMP.all)
where EMP.name = DEPT.manager and DEPT.name = $1
append to EMP
(name = “Sam”, salary = 1000, age = 40,
manager = mgr-lookup(“shoe”))
• Like C functions, POSTQUEL functions can have a specific class as
an argument and can either be thought of as functions or as new
attributes.
POSTGRES Query Language
• We already saw: User-defined functions and
operators, arrays, path expressions
• Support for nested queries
• Transitive closure
• Support for inheritance
• Support for time travel
Nested Queries
• POSTQUEL allows queries to be nested and
has operators that have sets of instances as
operands. For example:
retrieve (DEPT.dname)
where DEPT.floor NOT-IN
{D.floor from D in DEPT where
D.dname != DEPT.dname}
Transitive Closure
• Allows a user to explode an ancestor hierarchy. For
example, consider the class parent (older, younger) and the
following query:
retrieve* into answer (parent.older) from a in answer
where parent.younger = “John” or parent.younger = a.older
- * after retrieve indicates that associated query should be
run until the answer fails to grow
- * can also be used to indicate that a query should be run
over a class and all classes under it in the inheritance
hierarchy. For example:
retrieve (E.name) from E in EMP* where E.age > 40
Time Travel
• Allows a user to run historical queries. For
example (T is a time):
retrieve (EMP.salary) from EMP [T] where
EMP.name = “Sam”
- POSTGRES will find the version of Sam’s
record valid at the correct time and get the
appropriate salary
Fast Path
• Reason for fast path: Application may require direct
access to user-defined or internal POSTGRES function.
• POSTQUEL has been extended with:
function-name (param-list)
• User can execute any function known to POSTGRES.
(e.g. parser, optimizer, executor, access methods, buffer
manager, utility routines)
• Validity of parameters not checked
• Allows user program to call a function in another address
space rather than its own
Rule System
• Reasons for rule system: Users require support for
views, triggers, integrity constraints, referential
integrity, protection, and version control.
• POSTGRES rule system is a general-purpose rules
system that can perform all of these functions.
Rule System (cont’d)
• Rules have the form:
ON event (TO) object
WHERE POSTQUEL-qualification
THEN DO [instead] POSTQUEL-command(s)
- events: retrieve, replace, delete, append, new (replace or append), or
old (delete or replace)
- objects: name of a class or class.column
- POSTQUEL-commands: set of POSTQUEL commands with the
following two changes:
- new or current can appear instead of the name of a class in front of
any attribute
- refuse (target-list) is added as a new POSTQUEL command
Versions
• Innovative application of rule system
• Goal of versions: Create a hypothetical version of a class with the
following properties:
- Initially, the hypothetical class has all the instances of the base class
- The hypothetical class can be freely updated to diverge from the base
class
- Updates to the hypothetical class do not cause physical modifications
to the base class
- Updates to the base class are visible in the hypothetical class, unless
the instance updated has been deleted or modified in the hypothetical
class
Example
• Can create a version of a class as follow:
create version my-EMP from EMP
• This command is supported by two differential class for EMP:
EMP-MINUS (deleted-OID)
EMP-PLUS (all-fields-in EMP, replaced-OID
• The retrieve rule installed at the time the version is created is:
on retrieve to my-EMP
then do instead retrieve (EMP-PLUS.all)
retrieve (EMP.all)
where EMP.OID NOT-IN {EMP-PLUS.replaced-OID} and
EMP.OID NOT-IN {EMP-MINUS.deleted-OID}
Forward Chaining
• Generally, rules specify additional actions to be taken as a
result of user updates. These additional actions may
activate other rules, and a forward chaining control flow
results. For example:
on new EMP.salary
where EMP.name = “Fred”
then do replace E (salary = new.salary) from E in EMP
where E.name = “Joe”
Backward Chaining
• Now consider the following rule:
on retrieve to EMP.salary
where EMP.name = “Joe”
then do instead retrieve (EMP.salary) where EMP.name =
“Fred”
• In this case, Joe’s salary is not explicitly stored, but it is
derived by activating the above rule. If Fred’s salary is not
explicitly stored, then further rules would be used to find
the ultimate answer and a backward chaining control flow
results.
Implementation of Rules
• Two implementations for POSTGRES rules:
- Through record level processing, the rules
system is called when individual records are
accessed, deleted, inserted, or modified.
- The second implementation is through query
rewrite.
Record Level Rule System
• A marker which contains the identifier of a rule is placed on an
attribute of an instance. If the executor touches a marked attribute,
then it calls the rules system before proceeding.
- Efficient if there are a large number of rules and each only covers a
few instances
- No extra overhead will be required unless a marked instance is
actually touched.
• However, consider the following rule and an incoming query:
on replace to EMP.salary then do append to AUDIT (name =
current.name, salary = current.salary, new = new.salary, user = user())
replace EMP (salary = 1.1 * EMP.salary) where EMP.age > 50
- In the record level rules system, we will use the rule for every elderly
employee, a large overhead.
Query Rewrite Module
• Solution: Rewrite the user command to the following:
append to AUDIT (name = EMP.name, salary = EMP.salary, new =
1.1 * EMP.salary, user = user()) where EMP.age > 50
replace EMP (salary = 1.1 * EMP.salary) where EMP.age > 50
- Auditing operation is done in bulk as a single command
- Preferable over the record level rule system
• This system will perform well if there are a small number of
large-scope rules and poorly if there are a large number smallscope rules.
• Note that the two implementations are complementary.
Storage System
• POSTGRES uses a no-overwrite storage manager.
• Old records remain in the database whenever an update occurs and
serves the purpose normally performed by a write-ahead log.
• POSTGRES, therefore, has no conventional log and only stores two
bits per transaction indicating whether each transaction is committed,
aborted, or in progress.
• This system allows for instantaneous crash recovery and time travel.
• Problem: Database will have committed instances intermixed with
instances that were written by aborted transactions.
• Solution: System must distinguish between these two and ignore the
latter.
Storage System (cont’d)
• If stable memory is available, a no-overwrite storage manager is
superior to a conventional one.
• However, in the absence of stable memory, a no-overwrite
storage manager must force to disk all pages written by a
transaction at commit time because the effects of a committed
transaction must be durable in case a crash occurs and main
memory is lost. A conventional disk manager only needs to
force the log pages.
• Even if there are as many log pages as data pages, which is
unlikely, the conventional storage manage is performing
sequential I/O versus the no-overwrite storage manage which is
performing random I/O.
Conclusions
• Original development and organization of POSTGRES is
better than that of INGRES.
• Performance:
- POSTGRES is about twice as fast as UCB-INGRES
- On the other hand, it is 3/5 as fast as ASK-INGRES
(commercial version)
• While at the time of the publication, POSTGRES 2.1 was
work in progress and contained inefficiencies, it still
touched on many interesting ideas for an implementation
of an ORDBMS.