Distributed Databases and DDBMS

LU13 – D ISTRIBUTED D ATABASES AND DDBMS
L EARNING O BJECTIVES
This week we’ll learn about distributed data, distributed databases, and distributed database management systems. Our learning
objectives are:



Describe various DDBMS implementations
Explain how database design affects the DDBMS environment
Apply DDBMS principles to solve problems
If you look at our methodology below we will be looking at activities that take place in the design and implementation
phases of our methodology:
In this learning unit we will explore the fundamentals of distributed data, distributed databases, and the features of the DBMS that
facilitate data distribution.
P ART 1: A LL
W HAT
ABOUT
D ISTRIBUTED D ATABASES
ARE THE DRIVING FORCES FOR DISTRIBUT ING DATA ?
There is little question that we are in the information age as well as firmly entrenched in a global economy. With these factors as our
backdrop it leaves little doubt that organizations are becoming more and more geographically dispersed. If this is the case, how will
we provide decision makers with the data they need to make business decisions? How then will organizations get data timely? One
Page 1
solution for getting data to where it is needed is by placing it closer to where it is used. This week we’ll explore the notion of
distributed databases which is the one of our solutions to getting data closer to the people that need it.
W HAT
IS A DISTRIBUTED DATABASE ?
A distributed database is a single logical view of the database (as seen by the user) of a database that is physically spread over
multiple computing locations connected by a network. The important thing to remember is that the distributed database is truly a
database and not a loose collection of files and applications. A distributed database requires multiple database management
systems with at least one DBMS running on a node of the network where a database is located.
What is the difference between a distributed database and a decentralized database? A decentralized database is also stored on
multiple computers in multiple locations but it is a collection of independent databases. The databases are not viewed as a single
logical entity and the data in these individual databases can not be shared.
In today’s day and age, distributed databases come in many shapes and forms, and serve a wide variety of purposes. Let’s look at
some of the conditions that fostered the growth of distributed databases:








T HREE
Distribution and autonomy of the business - result of globalization, mergers and acquisitions modern organizations have
created geographically dispersed autonomous business units. Each business unit has demanded local control over its data.
Data Sharing - Most modern business decisions, independent of the organization’s complexity, are requiring crossfunctional data. That is data from business units have become increasingly more important.
Data communications costs - the cost to continually move high volumes of data to remote locations can be high. While
costs have decreased, thereby reducing the need for this form of data distribution, it still remains a factor in the overall
database design.
Data communications reliability - moving large volumes of data can be risky
Data communications timing - moving large volumes of data takes time. Having the data stored closer to where it is used
decreases the delivery time to the user.
Purchased software - many organizations are relying on purchased software packages to help solve their business
problems. Often times these independent software packages use different databases and different DBMSs. So it is not
unusual for an organization to have a variety of databases and DBMSs. The development focus then shifts from design and
implementation to how these applications can be integrated with each other.
Data recovery - Many organizations use distributed databases as part of their data recovery strategy. By having duplicate
databases geographically distributed, an organization can spread the risk as well as having copies of their data available if
one of the databases fails.
Multiple uses - A distributed database gives an organization the option to use their databases for multiple purposes. One
database can support the operational decision making and processing needs of the organization while another database
can provide the decision support needs of the tactical and strategic decision makers.
PRIMARY OBJECTI VES OF THE DISTRIBUT ED DATABASE
Before a distributed database can provide the capabilities necessary for the user to derive value from the database it must first meet
three objectives. These objectives make it possible for users to have confidence that the database maintains data integrity, meets
high performance thresholds and has the capabilities for easy data access. To meet these criteria a distributed data database must
have:
1.
2.
Location transparency - even though the data is physically dispersed the database has to perform as if all of the data were
stored in one physical location. Location transparency ensures to the end user the applications that use the database will
work regardless of the user’s and data’s location.
Replication transparency - even though a piece of data is physically stored in more than one location the database treats
the data as if it were stored in only one location. Many distributed databases rely on the storage of copies of data in several
Page 2
3.
locations to improve performance. This fact must be abstracted so that the perception to the end user is that there’s only
one data source.
Failure transparency - even though a transaction is successful in one location it must also be successful at all locations. Once
a transaction occurs it must survive all opportunities for failure, it the transaction fails to commit at one site it must be
rolled back at all sites. This is critical for maintaining consistent and accurate data at multiple locations.
P ART 2: O PTIONS FOR D ISTRIBUTING D ATABASES
D ISTRIBUTED
DATABA SE DESIGN FACTORS
The determining factor for the method chosen will be based on the situation in which the organization plans to use the data and the
resources they have to support the chosen method. There are five factors that influence the selection of a distributed database
design strategy:
1.
2.
3.
4.
5.
Organizational forces - funding availability, autonomy of organizational units, and the need for security.
Frequency and location or clustering of reference to data - In general, data should be located close to the applications that
use those data.
Need for growth and expansion - The availability of processors on the network will influence where data may be located
and applications may be run, and may indicate the need for expansion of the network.
Technological capabilities - Capabilities at each node and for DBMSs coupled with the costs for acquiring and managing
technology must be considered. Storage costs tend to be low, but the costs for managing complex technology can be great.
Need for reliable service - Mission-critical applications and very frequently required data encourage replication schemes.
There are a number of ways that databases can be distributed. The three most common are data replication, horizontal partitioning
and vertical partitioning.
D ATA R EPLICATION
Data replication is a very popular method of data distribution. This method provides a fault tolerant way of distributing a copy
(complete of partial) of the database in more than one location. There are many different replication models including push
replication, pull replication and store
Page 3
Here are some advantages:




Reliability - in the event that one database node fails there are other nodes with a copy of the same data available for use
Faster response - each site has a copy of the data close to where it is used minimizing data movement across the network
or contention with other user sites for the same data
Node decoupling - transactions can be processed without coordination across a network. Each user’s request can be
handled independently. When the updates occur, data synchronization of all database copies can take place independently
Reduced network traffic - network traffic for data replication or synchronization can take place in off-peak times to reduce
network congestion
Some disadvantages:


Increase storage requirements - each copy of the database is going to require additional disk space to store “replication
metadata”, such as which row is newer and where it was updated from.
Complexity of synchronization - there are costs and complexities to keep all copies of the database current especially
updating and synchronizing in near-real time
H ORIZONTAL P ARTITIONING
Imagine a scenario by which the sales team, customer records and orders for each particular location are stored in a DBMS at that
location. This makes sense when that majority of CRUD operations will take place within a specific scope. For example, a North
American sales office would seldom place an order for a European customer, but could do so thanks to distribution transparency.
Horizontal partitioning is a method for distributing data by row. Each database node on the network gets a subset of the rows of the
database and the total of all of the rows from all of the nodes on the network comprise the entire database
Advantages of horizontal partitioning:




Efficiency - Data are stored close to where they are used and separate from other data used by other users or applications.
Local optimization - Data can be stored to optimize performance for local access.
Security - Data not relevant to usage at a particular site are made unavailable.
Ease of querying - Combining data across horizontal partitions is easy since rows are simply merged by unions across the
partitions.
Page 4

Ease of combining - a horizontal database can be combined by performing a UNION of the databases making it much easier
to combine rows from multiple tables versus the more complex JOINS.
Disadvantages horizontal partitioning:


Inconsistent access speed - When data from several partitions are required, the access time can be significantly different
from local-only data access.
Backup vulnerability - Since data are not replicated, when data at one site become inaccessible or damaged, usage cannot
switch to another site where a copy exists; data may be lost if proper backup is not performed at each site.
V ERTICAL P ARTITIONING
Vertical partitioning is a method for distributing data by column. A subset of a database’s columns is distributed to multiple sites.
Each set of data needs a combination of primary and foreign keys in order to recreate the entire database. The advantages and
disadvantages of vertical partitioning are identical to those of horizontal partitioning, with the exception that combining data across
vertical partitions is more difficult than across horizontal partitions. This difficulty arises from the need to match primary keys or
other qualifications to join rows across partitions.
H OW DO YOU
SELECT THE RIGHT DATA DISTRIBUT ION STRATEGY ?
Now that you understand the basic options for distributed databases, you might be wondering how you choose the right strategy for
a given situation. There are a variety of ways to distribute data, and the rationale for making the decision is based on your
organization’s unique data needs and the resources your organization is willing to invest. Of the following methods of distribution
there is no one best approach, and oftentimes a combination of these approaches is used.
Various distribution strategies at a glance.


Complete centralization - the database is physically stored in one location and is accessed remotely by the geographically
dispersed sites.
Replication w/ Snapshots - each remote location has a complete or partial copy of the database with each copy periodically
updated with snapshots of reflecting changes to the data. Snapshots occur on a routine interval.
Page 5



Replication w/Synchronization - each remote site has a complete or partial copy of the database with each copy receiving
near real-time updates via synchronization – like remote triggers on a table.
Integrated Partitioning - where each site’s database is viewed as one logical piece of the entire database, either through
horizontal or vertical partitioning.
Independent Partitioning - where each site’s database is independent and non-integrated with all of the other database
segments
The table below contrasts the various database distribution strategies:

taken from "Modern Database Management" 7th ed., Hoffer, Prescott, McFadden, Prentice-Hall 2007.
D ISTRIBUTION : H OMOGENEOUS VS .
HETEROGENEOUS
When distributing databases it is easier and less complicated to distribute homogeneous databases. The idea is that the same
vendor’s DBMS (with the same major release) will be used to manage the data at each node. Often times this is impractical
especially in a packaged software environment or in a situation where you are distributing databases acquired through mergers or
acquisitions. The figures below demonstrate the difference between a homogeneous and a heterogeneous database
Page 6
Page 7