Relation: Mathematical name of table.
Tuple: row of a table
Cardinality:- number of distinct values
Attribute: column of table.
Degree: number of column
Domain: pools of values
Status table.
S#
SName
Status
City
S1
S2
S3
S4
S5
Smith
John
Angle
Blake
Whitey
20
10
30
20
30
London
Paris
Brisbane
New York
Masachhetute
Data Abstraction
Simply it’s the retrieve of the data efficiently.
Physical Level:
The lowest level of abstraction.
This describes how the data are actually stored.
This describes complex low-level data structures.
Logical Level:
Next higher level of abstraction after physical level.
This describes what data are stored in the database, and what relationships exist among those
data.
This level helps us to know the entire database in terms of a small number of relatively simple
structures.
Data Administrators use the Logical Level and are the responsible bodies who decide what
information to keep in the database.
View Level:
The highest level of abstraction.
This describes only part of the database.
Many users of the database system do not need all this information; instead, they need to
access only part of the database.
The view level of abstraction exists to simplify their interaction with the system.
The system may provide many views for the same database.
Data Models
The structure of the database is the data model.
It’s the collection of conceptual tools for describing data, data relationships, data semantics, and
consistency constraints.
It’s the tool that describes the design of a database at the physical, logical and view levels.
Data models can be classified into four different categories:
Relational Model
Entity-Relationship Model
Object-Based Data Model
Semi structured Data Model
Relational Model:
The relational model uses a collection of tables to represent both the data and the relationships
among the data.
Each table has the multiple columns, and each column has a unique name.
Tables are also known as relations.
The relational model is an example of a record-based model.
Record based model are so named because the database is structured in fixed-format records of
several types.
Each table contains records of a particular type.
Each record type defines a fixed number of fields, or attributes.
The columns of the table correspond to the attributes of the record type.
The relational data model is the most widely used data model, and exists commonly in current
database systems.
Entity-Relationship Model:
The entity-relationship (E-R) data model uses a collection of basic objects , called entities , and
relationship among these objects.
An entity is a thing or object in the real world that is distinguishable from other objects.
The entity –relationship model is most widely used in database design.
Object – Based Data Model:
Object Oriented programming such as the Java , C++, C# has become the dominant software
development methodology, this led to the development of an object-oriented data model that
can be seen as extending the E-R model with notions of encapsulation, methods, and object
identity.
The object-relational data model combines features of the object – oriented data model and
relational data model.
Semi structured Data Model:
The semi structured data model permits the specification of data where individual data items of the
same type may have different sets of attributes.
This is in contrast to the data models mentioned earlier, where every data item of a particular type must
have the same set of attributes. The extensible markup language is widely used to represent
semisturctured data.
Relational Model
Relational Model is the model in which database is represented as collection of tables.
This model is the powerful way of representing data and its relationships and it is now
established as primary data representing way for commercial data processing applications.
The Relation
The Relation is the basic element in a relational data model.
Relations in the Relational Data Model
A relation is subject to the following rules:
1.
2.
3.
4.
Relation (file, table) is a two-dimensional table.
Attribute (i.e. field or data item) is a column in the table.
Each column in the table has a unique name within that table.
Each column is homogeneous. Thus the entries in any column are all of the same type
(e.g. age, name, employee-number, etc).
5. Each column has a domain, the set of possible values that can appear in that column.
6. A Tuple (i.e. record) is a row in the table.
7. The order of the rows and columns is not important.
8. Values of a row all relate to something or portion of a thing.
9. Duplicate rows are not allowed (candidate keys are designed to prevent this).
10. Cells must be single-valued (but can be variable length). Single valued means the
following:
o Cannot contain multiple values such as 'A1,B2,C3'.
o Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means one
thing and 'XYZ' another.
A relation may be expressed using the notation R(A,B,C, ...) where:
R = the name of the relation.
(A,B,C, ...) = the attributes within the relation.
A = the attribute(s) which form the primary key.
Keys
1. A simple key contains a single attribute.
2. A composite key is a key that contains more than one attribute.
3. A candidate key is an attribute (or set of attributes) that uniquely identifies a row. A
candidate key must possess the following properties:
o Unique identification - For every row the value of the key must uniquely identify
that row.
o Non redundancy - No attribute in the key can be discarded without destroying the
property of unique identification.
4. A primary key is the candidate key which is selected as the principal unique identifier.
Every relation must contain a primary key. The primary key is usually the key selected to
identify a row when the database is physically implemented. For example, a part number
is selected instead of a part description.
5. A superkey is any set of attributes that uniquely identifies a row. A superkey differs from
a candidate key in that it does not require the non redundancy property.
6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non key
attribute in one relation and as a primary key attribute in another relation. I say usually
because it is possible for a foreign key to also be the whole or part of a primary key:
o A many-to-many relationship can only be implemented by introducing an
intersection or link table which then becomes the child in two one-to-many
relationships. The intersection table therefore has a foreign key for each of its
parents, and its primary key is a composite of both foreign keys.
o A one-to-one relationship requires that the child table has no more than one
occurrence for each parent, which can only be enforced by letting the foreign key
also serve as the primary key.
Relationships
One table (relation) may be linked with another in what is known as a relationship.
Relationships may be built into the database structure to facilitate the operation of relational joins
at runtime.
1. A relationship is between two tables in what is known as a one-to-many or parent-child
or master-detail relationship where an occurrence on the 'one' or 'parent' or 'master' table
may have any number of associated occurrences on the 'many' or 'child' or 'detail' table.
To achieve this child table must contain fields which link back the primary key on the
parent table. These fields on the child table are known as a foreign key, and the parent
table is referred to as the foreign table (from the viewpoint of the child).
2. It is possible for a record on the parent table to exist without corresponding records on
the child table, but it should not be possible for an entry on the child table to exist
without a corresponding entry on the parent table.
3. A child record without a corresponding parent record is known as an orphan.
4. It is possible for a table to be related to itself. For this to be possible it needs a foreign
key which points back to the primary key. Note that these two keys cannot be comprised
of exactly the same fields otherwise the record could only ever point to itself.
5. A table may be the subject of any number of relationships, and it may be the parent in
some and the child in others.
6. Some database engines allow a parent table to be linked via a candidate key, but if this
were changed it could result in the link to the child table being broken.
7. Some database engines allow relationships to be managed by rules known as referential
integrity or foreign key restraints. These will prevent entries on child tables from being
created if the foreign key does not exist on the parent table, or will deal with entries on
child tables when the entry on the parent table is updated or deleted.
Relational Joins
The join operator is used to combine data from two or more relations (tables) in order to satisfy a
particular query. Two relations may be joined when they share at least one common attribute.
The join is implemented by considering each row in an instance of each relation. A row in
relation R1 is joined to a row in relation R2 when the value of the common attribute(s) is equal
in the two relations. The join of two relations is often called a binary join.
The join of two relations creates a new relation. The notation R1 x R2 indicates the join of
relations R1 and R2. For example, consider the following:
Relation R1
A
B
C
1
5
3
2
4
5
8
3
5
9
3
3
1
6
5
5
4
3
2
7
5
Relation R2
B
D
E
4
7
4
6
2
3
5
7
8
7
2
3
3 2 2
Note that the instances of relation R1 and R2 contain the same data values for attribute B. Data
normalization is concerned with decomposing a relation (e.g. R(A,B,C,D,E) into smaller
relations (e.g. R1 and R2). The data values for attribute B in this context will be identical in R1
and R2. The instances of R1 and R2 are projections of the instances of R(A,B,C,D,E) onto the
attributes (A,B,C) and (B,D,E) respectively. A projection will not eliminate data values duplicate rows are removed, but this will not remove a data value from any attribute.
The join of relations R1 and R2 is possible because B is a common attribute. The result of the
join is:
Relation R1 x R2
A B C D E
1
5
3
7
8
2
4
5
7
4
8
3
5
2
2
9
3
3
2
2
1
6
5
2
3
5
4
3
7
4
2 7 5 2 3
The row (2 4 5 7 4) was formed by joining the row (2 4 5) from relation R1 to the row (4 7 4)
from relation R2. The two rows were joined since each contained the same value for the common
attribute B. The row (2 4 5) was not joined to the row (6 2 3) since the values of the common
attribute (4 and 6) are not the same.
The relations joined in the preceding example shared exactly one common attribute. However,
relations may share multiple common attributes. All of these common attributes must be used in
creating a join. For example, the instances of relations R1 and R2 in the following example are
joined using the common attributes B and C:
Before the join:
Relation R1
A
B
C
6
1
4
8
1
4
5
1
2
2
7
1
Relation R2
B
C
D
1
4
9
1
4
2
1
2
1
7
1
2
7 1 3
After the join:
Relation R1 x R2
A
B
C
D
6
1
4
9
6
1
4
2
8
1
4
9
8
1
4
2
5
1
2
1
2
7
1
2
2 7 1 3
The row (6 1 4 9) was formed by joining the row (6 1 4) from relation R1 to the row (1 4 9) from
relation R2. The join was created since the common set of attributes (B and C) contained
identical values (1 and 4). The row (6 1 4) from R1 was not joined to the row (1 2 1) from R2
since the common attributes did not share identical values - (1 4) in R1 and (1 2) in R2.
The join operation provides a method for reconstructing a relation that was decomposed into two
relations during the normalisation process. The join of two rows, however, can create a new row
that was not a member of the original relation. Thus invalid information can be created during
the join process.
Lossless Joins
A set of relations satisfies the lossless join property if the instances can be joined without creating
invalid data (i.e. new rows). The term lossless join may be somewhat confusing. A join that is not lossless
will contain extra, invalid rows. A join that is lossless will not contain extra, invalid rows. Thus the term
gainless join might be more appropriate.
To give an example of incorrect information created by an invalid join let us take the following data
structure:
R(student, course, instructor, hour, room, grade) Assuming that only one section of a class is offered
during a semester we can define the following functional dependencies:
(HOUR, ROOM)
1.
2.
3.
4.
COURSE
(COURSE, STUDENT)
GRADE
(INSTRUCTOR, HOUR)
ROOM
(COURSE)
INSTRUCTOR
(HOUR, STUDENT)
ROOM
Take the following sample data:
STUDENT COURSE INSTRUCTOR HOUR ROOM GRADE
Smith
Math 1
Jenkins
8:00
100
A
Jones
English
Goldman
8:00
200
B
Brown
English
Goldman
8:00
200
C
Green
Algebra Jenkins
9:00
400
A
The following four relations, each in 4th normal form, can be generated from the given and
implied dependencies:
R1(STUDENT, HOUR, COURSE)
R2(STUDENT, COURSE, GRADE)
R3(COURSE, INSTRUCTOR)
R4(INSTRUCTOR, HOUR, ROOM)
Note that the dependencies (HOUR, ROOM)
COURSE and (HOUR, STUDENT)
ROOM
are not explicitly represented in the preceding decomposition. The goal is to develop relations in
4th normal form that can be joined to answer any ad hoc inquiries correctly. This goal can be
achieved without representing every functional dependency as a relation. Furthermore, several
sets of relations may satisfy the goal.
The preceding sets of relations can be populated as follows:
R1
STUDENT
HOUR
COURSE
Smith
8:00
Math 1
Jones
8:00
English
Brown
8:00
English
Green
9:00
Algebra
R2
STUDENT COURSE GRADE
Smith
Math 1
A
Jones
English
B
Brown
English
C
Green
Algebra
A
R3
COURSE
INSTRUCTOR
Math 1
Jenkins
English
Goldman
Algebra
Jenkins
R4
INSTRUCTOR
HOUR
ROOM
Jenkins
8:00
100
Goldman
8:00
200
Jenkins
9:00
400
Now suppose that a list of courses with their corresponding room numbers is required. Relations
R1 and R4 contain the necessary information and can be joined using the attribute HOUR. The
result of this join is:
R1 x R4
STUDENT COURSE INSTRUCTOR HOUR ROOM
Smith
Math 1
Jenkins
8:00
100
Smith
Math 1
Goldman
8:00
200
Jones
English
Jenkins
8:00
100
Jones
English
Goldman
8:00
200
Brown
English
Jenkins
8:00
100
Brown
English
Goldman
8:00
200
Green
Algebra Jenkins
9:00
400
This join creates the following invalid information (denoted by the coloured rows):
Smith, Jones, and Brown take the same class at the same time from two different
instructors in two different rooms.
Jenkins (the Maths teacher) teaches English.
Goldman (the English teacher) teaches Maths.
Both instructors teach different courses at the same time.
Another possibility for a join is R3 and R4 (joined on INSTRUCTOR). The result would be:
R3 x R4
COURSE INSTRUCTOR HOUR ROOM
Math 1
Jenkins
8:00
100
Math 1
Jenkins
9:00
400
English
Goldman
8:00
200
Algebra
Jenkins
8:00
100
Algebra Jenkins
9:00
400
This join creates the following invalid information:
Jenkins teaches Math 1 and Algebra simultaneously at both 8:00 and 9:00.
A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting relation
with R4 (using both INSTRUCTOR and HOUR). The result would be:
R1 x R3
STUDENT COURSE INSTRUCTOR HOUR
Smith
Math 1
Jenkins
8:00
Jones
English
Goldman
8:00
Brown
English
Goldman
8:00
Green
Algebra
Jenkins
9:00
(R1 x R3) x R4
STUDENT COURSE INSTRUCTOR HOUR ROOM
Smith
Math 1
Jenkins
8:00
100
Jones
English
Goldman
8:00
200
Brown
English
Goldman
8:00
200
Green
Algebra Jenkins
9:00
400
Extracting the COURSE and ROOM attributes (and eliminating the duplicate row produced for
the English course) would yield the desired result:
COURSE ROOM
Math 1
100
English
200
Algebra 400
The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless (gainless?)
join property.
A relational database is in 4th normal form when the lossless join property can be used to answer
unanticipated queries. However, the choice of joins must be evaluated carefully. Many different
sequences of joins will recreate an instance of a relation. Some sequences are more desirable
since they result in the creation of less invalid data during the join operation.
Suppose that a relation is decomposed using functional dependencies and multi-valued
dependencies. Then at least one sequence of joins on the resulting relations exists that recreates
the original instance with no invalid data created during any of the join operations.
For example, suppose that a list of grades by room number is desired. This question, which was
probably not anticipated during database design, can be answered without creating invalid data
by either of the following two join sequences:
R1 x R3
(R1 x R3) x R2
((R1 x R3) x R2) x R4
Or
R1 x R3
(R1 x R3) x R4
((R1 x R3) x R4) x R2
The required information is contained with relations R2 and R4, but these relations cannot be
joined directly. In this case the solution requires joining all 4 relations.
The database may require a 'lossless join' relation, which is constructed to assure that any ad hoc
inquiry can be answered with relational operators. This relation may contain attributes that are
not logically related to each other. This occurs because the relation must serve as a bridge
between the other relations in the database. For example, the lossless join relation will contain all
attributes that appear only on the left side of a functional dependency. Other attributes may also
be required, however, in developing the lossless join relation.
Consider relational schema R(A, B, C, D), A B and C D. Relations Rl(A, B) and R2(C, D)
are in 4th normal form. A third relation R3(A, C), however, is required to satisfy the lossless join
property. This relation can be used to join attributes B and D. This is accomplished by joining
relations R1 and R3 and then joining the result to relation R2. No invalid data is created during
these joins. The relation R3(A, C) is the lossless join relation for this database design.
A relation is usually developed by combining attributes about a particular subject or entity. The
lossless join relation, however, is developed to represent a relationship among various relations.
The lossless join relation may be difficult to populate initially and difficult to maintain - a result
of including attributes that are not logically associated with each other.
The attributes within a lossless join relation often contain multi-valued dependencies.
Consideration of 4th normal form is important in this situation. The lossless join relation can
sometimes be decomposed into smaller relations by eliminating the multi-valued dependencies.
These smaller relations are easier to populate and maintain.
Determinant and Dependent
The terms determinant and dependent can be described as follows:
1. The expression X Y means 'if I know the value of X, then I can obtain the value of Y'
(in a table or somewhere).
2. In the expression X Y, X is the determinant and Y is the dependent attribute.
3. The value X determines the value of Y.
4. The value Y depends on the value of X.
Functional Dependencies (FD)
A functional dependency can be described as follows:
1. An attribute is functionally dependent if its value is determined by another attribute
which is a key.
2. That is, if we know the value of one (or several) data items, then we can find the value of
another (or several).
3. Functional dependencies are expressed as X Y, where X is the determinant and Y is
the functionally dependent attribute.
4. If A (B,C) then A B and A C.
5. If (A,B) C, then it is not necessarily true that A C and B C.
6. If A B and B A, then A and B are in a 1-1 relationship.
7. If A B then for A there can only ever be one value for B.
Transitive Dependencies (TD)
A transitive dependency can be described as follows:
1. An attribute is transitively dependent if its value is determined by another attribute which
is not a key.
2. If X Y and X is not a key then this is a transitive dependency.
3. A transitive dependency exists when A B C but NOT A C.
Multi-Valued Dependencies (MVD)
A multi-valued dependency can be described as follows:
1. A table involves a multi-valued dependency if it may contain multiple values for an
entity.
2. A multi-valued dependency may arise as a result of enforcing 1st normal form.
3. X Y, ie X multi-determines Y, when for each value of X we can have more than one
value of Y.
4. If A B and A C then we have a single attribute A which multi-determines two other
independent attributes, B and C.
5. If A (B,C) then we have an attribute A which multi-determines a set of associated
attributes, B and C.
Join Dependencies (JD)
A join dependency can be described as follows:
1. If a table can be decomposed into three or more smaller tables, it must be capable of
being joined again on common keys to form the original table.
Modification Anomalies
A major objective of data normalisation is to avoid modification anomalies. These come in two
flavours:
1. An insertion anomaly is a failure to place information about a new database entry into
all the places in the database where information about that new entry needs to be stored.
In a properly normalized database, information about a new entry needs to be inserted
into only one place in the database. In an inadequately normalized database, information
about a new entry may need to be inserted into more than one place, and, human
fallibility being what it is, some of the needed additional insertions may be missed.
2. A deletion anomaly is a failure to remove information about an existing database entry
when it is time to remove that entry. In a properly normalized database, information
about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the
database. In an inadequately normalized database, information about that old entry may
need to be deleted from more than one place, and, human fallibility being what it is, some
of the needed additional deletions may be missed.
An update of a database involves modifications that may be additions, deletions, or both. Thus
'update anomalies' can be either of the kinds of anomalies discussed above.
All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption
of the database. Properly normalised databases are much less susceptible to corruption than are
unnormalised databases.
Types of Relational Join
A JOIN is a method of creating a result set that combines rows from two or more tables
(relations). When comparing the contents of two tables the following conditions may occur:
Every row in one relation has a match in the other relation.
Relation R1 contains rows that have no match in relation R2.
Relation R2 contains rows that have no match in relation R1.
INNER joins contain only matches. OUTER joins may contain mismatches as well.
Inner Join
This is sometimes known as a simple join. It returns all rows from both tables where there is a
match. If there are rows in R1 which do not have matches in R2, those rows will not be listed.
There are two possible ways of specifying this type of join:
SELECT * FROM R1, R2 WHERE R1.r1_field = R2.r2_field;
SELECT * FROM R1 INNER JOIN R2 ON R1.field = R2.r2_field
If the fields to be matched have the same names in both tables then the ON condition, as in:
ON R1.fieldname = R2.fieldname
ON (R1.field1 = R2.field1 AND R1.field2 = R2.field2)
can be replaced by the shorter USING condition, as in:
USING fieldname
USING (field1, field2)
Natural Join
A natural join is based on all columns in the two tables that have the same name. It is
semantically equivalent to an INNER JOIN or a LEFT JOIN with a USING clause that names all
columns that exist in both tables.
SELECT * FROM R1 NATURAL JOIN R2
The alternative is a keyed join which includes an ON or USING condition.
Left [Outer] Join
Returns all the rows from R1 even if there are no matches in R2. If there are no matches in R2
then the R2 values will be shown as null.
SELECT * FROM R1 LEFT [OUTER] JOIN R2 ON R1.field = R2.field
Right [Outer] Join
Returns all the rows from R2 even if there are no matches in R1. If there are no matches in R1
then the R1 values will be shown as null.
SELECT * FROM R1 RIGHT [OUTER] JOIN R2 ON R1.field = R2.field
Full [Outer] Join
Returns all the rows from both tables even if there are no matches in one of the tables. If there
are no matches in one of the tables then its values will be shown as null.
SELECT * FROM R1 FULL [OUTER] JOIN R2 ON R1.field = R2.field
Self Join
This joins a table to itself. This table appears twice in the FROM clause and is followed by table
aliases that qualify column names in the join condition.
SELECT a.field1, b.field2 FROM R1 a, R1 b WHERE a.field = b.field
Cross Join
This type of join is rarely used as it does not have a join condition, so every row of R1 is joined
to every row of R2. For example, if both tables contain 100 rows the result will be 10,000 rows.
This is sometimes known as a cartesian product and can be specified in either one of the
following ways:
SELECT * FROM R1 CROSS JOIN R2
SELECT * FROM R1, R2
Integrity Constraints
Integrity constraints used in oracle
1.
2.
3.
4.
5.
Not null
Unique key
Primary key
Foreign key
Check
Integrity constraints are those constraints that are used to ensure accuracy and consistency of data in
Relational database.
The constraints ensure that changes made to database by authorized user do not result in loss of data
consistency.
Integrity constraints allow us to define certain data quality requirements that the data in database
needs to meet. If a user tries to insert data that do not meet requirements, integrity constraint will not
allow this.
1) Not null:
The NOT NULL constraint enforces a column to not accept Null values.
It is not possible to insert the null values if the column is defined as not null.
Create table student
(
name char(20),
rollNumber number(10),
not null (Name)
);
Insert into student (name, rollNumber) values (‘bijaya’,’1’);
Insert into student (name, rollNumber) values (‘’,’1’);
)
Create table persons
(
P_id int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255)
)
2) Unique Key:
It is not possible to insert the null and duplicate values to the column which is defined primary
key.
Create table student
(
name char(20),
roll number(10),
unique (name)
);
Unique Key
It accepts the null values.
Primary Key
Doesn’t accept null values. It is the combination of
not null and unique constraints.
3) Primary Key:
A primary key is a special relational database table column (or combination of columns)
designated to uniquely identify all table records.
A primary key’s main features are:
It must contain a unique value for each row of data.
It cannot contain null values.
A primary key is either an existing table column or a column that is specifically generated by the
database according to a defined sequence.
CREATE TABLE PERSONS
(
P_id int not null,
Lastname varchar(255) not null,
Firstname varchar(255),
Address varchar(255),
City varchar(255),
Primary key(p_id)
);
4) Foreign Key: ( Referential Integrity Constraints )
Student_Info
Name
Roll_no
Address
Keshab
Harearam
Ram
302
309
322
Jamal
Jorpati
Naxal
Student_Performance
Roll_no
Percentage
302
309
79
72
A Foreign key constraint on a column ensures that the value in that column is the primary key of another
table.
5) Check Constraint:
The check constraint is used to limit the value range that can be placed in a column
If we define the check constraint on a single column it allows only certain values for this column.
If we define the check constraint on a table it can limit the values in certain columns based on
values in other columns in the row.
Create table persons
(
P_id int not null,
Lastname varchar(255) not null,
Firstname varchar(255) ,
Address varchar(255),
City varchar(255),
Check(p_id > 0)
);
Advance SQL Programming
Function
PL/SQL functions block create using CREATE FUNCTION statement. The major difference between
PL/SQL function or procedure, function return always value where as procedure may or may not return
value.
PL/SQL Functions Syntax
CREATE [OR REPLACE] FUNCTION function_name
[ (parameter [,parameter]) ]
RETURN return_datatype
IS | AS
[
declaration_section
variable declarations;
constant declarations;
]
BEGIN
[executable_section
PL/SQL execute/subprogram body
]
[EXCEPTION]
[exception_section
PL/SQL Exception block
]
END [function_name];
/
Function Example
In this example we are creating a function to pass employee number and get that employee name
from table. We have emp1 table having employee information,
EMP_NO
EMP_NAME
EMP_DEPT
EMP_SALARY
1
Forbs ross
Web Developer
45k
2
marks jems
Program Developer
38k
3
Saulin
Program Developer
34k
4
Zenia Sroll
Web Developer
42k
Create Function
CREATE or REPLACE FUNCTION fun1(no in number)
RETURN varchar2
IS
name varchar2(20);
BEGIN
select emp_name into name from emp1 where emp_no = no;
return name;
END;
/
Calling Function
SQL> select fun1(1) from dual;
PL/SQL Drop Function
You can drop PL/SQL function using DROP FUNCTION statements.
Functions Drop Syntax
DROP FUNCTION function_name;
Functions Drop Example
SQL>DROP FUNCTION fun1;
Function dropped.
Example 1: Practical
-------------------------Preparing of the table
-----------------------create table emp1
(
emp_no number,
emp_name varchar2(20),
emp_dept varchar2(20),
emp_salary varchar2(20)
);
----------------------------------Inserting the rows to the table
--------------------------------insert all
into emp1(emp_no, emp_name, emp_dept,
'Web Developer',
into emp1(emp_no, emp_name, emp_dept,
'Program Developer',
'38k')
into emp1(emp_no, emp_name, emp_dept,
into emp1(emp_no, emp_name, emp_dept,
'Web Developer',
select * from dual;
emp_salary)values ('1','Forbs ross',
'45k')
emp_salary)values ('2','marks jems',
emp_salary)values ('3','Saulin',
emp_salary)values ('4','Zenia Sroll',
'42k')
'Progra
-------------------------------------------------Creating the function
-----------------------------------------------CREATE or REPLACE FUNCTION fun1(no in number)
RETURN varchar2
IS
name varchar2(20);
BEGIN
select emp_name into name from emp1 where emp_no = no;
return name;
END;
/
PL/SQL Procedure Syntax
CREATE [OR REPLACE] PROCEDURE procedure_name
[ (parameter [,parameter]) ]
IS
[declaration_section
variable declarations;
constant declarations;
]
BEGIN
[executable_section
PL/SQL execute/subprogram body
]
[EXCEPTION]
[exception_section
PL/SQL Exception block
]
END [procedure_name];
/
PL/SQL Procedure Example
In this example we are creating a procedure to pass employee number argument and get that
employee information from table. We have emp1 table having employee information,
EMP_NO
EMP_NAME
EMP_DEPT
EMP_SALARY
1
Forbs ross
Web Developer
45k
2
marks jems
Program Developer
38k
3
Saulin
Program Developer
34k
4
Zenia Sroll
Web Developer
42k
Create PROCEDURE
In this example passing IN parameter (no) and inside procedure SELECT ... INTO statement to
get the employee information.
CREATE or REPLACE PROCEDURE pro1(no in number, temp out emp1%rowtype)
IS
BEGIN
SELECT * INTO temp FROM emp1 WHERE emp_no = no;
END;
/
PL/SQL Program to Calling Procedure
This program (pro) call the above define procedure with pass employee number and get that
employee information.
DECLARE
temp emp1%rowtype;
no number :=&no;
BEGIN
pro1(no,temp);
dbms_output.put_line(temp.emp_no||'
'||
temp.emp_name||'
'||
temp.emp_dept||'
'||
temp.emp_salary);
END;
/
PL/SQL Drop Procedure
You can drop PL/SQL procedure using DROP PROCEDURE statement,
Functions Drop Syntax
DROP PROCEDURE procedure_name;
Procedure Drop Example
SQL>DROP PROCEDURE pro1;
Procedure dropped.
Trigger
What is Trigger?
Oracle engine invokes automatically whenever a specified event occurs.
Trigger is stored into database and invoked repeatedly, when specific condition match.
PL/SQL Triggers Syntax
PL/SQL trigger define using CREATE TRIGGER statement.
CREATE [OR REPLACE] TRIGGER trigger_name
BEFORE | AFTER
[INSERT, UPDATE, DELETE [COLUMN NAME..]
ON table_name
Referencing [ OLD AS OLD | NEW AS NEW ]
FOR EACH ROW | FOR EACH STATEMENT [ WHEN Condition ]
DECLARE
[declaration_section
variable declarations;
constant declarations;
]
BEGIN
[executable_section
PL/SQL execute/subprogram body
]
EXCEPTION
[exception_section
PL/SQL Exception block
]
END;
Syntax Description
1. CREATE [OR REPLACE] TRIGGER trigger_name : Create a trigger with the given
name. If already have overwrite the existing trigger with defined same name.
2. BEFORE | AFTER : Indicates when the trigger get fire. BEFORE trigger execute before
when statement execute before. AFTER trigger execute after the statement execute.
3. [INSERT, UPDATE, DELETE [COLUMN NAME..] : Determines the performing
trigger event. You can define more then one triggering event separated by OR keyword.
4. ON table_name : Define the table name to performing trigger event.
5. Referencing [ OLD AS OLD | NEW AS NEW ] : Give referencing to a old and new
values of the data. :old means use existing row to perform event and :new means use
executing new row to perform event. You can set referencing names user define name
from old (or new).
You can't referencing old values when inserting a record, or new values when deleting a
record, because It's does not exist.
6. FOR EACH ROW | FOR EACH STATEMENT : Trigger must fire when each row gets
Affected (ROW Trigger). and fire only once when the entire sql statement is execute
(STATEMENT Trigger).
7. WHEN Condition : Optional. Use only for row level trigger. Trigger fire when specified
condition is satisfy.
PL/SQL Triggers Example
You can make your own trigger using trigger syntax referencing. Here are fewer trigger example.
Inserting Trigger
This trigger execute BEFORE to convert ename field lowercase to uppercase.
CREATE or REPLACE TRIGGER trg1
BEFORE
INSERT ON emp1
FOR EACH ROW
BEGIN
:new.emp_name := upper(:new.emp_name);
END;
/
Restriction to Deleting Trigger
This trigger is preventing to deleting row.
Delete Trigger Code:
CREATE or REPLACE TRIGGER trg1
AFTER
DELETE ON emp1
FOR EACH ROW
BEGIN
IF :old.emp_no = 1 THEN
raise_application_error(-20015, 'You can't delete this row');
END IF;
END;
/
Delete Trigger Result :
SQL>delete from emp1 where eno = 1;
Error Code: 20015
Error Name: You can't delete this row
Query optimization
Query optimization is the process of selecting the most efficient query-evaluation plan from
among the many strategies usually possible for processing a given query, especially if the query
is complex.
We do not expect users to write their queries so that they can be processed efficiently. Rather, we
expect the system to construct a query-evaluation plan that minimizes the cost of query
evaluation. This is where query optimization comes into play. One aspect of optimization occurs
at the relational-algebra level, where the system attempts to find an expression that is equivalent
to the given expression, but more efficient to execute. Another aspect is selecting a detailed
strategy for processing the query, such as choosing the algorithm to use for executing an
operation, choosing the specific indices to use, and so on. The difference in cost (in terms of
evaluation time) between a good strategy and a bad strategy is often substantial, and may be
several orders of magnitude. Hence, it is worthwhile for the system to spend a substantial amount
of time on the selection of a good strategy for processing a query, even if the query is executed
only once.
Overview
Consider the following relational-algebra expression, for the query “Find the names of all
instructors in the Music department together with the course title of all the courses that the
instructors teach.”
Note that the projection of course on (course id,title) is required since course shares an attribute
dept name with instructor; if we did not remove this attribute using the projection, the above
expression using natural joins would return only courses from the Music department, even if
some Music department instructors taught
courses in other departments.
The above expression constructs a large intermediate relation,
However, we are interested in only a few tuples of this
relation (those pertaining to instructors in the Music department), and in only two of the ten
attributes of this relation. Since we are concerned with only those tuples in the instructor relation
that pertain to the Music department, we do not need to consider those tuples that do not have
dept name = “Music”. By reducing the number of tuples of the instructor relation that we need to
access, we reduce the size of the intermediate result. Our query is now represented by the
relational-algebra expression:
which is equivalent to our original algebra expression, but which generates smaller intermediate
relations. Figure 13.1 depicts the initial and transformed expressions.
An evaluation plan defines exactly what algorithm should be used for each operation, and how
the execution of the operations should be coordinated. Figure 13.2 illustrates one possible
evaluation plan for the expression from Figure 13.1(b). As we have seen, several different
algorithms can be used for each relational operation, giving rise to alternative evaluation plans.
In the figure, hash join has been chosen for one of the join operations, while the other uses merge
join, after sorting the relations on the join attribute, which is ID. Where edges are marked as
pipelined, the output of the producer is pipelined directly to the consumer, without being written
out to disk. Given a relational-algebra expression, it is the job of the query optimizer to come up
with a query-evaluation plan that computes the same result as the given expression, and is the
least-costly way of generating the result (or, at least, is not much costlier than the least-costly
way).
To find the least-costly query-evaluation plan, the optimizer needs to generate alternative plans
that produce the same result as the given expression, and to choose the least-costly one.
Generation of query-evaluation plans involves three steps: (1) generating expressions that are
logically equivalent to the given expression,
(2) annotating the resultant expressions in alternative ways to generate alternative queryevaluation plans, and (3) estimating the cost of each evaluation plan, and choosing the one whose
estimated cost is the least. Steps (1), (2), and (3) are interleaved in the query optimizer—some
expressions are generated, and annotated to generate evaluation plans, then further expressions
are generated and annotated, and so on. As evaluation plans are generated, their costs are
estimated by using statistical information about the relations, such as relation sizes and index
depths. To implement the first step, the query optimizer must generate expressions equivalent to
a given expression. It does so by means of equivalence rules that specify how to transform an
expression into a logically equivalent one. We describe these rules in Section 13.2. In Section
13.3 we describe how to estimate statistics of the results of each operation in a query plan. Using
these statistics with the cost formulae in Chapter 12 allows us to estimate the costs of individual
operations. The individual costs are combined to determine the estimated cost of evaluating a
given relational-algebra expression, as outlined earlier in Section 12.7. In Section 13.4, we
describe how to choose a query-evaluation plan. We can choose one based on the estimated cost
of the plans. Since the cost is an estimate, the selected plan is not necessarily the least-costly
plan; however, as long as the estimates are good, the plan is likely to be the least-costly one, or
not much more costly than it. Finally, materialized views help to speed up processing of certain
queries. In Section 13.5, we study how to “maintain” materialized views—that is, to keep them
up-to-date—and how to perform query optimization with materialized views.
VIEWING QUERY EVALUATION PLANS
Most database systems provide a way to view the evaluation plan chosen to execute a given
query. It is usually best to use the GUI provided with the database system to view evaluation
plans. However, if you use a command line interface, many databases support variations of a
command “explain <query>”, which displays the execution plan chosen for the specified query
<query>. The exact syntax varies with different databases:
• PostgreSQL uses the syntax shown above.
• Oracle uses the syntax explain plan for. However, the command stores the resultant plan in a
table called plan table, instead of displaying it. The query “select * from table(dbms
xplan.display);” displays the stored plan.
• DB2 follows a similar approach to Oracle, but requires the program db2exfmt to be executed to
display the stored plan.
• SQL Server requires the command set showplan text on to be executed before submitting the
query; then, when a query is submitted, instead of executing the query, the evaluation plan is
displayed.
The estimated costs for the plan are also displayed along with the plan. It is worth noting that the
costs are usually not in any externally meaningful unit, such as seconds or I/O operations, but
rather in units of whatever cost model the optimizer uses. Some optimizers such as PostgreSQL
display two cost-estimate numbers; the first indicates the estimated cost for outputting the first
result, and the second indicates the estimated cost for outputting all results.
Transformation of Relational Expressions
A query can be expressed in several different ways, with different costs of evaluation. In this
section, rather than take the relational expression as given, we consider alternative, equivalent
expressions. Two relational-algebra expressions are said to be equivalent if, on every legal
database instance, the two expressions generate the same set of tuples. (Recall that a legal
database instance is one that satisfies all the integrity constraints specified in the database
schema.) Note that the order of the tuples is irrelevant; the two expressions may generate the
tuples in different orders, but would be considered equivalent as long as the set of tuples is the
same.
In SQL, the inputs and outputs are multisets of tuples, and the multiset version of the relational
algebra (described in the box in page 238) is used for evaluating SQL queries. Two expressions
in the multiset version of the relational algebra are said to be equivalent if on every legal
database the two expressions generate the same multiset of tuples. The discussion in this chapter
is based on the relational algebra.
Concurrency control and Transaction management
When several transactions execute concurrently in the database, however, the isolation property
may no longer be preserved. To ensure that it is, the system must control the interaction among
the concurrent transactions; this control is achieved through one of a variety of mechanisms
called concurrency control schemes.
Lock-Based Protocols
One way to ensure isolation is to require that data items be accessed in a mutually exclusive
manner; that is, while one transaction is accessing a data item, no other transaction can modify
that data item. The most common method used to implement this requirement is to allow a
transaction to access a data item only if it is currently holding a lock on that item.
Locks
There are various modes in which a data item may be locked. In this section, we restrict our
attention to two modes:
1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on item Q, then Ti
can read, but cannot write, Q.
3. Exclusive. If a transaction Ti has obtained an exclusive-mode lock (denoted by X) on
item Q, then Ti can both read and write Q.
We require that every transaction request a lock in an appropriate mode on data item Q,
depending on the types of operations that it will perform on Q. The transaction makes the request
to the concurrency-control manager. The transaction can proceed with the operation only after
the concurrency-control manager grants the lock to the transaction. The use of these two lock
modes allows multiple transactions to read a data item but limits write access to just one
transaction at a time.
To state this more generally, given a set of lock modes, we can define a compatibility function
on them as follows: Let A and B represent arbitrary lock modes. Suppose that a transaction Ti
requests a lock of mode A on item Q on which transaction Tj (Ti _= Tj ) currently holds a lock of
mode B. If transaction Ti can be granted a lock on Q immediately, in spite of the presence of the
mode B lock, then we say mode A is compatible with mode B. Such a function can be
represented conveniently by a matrix. The compatibility relation between the two modes of
locking discussed in this section appears in the matrix comp of Figure 15.1. An element comp(A,
B) of the matrix has the value true if and only if mode A is compatible with mode B. Note that
shared mode is compatible with shared mode, but not with exclusive mode. At any time, several
shared-mode locks can be held simultaneously (by different transactions) on a particular data
item. A subsequent exclusive-mode lock request has to wait until the currently held shared-mode
locks are released.
A transaction requests a shared lock on data item Q by executing the lock- S(Q) instruction.
Similarly, a transaction requests an exclusive lock through the lock-X(Q) instruction. A
transaction can unlock a data item Q by the unlock (Q) instruction.
To access a data item, transaction Ti must first lock that item. If the data item is already locked
by another transaction in an incompatible mode, the concurrency control manager will not grant
the lock until all incompatible locks held by other transactions have been released. Thus, Ti is
made to wait until all incompatible locks held by other transactions have been released.
Transaction Ti may unlock a data item that it had locked at some earlier point. Note that a
transaction must hold a lock on a data item as long as it accesses that item. Moreover, it is not
necessarily desirable for a transaction to unlock a data item immediately after its final access of
that data item, since serializability may not be ensured.
As an illustration, consider again the banking example. Let A and B be two accounts that are
accessed by transactions T1
T1: lock-X(B);
read(B);
B := B − 50;
write(B);
unlock(B);
lock-X(A);
read(A);
A := A + 50;
write(A);
unlock(A).
Figure 15.2 Transaction T1.
and T2. Transaction T1 transfers $50 from account B to account A (Figure 15.2). Transaction T2
displays the total amount of money in accounts A and B—that is, the sum A + B (Figure 15.3).
Suppose that the values of accounts A and B are $100 and $200, respectively. If these two
transactions are executed serially, either in the order T1, T2 or the order T2, T1, then transaction
T2 will display the value $300. If, however, these transactions are executed concurrently, then
schedule 1, in Figure 15.4, is possible.
In this case, transaction T2 displays $250, which is incorrect. The reason for this mistake is that
the transaction T1 unlocked data item B too early, as a result of which T2 saw an inconsistent
state.
The schedule shows the actions executed by the transactions, as well as the points at which the
concurrency-control manager grants the locks. The transaction making a lock request cannot
execute its next action until the concurrency control manager grants the lock. Hence, the lock
must be granted in the interval of time between the lock-request operation and the following
action of the transaction.
Exactly when within this interval the lock is granted is not important; we can safely assume that
the lock is granted just before the following action of the transaction. We shall therefore drop the
column depicting the actions of the concurrency-control manager from all schedules depicted in
the rest of the chapter. We let you infer when locks are granted.
T2: lock-S(A);
read(A);
unlock(A);
lock-S(B);
read(B);
unlock(B);
display(A + B).
Figure 15.3 Transaction T2.
Suppose now that unlocking is delayed to the end of the transaction. Transaction T3 corresponds
to T1 with unlocking delayed (Figure 15.5). Transaction T4 corresponds to T2 with unlocking
delayed (Figure 15.6).
You should verify that the sequence of reads and writes in schedule 1, which lead to an incorrect
total of $250 being displayed, is no longer possible with T3
T3: lock-X(B);
read(B);
B := B − 50;
write(B);
lock-X(A);
read(A);
A := A + 50;
write(A);
unlock(B);
unlock(A).
Figure 15.5 Transaction T3 (transaction T1 with unlocking delayed).
T4: lock-S(A);
read(A);
lock-S(B);
read(B);
display(A + B);
unlock(A);
unlock(B).
Figure 15.6 Transaction T4 (transaction T2 with unlocking delayed).
and T4. Other schedules are possible. T4 will not print out an inconsistent result in any of them;
we shall see why later. Unfortunately, locking can lead to an undesirable situation. Consider the
partial schedule of Figure 15.7 for T3 and T4. Since T3 is holding an exclusive mode lock on B
and T4 is requesting a shared-mode lock on B, T4 is waiting for T3 to unlock B. Similarly, since
T4 is holding a shared-mode lock on A and T3 is requesting an exclusive-mode lock on A, T3 is
waiting for T4 to unlock A. Thus, we have arrived at a state where neither of these transactions
can ever proceed with its normal execution. This situation is called deadlock. When deadlock
occurs, the system must roll back one of the two transactions. Once a transaction has been rolled
back, the data items that were locked by that transaction are unlocked.
These data items are then available to the other transaction, which can continue with its
execution. If we do not use locking, or if we unlock data items too soon after reading or writing
them, we may get inconsistent states. On the other hand, if we do not unlock a data item before
requesting a lock on another data item, deadlocks may occur. There are ways to avoid deadlock
in some situations. However, in general, deadlocks are a necessary evil associated with locking,
if we want to avoid inconsistent states.
Deadlocks are definitely preferable to inconsistent states, since they can be handled by rolling
back transactions, whereas inconsistent states may lead to real-world problems that cannot be
handled by the database system.
We shall require that each transaction in the system follow a set of rules, called a locking
protocol, indicating when a transaction may lock and unlock each of the data items. Locking
protocols restrict the number of possible schedules. The set of all such schedules is a proper
subset of all possible serializable schedules. We shall present several locking protocols that allow
only conflict-serializable schedules, and thereby ensure isolation. Before doing so, we introduce
some terminology.
Let {T0, T1, . . . , Tn} be a set of transactions participating in a schedule S. We say that Ti
precedes Tj in S, written Ti → Tj, if there exists a data item Q such that Ti has held lock mode A
on Q, and Tj has held lock mode B on Q later, and comp(A,B) = false. If Ti →Tj , then that
precedence implies that in any equivalent serial schedule, Ti must appear before Tj . Observe that
this graph is similar to the precedence graph that we used in Section 14.6 to test for conflict
serializability. Conflicts between instructions correspond to non compatibility of lock modes.
We say that a schedule S is legal under a given locking protocol if S is a possible schedule for a
set of transactions that follows the rules of the locking protocol.
We say that a locking protocol ensures conflict serializability if and only if all legal schedules
are conflict serializable; in other words, for all legal schedules the associated→relation is
acyclic.
Performance Tuning
Tuning the performance of a system involves adjusting various parameters and design choices to
improve its performance for a specific application. Various aspects of a database-system
design—ranging from high-level aspects such as the schema and transaction design to database
parameters such as buffer sizes, down to hardware issues such as number of disks—affect the
performance of an application. Each of these aspects can be adjusted so that performance is
improved.
Improving Set Orientation
Tuning of Bulk Loads and Updates
Location of Bottlenecks
Tunable Parameters
Tuning of Hardware
Tuning of the Schema
Tuning of Indices
Using Materialized Views
Automated Tuning of Physical Design
Tuning of Concurrent Transactions
Performance Simulation
1) Improving Set Orientation
When SQL queries are executed from an application program, it is often the case that a query is
executed frequently, butwith different values for a parameter. Each call has an overhead of
communication with the server, in addition to processing overheads at the server.
For example, consider a program that steps through each department, invoking an embedded
SQL query to find the total salary of all instructors in the department:
select sum(salary)
from instructor
where dept name= ?
If the instructor relation does not have a clustered index on dept name, each such query will
result in a scan of the relation. Even if there is such an index, a random I/O operation will be
required for each dept name value.
Instead, we can use a single SQL query to find total salary expenses of each department:
select dept name, sum(salary)
from instructor
group by dept name;
This query can be evaluated with a single scan of the instructor relation, avoiding random I/O for
each department. The results can be fetched to the client side using a single round of
communication, and the client program can then step through the results to find the aggregate for
each department. Combining multiple SQL queries into a single SQL query as above can reduce
execution costs greatly in many cases–for example, if the instructor relation is very large and has
a large number of departments.
Another technique used widely in client–server systems to reduce the cost of communication and
SQL compilation is to use stored procedures, where queries are stored at the server in the form of
procedures, which may be precompiled. Clients can invoke these stored procedures, rather than
communicate a series of queries.
Another aspect of improving set orientation lies in rewriting queries with nested sub-queries. In
the past, optimizers on many database systems were not particularly good, so how a query was
written would have a big influence on how it was executed, and therefore on the performance.
Today’s advanced optimizers can transform even badly written queries and execute them
efficiently, so the need for tuning individual queries is less important than it used to be.
However, complex queries containing nested subqueries are not optimized very well by many
optimizers.
2) Tuning of Bulk Loads and Updates
When loading a large volume of data into a database (called a bulk load operation), performance
is usually very poor if the inserts are carried out a separate SQL insert statements. One reason is
the overhead of parsing each SQL query; a more important reason is that performing integrity
constraint checks and index updates separately for each inserted tuple results in a large number
of random I/O operations. If the inserts were done as a large batch, integrity-constraint checking
and index update can be done in a much more set-oriented fashion, reducing overheads greatly;
performance improvements of an order-of-magnitude or more are not uncommon.
To support bulk load operations, most database systems provide a bulk import utility, and a
corresponding bulk export utility. The bulk-import utility reads data froma file, and performs
integrity constraint checking as well as index maintenance in a very efficient manner. Common
input and output file format supported by such bulk import/export utilities include text files with
characters such as commas or tabs separating attribute values, with each record in a line of its
own (such file formats are referred to as comma-separated values or tab-separated values
formats). Database specific binary formats, as well as XML formats are also supported by bulk
import/export utilities. The names of the bulk import/export utilities differ by database.
In PostgreSQL, the utilities are called pg dump and pg restore (PostgreSQL also provides an
SQL command copy which provides similar functionality). The bulk import/export utility in
Oracle is called SQL*Loader, the utility in DB2 is called load, and the utility in SQL Server is
called bcp (SQL Server also provides an SQL command called bulk insert).
3) Location of Bottlenecks
The performance of most systems (at least before they are tuned) is usually limited primarily by
the performance of one or a few components, called bottlenecks.
For instance, a program may spend 80 percent of its time in a small loop deep in the code, and
the remaining 20 percent of the time on the rest of the code; the small loop then is a bottleneck.
Improving the performance of a component that is not a bottleneck does little to improve the
overall speed of the system; in the example, improving the speed of the rest of the code cannot
lead to more than a 20 percent improvement overall, whereas improving the speed of the
bottleneck loop could result in an improvement of nearly 80 percent overall, in the best case.
Hence, when tuning a system, we must first try to discover what the bottlenecks are and then
eliminate them by improving the performance of system components causing the bottlenecks.
When one bottleneck is removed, it may turn out that another component becomes the
bottleneck. In a well-balanced system, no single component is the bottleneck. If the system
contains bottlenecks, components that are not part of the bottleneck are underutilized, and could
perhaps have been replaced by cheaper components with lower performance.
4) Tunable Parameters
Database administrators can tune a database system at three levels. The lowest level is at the
hardware level. Options for tuning systems at this level include adding disks or using a RAID
system if disk I/O is a bottleneck, adding more memory if the disk buffer size is a bottleneck, or
moving to a faster processor if CPU use is a bottleneck.
The second level consists of the database-system parameters, such as buffer size and check
pointing intervals. The exact set of database-system parameters that can be tuned depends on the
specific database system. Most database-system manuals provide information on what databasesystem parameters can be adjusted, and how you should choose values for the parameters. Welldesigned database systems perform as much tuning as possible automatically, freeing the
user or database administrator from the burden. For instance, in many database systems the
buffer size is fixed but tunable. If the system automatically adjusts the buffer size by observing
indicators such as page-fault rates, then the database administrator will not have to worry about
tuning the buffer size.
The third level is the highest level. It includes the schema and transactions. The administrator
can tune the design of the schema, the indices that are created, and the transactions that are
executed, to improve performance. Tuning at this level is comparatively system independent.
The three levels of tuning interact with one another; we must consider them together when
tuning a system. For example, tuning at a higher level may result in the hardware bottleneck
changing from the disk system to the CPU, or vice versa.
5) Tuning of Hardware
Even in a well-designed transaction processing system, each transaction usually has to do at least
a few I/O operations, if the data required by the transaction are on disk. An important factor in
tuning a transaction processing system is to make sure that the disk subsystem can handle the
rate at which I/O operations are required. For instance, consider a disk that supports an access
time of about 10 milliseconds, and average transfer rate of 25 to 100 megabytes per second (a
fairly typical disk today). Such a disk would support a little under 100 random-access I/O
operations of 4 kilobytes each per second. If each transaction requires just 2 I/O operations, a
single disk would support at most 50 transactions per second.
The only way to support more transactions per second is to increase the number of disks. If the
system needs to support n transactions per second, each performing 2 I/O operations, data must
be striped (or otherwise partitioned) across at least n/50 disks (ignoring skew).
Notice here that the limiting factor is not the capacity of the disk, but the speed at which random
data can be accessed (limited in turn by the speed at which the disk arm can move). The number
of I/O operations per transaction can be reduced by storing more data in memory. If all data are
in memory, there will be no disk I/O except for writes. Keeping frequently used data in memory
reduces the number of disk I/Os, and is worth the extra cost of memory. Keeping very
infrequently used data in memory would be a waste, since memory is much more expensive than
disk.
6) Tuning of the Schema
Within the constraints of the chosen normal form, it is possible to partition relations vertically.
For example, consider the course relation, with the schema:
course (course id, title, dept name, credits)
for which course id is a key. Within the constraints of the normal forms (BCNF and third normal
forms), we can partition the course relation into two relations:
course credit (course id, credits)
course title dept (course id, title, dept name)
The two representations are logically equivalent, since course id is a key, but they have different
performance characteristics. If most accesses to course information look at only the course id
and credits, then they can be run against the course credit relation, and access is likely to be
somewhat faster, since the title and dept name attributes are not fetched. For the same reason,
more tuples of course credit will fit in the buffer than corresponding tuples of course, again
leading to faster performance. This effect would be particularly marked if the title and dept name
attributes were large. Hence, a schema consisting of course credit and course title dept would be
preferable to a schema consisting of the course relation in this case. On the other hand, if most
accesses to course information require both dept name and credits, using the course relation
would be preferable, since the cost of the join of course credit and course title dept would be
avoided. Also, the storage overhead would be lower, since there would be only one relation, and
the attribute course id would not be replicated.
7) Tuning of Indices
We can tune the indices in a database system to improve performance. If queries are the
bottleneck, we can often speed them up by creating appropriate indices on relations. If updates
are the bottleneck, there may be too many indices, which have to be updated when the relations
are updated. Removing indices may speed up certain updates.
The choice of the type of index also is important. Some database systems support different kinds
of indices, such as hash indices and B-tree indices. If range queries are common, B-tree indices
are preferable to hash indices. Whether to make an index a clustered index is another tunable
parameter. Only one index on a relation can be made clustered, by storing the relation sorted on
the index attributes. Generally, the index that benefits the greatest number of queries and updates
should be made clustered.
To help identify what indices to create, and which index (if any) on each relation should be
clustered, most commercial database systems provide tuning wizards. These tools use the past
history of queries and updates (called the workload) to estimate the effects of various indices on
the execution time of the queries and updates in the workload. Recommendations on what
indices to create are based on these estimates.
Distributed Database System
A distributed database system consists of loosely coupled sites that share no physical
component
Database systems that run on each site are independent of each other
Transactions may access data at one or more sites
Homogeneous Distributed Databases
In a homogeneous distributed database
o All sites have identical software
o Are aware of each other and agree to cooperate in processing user requests.
o Each site surrenders part of its autonomy in terms of right to change schemas or
software
o Appears to user as a single system
In a heterogeneous distributed database
o Different sites may use different schemas and software
Difference in schema is a major problem for query processing
Difference in software is a major problem for transaction processing
o Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
Distributed Data Storage
Assume relational data model
Replication
o System maintains multiple copies of data, stored in different sites, for faster
retrieval and fault tolerance.
Fragmentation
o Relation is partitioned into several fragments stored in distinct sites
Replication and fragmentation can be combined
Relation is partitioned into several fragments: system maintains several identical replicas
of each such fragment.
Data Replication
A relation or fragment of a relation is replicated if it is stored redundantly in two or more
sites.
Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the entire
database.
Advantages of Replication
o Availability: failure of site containing relation r does not result in unavailability
of r is replicas exist.
o Parallelism: queries on r may be processed by several nodes in parallel.
o Reduced data transfer: relation r is available locally at each site containing a
replica of r.
Disadvantages of Replication
o Increased cost of updates: each replica of relation r must be updated.
o Increased complexity of concurrency control: concurrent updates to distinct
replicas may lead to inconsistent data unless special concurrency control
mechanisms are implemented.
Data Fragmentation
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more fragments
Vertical fragmentation: the schema for relation r is split into several smaller schemas
o All schemas must contain a common candidate key (or superkey) to ensure
lossless join property.
o A special attribute, the tuple-id attribute may be added to each schema to serve as
a candidate key.
Example : relation account with following schema
Account-schema = (branch-name, account-number, balance)
Advantages of Fragmentation
Horizontal:
o allows parallel processing on fragments of a relation
o allows a relation to be split so that tuples are located where they are most
frequently accessed
Vertical:
o allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
o tuple-id attribute allows efficient joining of vertical fragments
o allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an arbitrary depth.
Data Transparency
Data transparency: Degree to which system user may remain unaware of the details of
how and where the data items are stored in a distributed system
Consider transparency issues in relation to:
o Fragmentation transparency
o Replication transparency
Location transparency
Naming of Data Items – Criteria
1.
2.
3.
4.
Every data item must have a system-wide unique name.
It should be possible to find the location of data items efficiently.
It should be possible to change the location of data items transparently.
Each site should be able to create new data items autonomously.
Distributed Transactions
Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
o Maintaining a log for recovery purposes
o Participating in coordinating the concurrent execution of the transactions
executing at that site.
Each site has a transaction coordinator, which is responsible for:
o Starting the execution of transactions that originate at the site.
o Distributing sub transactions at appropriate sites for execution.
Coordinating the termination of each transaction that originates at the site, which may
result in the transaction being committed at all sites or aborted at all sites.
Transaction System Architecture
System Failure Modes
Failures unique to distributed systems:
o Failure of a site.
o Loss of massages
Handled by network transmission control protocols such as TCP-IP
o Failure of a communication link
Handled by network protocols, by routing messages via alternative links
o Network partition
A network is said to be partitioned when it has been split into two or
more subsystems that lack any connection between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally indistinguishable.
Database system security is more than securing the database
Secure database
Secure DBMS
Secure applications
Secure operating system in relation to database system
Secure web server in relation to database system
Secure network environment in relation to database system
© Copyright 2025 Paperzz