A DATA DEFINITION AND MAPPING LANGUAGE FOR

A DATA DEFINITION AND MAPPING LANGUAGE FOR NUMERICAL DATA BASES
Ola-Olu A. Dainl
and Peter Scheuermann
Electrical Engineering and Computer Science Department
Northwestern University
Evanston, Illinois 60201
12]. A variety of compact storage schemes have
been developed and facilities for incore data manipulation using these schemes are available in a
number of the software packages currently in use
at any computing center. However, only a few matrix compact storage schemes are currently being
implemented for the manipulation of large dense or
sparse matrices residing on secondary devices and
these are not readily available [3,8,9]. This is
due to the fact that some of these methods employ
quite complex data structures, such as threaded
linked lists [Ii], which require complex programs
for their implementation on secondary devices.
In addition, there is also the added difficulty to
an application user in accessing the compact matrix data residing on secondary devices.
Abstract
Numerical data bases arise in many scientific
applications to keep track of large sparse and
dense matrices.
Unlike the many matrix data storage techniques available for incore manipulation,
very large matrices are currently limited to a few
compact storage schemes on secondary devices, due
to the complex underlying data management facilities. This paper proposes an approach for generalized numerical database management that would promote physical data independence by relieving users
from the need for knowledge of the physical data
organization on the secondary devices.
Our approach is to describe each of the storage techniques for dense and sparse matrices by a
physical schema, which encompasses the corresponding access path, the encoding to storage structures, and the file access method.
A generalized
facility for describing any kind of numerical database and its mapping to storage is provided via
nonprocedural Stored-Data Description and Mapping
Languages (SDDL and SDML). The languages are processed by a Generalized Syntax-Directed Translation
Scheme (GSDTS) to automatically generate FORTRAN
conversion programs for creating or translating numerical database from one compact storage scheme
to another. The feasibility of the generalized approach with regard to our current implementation
is also discussed.
I.
Numerical data bases refer to data bases necessary to process numerical applications, that are
residing on secondary storage devices in matrix
compact storage forms.
A numerical application
database may consist of from one to three interrelated set of files because pseudo data e.g. distance from the diagonal and row beginning in the
data item vector, is usually kept on separate
files from the data item file.
In addition, the
set of files may also be processed by different
file access methods e.g. sequential for pseudo data file, and indexed sequential or direct for the
index and data item files.
Introduction
The problem of storage representation for
dense/sparse matrices in main core, in order to
optimize storage costs or processing time, has received considerable attention in literature [6,10,
* This research is supported by the University
of Ife, lle-lfe, Nigeria.
**On study leave from Computer Science Department, University of Ire, Nigeria.
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the A C M copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee a n d / o r specific permission.
©1980 ACM 0-89791-028-1/80/1000/0418
$00.75
418
While there recently have been important advances in the use of very large data bases in commercial applications, little has been done in the
area of numerical applications because the current
facilities of database management systems (DBMS)
are not suitable for processing numerical data
bases in the majority of the matrix compact storage schemes.
In order to address this problem,
there is a need for a generalized approach to numerical database management whereby the numerical
application users have facilities for data definition and mapping as well as data access to numerical data bases in any matrix compact storage
scheme by means of simple hlgh-level nonprocedural
languages that relieve them from the need for
knowledge of low-level details of physical implementation.
The main advantage of data definition and
mapping facilities is that the information that
usually resides in an application program on any
storage structure is removed into a schema which
provides information on the physical storage or-
ganization and its mapping interface to the operating system such that the user only provides infonm
ation about logical data descriptions.
These facilities are usually provided by a data definition
language (DDL) or by stored-data description and
mapping languages (SDDL and SDML) [7]. Similarly,
data access facilities are provided by a data manipulation language (DML) which promotes physical
data independence.
ally corresponds to dense or sparse matrices, and
any such data necessary to process a numerical application which is residing on secondary storage
is called here a nmuerlcal database.
We discuss
matrix features which provide guidelines towards
minimization of storage space and storage data
representation as well as DBMS concepts, such as
schema and the data language facilities, which enable our generalized approach.
Our investigation of data language facilities reported in [2,7,13,17,18] reveals that none
are suitable for numerical data management, which
usually requires different kinds of indexing and
ordering capabilities.
Therefore, we have designed the data language facilities (SDDL, SDML, and
DML) which can provide a generalized approach to
numerical database management.
In view of the
limited number of compact storage schemes currently in use for numerical database management, we
are implementing a generalized data translator
that will automatically restructure any numerical
database from one compact storage scheme to another by means of SDDL and SDML facilities.
This
satisfies an important goal of data portability,
and in addition the methodology developed for the
data translator is an essential part for the support of a DML, which will be implemented in the
second phase of our project.
2.1.
Two major types of matrices, dense and sparse
matrices, will be considered.
A dense matrix has
a high proportion of nonzero elements, while a
sparse matrix has a few nonzero elements.
The two
basic features for promoting compact matrix storeddata are symmetry and bandwidth.
Different compact storage schemes for synmnetrlc and band matrices as well as several other sparse matrix indexing schemes are identified in literature [6,
9-12]. These compact storage schemes are described by the corresponding numerical physical
schemas in our generalized approach, as is described in section 3.
2.2.
Our current approach provides the following
features:
i.
Each dense or sparse matrix compact storage scheme can be described by a physical
schema, which comprises the corresponding
data access path, the encoding to storage
structures and the file access method.
2.
A generalized facility for describing any
kind of numerical database and its mapping
to secondary storage, i.e. the physical
schema, is provided via nonprocedural
Stored-Data Description and Mapping Languages (SDDL and SDML).
3.
A generalized data translator that will
enable application users to create or to
restructure their numerical database from
one compact storage scheme to another, by
supplying the SDDL and SDML statements of
the source and target database descriptlons.
Schema
The term schema was originally coined in connection with the logical database description,
i.e. the definition of the objects, roles and
properties of interest to a given enterprise.
The
term was first brought into usage by the CODASYL
Database Task Group [4,5]. However, it is now
used in a broader sense to stand for data descriptions in database systems at the logical or physical level.
Since for our numerical databases the
logical structure is relatively simple, the role
of the physical schema which describes the mapping
to storage becomes predominant.
Each type of
matrix data organization, such as a square, lower
triangular or band matrix, could be viewed as
corresponding to a logical schema, while any compact storage scheme can be viewed as a storage
model with a corresponding physical schema. The
physical schema describes completely the mapping
to storage in terms of:
(I) access path organization, (2) encoding of storage structures, and
(3) operating system accessing methods [15].
2.3.
Data Languase Facilities
Data definition and mapping facilities are
important features of a DBMS which support the
concept of data independence.
These facilities
are provided either in the form of a self-contained language llke the data definition language
(DDL) or as two languages which are a stored-data
description language (SDDL) and a stored-data
mapping language (SDML). A DDL is generally a
declarative language for specifying logical data
structures and a data mapping language specifies
the mapping of the logical data structure to the
storage space.
We begin by describing some relevant concepts
from numerical analysis and DBMS in Section 2.
Next, the numerical physical schemas and the SDDL
and SDML facilities are described in Sections 3
and 4. The feasibility of the SDDL and SDML in
numerical database management and their implementation by a Generalized Syntax-Directed Translation
Scheme (GSDTS) as part of our generalized data
translator is discussed in Section 5.
2.
Dense and Sparse Matrix Compact Storage
Schemes
Overvlew of Numerical Analysis and DBMS
Concepts
The Database Task Group in [5] proposed a
schema DDL as a language for defining a data model together with its mapping to storage so that
it would meet the requirements of many distinct
progran~ning languages.
Another CODASYL group,
Numerical data are usually generated in both
quantitative and qualitative problem solving operations in the social sciences, physical sciences,
engineering, etc. Numerical application data usu-
419
or. The stored-data organization is in row or
column major order and the access path is direct.
Each of these storage schemes requires a single
external file and those with a non-synmaetrlc dataset are usually processed by a sequential file access method.
But indexed sequential or direct file
access methods may be appropriate for symmetric
matrices in order to reduce the access time involved in reconstructing the data items for a row/
column.
We identify the following storage schemes
in this category (albeit, close to their logical
counterparts).
the Stored-Data Definition and Translation Task
Group [7], proposed a stored-data and data translation model and language for describing and translating among a wide class of logical and physical
structures.
Additional data definition and mapping languages have been proposed, with prototype
implementations, for database reorganization, e.g.
[2,13,17,18].
The language facilities are usually
designed for operating on the traditional database
schemas of relational, hierarchical and network
data models.
The matrix compact storage schemes which represent our model cannot be suitably defined using
the data language facilities mentioned above because of the requirements for different kinds of
indexing and data ordering capabilities.
Therefore, we decided to develop nonprocedural storeddata description and mapping languages (SDDL and
SDML) which provide a generalized approach for
describing and mapping any numerical database to
secondary storage.
The two languages are discussed in section 4.
i.
2.
3.
4.
An illustration of one of t h e s c h e m a s
below in Figure I.
42000
35600
01430
00917
00023
Another important feature of a DBMS is a data
manipulation language (DML) which provides the interface between the application users and the DBMS
via a set of higher-level commands.
We have designed a DML which contains commands embedded in
FORTRAN, corresponding to the operation performed
on numerical databases.
However, the DML will not
be discussed further, since its implementation
will be considered only in a future project.
3.
Source
Dataset
is shown
1356
1143
~917
230
Logical
Schema
Storage
Scheme
Figure i - Dense nonsymnetric-band matrix data
structure.
The group's access path is direct because the
search technique uses computed-access array storage mapping which is defined as follows [14]:
Numerical Physical Schemas
As we mentioned previously, the various storage techniques for dense and sparse matrices suggested in literature can be represented by a corresponding physical schema, which depicts not only
the access path, but also the encoding of storage
structures and the file access method.
In order
to generalize the description of the physical
schemas, we investigated their access paths for
similarities.
Our investigation reveals three
groups which have direct, indirect and linked access paths respectively.
The direct access path
corresponds to dense array realization, the indirect to the technique of going through an index to
access a data item (non-zero element) and the
linked to the technique of accessing a data item
through other data items connected to it by pointers. Formal definitions of the access paths will
be presented later.
Definition:
Let N denote the set of positive integers and A be a two-dimenslonal array scheme.
A
computed access storage mapping for A is a total
function f: N x N 4 N such that: (I) f(l,l) = I,
and (2) f is one-to-one on array scheme A.
3.2.
Since in our case the access paths are closely
related to the actual encodings of the storage
structures, which specify mappings into a linear
address space [15], we identify the groups as direct, indirect and linked encodlngs respectively.
We shall assume that in our approach the linear
address space refers to storage space on secondary
devices.
3.1.
address-polynomial (regular m x n matrix)
lower- or upper-trlangular
symmetrlc-band
nonsynmnetrlc-band
Direct Encoding Group
Numerical physical schemas in this group describe compact storage schemes for dense matrices.
Their logical schema comprises the dense m x n,
lower-/upper-triangular or band matrices, and
their storage is either an m x n matrix or a vect-
420
Indirect Encoding Group
This group of numerical physical schemas describe the storage structures for all the sparse
matrix indexing techniques whose access paths include reference data separately from the data-items themselves.
Their logical schema is a m x n
sparse matrix or a lower/upper diagonal matrix.
Their storage scheme consists of vectors of data
items, i.e., the non-zero elements, in row or colu m n m a J o r order, with corresponding row and/or
column indices and/or reference data. Reference
data, i.e. pseudo data, refers to the location of
data items within the source matrix; row/column
beginning in the data item vector; or distance
from the diagonal.
These schemas usually require
interrelated sets of two or three files respectively and their choice of file access method depends
on the type of expected row/column retrieval.
For sequential row/column retrieval, a sequential
file access method is adequate; for random row/column retrieval, we can choose either indexed sequential/dlrect for all files or a combination of sequential for reference data file and indexed/dlrect for index and data item files. The schemas
we identify in this group are:
I.
2.
3.
4.
5.
pointer linkage.
slngle-lndexlng
double-lndexlng-I (row-column-I)
double-lndexlng-2 (row-column-2)
blt-map
address-map
Definition: Let D = (X, R) be a storage structure
with nodes xl, ..., x_ and relations (r,, ro) c R
such that rl-represen~s a row equlvalen~e relatlon
and r2 represents a column equivalence relation.
In adaltlon, let ~x I represent the address of node
x.; k.x--the value of ith pointer field of node x
l
l
i.e,
row pointer value; k4x--value of Jth polnte r
field of node x i.e., col6mn pointer value; X/rl-row equivalence class and X/r2--column equivalence
class. A linked mapping is a linked realization
of a relation from the header pointer node, if at
least one of the following holds:
An illustration of one of the schemas is shown
in Figure 2.
1234
i 0 0 2
0 0 3 0
i
31 ~
0400
50 I 2
~ 143
i
123
Logical
Schema
~i
Column index
vector
!
4567
i
I 12 3
I
123
Row beginning in
data item vector
2 I 34 ~ j
4 5 1 2
4567
It may be defined as follows:
~j
I.
The relation r I is realized as a linked
structure (rel~tive to the ith pointer
field) i,e., for every pair of nodes
(x.,i x^)sz X/rl' ~xp ¢ k~x I holds, or
similarly r 2 Is realizes ~s a linked
structure.
2.
If for every ordered three nodes <xl, xg,
xR> such that (xl, x~) c X/r. and (~., ~.)
¢ X/r2, ~x2¢ ~ix~ and nx 3 ¢ ~jxI hol~.
Data item
vector
M(i,J)
Storage Scheme
Figure 2 - Double-indexing-2 (Row column-2)
Their access path is indirect because the
search technique uses a composite storage mapping
which may be defined by the following [ii]:
In addition, it is possible that the relation r is
realized as a linked structure and the end node x
points to the header node x', i.e. kxn = ~x'.
n
Definition: Let i and j represent the row and column data item subscripts; M(i,J)--data item location; ~.--beginnlng relative address of indices
i
for row i; and n.--relatlve address of element 3 in
column index vector as illustrated in column Figure 2. Data ordering is assumed rowwise, for columnwlse ordering we can just interchange i and j.
Let f represent any storage mapping function such
that f(1) = ~i" A search function, ~f, is defined
as follows:
4.
Data Lansuage Facilities
A composite mapping function, h, on a search function, ~. is defined as follows: h(~f(j, f(i))) =
M(i,j). r
The data language facilities provide a generalized approach for describing any numerical database and its mapping to storage. They consist
of a stored-data description language (SDDL) and a
stored-data mapping language (SDML). The two
languages are similar to other data definition and
mapping languages [7,17,18]. We have attempted as
much as possible to make them user friendly, by
including simple, self-explanatory language const,
ructs. The choice of only one of the alternatives
is represented by [] (braces) and an optional
phrase by [] (square brackets). Language keywords
appear in capital letters and user-defined words
in lower case. Sample SDDL and SDML statements
of both source and target numerical databases are
shown in Figures 4 and 4.1 respectively. Other
features of the two languages will be revealed as
they are described below.
3.3.
4.1.
(~f) (j,~i) = ~j, iff f(~j) = j;
and V ~j' s.t.
!
= @,
~i ~ ~j < ~j' f(~j' ) ~ j"
iff V ~j ¢ N, f(~j) ~ j.
Linked Encodln~ Group
The linked encoding group consists of numerical physical schemas for all the sparse indexing schemes with linked llst data structures.
Their logical schema is the m x n sparse matrix
and their storage scheme consists of lists of nodes.
Each node has a format which might consist of data
item, row and column indices and pointer fields.
The schemas usually require a single file with indexed sequential or direct file access method.
These schemas are further classified as:
I. llnear-llnked-llst
2. doubly-llnked-list
3. threaded-linked-llst
Figure 3 shows an illustration of such one of them.
Their access path is called linked because the
search technique uses a mapping defined through
421
Stored~Data Description Language (SDDL)
The SDDL is intended mainly for the user to
describe the logical characteristics of his numerical database and the associated type of file organization on secondary storage devices, or alternatively the card input-fornlst. Therefore, the
language is divided into three parts which are
(I) matrix structure, (2) file control, and (3)
input format.
The matrix structure describes the logical
characteristics of the data and it also indicates
if dynamic storage management is required. The
basic matrix format is specified using the selfexplanatory keywords: ~DENSE ~ {SYMMETRIC
~,and
~SPARSEy~ONSYMMETRIC 3
BANDED ~. If the matrix is symmetric, the
ONBANDEDJ statement will include~UPPER-DIAGONAL~
~LOWER-DIAGONAL~
in order to specify the partition of the dataset
i
0
0
2
0
0
3
0
0
4
0
0
5
0
I
2
Logical Schema
?
E~
D---~iII,Ii1115121
:"I 1412171oi
1
!
[i]------~I,I,I~I°I0L
-~L'71~b121oIoi
Storage Scheme
Node
Node
Row
Format
Key
Index
I Column
I
Index
Data
Item
Figure 3. Doubly-llnked-llst
422
Column
Node
Pointer
Row
Node
Pointer
DATA-DESCRIPTION:
DATA-DESCRIPTION:
MATRIX-STRUCTURE:
TYPE = SPARSE, NONSYMMETRIC,
MATRIX-STRUCTURE:
TYPE = SPARSE, NONSYMMETRIC,
STATIC:
FILE-CONTROL:
TYPE
= TARGET;
FILE-UNIT = 4;
MEDIUM
= DISK;
RECORD:
REC-KEY = integer;
SIZE = 1024, FIXED, UNBLOCKED;
FILE-CONTROL:
TYPE
= SOURCE;
FILE-UNIT = 21, 22, 23;
MEDIUM
= DISK;
RECORD:
REC-KEY = integer;
SIZE = 512, FIXED, UNBLOCKED;
DATA-MAPPING
DATA-MAPPING:
(double-indexing-2);
(doubly-linked-list);
ACCESS-PATH-ENCODING:
ACCESS-PATH = LINKED-ENCODING:
(LINKED-DATA-ORG);
ACCESS-PATH-ENCODING:
ACCESS-PATH = INDIRECT-ENCODING
(REF-DATA-ORG);
INDIRECT-ENCODING:
REF-DATA-ORG: (REF-ORG-i,
REF-ORG-2, DATA-ORG);
REF-ORG-I: SET(LOC);
LOC: integer, TYPE = ROW BEGINING;
REF-ORG-2: SET(INDEX);
INDEX: integer, TYPE = COLUMN
INDEX;
DATA-ORG:
DIMENSION = (5000,5000);
ORDERING = ROWWISE;
SET(DATA-ITEM);
DATA-ITEM: real, REAL-PRECISION
= DOUBLE;
LINKED-DATA-ORG:
(COL-HEAD-NODE,
ROW-HEAD-NODE,
DATA-ITEM-NODE);
COL-HEAD-NODE: (PTR-ITEM,FIELDLINKAGE);
PTR-ITEM: integer, TYPE = COL PTR;
FIELD-LINKAGE = FIRST COL NODE;
ROW-HEAD-NODE: (PTR-ITEM, FIELDLINKAGE) ;
PTR-ITEM: integer, TYPE = ROW PTR;
FIELD-LINKAGE = FIRST ROW NODE;
DATA-ITEM-NODE: (KEY-FIELD, ROW-FIELD,
COL-FIELD,
DATA-FIELD, COL-PTRFIELD, ROW-PTR-FIELD);
KEY-FIELD: NODE-KEY = integer;
ROW-FIELD: REF-ITEM = INDEX;
INDEX: integer, TYPE = ROW
INDEX;
COL-FIELD: INDEX: integer, TYPE=
COL INDEX;
DATA-FIELD: ORDERING = NONE;
DATA-ITEM = real, REALPRECISION;
REAL-PRECISION = DOUBLE;
COL-PTR-FIELD: PTR-ITEM, FIELDLINKAGE;
PTR-ITEM: integer, TYPE =
COL PTR;
FIELD-LINKAGE = NEXT COL NODE;
ROW-PTR-FIELD: PTR-ITEM, FIELDLINKAGE;
PTR-ITEM: integer, TYPE =
ROW PTR;
FIELD-LINKAGE = NEXT ROW N O D E
ENCODED-FILE:
FILE-NAME
= datfile,lndfile,locfile;
ORGANIZATION = RANDOM,RANDOM, SEQUENTIAL;
ENCODED-DATA = DATA-ORG, REF-ORG-2,
REF-ORG-I;
Figure 4.
STATIC;
Sample SDDL & SDML statements of a
source numerical database for a
double-lndex-2 schema.
to be processed.
Similarly, a bandwidth statement
which specifies the size of the band is required
for a band matrix and a density statement giving
an estimated density of a sparse matrix is necessary for creating a database with random file organization.
Some statements in the matrix structure section are shown in the example below.
MATRIX-STRUCTURE:
TYPE = SPARSE, BANDED, SYMMETRIC, LOWERDIAGONAL, STATIC;
BANDWIDTH = (250, 250);
ENCODED-FILE:
FILE-NAME
ORGANIZATION
ENCODED-DATA
The file control specifies the file organization of a numerical database already residing
on a secondary device or to be created, by listing
the type of file, device medium, file unit etc.
The file control statements depend on the device
medi~m~ selected for processing as specified by the
device m e d i u m keyword, CARD, TAPE, or DISK.
If data is to be processed from card input stream, only
the file-type, file-unit and device-medlum statements are required, but in addition to these three
statements, both disk and tape files require record statements.
Figure 4.1.
FIXED
=
=
=
NODFILE;
RANDOM;
SET(LINKED-DATA-ORG);
Sample SDDL & SDML statements of a
target numerical database for a
doubly-llnked-llst schema.
IBLOKED
V A R I A B L ~ a n d [ U N B L O C K E D J - In addition, the file
control section may include any of the following
optional statements: (I) a record-key statement
to specify either integer or alphanumeric key
for random file organization; (2) a block-size
statement required for blocked records; and (3)
a format statement (similar to FORTRAN) for formatted records.
Some of these statements are
illustrated under FILE-CONTROL in figure 4.
The file-type statement identifies the source/
target file and the file-unlt statement gives a
set of FORTRAN READ/WRITE unit numbers for processing the files in the database.
The record statement lists the record properties llke record-slze,
423
selection of an appropriate mapping subsection and
relates its subsections to the mapping descriptions of the direct, indirect and linked schema
encoding groups. Reference to mapping descriptions defined in one encoding group by another is
a colmnon feature of the language, e.g. REF-ITEM
definition of pseudo data in the indirect encoding subsection is referenced by the linked encoding subsection.
The input-format section provides facilities
for processing unstructured database from cards.
The section is comprised of the dimension, the data ordering and format statements respectively.
The dimension statement, shown below, specifies
the numbers of
DIMENSION= SROW
~, integer,~COLUMN~, integer;
COLUMN)
[ROW
rows and columns in the matrix. The data ordering
statement specifies a rowwise/columnwlse/none ordering. The data-format statement:
The direct encoding, implied by the DATA-ORG:
subsection, describes the data item with its properties llke data ordering and type. It also provides for an optional definition of dimension and
bandwidth for a source database description. The
indirect encoding provides a choice of mapping alternatives for encoding pseudo data and data item
to separate encoded files by the mapping descriptions identified by MAP-ORG: and REF-ORG: (see
Figure 4). In addition, an ordered combination
of pseudo data and data items may be mapped to an
encoded file by MIXED-ORG: mapping description as
follows:
(SRARSEYPE-q
DATA-FORMAT=~SPARSE-TYPE-21;
(DENSE
J
gives users three choices of format specifications.
Both SPARSE-TYPE-I and SPARSE-TYPE-2 are for sparse
matrix input format specifications of only nonzero
elements and the DENSE is for all the matrix elements.
SPARSE-TYPE-i is for an ordered input data so
that a row or column input data stream is processed at a time. As shown below,
SPARSE-TYPE-i: CONTROL-DATA = ~ROW
MIXED-ORG: SET ~RDERED~(REF-ITEM, DATA-ORG)~.
~(REF-ITEM, REF'ITEM,~r
~
DATA-ORG) JJ
~ data-type;
ICOLU~NJ
'
FORMAT
= SET(data-type,
data-type);
it requires a control data to specify the row or
column to be processed so that the format becomes
a set of pairs of column/row and data item datatypes. A data-type is any valid FORTRAN format
specification for spacing, alphanumeric, integer
or real variable e.g. 5X, 16, FIO.4 and E20.12.
The linked encoding enables the mapping of
any set of nodes to an encoded file. Each node
is identified by a user defined node-name and
consists of a set of fields. Each field is described by an optional field-name and a field identifier which may be a node key, pseudo data, or
data item. An example of linked encoding mapping
is illustrated in Figure 4.1.
SPARSE-TYPE-2 is for an unordered input data
so that the format is a set of row, column, and
data item data-types as follows:
SPARSE-TYPE-2 = SET([ROW], data-type,
[COLUMN], data-type,
data-type);
Finally, DENSE = SET(data-type); provides for
a set of regular FORTRAN-type format specifications.
An example of a SPARSE-TYPE-I input format is shown
below.
The mapping description consists of definitions of both primitive and nonprimitive data
structures. The representation of structures of
primitive type is usually by an assignment statement, while that of nonprimltive is by a descriptive statement consisting of a set or group name,
and a set or group definition [16]. We provide
the following constructs in the language to specify data, ordering and linkage definitions:
INPUT-FORMAT:
DIMENSION = ROW, 5000, COLUMN, 5000;
ORDERING = ROWWISE;
SPARSE-TYPE-l: CONTROL-DATA = ROW, 14;
FORMAT
=5(14,2X,FI0.6) ;
4.2.
Stored-Data Mapping Language (SDML)
i.
ordering definition types--rowwise, collumnwise and none;
2.
basic data types--integer, real, and alphanumeric;
3.
linkage definition types--header, first,
next, prior, last, row, column, node,
field, and null.
A valid and meaningful linkage definition,
except the NULL keyword, requires an ordered combination of the following: (I) a pointer linkage
keyword, (2) row or column, and (3) node or field.
The pointer linkage keywords are header, first,
next, prior, and last. An example of a valid
definition is FIRST ROW NODE.
The SDML has two functions: (i) to describe the different types of mapping which the
system can make between a logical schema and a
target storage space, and (2) to describe the encoding to storage structures. The major structure
of the language is comprised of the access path encoding and the encoded file. The major emphasis
of the language is on the access path encoding,
which represents the most difficult part of the
mapping description. The encoded file section enables the assignment of encoded data (data items
and pseudo data) to the files in the database according to the corresponding definitions of filenames and file accessing methods.
An example of
a
primitive type data structure
is:
DATA-ITEM
The access path encoding section enables the
424
=
integer
~ real
~
L alpha 3
;
Our first priority then is to develop a generalized data translator for numerical databases
that will isolate the users from the underlying
data management through stored-data description
and mapping language facilities.
An example of a nonprlmltive type data structure illustrating a SET definition is:
DATA-ORG:
[ROUSE
SET(DATA-ITEM), ORDERING=~COLUMNWISE|;
(.NONE
.2
5.1.
A primitive type data structure which is semantlcally ambiguous, e.g. index and pointer, becomes a nonprlmltive structure by qualifying the
basic data definition with a semantic phrase definition as follows:
INDEX: ~integer~ ,
Lalpha J
TYPE =[ROW INDEX
~COLUMN INDEX
~
]CONCAT(ROW INDEX,]
£COLUM~ INDEX)
We are currently developing a generalized data translator for numerical databases as a first
step towards developing a generalized numerical
database management system. The generalized data
translator is focused on the implementation of our
nonprocedural Stored-Data Description and Mapping
Languages (SDDL and SDML).
Its function is to automatically create or restructure a numerical database from one schema to another in two consecutive processes of compilation and data translation (to be discussed later).
Its input, supplied
by the user, consists of the source and target
SDDL and SDML statements (see Figure 4), and a
source numerical database,
Its output is the target numerical database.
The overall functions are
illustrated in Figure 5.
;
J
An access path is described by ORDERING and
LINKAGE phrases.
ORDERING describes the matrix
data access path by row, column or none.
It is assumed that the ORDERING of reference items, i.e.,
indices and locations (within the matrix or from
diagonal elements) corresponds to that of matrix
data items. LINKAGE describes linked llst structure connectivity by a combination of linkage keywords as in the following example:
PTR-ORG:
5.
A ~enerallzed Data Translator for Numerical
Databases
During the compilation process, the user-supplied SDDL and SDML statements are converted by a
lexical analyzer into a token stream which is
translated by a Generalized Syntax Directed Translation Scheme (GSDTS) £nto FORTRAN source programs
of the reader, the restructurer, and the writer
subroutines.
After compilation by a FORTRAN compiler, the subroutines become the major components
of the translator subsystem.
The translator subsystem also includes common data table information,
shown in Figure 6, and utility functions and routines to compute mapping functions, e.g., synmnetrlc and band address locations, and to execute
search and reordering algorithms.
SET(PTR-ITEM), LINKAGE=NEXT COLUMN FIELD;
The Feaslbillty of SDDL and SDML in a Numerical Database System
The current approach to numerical database
management is restricted to a few matrix compact
storage schemes. The most cmmnon compact storage
scheme for processing sparse matrices residing on
secondary devices is the double-lndexlng (rowcolumn) technique, but this is not the best technique for many applications.
A few research
groups, e.g., [9], have tried the linked llst
technique for programs tailored to their applications; however, they are not always available for
public distribution.
5.2.
Our investigation of the implementation of
a generalized approach to numerical database management reveals two basic requirements.
The
first requirement is for the numerical database to
reside on secondary storage using the storage
scheme that is best fitted for its application.
The second requirement is to provide tools for
data access that will promote physical data independence through the implementation of a DML.
It is obvious that the first requirement is
a prerequisite to the second and that there are
two options for its realization.
The first option
is for each user to be responsible for structuring
his numerical database corresponding to the physical schema best suited to his application.
This optlon is not practical because a user may not know
how to structure his database to suit his objective.
The second option is to have a generalized data
translator that will automatically restructure any
numerical database from one physical schema to another, or convert unstructured raw data not in a
compact storage form, corresponding to a physical
schema.
It is essential for this option to be integrated into any effective generalized approach
to numerical database management.
The data translation process of the translator subsystem starts with the encoding of each record(s) of the source database into a translator
internal form (TIF), followed by the decoding of
TIF data to encoded record(s), and ending with the
writing of record(s) on the storage devices.
The
components of the TIF are (I) the row/column identifier, (2) the index buffer for column/row index,
and (3) the data item buffer for row/column data
item. The translation process is controlled by
the translation supervisor which a c t i v a t e s
the
reader to encode the source database record(s) to
TIF data, followed by the restructurer to decode
the TIF data to encoded record(s), and then the
writer to convert the encoded record(s) to physical record(s) and to wrlte it on the storage device, Each subroutine returns control to the supervisor, which activates the next subroutine accordingly, and the process is repeated until all
the records of the source database have been processed. Figure 6.1 illustrates a data translation
process of double-lndex-2 source database to doubly-llnked-llst target database.
5.2.1.
rix
425
Data Translation Process
Reader Module
The r e a d e r
data, i.e.,
encodes both the unstructured matraw d a t a n o t i n a n y c o m p a c t s t o r -
turned to the supervisor for the next step of
translation iteration, i.e., the decode step by
the restructurer.
age form, and the numerical database.
In both cases, the information in the source file control
table and either the input format or the physical
schema table (see Figure 6) is used by the reader
to read source data from cards or secondary devices and encode it into the translator internal
form (TIF) data. The source data is processed by
row/column according to the input format or physical schema specification. In order to produce the
TIF data, each encode step ~f the translation iteration does the following:
(i) fills in the appropriate row/column identifier, and (2) fills in
the corresponding index and data buffers for that
row/column (see Step la of Figure 6.1).
For example, with row identifier equals I, we have I and
4 in column index buffer, as well as I and 2 in
data item buffer. On completion, control is reSource
SDDL & SDML
Statements
I
5.2.2.
Restrueturer Module
If the source ordering is different from the
target ordering, the TIF data of the entire database is temporarily stored in a workfile(s) to be
reordered before it is decoded; otherwise, the TIF
data is decoded into encoded data corresponding to
the target schema as received. Each decode step
of the translation iteration from the TIF data to
a direct encoding group, dlslcards the index buffer, and reorganizes the data items to the appropriate encoded data. For the indirect encoding
group, both the data items and the index which is
Target
|SDDL & SDML
Statements
I
I
Lexical
Analyzer
COMPILATION
I
Lexical
Analyzer
Token
i
Target
Token
GSDTS for SDDL and SDML
/
FORTRAN
Conversion
Programs
/
\
FORTRAN Compiler
r-
C
TRANSLATION
L
Figure 5.
NSu.t~rr~.ceall,
Database j
,%
TRANSLATOR
Subsystem
i
. ( Target
,~Numerical
~.D_atabase
Internal
Form Data
Usage and functions of the generalized data translator.
426
>
TARGET
SOURCE
I Input
File
I
Control
Table
\
\
I
i
f
f
I
/
f
I
/
I
/
WRITER
RESTRUCTURER
SUBSYSTEM
TRANSLATOR
I,
Tran81ator
<
Source
Numerical
Database
*
Either
File
Control
Table
/
I
I
\
Physical
Schema
Table
Physical
Schema
Table
Format
1
Internal
11
Target
Numerical
Database
Input Format--unstructured (raw) source matrix data.
Physical Schema Table--source database in compact storage form.
Or
data descriptions
data flow
>
Figure 6.
processing sequence
Major components of the translator subsystem.
converted to the appropriate pseudo data, become
the encoded data. However, the linked encoding
group requires the supervisor to create null head
nodes during initialization. Data item nodes with
any appropriate pointers are created to form the
encoded data at each decode step. For example, in
Step Ib of Figure 6.1, two data item nodes for the
first row are created to correspond to the TIF data
in Step la. In addition, "i" in the row and column
head nodes represents the pointer to the first data
item node, and "2" in the column head and the first
data item nodes respectively represents the column
pointer to the second data item node. At the end
of this step, control is returned to the supervisor
for the last phase of the translation iteration
i.e. writing the encoded data on the secondary devices by the writer.
5.2.3.
Writer Module
The writer uses the information in the target
file control table to open the file(s) of the target database during initialization and closes them
after the entire database has been processed. It
427
performs the last phase of each translation iteration by converting the encoded data into physical
record(s) to be written on the secondary devices
according to the user-deflned target file access
method. For example, with regard to the encoded
data of Step Ib in Figure 6.1, the head node records are updated records which are rewritten in
place, and the data item node record is written
as a new record on secondary device. On completion, control is returned to the supervisor for
another translation iteration to begin with the
reader.
5.3.
Process
The compilation process is the sequence of
Compilation
operations necessary to automatically produce the
reader, the restructurer, and the writer subroutine programs from the SDDL and SDML statements
supplied by the user. Our investigation of automatic data conversion techniques [2,13,17,18] reveals tha= compiler-compiler techniques are generally used. In order to be able to perform a
broad, useful and syntactically valid class of
Source database of figure 2
1002
Step O
II
3
4
5
0
0
0
0
1
Row beginning file
0030
1 1 4 3 2 1 3 4 0 1
0400
5012
II
2
3
4
5
1
2
Column index file
0]
Data item file
Logical
S ch ema
Source record size
Source file org.
=
=
4;
sequential for all files.
Target database of figure 3
(Partial data description)
Target record size = 14;
No of row
= 4;
Target file org.
= random;
Buffer size
No of column
Record key
=
=
=
4;
4;
integer;
Translation Start
Initialization Operation
Create null head node records
Row-head node rec.
Col-head node rec.
[I [0
0
0
0
0
.....
0 [
12 I0
0
0
0
0
.....
01
rec key
ist Translation Iteration
Source data to TIF (translator internal form) data
Step la
Row identifier
Index buffer
= I;
=
II
4
0
0 ~
Data buffer
ii
2
0
0~
TIF data to Encoded Data
Step Ib
Row-head node rec.
~I ~ i
0
0
0
0
.....
0#
Col-head node rec.
12 | I
0
0
2
0
.....
0J
Data-ltem node rec.
[3l
1 l1
r~c
key
~ode
key
Figure 6.1
1 1 0
212ll
4
2
o
01
n~de
key
An illustration of a data translation process.
A token stream of single digits or letters
for keywords, and user-defined variables and constants is the output from the conversion of the
SDDL and SDML statements by the lexical analyzer
Eli. For example, "TYPE = SOURCE"; is converted
to "I", "TYPE = TARGET"; becomes "2", "FILE-NAME
= SAMPLE"; becomes "SAMPLE." The token stream is
the input to the GSDTS which produces the source
FORTRAN subroutine programs to be compiled by the
FORTRAN compiler into object decks as the final
output of the compilation process.
translations, we decided that a generalized syntaxdirected translation scheme (GSDTS) is the best model for our application.
Because FORTRAN is the
progran~ning language of the majority of numerical
application users, we decided to write the translation software in portable FORTRAN so that it can
be of general distribution with little or no modification of the source programs from one computer
system to another.
A GSDTS requires an underlying LR(k) contextfree grammar. Therefore, we had to construct LR(k)
gralmaars for our SDDL and SDML, and in order to
minimize the compilation time, we have constructed
SLR(1) grammars for the SDDL and SDML such that the
terminal symbols are single digits or letters except the user-deflned variables and constants.
The grammars and the LR(1) automatic parser generator which is used to validate them as part of the
system initialization process are discussed below.
An illustration of the compilation process is
shown in figure 6.2. The SDDL statements of figure 4 are input to the lexical analyzer.
The
statements are processed by the lexieal analyzer
to produce an output token stream, which becomes
an input to the GSDTS.
The token stream is processed by the GSDTS in a concurrent operation of
LR(1) parsing and semantic analysis.
If no error
is encountered during parsing and on successful
428
5.3.1.
reduction to the final state, the Semantic Analyzer
outputs the generated FORTRAN statements.
SLR(I) Grammars for SDDL and SDML
We have constructed one SLR(1) grammar for
the SDDL such that terminal symbols for keywords
are generally numerical codes with single letters
wherever it is necessary to provide one unique
lookahead symbol for consistency resolution.
In
order to maintain a modular programming approach
and provide for execution time storage overlay
should the need arise, we constructed two SLR(1)
grammars for the SDML, which are one for the Direct and Indirect Encoding Sections, and another
for the Linked Encoding Section with the Encoded
File Section included in each grammar.
The two
SLR(1) grammars are similar to that of SDDL.
We will llke to mention that all data declarations are made in the Translator Subsystem so
that the routines would have access to the common
variables, even if there is an overlay operation.
This explains why only the Translator Subsystem
declarative statements are generated in figure 6.2~
because the Reader routine FORTRAN statements of
a structured database are generated by processing
the SDML statements.
On the other hand, since an
unstructured source database has no SDML statements, so in this case the Reader routine FORTRAN
statements are generated along with the Translator
Subsystem declarative statements by processing the
SDDL statements.
The nontermlnals of the grammars are in selfexplicit BNF, e.g., <index-type>, <file-structure>,
Conversion of SDDL statements of figure 4 to Tokens
Token
Input statemen t
DATA-DESCRIPTION:
MATRIX-STRUCTURE:
TYPE = SPARSE, NONSYMMETRIC,
STATIC;
S
21N
22N
23N
3
I
512
1
2
FILE-CONTROL:
TYPE = SOURCE;
FILE-UNIT = 21, 22, 23;
MEDIUM
RECORD:
Token Stream
21N
22N
Output from Lexical Analyzer,
Input to GSDTS.
-
23N
GSDTS Output
= DISK;
REC-KEY
= integer;
SIZE
= 512,
FIXED,
UNBLOCKED;
3
-
I
512
I
2
FORTRAN Declarative Statements for the Translator Sybsystem
INTEGER ROWID, COLID, BUFSZE, SDATOG, UPRCOD
INTEGER RCOSTA, RECSZE, FLEUNT
INTEGER DIAGID, DENSTY, FLENAM, BLKSZE
DIMENSION INDROW(500), INDCOL(500), DATA(500),
I
INDEX(500),FLEUNT(3)
DIMENSION DATBUF(500), INDUF(500)
DIMENSION FLEUNT(3), FLEID(3), FLENAM(42)
COMMON/GLOBAL/NOROW, NOCOL, ROWID, COLID, LWRCOD,
i BUFSZE, IERROR, SDATOG, UPRCOD, DATBUF, INDBUF
COMMON/ENCCOM/RCOSTA, INDPTR, KONTRL, RECSZE,
I DATA, INDROW, INDCOL, FLEUNT
DATA BUFSZE/500/
DATA FLEUNT(1), FLEUNT(2), FLEUNT(3) / 21,22,23/
DATA RECSZE,BLKSZE,RECKEY /512,0,1/
Figure 6.2
An illustration of the Compilation Process
429
TRS20020
TRS20040
TRS20080
TRS20100
TRS20120
TRS20140
TRS20150
TRS20160
TRS20170
TRS20210
TRS20220
TRS20310
the development of the GSDTS--for the SDDL and the
SDML to be discussed below.
<sparse-type-l> and<1~ode-nama>. One advantage
of the modular SLR(1) gran~aar approach is that new
features, llke additional pointer linkage definitions, could be added to the language with easy
modification of the corresponding grammar. All
the grammars have been proved to be SLR(1) by the
LR(1) automatic parser generator.
5.3.2.
5.3.3.
Generalized syntax-dlrected translation
schemes (GSDTS) are well defined in literature and
we chose to implement a bottom-up execution of
GSDTS [i]. The major components of the GSDTS-for the SDDL and the SDML are, as illustrated in
Figure 7, the following:
(I) LR(1) parser, (2)
LR(1) tables, (3) Semantic Analyzer, and (4) SDDL
and SDML Semantic Tables. Its input is the SDDL
and SDML token stream generated by the lexleal
analyzer and assigned token values from LR(1) tables by the LR(1) parser's internal scanner. The
outputs produced by the GSDTS are the reader, the
restructurer and the writer FORTRAN source subroutines produced from the tokens of the source
SDDL and SDML, the target 3DML, and the target
SDDL respectively.
LR(1) Automatic Parser Generator
The LR(1) automatic parser generator, developed by Wetherell and Shannon in [19], is written
entirely in portable ANSI Standard FORTRAN 66 and
it has been successfully operating on a number of
computers. It generates a space efficient parser
for any LR(1) grammar.
It reads a context-free
grammar in a modified BNF format and produces tables which describe an LR(1) parsing automaton. It
has been used to validate our SDDL and SDML grammars and to produce the corresponding tables for
describing their LR(1) parsing automata. The tables consist of dimension and data statements to be
embedded into the LR(1) parser subroutines to be
described later. The procedure is performed once
as part of our system initialization operation for
SDDL &
SDML Token
GSDTS--for the SDDL and the SDML
The LR(1) parser is a set of subroutines
which interpret the LR(1) tables to construct a
parse of the SDDL and SDML token stream. Some of
/
LR(1)
Parser
Tables
1
Semantic
Analyzer
ii s° 1
[
Rules
[
GSDTS
FORTRAN
Conversion
Program
Figure 7.
/
GSDTS for SDDL and SDML.
430
[8], while a row-column schema is used in Vectorized General Sparslty Algorithms with Backing
Store [3]. Since the need for secondary storage
backup is relative to the size of the primary
storage, our model will be of great advantage in
institutions with small or medium size computing
facilities.
subroutines were part of the software developed
by Wetherell and Shannon in [19], but they have
been modified and tested to suit our application.
We have developed three LR(1) parsers for the
SDDL, the direct and indirect encodings, and the
linked encoding SLR(1) granmaars respectively.
The Semantic Analyzer consists of two major
routines which perform the semantic analysis and
the output production. The SDDL and SDML Semantic
Tables contain the semantic rules corresponding
to the SLR(1) grammar production rules. However,
we are currently restricting our implementation to
a few physical schemas which are representative of
the three encoding groups, Therefore, the current
semantic tables contain semantic rules corresponding to only those physical schemas, with null
rules for the others so that they could be easily
extended after the completion of the current development process.
6.
REFERENCES
I.
Aho, A.V. & Ullman, J.D. "The Theory of Parsing, Translation and Computing, Volume II:
Compiling," Prentlce-Hall, Inc., Englewood
Cliffs, N.J., 1973.
2.
Bach, M.J., et al. "The ADAPT System: A Generalized Approach Towards Data Conversion,"
Proc. 5th Int. Conf. Very Large Data Bases,
ACM, N.Y. Oct. 1979, pp. 183-193.
3.
Calahan, D.A., et al. "Vectorlzed General
Sparslty Algorithms with Backing Store," Systems Eng. Lab., University of Michlgan, Ann
Arbor, SEL Report #96, Jan. 15, 1977.
4.
CODASYL Data Base Task Group Report, Conf.
Data System Languages, April 1971, ACM, New
York,
5.
CODASYL Data Description Language Journal of
Development, June 1973 Report.
6.
Duff, I.S., "A Survey of Sparse Matrix Research," Proc. of the IEEE, Vol. 65, No. 4,
April 1977, pp. 500-535.
7.
Fry, J.P., et al. "Stored-Data Description
and Data Translation: A Model and Language,"
Information Systems, Vol. 2(3), 1977, pp.
95-147.
8.
Jensen, paul S., "An Engineering Analysis System," Proc. ACM 1978 Annual Conference, Washington, D.C., Vol. I of 2, Dec. 4-5-6, 1978,
pp. 490-495.
9.
Larcombe, M.H.E., "A List Processing Approach
to the Solution of Large Sparse Sets of Matrix Equations and the Factorlzation of the
Overall Matrix," Proc. Oxford Con f. on "Large
Sparse Sets of Linear Equations," J,K. Reid ,
Editor, April 1970, Academic Press, New York,
1971, pp. 25-40.
Future Directions and Developments
In this paper, we have provided a model of a
generalized approach for describing and mapping
any numerical database to secondary storage by nonprocedural Stored-Data Description and Mapping
Languages (SDDL and SDML). We have also shown how
the DMBS concepts llke schema and data language
facilities are also applicable to databases necessary to process numerical applications, which are
residing on secondary devices. In addition, we
have also discussed the feasibility of our model
as a valuable tool in numerical database management as described in the current implementation
of our generalized data translator for numerical
databases.
An area for the extension of thls research
is in the implementation of a data manipulation
language (DML). As previously mentioned, we have
already designed a DML which consists of certain
primitive statements that correspond to the operations permitted on the numerical database and embedded into FORTRAN. The file control and the
physical schema tables, and some of the conversion
utility subroutines of our model would be of use
in the implementation of the DML at a later date.
Another area of research is in the performance evaluation of the numerical physical schemas
with regards to specific applications or numerical operations. MacVelgh has reported in [i01, the
effect of data representation on the cost of
sparse matrix operations in primary storage. It
is desirable to extend this work to secondary storage and to develop a performance evaluation model
for matching numerical database of an applicatlon
to the best-fit physical schema on secondary storage.
I0. MacVelgh, Donald T., "Effect of Data Representation on Cost of Sparse Matrix Operations," Acta Informatlca , Vol. 7, 1977,
pp. 361-394.
ii. Maurer, Herman H., "Data Structures and Progranm~Ing Techniques," Translated by Camille C~
Price, Prentice-Hall, Inc., Englewood cliffs,
N.J., 1977.
Finally, we would like to identify some physical schemas of our model that have currently
proved to be of practical applications in numerical database management. The threaded-llnked-list
structure has been successfully implemented in the
WARDEN system in use at the University of Warwick
[9] for Computer-Aided Design. Besides, secondary
storage implementations that are similar to our
direct encoding group, are identified in EASY-an Engineering Analysis System of Utility Programs
12. Pooch, U.W. and Nieder, A., "A Survey of Indexing Techniques for Sparse Matrices," ACM
Computing Surveys, pp. 109-133, Vol. 5. No. 2,
June 1973.
13. Ramlrez, J., "Automatic Generation of Data
Converslon-Programs Using a Data Description
431
Language (DDL)," Ph.D. Dissertation, University of Pennsylvania, 1973.
14.
Rosenberg, A.L. and Stockmeyer,
.L., "Storage Schemes for Boundedly Extendible Arrays,"
Acta Informatlca, 7, 1977, pp. 289-303.
15.
Scheuermann, Peter, "On the Design and Evaluation of Data Bases," IEEE Computer, Feb. 1978,
pp. 46-54.
16.
Scheuermann, Peter, "Concepts of a Data Base
Simulation Language", Proc. ACM SIGMOD Int'l.
Conf. on Management of Data, 1977, pp. 144-156.
17.
Shu, N.C. et al., "EXPRESS: A Data EXtraction,
Processing and REStructuring System," ACM
Trans. Database Systems, Vol. 2, No. 2,
June 1977, pp. 134-174.
18.
Taylor, Robert W., "Generalized Data Base
Management System Data Structures and their
Mapping to Physical Storage," Ph.D. dissertation, Univ. of Michigan, 1971.
19.
Wetherell, Ca. and Shannon, A., "LR Automatic
Parser Generator and LR(1) Parser," Lawrence
Livermore Lab., University of California,
P.O. Box 808, Livermore, CA 94550, June 14,
1979.
432

Download Report

A DATA DEFINITION AND MAPPING LANGUAGE FOR

Paperzz.com

Your Paperzz