Issues related to 4GB partitioned tables

Issues related to 4GB partitioned tables
During research undertaken on the partitioned table capability of Ingres, it has been noted
that many problems are a result of the architectural design. A reworking of the partitioning
capabilities of Ingres needs to be undertaken to address these problems
4GB+ table - research
Research into 4GB tables, was done on usl3sd01 with SAN attached (Three lungs, each of
2TB of disk). The Ingres installation is Ingres 9.3 (Build SVN 932).
The Hardware is given in appendix 1.
The use of a 4GB+ table was deemed necessary to ensure that wrap-around errors where
integers are used could be detected. Following several successful creations of tables with
4B+ tuples it was found that there were areas of greater concern than simply the capability to
load 4B+ tuples. The results of the study are included in this paper.
Datawarehouse - research
My working title for an Ingres warehouse is IngresWDB - I have a 300 DB star configuration
that has views against a federated table (which is equivalent to effectively spreading a LIST
partitioned table over 300 servers). This configuration will be used for functionality testing
For practical testing, performance of data load and retrieval, a 20 DB constellation has been
created on the same hardware.
A 3 DB constellation has been used to test the scripts used to build the above database
warehouse examples.
1GB table - research

Creating a 1 billion row table took approx 12 seconds for each 500,000 tuples added.
o Total run time of 6h 32m.
o Spikes were seen when the table extended. As each partition was extended
the load time for 500K tuples was approx 28 seconds
The CREATE TABLE statement is given in appendix 2.

MODIFY the 1 billion row table to BTREE
o Total run time 4h 15m
o Required “SET SESSION WITH ON_LOGFULL = COMMIT”, even though the
transaction log file was 16 Gigabytes,
What was causing the very large volume of writes to the transaction log file
has yet to be identified.
The MODIFY TABLE statement is given in appendix 3

Creating a composite key index on two i4's on the 1 billion row table ran to
completion
o Total runtime of 3h 08m
o Resulting index contains 4.3 million 8K pages.
The CREATE INDEX statement is given in appendix 4

Optimizing the table ran to completion
o Total runtime of 13m 19s
Optimizedb –zr5000 –zu5000 –zh big4 –rcollected_data –asampleperiod –adcid

copydb
o copy.out ran to completion
 Required “SET SESSION WITH ON_LOGFULL = COMMIT” to be

added to the script
Time to unload 3h
o copy.in
 94 gigabytes of disk and more than 1 billion log writes were required

to load the table.
Time to load 4h 30m
Changes required to COPYDB





Changes are required in the copy.out and copy.in files generated to include set
session with on_logfull = notify if any partitioned tables are being processed.
Bulk load is not supported against partitioned tables, this needs to be addressed
Features to export partition sets and create the load schema, where a partition set is
a range from 1 to n partitions of the partitioned table in sequence with any partition as
the starting point
To unload and load a partitioned table in parallel
To unload a partitioned table into individual files
o Unloading the partitions in parallel: this would speed up the extract of data
from a table as would a parallel load capability. The limitation is that the same
partitioning scheme is used for unload and load. The XFERDB syntax would
need to be enhanced to add the required SQL to facilitate the parallel load
and unload of table data into individual physical files. The naming of the files
for each partition would need to be considered to ensure that the partition
data was correctly handled irrespective of the number of partitions
Changes required to USERMOD


To allow specific partitions or a range of partitions to be acted upon independently
o A rule to automatically restructure a range partition or list partition with
multiple entries into new partitions automatically when a global or table
partition limit is reached.
 E.g. if 6m tuples are added to a table every day, when the number of
tuples in any partition reached 25m that partition would be split ( if
possible) or an entry in the error log made to notify the DBA
Negating the automatic re-partitioning action is required as the DBA may be
controlling the table and be aware of the potential issues that may arise. The
automatic feature is required for lights out installations.
Changes required to MODIFY and ALTER

To DROP partitions from a partition schema (Sharding)







To CREATE a new table from one or more consecutive partitions by refactoring of the
partitioned table and changes to the iirelation table
To add partitions to the beginning or end of a partitioned table (Growing)
To split a partition (Re-Partitioning)
To aggregate partitions (Re-Partitioning)
MODIFY to TRUNCATE any partition sequence leaving them in place
MODIFY or ALTER changes to enable partition level indexing
MODIFY or ALTER changes to apply updates to global indexes when changes to
partitioning moves keyed values between partitions or otherwise updates the TID
information without needing to DROP or otherwise create the global index.(TBD)
Changes required to partitioning criteria
Various difficulties have been experienced when loading partitioned tables. Some ideas and
questions are provided below for consideration.
 Session defaults: Loading tuples into a partitioned table fills the TX log file. Should a
session be able to add the set session with on_logfull= notify setting automatically
prior to starting the load of tuples into a partitioned table.
 Enable partitioned tables to be bulk loaded under all circumstances. This could be
resolved by loading data into single partitions in parallel with all tuples in a given load
file that fail to match the partition scheme being loaded into the default partition or
placed in an exception file. To facilitate this the unload (copy.out) would also ned to
be able to unload the table into individual load files.
 Being able to treat partitions as if they were ordinary tables for the purposes of
loading and unloading via copy.in/copy.out scripts would benefit all implementations
of partitioned tables.
BOLKE01 – EXPANDED ABOVE (The above is missing something)
Partitioned tables are made up of tables that are treated as special tables that are part of a
Master table. These tables share the Master table schema (columns are only held in the
iiattribute table for the Master table) however in every other way they are the physical table
that the DBMS server acts upon, since the MASTER table has no physical presence.



Treating partitions as tables that can participate in registered table visualization would
address some areas of performance by reducing locking. Only SQL statement
requests made against the Master table would require high level locks to be taken on
the Master table.
A registered table has similarity to a Master table in that it has no physical presence
of its own. An Ingres STAR registered table has a link to one or more tables on a
different database that are either local or in remote installations.
If partitions of a partitioned table could be registered, as if they were a set of coherent
tables, created by either the register table statement or the create table statement
with a new ‘register’ option then the distinction of partitioned tables and an Ingres
STAR registered table would be removed. In the former all participating tables would
need to be of the same DDL scheme. In the latter the underlying tables would be
created by the CREATE TABLE statement (as per current standard) with a syntax
update to enable the partitions to be created as individual tables under the umbrella
of the master table.
o Example:
CREATE TABLE ptn ( id
samples
svalue
WITH
DUPLICATES,
NOJOURNALING,
PAGE_SIZE = 16384
\p\g
INTEGER,
INTEGER,
FLOAT)
MODIFY ptn TO HEAP
WITH
ALLOCATION = 2000,
EXTEND = 2048,
PAGE_SIZE = 16384,
register partition = ((
list on samples
partition p01 values
partition p02 values
partition p03 values
partition p04 values
partition p05 values
partition p53 values
SUBPARTITION(
HASH ON id
35 partition with location =
(iidatabase)))
(1),
(2),
(3),
(4),
(5),
(default))
\p\g
REGISTER TABLE ptn_history FOR (ptn.p01, ptn.p02);
\p\g
REGISTER TABLE ptn_research FOR (ptn.p03, ptn.p04);
\p\g
REGISTER TABLE ptn_current FOR (ptn.p04, ptn.p05);
\p\g
REGISTER TABLE ptn_invalid FOR ptn.p53;
\p\g
/* brackets not required for a single table)
In this example the ptn.p04 and ptn.p05 would exist as real tables that can be acted upon by
MODIFY and ALTER statements
e.g. ALTER TABLE ptn.p01 ADD CONSTRAINT UNIQUE INDEX ptn_idx_01 (id,
samples) ; ) ,
However to use DML statements (SELECT, UPDATE or INSERT) data, the Master table (ptn)
or the registered table (ptn_current) would need to be used.
e.g. working with the ptn table would work for both these examples, however the first
would have failed against ptn_current since the partitioning scheme would still be in
effect allowing only tuples with samples = 4 or 5:
INSERT INTO ptn values (2,5,3.57);
or
INSERT INTO ptn_current values (2, 4, 59.0);
BOLKE01 added examples - The above does not make sense, how about examples?

Addressability of the individual partitions as independent tables would resolve many
issues and enabling views over restricted parts of a single highly partitioned table
would bring benefits all round. Un-latching the control/master table from the partitions
would bring a big improvement to flexibility and the capability to manage (register) a
set of related tables (same schema) as a partitioned table (similar to a view) would
bring its own benefits
See above examples: this section is a bit of a tototolgy of the one above now  The
above does not make sense, how about examples?

A re-write of partition table logic is required to address restrictions on indexing and
table modification - (sharding, splitting, combining and adding new partitions) all of
which are necessary when dealing with massive tables
BOLKE01 – YEP – especially after the new additions. Have you not already said this
in a different form within this document?

The ability to register a table as a partition or subset of consecutive partitions of the
main table either in STARDB or in a standard DB would bring benefits to the
partitioning of a table across multiple installations, and in reverse, the combining of
multiple tables as a distributed partitioned table facilitating the development of the
IngresWDB solution

Individual partitions should have their own table structure; at present all partitions
must have the same structure and index columns which may not be an optimal choice

Multi-part partition key is required within iirelation to identify partitioned tables;
disconnecting the over-load of partition information within reltidx
o A three part TID is a necessary requirement (reltid, reltidp - new column,
reltidx), which together with the relnparts column form a consistently
addressable referencing model and should be added to the next release even
if the functionality is not fully crossed from the use of the two part (overloading of the index reltidx column)
o
reltid reltidx
reltidp relnparts
type Description
N
0
0
0
T
A base table
N
n
0
0
I
An index on a base table
N
0
n
m
P
A partitioned table
N
n
n
m
G
A partitioned index on a partitioned
table. Global index.
n
n
0
0
L
An index on a single partition of a
partitioned table. Local Index
the value ‘N’ is the integer reltid of a Base or Master Partition
the value ‘n’ is the integer reltid of an Index or partition
the value ‘m’ is the integer partition number
Note: for a non partitioned table the value is 0, and for a base table of a
partition it holds the maximum partition number.
Note: ‘N’ and ‘n’ should be increased to big integers (integer8)
o

With the current reltid and over-layed reltidx structure there is no ability to
implement the local index or the partitioned global index.
The locking implementation that is currently in place is prohibitively costly
when multiple partitions are being updated as there is a lock taken at the
master table level by each updating statement which causes synchronous
activity.
Indexing
Index Type
Local Index
Global Index
Description
An Index that is against a single partition
An Index across the whole partitioned table
This index can be either partitioned or non-partitioned.
If the index is partitioned then it should have the same partitioning
properties as for portioning of a partitioned base table
Current DB Capability
New DB Capability (1)
Reltid, reltidx, relnparts
Reltid, reltidx, reltidp, relnparts
183, -2147483464, 0
183, -2147483463, 1
Partitioned Table
183, 0, 184, 0
Partitioned Table
183, 0, 185, 1
183, -2147483462, 2
183, 0, 186, 2
183, 187, 188, 0
Index
183,184, 0
Index: All partitions can only be
included in a single index, using
single or multiple locations. A
maximum of 8.2m pages are
possible in an index, limiting
flexibility.
Partitioned
Index.
183, 187, 189, 1
183, 187, 190, 2
Partitioned Index: Same scheme as
table partitioning ensuring no
coincidental locking is required.
Single or multiple locations may be
used, though not required.
New DB Capability (2)
Reltid, reltidx, reltidp, relnparts
183, 0, 184, 0
Partitioned
Table
183, 0, 185, 1
183, 0, 186, 2
184, 187, 0, 0
Local Index
185, 188, 0, 0
Local Index
186, 189, 0, 0
Local Index
Local Index: Each partition can be
indexed separately as if it were a
Base table. No locking of the Master
table or any other partition is
required. Single or Multiple
locations may be used.

Sharding and Re- Partitioning
Action
Sharding
Description
Removal of the oldest partitions of a table. Special case of RePartitioning. I.e. Removal of the lowest range for Range partitioned
Re-Partitioning
Growing
tables or the first set of a List partitioned table.
Modification of all or some of the parts of a table. Enables partitions of
a Range partitioned table to be subdivided and an additional set to be
added to List partitioned tables
Adding a new partition to the head of a partitioned table. Special case
of Re-Partitioning. I.e. Repartitioning the default or last partition to have
a range following the last defined value of a range partition or a new set
of values for a List partitioned table
Sharding Capability
Sharding Capability
Reltid, reltidx, relnparts relnparts
183, 0, 184, 0
Reltid, reltidx, reltidp, relnparts
183, 0, 185, 1
Partitioned Table
Partitioned Table
183, 0, 186, 2
183, 0, 187, 3
Sharding: Removal of the first
identified partition using reltid,
reltidp and relnparts.
Global indexes are not permitted
during Sharding.
Any Local index for the partition
being removed are dropped.
183, 0, 185, 0
183, 0, 186, 1
183, 0, 187, 2
Sharding: The sequencing of the
partitions is related to the relnparts
value. When the tables first partition
is removed, the remaining partitions
are re-sequenced.
Local indexes on the re-sequenced
partitions are unaffected.
Growing Capability
Growing Capability
Reltid, reltidx, reltidp, relnparts
Reltid, reltidx, relnparts relnparts
183, 0, 184, 0
183, 0, 184, 0
Partitioned Table
183, 0, 184, 0
183, 0, 185, 1
183, 0, 185, 1
Partitioned Table
183, 0, 186, 2
183, 0, 186, 2
183, 0, 187, 3
Growing: Adding a new partition
requires that the default partition is
made the highest numbered partition
relnpartno and that the new partion is
inserted behind it.
Global indexes are not permitted
during Growing.
Local indexes are unaffected.
Growing: Once the new partition is
inserted, all values in the default
partition are inserted into the new
partition and deleted from the
default partition.
Local indexes on the default
partition are reorganised.
Creation of Local indexes on the
new partition are the responsibility
of the User.
Re-Partitioning Capability
Re-Partitioning Capability
Reltid, reltidx, reltidp, relnparts
Reltid, reltidx, relnparts relnparts
183, 0, 184, 0
183, 0, 184, 0
183, 0, 185, 1
183, 0, 185, 1
Partitioned Table
183, 0, 186, 2
183, 0, 187, 3
Partitioned Table
183, 0, 188, 2
183, 0, 187, 3
183, 0, 187, 4
Re-Partitioning: The new range or list set for
an inserted partition must completely cover
the original partitioning scheme for the
affected partition.
Global indexes are not permitted during rePartitioning.
Local indexes are dropped and recreated on
the affected partition.
The default partition will be checked for
matching values which will be relocated to the
new partition.
Re-Partitioning: Once the new partition is
inserted, all matching values from the
original partition are inserted into the new
partition and deleted from the original
partition.
The range and List set criteria is validated
prior to initiating the re-partitioning action.
Local indexes on the modified partition are
dropped and once the partitions are repopulated they are re-created on both the
original and new partitions.
APPENDIX 1 – HARDWARE and OPERATING SYSTEM
OPERATING SYSTEM:
uname –a
Linux usl3sd01.ingres.prv 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux
FILESYSTEM / SAN
Filesystem
/dev/cciss/c0d0p1
tmpfs
/dev/mapper/vg00-vol01
/dev/mapper/vg00-vol02
/dev/mapper/vg00-vol03
Size
62G
2.0G
2.0T
2.0T
2.0T
Used
58G
0
1.4T
5.6G
404G
Avail
1.6G
2.0G
487G
1.9T
1.5T
Use%
98%
0%
75%
1%
22%
Mounted on
/
/dev/shm
/vol01
/vol02
/vol03
CPU INFO
4 CPUs (0-3) Details of CPU 0 is given here
processor
:0
vendor_id
: GenuineIntel
cpu family : 15
model
:4
model name
:
Intel(R) Xeon(TM) CPU 3.80GHz
stepping
:3
cpu MHz
: 2800.000
cache size : 2048 KB
physical id : 0
siblings
:2
core id
:0
cpu cores
:1
fpu
: yes
fpu_exception : yes
cpuid level : 5
wp
: yes
flags
: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor
ds_cpl est tm2 cid cx16 xtpr
bogomips
: 7603.80
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:
APPENDIX 2 – CREATE TABLE
\continue
\sql
set autocommit on
\p\g
DROP SEQUENCE big1
\p\g
CREATE SEQUENCE big1 as INTEGER8
START WITH 1 CACHE 500000 NOCYCLE
\p\g
DROP TABLE collected_data
\p\g
CREATE TABLE collected_data
(
dcid
INTEGER8 NOT NULL NOT DEFAULT,
sampletime
INTEGER NOT NULL,
sampleperiod
INTEGER NOT NULL,
qualifier
INTEGER NOT NULL,
samples
INTEGER,
avgval
F4,
minval
F4,
maxval
F4,
stddev
F4
)
WITH
DUPLICATES,
NOJOURNALING,
PAGE_SIZE = 16384
\p\g
MODIFY
WITH
collected_data TO HEAP
ALLOCATION = 200000,
EXTEND = 51200,
PAGE_SIZE = 16384,
partition = ((
list on sampletime
partition p01 values
partition p02 values
partition p03 values
partition p04 values
partition p05 values
partition p06 values
partition p07 values
partition p08 values
partition p09 values
partition p10 values
partition p11 values
partition p12 values
partition p13 values
partition p14 values
partition p15 values
partition p16 values
partition p17 values
partition p18 values
partition p19 values
partition p20 values
partition p21 values
partition p22 values
partition p23 values
partition p24 values
partition p25 values
partition p26 values
partition p27 values
partition p28 values
partition p29 values
partition p30 values
partition p31 values
partition p32 values
(1),
(2),
(3),
(4),
(5),
(6),
(7),
(8),
(9),
(10),
(11),
(12),
(13),
(14),
(15),
(16),
(17),
(18),
(19),
(20),
(21),
(22),
(23),
(24),
(25),
(26),
(27),
(28),
(29),
(30),
(31),
(32),
partition p33 values (33),
partition p34 values (34),
partition p35 values (35),
partition p36 values (36),
partition p37 values (37),
partition p38 values (38),
partition p39 values (39),
partition p40 values (40),
partition p41 values (41),
partition p42 values (42),
partition p43 values (43),
partition p44 values (44),
partition p45 values (45),
partition p46 values (46),
partition p47 values (47),
partition p48 values (48),
partition p49 values (49),
partition p50 values (50),
partition p51 values (51),
partition p52 values (52),
partition p53 values (default))
subpartition (
hash on dcid
/* 100 */ 35 partition with location =
(loc1)))
\p\g
APPENDIX 3 – MODIFY TABLE
MODIFY
WITH
collected_data TO BTREE
ALLOCATION = 200000,
EXTEND = 51200,
PAGE_SIZE = 16384,
partition = ((
list on sampletime
partition p01 values (1),
partition p02 values (2),
partition p03 values (3),
partition p04 values (4),
partition p05 values (5),
partition p06 values (6),
partition p07 values (7),
partition p08 values (8),
partition p09 values (9),
partition p10 values (10),
partition p11 values (11),
partition p12 values (12),
partition p13 values (13),
partition p14 values (14),
partition p15 values (15),
partition p16 values (16),
partition p17 values (17),
partition p18 values (18),
partition p19 values (19),
partition p20 values (20),
partition p21 values (21),
partition p22 values (22),
partition p23 values (23),
partition p24 values (24),
partition p25 values (25),
partition p26 values (26),
partition p27 values (27),
partition p28 values (28),
partition p29 values (29),
partition p30 values (30),
partition p31 values (31),
partition p32 values (32),
partition p33 values (33),
partition p34 values (34),
partition p35 values (35),
partition p36 values (36),
partition p37 values (37),
partition p38 values (38),
partition p39 values (39),
partition p40 values (40),
partition p41 values (41),
partition p42 values (42),
partition p43 values (43),
partition p44 values (44),
partition p45 values (45),
partition p46 values (46),
partition p47 values (47),
partition p48 values (48),
partition p49 values (49),
partition p50 values (50),
partition p51 values (51),
partition p52 values (52),
partition p53 values (default))
subpartition (
hash on dcid
/* 100 */ 9 partition with location =
(loc1)))
\p\g