DB-15: Inside The Recovery Subsystem

DB-15: Inside The Recovery Subsystem
Plan to commit; Be prepared to rollback.
Richard Banville
Fellow, Technology and Product Architecture
Progress OpenEdge
Recovery Types




Transaction Recovery*
• Before image rollback/undo and crash recovery
Hard Failure Recovery
• Roll forward after images
• Point in time, transaction, retry
Coordinated distributed txn consistency
• OpenEdge® 2PC - Prepare Phase, Commit Phase
Heterogeneous distributed txn consistency (JTA)
• External distributed transaction coordinator
• Requires application changes
• Available for OpenEdge SQL only
* Before Imaging is the focus of this presentation
2 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Agenda
 The BI Units of Measure
 Some Simple Rules
 General Processing (the fun stuff)
 Reliability Switches
 Summary
3 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Layout: Notes and Blocks
Notes are the basis for recording
change in the database
BI made up of many Notes
Notes are variable sized
Notes are organized in order of
operation
Notes are stored into BI blocks
BI block size can be customized (1-16K)
I/O is performed in BI Blocksize
4 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Layout: Clusters
Blocks are grouped to form a cluster
BI cluster size can be customized
(16KB – 256MB)
Size affects checkpoint frequency
(among other things)
Notes are stored into BI blocks
BI Block size can be customized (1-16K)
I/O is performed in BI Blocksize
5 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Layout: Clusters
Clusters are allocated as needed
Clusters are logically joined and ordered
into a ring
Only ever one cluster accepting BI writes
6 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Layout: Storage
BI File
BI File
BI File
The Primary Recovery Area:
BI data stored in the extents of area
#2 of the database
It grows as needed
Space is re-used when possible
7 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
What’s in a note?
Trid: 81180 code = RL_RMCR version = 2
Trid: 81180 area = 8 dbkey = 14528 update counter = 4770
Header
Note Specific Info
Data Portion (if needed)
 Length & note version
 Record #
 Block change data
 Note code/identifier
 Table number
 i.e, Record data itself
 Size of record
 Only if needed
 Associates action
 Note type
 Transaction Id
 Split information
 Block pointer & area
 Block update counter
8 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Agenda
 The BI Units of Measure
 Some Simple Rules
 General Processing (the fun stuff)
 Reliability Switches
 Summary
10 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Rules to live by
 #1 - Write ahead logging (WAL)
• Recovery log notes written BEFORE data
– Assures atomic and durable transactions
– BI, AI - reliable write I/O
– Can relax data write I/O




Write prior to BI-reuse
Cluster close
Missing data applied by redo
Deferring writes allows multiple updates to occur with
a single I/O
 #2 - Write ordering rule (FS and hardware)
• AI, BI writes get to disk in order requested
11 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Rules to follow

#3 - BI Space Reuse
• Only when cluster is closed
• Cluster closes when its last transaction ends
– Checkpoint DOES NOT close a cluster
– Checkpoint occurs when cluster fills up

#4 - Exclusive Block Access
• When changing data in database

#5 - Atomic Physical Changes
• Such as block chain manipulations
• Enforced by internal TXE mechanism
• SYSTEM ERROR: User 5 died during micro txn.
12 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Rule
 #6 - Without exception:
• All DB changes are recorded in recovery log.
13 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Rules were meant to be broken
 #6 - Without exception:
• All DB changes are recorded in recovery log.
 Exception:
• Control Area (area #1) changes are not logged.
– Why should I care?
– Allows structural changes w/o affecting recovery
 Such as adding space while in roll forward.
– Recovery Mechanism: Builddb
14 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Agenda
 The BI Units of Measure
 Some Simple Rules
 General Processing (the fun stuff)
 Reliability Switches
 Summary
15 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Forward Processing
So you want to perform a database action
 Locate/Lock the data block to change
• Not all notes require a block
– Transaction begin, end
• Not all DB changes require a block!
– Acquiring additional space
– Certain index sub-operations
 Ensure begin transaction recorded
 Record the change in the BI log
(via the BI buffer pool)
16 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Buffer Pool – Recording a change
Forward Processing
Rollback Processing
-bibufs 10
Free List
NF - a
NF - b
Current
Output Buffer
Modified
Queue
Current
Input Buffer
32
31
15
30
Backout
Buffer
Backout
Buffer
9
12
New Notes
(Actions)
NF - c
29
NF - d
NF - e
17 DB-15: Inside the Recovery Subsystem
BI
© 2007 Progress Software Corporation
BI Buffer Pool – Recording a change
Forward Processing
-bibufs 10
Free List
NF - a
NF - b
Current
Output Buffer
Modified
Queue
32
31
New Notes
(Actions)
NF - c
30
29
Busy buffer waits
Empty buffer waits
Partial Writes
Is it OK to buffer dirty BI blocks?
YES
NF - d
NF - e
PROMON:
Total BI Writes
Records (notes) written
Is it OK to buffer committed BI data?
BI
Delayed commit is up to you!
18 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Forward Processing (continued)
The BI Note has been written…

Finally perform the DB action (make the change)
• Logical, physical or a mix

Data block’s update ctr is incremented
• Identifies if a noted change made it to disk yet
• Ensures changes re-applied in order

Dependency counter maintained in ctlr struct
• Ensures associated BI flushed if –B eviction

User may be forced to do (expensive) BI I/O
• On -B eviction or No BI buffers available
• Avoid with APWs, BIW and -bibufs
19 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Helping avoid OLTP BI I/O
20 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Broker Processing
Helping Avoid OLTP BI I/O
-bibufs 10
Free List
NF - a
NF - b
Current
Output Buffer
Modified
Queue
32
31
New Notes
(Actions)
NF - c
PROMON:
Total BI Writes
Records (notes) written
Partial Writes
30
29
Delayed commit (Durability)
NF - d
NF - e
21 DB-15: Inside the Recovery Subsystem
Broker
BI
Based on –Mf value, Broker
may flush BI buffers to disk
For aged txn ends
© 2007 Progress Software Corporation
BIW Processing
Helping Avoid OLTP BI I/O
-bibufs 10
Free List
NF - a
NF - b
PROMON:
Total BI Writes
Records (notes) written
Current
Output Buffer
Modified
Queue
32
31
Partial Writes
30
BIW Writes
New Notes
(Actions)
NF - c
29
NF - d
NF - e
22 DB-15: Inside the Recovery Subsystem
BIW
BI
© 2007 Progress Software Corporation
APW Processing
Helping Avoid OLTP BI I/O
-bibufs 10
Free List
NF - a
NF - b
Current
Output Buffer
Modified
Queue
32
31
New Notes
(Actions)
NF - c
30
Checkpoint
Queue
Associated BI
Note
172
(dependency ctr)
128
Data
Blocks
29
WAL
NF - d
NF - e
AP W
BI
db
12
23 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Clusters And Checkpointing
24 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
The Precious Ring
BI Files
1
2
3
4
BI Cluster Layout
Current Modified
Out Buffer Queue
32
31
BI blocks are grouped together to form a cluster of blocks.
The cluster of blocks are logically joined together in a ring.
30
-bibufs
29
1
2
3
4
-B buffer pool
Database
25 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Checkpoint – Synchronization point
All Database Changes Halted!
BI Files
1
Current Modified
Out Buffer Queue
32
2
3
4
BI Cluster Layout
BI buffer pool flushed
Db buffer pool scanned
Db buffers previously marked for chkpt are written out (OUCH!)
31
Dirty buffers are marked for chkpt & put on checkpoint queue
30
-bibufs
Fuzzy checkpointing avoids I/O
29
File system cache is synchronized
1
2
3
Database
26 DB-15: Inside the Recovery Subsystem
4
File
System
Cache
-B buffer pool
No more sync delay
© 2007 Progress Software Corporation
Checkpoint (with –directio)
All Database Changes Halted!
BI Files
1
2
3
4
BI Cluster Layout
BI buffer pool flushed
Db buffer pool scanned
Db buffers marked for chkpt are written out
Dirty buffers are marked for chkpt & put on checkpoint queue
Fuzzy checkpointing avoids I/O
1
2
3
Database
27 DB-15: Inside the Recovery Subsystem
4
-B buffer pool
(unbuffered I/O)
© 2007 Progress Software Corporation
The APW
The APWs help w/checkpoints too
PROMON:
Buffers Flushed
at checkpoint
BIW Writes
AP W
APW
Queue
172
128
128
Checkpoint
Queue
256
1024
512
-B Buffer
Pool
1152
1664
…
28 DB-15: Inside the Recovery Subsystem
db
© 2007 Progress Software Corporation
Checkpoint – Size Does Matter
 Larger cluster sizes
• Fewer checkpoints (sync points)
– Will a crash result in additional lost data?
• Longer recovery time
– Recovery starts at last cluster - 1
• Longer BI format time (runtime)
• Longer BI format time after truncate
– Use at least one fixed length extent
 Also use a variable length extent
– Use bigrow
29 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Checkpoints and Promon
Seeing is believing…
Ckpt
No. Time
Ooops!!
------ Database Writes -----Len
Freq
Dirty
CPT Q
Scan
27 10:23:12
4
0
384
52
0
0
0
26 10:22:46
25
26
381
381
0
0
0
25 10:22:18
27
28
380
380
0
0
0
24 10:21:50
27
28
346
158
201
0
0
23 10:21:21
28
29
372
360
115
0
0
30 DB-15: Inside the Recovery Subsystem
APW Q Flushes
© 2007 Progress Software Corporation
Checkpoints and Promon
Seeing is believing…
Ckpt
------ Database Writes ------
No. Time
Len
Freq
Dirty
CPT Q
Scan
APW Q Flushes
27 10:23:12
4
0
384
52
0
0
0
26 10:22:46
25
26
381
381
0
0
0
25 10:22:18
27
28
380
380
0
0
0
24 10:21:50
27
28
346
158
201
0
0
23 10:21:21
28
29
372
360
115
0
0
Len: begin to end time - Time cluster was actively available for writes
Freq: begin time to begin time - Time between checkpoints
Time spent performing checkpoint operation:
Freq - Len
Dirty: # data blocks newly updated – not incremented when “made dirtier”
31 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Checkpoints and Promon
APW Specific Activity…
Ckpt
------ Database Writes ------
No. Time
Len
Freq
Dirty
CPT Q
Scan
APW Q Flushes
27 10:23:12
4
0
384
52
0
0
0
26 10:22:46
25
26
381
381
0
0
0
25 10:22:18
27
28
380
380
0
0
0
24 10:21:50
27
28
346
158
201
0
0
23 10:21:21
28
29
372
360
115
0
0
CPT Q: # data buffers APW wrote from checkpoint queue (from prev chkpt)
Scan: # data buffers APW wrote while scanning -B
APW Q: # data buffers APW wrote from APW Q
Dirty buffers added to APWQ from -B LRU eviction
32 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Checkpoints and Promon
To be avoided…
Ckpt
------ Database Writes ------
No. Time
Len
Freq
Dirty
CPT Q
Scan
APW Q Flushes
27 10:23:12
4
0
384
52
0
0
0
26 10:22:46
25
26
381
381
0
0
0
25 10:22:18
27
28
380
380
0
0
0
24 10:21:50
27
28
346
158
201
0
0
23 10:21:21
28
29
372
360
115
0
0
Flushes: Number of blocks written during checkpoint
(marked from previous checkpoint)
Len: Checkpointing too often should be avoided
33 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Reusing space in the BI file
34 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
BI Space Reuse
BI Files
1
35 DB-15: Inside the Recovery Subsystem
2
3
4
© 2007 Progress Software Corporation
BI Space Reuse
BI Files
1
36 DB-15: Inside the Recovery Subsystem
2
3
4
5
© 2007 Progress Software Corporation
BI Space Reuse
BI Files
1
2
3
4
5
6
When can BI space be reused?
No need to “Age” cluster anymore
No open transactions in cluster
-G 0 vs –G 60 Thanks fdatasync()
W h y ??
Checkpoint DOES NOT close a cluster!!
Changes have been written to data files
If outstanding transaction were to roll back,
where would the undo action come from?
BI files grow to some working set size
37 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Rollback
38 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Rollback Processing
-bibufs 10
Read backwards & UNDO until tx begin
Modified
Queue
Current
Input Buffer
31
15
NF - b
30
Backout
Buffer
Backout
Buffer
NF - c
29
9
12
Free List
NF - a
Current
Output Buffer
PROMON:
Input buffer hits
Output buffer hits
Mod buffer hits
Busy buffer waits
32
Total BI Reads
Notes read
NF - d
NF - e
BI
.lbi
39 DB-15: Inside the Recovery Subsystem
ABL sub transaction rollback:
ABL requests compensating action
© 2007 Progress Software Corporation
What about BOB?
-bibufs 10
Modified
Queue
Current
Input Buffer
31
15
NF - b
30
Backout
Buffer
Backout
Buffer
NF - c
29
9
12
Free List
NF - a
Current
Output Buffer
PROMON:
Input buffer hits
Output buffer hits
Mod buffer hits
32
BO Buffer hits
NF - d
NF - e
40 DB-15: Inside the Recovery Subsystem
BI
© 2007 Progress Software Corporation
Crash Recovery
41 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Crash Recovery
 Performed on each database startup
• Only needed phases performed
 Brings DB up to last known consistent state
• Physically sound
• In-flight transactions rolled back
• Missing committed transactions re-applied
43 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Physical Redo
Bring DB up to point of crash
redo phase - forward scan
Before-Image Log
Last Recorded Note
Oldest active txn
Find last active cluster and backup one
*** Begin Physical Redo Phase, 4 at 0.
Apply notes based on updctr
No BI notes generated during redo
*** Physical Redo Phase Completed at block, off, upd…
*** At end of Physical Redo, txn table is 128
44 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Physical Undo
Backout physical DB changes (if needed)
redo phase - forward scan
Before-Image Log
Oldest active txn
Last Note
Physical undo
*** Begin Physical Undo 10 txns at block 128 offset 1608
Starts at crash point. Undo physical and physiological notes
Causes new BI notes to be generated
Ends when 1st transaction end encountered
*** Physical Undo Completed at 128 (block #)
45 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Logical Undo
Backout all uncommitted transactions
redo phase - forward scan
Before-Image Log
Last Note
Oldest active txn
Logical undo backward scan
Physical undo
*** Begin Logical Undo Phase, 10 incomplete txns are being backed out.
*** Logical Undo Phase begin at Block 1136 offset 1608.
Starts where physical undo left off
Undo logical and physiological notes
*** Logical Undo Phase Completed at Block 1135 offset 7743.
46 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Agenda
 The BI Units of Measure
 Some Simple Rules
 General Processing
 Reliability Switches
 Summary
47 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Switches: Reliability and Integrity
 -I : No longer a valid parameter.
• Never had anything to do with crash recovery
 -R : Default - Reliable BI I/O
• Writes bypass the FS cache
• Use for OLTP
*** Before-Image File I/O (-r -R): Reliable.
*** Crash Recovery (-i): Enabled.
48 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Switches: Reliability and Integrity
 -r : BI writes are buffered (un-reliable) to FS
•
•
•
•
•
Well tuned system overshadows any gain of -r
All notes recorded
Rollback will work
Crash recovery likely to work
Recovery from OS crash will most likely fail
*** This session is running with the non-raw (-r) parameter.
*** Before-Image File I/O (-r -R): Not Reliable.
*** Crash Recovery (-i): Enabled.
*** An earlier -r session crashed, the database may be damaged.
49 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Switches: Reliability and Integrity
Why provide it then?
 -i : Does not record purely physical notes
•
•
•
•
BI I/O is buffered (un-reliable) to FS
No FS sync at checkpoint
Rollback will work.
OS or DB crash, abnormal termination
– Must restore from backup
*** This session is being run with the no-integrity (-i) option.
*** Crash Recovery (-i): Not Enabled.
*** Before-Image File I/O (-r -R): Not Reliable.
50 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Switches: Last Resort
 -F (dash Foolish)
•
•
•
•
Enter DB without recovery
Use as a last resort
Integrity NOT maintained
Usually need to
– Validate Data Integrity
– Dump and load
51 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Agenda
 The BI Units of Measure
 Some Simple Rules
 General Processing
 Reliability Switches
 Summary
52 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Summary
 Recovery is a complex thing
 You can do things to improve the process
 We make it simple for you
53 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Questions?
-bibufs 10
Free List
NF - a
Current
Out Buffer
Checkpoint
Queue
Modified
Queue
32
31
NF - b
30
NF - c
29
Associated
BI Note
172
128
NF - d
AP W
NF - e
BI
1
2
54 DB-15: Inside the Recovery Subsystem
db
3
4
© 2007 Progress Software Corporation
Thank you for
your time!
55 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
56 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Other recovery related Switches
 -bi
 -biblocksize
 -directio
• No need for sync at checkpoint time
 -bwdelay
 -bibufs, -aibufs
 -bistall, -bithold
57 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation
Switches: Transactions
 -Mf : Delayed commit
• # seconds a commit note can reside in –bibufs
• Some commits lost/Integrity Maintained
 Group Commit Technique
• –groupdelay only runs w/-Mf 0
• Only in multi user mode
• # milliseconds to sleep at commit time
 -G : # seconds to age cluster (use & re-use)
• No longer needed with fdatasync()
58 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation