How Recovery Appliance addresses the issues

Zero Data Loss Recovery Appliance
Best Practices at Work in Your Data
Center
Andrew Babb
Consulting Member of Technical Staff
MAA Development, System Technology Group
Dongwook Kim
Engineering System Team, Oracle Korea
JeongRyun Park
IT Planning Team Leader / Information Technology Office
SK Hynix
Modified Version of Oracle OpenWorld Presentation November 23, 2015
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Presented with
Zero Data Loss Recovery Appliance Overview
Protected
Databases
Recovery Appliance
Offloads Tape
Backup
Delta Push
• DBs access and send only changes
• Minimal impact on production
• Real-time redo transport instantly
protects ongoing transactions
Protects all DBs in Data Center
• Petabytes of data
• Oracle 10.2-12c, any platform
• No expensive DB backup agents
Delta Store
• Stores validated, compressed DB changes on disk
• Fast restores to any point-in-time using deltas
• Built on Exadata scaling and resilience
• Enterprise Manager end-to-end control
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Replicates to Remote
Recovery Appliance
2
Program Agenda
1
Architecture
2
Backup Best Practices
3
Restore and Recovery
4
Recovery Appliance and Data Guard
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
3
Program Agenda
1
Architecture
2
Backup Best Practices
3
Restore and Recovery
4
Recovery Appliance and Data Guard
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
4
Architecture - Networks
• There are three networks on a RA
– Management network
• For RA administrators to login when patching
the appliance
– Ingest network
• For receiving backups from protected
databases or restoring backups from RA
– Replication network
• For replicating protected database backups
between Recovery Appliances
• Bond Ingest & Replication for HA
– Options are active/passive or 802.3ad
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
5
Architecture – Tape Options
• Two options for tape backups
– Oracle integrated
• Preconfigured Oracle Secure Backup (OSB)
• Direct attach tape library via Recovery
Appliance certified Fiber Channel Adapters
installed in each compute server
• Can connect to any OSB certified tape library
– Third-party tape systems
• Uses products like NetBackup
• Backups sent via 10GigE to Media Manager
– Normally using Ingest Network for transfer
• Integration and operational support provided
by 3rd Party Vendor
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
6
Program Agenda
1
Architecture
2
Backup Best Practices
3
Restore and Recovery
4
Recovery Appliance and Data Guard
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
7
Backup Manageability Best Practices
• Configure protected database
– Use Enterprise Manager Cloud Control
• Simplest deployment and configuration for 11g
and 12c
– Steps to backup database
• Create Protection Policy on Recovery
Appliance (RA)
• Add Protected Database to RA
• Configure Backup Settings for Protected
Database
• Schedule “Oracle-Suggested Recovery
Appliance Backup”
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
8
Backup Best Practices
• Database backup script is simple
– $ rman target <target string> catalog
<catalog string>
backup
device type sbt
cumulative incremental level 1
filesperset 1
section size 32g
database
plus archivelog
not backed up
filesperset 32;
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
9
RMAN best practices for the Recovery Appliance
• Initially allocate 2 RMAN channels per Database Node
– Do not over-allocate RMAN channels - can result in worse performance
– Take only one level 0 backup - ensure indexing of this backup has completed before
taking another
• select bp_key from rc_backup_piece where tag = ‘&tag' and backup_type ='D' and virtual = 'NO';
– Take daily cumulative incremental level 1 to reduce database recovery time
• Most common bottleneck is network or protected database’s I/O system
• Use Transparent Data Encryption (TDE) instead of RMAN encryption
• Use native database compression instead of RMAN compression
• Use change tracking file for all protected databases
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
10
Backup Best Practices for Bigfile Tablespaces
• Use Section Size 32 GB for bigfiles or large data files. Benefits level 0
backups for 11g and 12c databases
• Only 12c supports Section Size for Incremental Level 1 backups to
parallelize across mutiple RMAN channels
– 11g RMAN commands ignore the Section Size clause if specified for Incremental Level
1 backups
– backup device type sbt cumulative incremental level 1 filesperset 1 section size 32g
database plus archivelog not backed up filesperset 32;
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
11
Program Agenda
1
Architecture
2
Backup Best Practices
3
Restore and Recovery
4
Recovery Appliance and Data Guard
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
12
Restore and Recovery Best Practices
How long can your production database be down for ? What is your RTO requirements?
• When there is no validated disaster recovery plan, this might happen to
you
– Database Failure Occurs – must be restored from backup
•
•
•
•
•
Backup was not available on disk
Backup restored from tape
Found some tapes had been expired by mistake – took days to re-scan and re-catalog the pieces
Tape library had issues – moved tapes to another library that was only 1GigE connectivity
Tape restores were failing after many hours – init.ora parameter was wrong
– Database was eventually handed back to customers but after 8 days and with
significant data loss (RPO)
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
13
Restore and Recovery Best Practices
• Have a disaster recovery plan and rehearse your plan
• Use RMAN Restore Database / Recover Database as you would today
– No new RMAN commands to learn
– RMAN is aware of the Virtual Backups and will make the best decision for you
– Can restore directly from tape or RA Replica without staging data on local RA
• Performance considerations
– You need to think about parallelism and the impact on the protected database servers
where the restore is going to
• Too many channels might impact other databases in a consolidated environment
– Will use the ingest network of the RA to restore the database
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
14
Restore and Recovery Best Practices
• Bigfile Tablespace Practices and Considerations
– 11g and 12c supports Section Size for restore of level 0 backups
– Only 12c supports Section Size for restoring a virtual backup from Recovery Appliance
• Without Section Size support parallelism of bigfile data files and tablespaces are not possible
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
15
Program Agenda
1
Architecture
2
Backup Best Practices
3
Restore and Recovery
4
Recovery Appliance and Data Guard
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
16
Recovery Appliance and Data Guard
• Follow all MAA recommendations
– Recovery Appliance per data center
– Backup primary and standby databases to the local RA
– No Recovery Appliance replication if standby already exists in the targeted data
center
– Restore operation can use any RA in any location
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
17
Recovery Appliance and Data Guard
MAA White Paper
• Post Data Guard role transition
– No change in backup operations. Continue to backup both the primary and standby
databases to the local RA
• Deploying the Zero Data Loss Recovery Appliance in a Data Guard
Configuration
– Refer to http://www.oracle.com/technetwork/database/availability/recoveryappliance-data-guard-2767512.pdf or Deploying Zero Data Loss Recovery Appliance in
a Data Guard Configuration
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
18
Program Agenda
1
Architecture
2
Backup Best Practices
3
Restore and Recovery
4
Recovery Appliance and Data Guard
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
19
Validation, Security and Troubleshooting
Top problems faced in the field
• RTO or RPO SLA’s not met
– Bad Tapes
– Corruptions in backups
– Missing pieces (archive logs, data files or control files)
– No automation or end to end understanding of restore and recover process
• Problem Avoidance:
– Weekly RMAN crosschecks,
– Weekly or monthly RMAN backup or restore validate
– Monthly or Quarterly end to end restore and recovery validation testing and
automation
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
20
Validation, Security and Troubleshooting
How Recovery Appliance addresses the issues
• Ingesting Backups
– Validate data blocks as they are read from source database and sent to appliance
• Indexing Backups
– Blocks received are validated, compressed as they are written to the delta store
• Ongoing Validation
– All backupsets are crosschecked daily
– All data file blocks are optimized weekly (meaning all blocks are read weekly)
– All backupsets are validated (think restore validate) bi-weekly
– May be modified by setting of configuration parameter
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
21
Validation, Security and Troubleshooting
How Recovery Appliance addresses the issues (continued)
• Built on Exadata
– Benefits from Exadata Disk Scrubbing and Exadata Checksum checks
• All checks run on the Recovery Appliance offloading additional load on the protected
databases
• Recovery plan still needed to be tested
– Does not remove the need for periodic end to end restore and recovery testing to
prepare operations team and validate issues outside RA
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
22
Validation, Security and Troubleshooting
Customers requiring end to end security
• Client to Recovery Appliance, or Recovery Appliance to Client
– Database TDE is recommended
– https, sqlnet encryption & Wallets/Certificates integration underway
– MAA paper out soon
• Security in the Recovery Appliance
– Recovery Appliance administrators responsibilities
• Create Virtual Private Catalog (VPC) User
• Assign protected databases to a specific VPC User
• The protected database administrator can see all databases that share a common VPC user
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
23
Validation, Security and Troubleshooting
Troubleshooting Note
• For performance related issues on the Recovery Appliance refer to
– Recovery Appliance Performance Issues Data Gathering Document (Doc ID
2066528.1)
• For network performance related issues between protected databases and
the Recovery Appliance refer to
– Recovery Appliance Network Test Throughput script (Doc ID 2022086.1)
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
24
Program Agenda
1
Architecture
2
Backup Best Practices
3
Recovery Appliance and Data Guard
4
Restore and Recovery
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
25
TEST CASE #1 – Complete Level 0 (Full Backup) within 24 hours
• Execute Level 0 backup on 200 protected databases while monitoring throughput and Recovery
Appliance virtual full creation (backup indexing) activity
• Passed: Completed in 6 hours and 17 minutes, 4X faster than requirement. Backup rate was
14.7 TB/hour (4.2 GB/sec).
• Exceeded Customer Expectations. No Recovery Appliance tuning required.
16:37 - 22:54
Backup - L0
Task
Backup - L0
Start
End
5/22/15 16:38 5/22/15 22:54
Duration (hr)
6.27
17:00
Indexing - L0
5/22/15 17:30
5/23/15 0:17
6.79
18:00
19:00
20:00
22:00
21:00
23:00
16:00
24:00
01:00
17:30 - 00:17
Backup Indexing
Copyright © 2014 Oracle and/or its affiliates. All rights reserved.
Oracle Confidential - Restricted Access
Recovery Appliance X5
Full Rack
Test Case #2 – Copy 200 Database Backups to Tape in 7 days
• After the Level 0 backups, a workload generation script was executed to induce
~12% random block changes in each database.
• Upon completion, Level 1 (incremental) backups were taken on all databases.
• Copy to tape job templates on the Recovery Appliance were created, scheduled,
and executed.
• Total of 6960 tape backup tasks executed.
• Average of 258 tasks/hour with throughput of 125MB/sec/tape drive.
• Passed: Backup to tape completed in 2 days and 3 hours, 3X faster than
requirement, maximizing the 4 tape drives (tape throughput expected to increase
with additional drives)
• Exceeded Customer Expectations. No Recovery Appliance tuning required.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved.
Oracle Confidential - Restricted Access
TEST CASE #3 - Restore 2 databases with Concurrent L1 Backups
• Restore 2 databases while Level 1 backups are executing on 198 databases, with
redo transport enabled on 159 databases.
• Level 1 backups must complete in 8 hours and restore operations must complete
in 8 hours.
• Passed: All incremental backups completed in 2 hours, 4X faster than the requirement.
Both databases were restored in 2 hours, 4X faster than the requirement, with a restore
rate of 225 GB/hour. Note that restores in absence of concurrent backup workload
could maximize ingest network bandwidth to achieve 12-14 TB/hour.
• Exceed Customer Expectations. Customer was more concerned about meeting backup
windows and reasonable restore windows rather than achieving peak restore rates. No
Recovery Appliance tuning was required.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved.
Oracle and CUSTOMER Confidential - Restricted Access
Program Agenda
1
Architecture
2
Backup Best Practices
3
Recovery Appliance and Data Guard
4
Restore and Recovery
5
Validation, Security and Troubleshooting
6
Best Practices at work
7
SK Hynix Recovery Appliance Use Case
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
29
SK Hynix Recovery Appliance Use Case
2015-10-26
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Customer Profile
Company & Customer Profile
SK hynix is the global leader in producing semiconductor, such as DRAM and
NAND flash and System IC including CMOS Image Sensors.
Since pilot production of Korea’s first 16Kb SRAM in 1984, SK hynix consistently
led the industry with smaller, faster and lower power semiconductor.
As the second largest manufacturer of memory semiconductor, SK hynix is at
the forefront of the IT industry.
Customer Profile
 Name : JeongRyun Park
 Company : SK hynix
 Role in Organization : IT Planning Team Leader / Information Technology Office
About Oracle in Our Organization
 Total 40 Unit of EXADATA across 4 business area in 5 FABs
 Hi-Tech Manufacturing Main System
 MES and Manufacturing Related System(Automation)
 HR,ERP, DW and Non MES System(Information)
Background of DR Systems
BCP(Business Continuity Plan) Introduction on Entire IT System.
• Sep 2013, SK hynix China Wuxi Factory Fire
– Problems in production for 3 months
• Necessity of DR Systems
– Manufacturing Automation Systems Campus DR
– Information Systems Remote DR
SK Hynix Inc. Reports Fiscal Year 2013 and Fourth
Quarter Results
Consolidated fourth quarter revenue was 3.4 trillion
won decreased 18% from 4.1 trillion won of the
previous quarter, due to decrease in production on
account of Wuxi fab affected by a fire and the
appreciation of Korean Won.
3
Benefit of Recovery Appliance
SK hynix ZDLRA Effect-> Quantitative Measurement
• Recovery Appliance reduce backup window to 12X
• Recovery Appliance reduce recovery window to 3X
• Recovery Appliance save space to 6X
Customer Profile
Characteristics of Hi-Tech MES & Related Systems
BigData & VLDB
Extreme
Transaction Processing
Real-Time
Zero-Downtime
Avg. DB Size : 30~50TB
4,000 ~ 6,500 Sessions
Manufacturing Control
24 * 7 * 365 No Down Time
Daily Data Increment :
3~5TB
2,000 ~ 3,500 tx/sec
Process Control
Redo Gen. 80~100MB/Sec
Sensor Data Collection,
Real-Time Analysis,
Monitoring and Order
Process
No deferred processing
No Delay on Extreme
Transaction
Structure/Un-Structure
Business Data / Sensor Data
DW / Analysis Query
Powered by Exadata
MAA
Flexible Management on Big Data Not Impacting Real-Time Transaction
EXADATA Systems
EXADATA has been services more than 40 systems in FDC and QA since 2012
EQP
• EQP Control Data
– Line/Chamber Auto Control
– MVIN/OUT, Remote Control
– LOT Auto Reservation
• Line Interlock
– Recipe/Reticle/DCOP/FDC
• DCOL
– Production/Measuring Data
(rSPC)
MES AP
EDB / RPT
MES DB
• Production
history
• Line history
• DCOL
…
8TB
EAP
• Overall
Production
History
5TB
……
FDC Interlock
…
FDC AP
ADG
• FDC/SPC Data
– Trace Data per Area
– Line Tool Event per Area
– Specific Line Sensor Data
FSA / TAS
FDC DB
…
• FDC/SPC
Data
19Units, 80TB
14Units, 70TB
• FDC Main Function and Practice : Real Time Data Collection, Fault Detection & Response Automation, Classification Function
• FSA / TAS Main Function and Practice: FDC and SPC Data Collection, QA , Reporting and Mining
3
EXADATA Legacy backup and Issues
Needs Fast Backup on Big Data Not Impacting Real-Time Transaction
AS-WAS Backup Configuration
Each Exadata had been backed up to Legacy System
Main Issue
• Impacting on Real-Time Transaction
–
FDC System
–
Consumes CPU on Database Server and I/Os on
Storage Server
Exadata resource management (DBRM and IORM)
limit the backup resource usages under 40%
• Backup Window and Performance Impact
Increasing
RMAN Backup
• Image Copy
• Incremental Update
–
–
Full backup time on initial Half Rack was 7 hours →
After Expansion to Full Rack, backup took 13 hours
3 Hours for Incremental Update Backup
• Incremental Backup Validation Issue
–
–
Needs periodic backup validation
Validate Backup using Snapshot Clone
Network
EXADATA (45TB)
Legacy
System
• Backup Management and Monitoring
Overhead
–
–
Manage RMAN Scripts for 40 systems
Manage Scheduling and Monitoring for 40 systems.
Benefits of Recovery Appliance
EXADATA DR System Architecture
Primary Center
• Eliminate Impact on Real-Time Transaction
EXADATA
–
–
–
M14FDC
FDC
Exadata X5-2 Half Rack
7EF + 4 HC (HIGH)
• Simple, Low Overhead and Consistent
Performance
M14LFDC
Exadata X5-2 Half Rack
7 EF (HIGH)
Maximum Availability Architecture
Backup from Standby Database to minimize impact
Enable Flashback on Standby Database and Enable
MAA parameters for comprehensive data protection
LFDC
–
–
–
Restore &
Bring-up
Only changed blocks are backed up → steady
backup performance
Multiple Databases are covered by One ZDLRA
Guaranteed Restore Rate
ADG
• Incremental Forever Strategy and Validation
DR Center
–
–
Backup window is very small consistently
Backup Validation is performed periodically
• Management Benefits
Flashback
Incremental
Forever
FDC
Real-time
Redo
ZDLRA
LFDC
M14 DR
Exadata X5-2 Half Rack
–
–
–
–
Easy backup environment configuration using
Enterprise Manager
Archivelog Auto Backup
Do not need any backup configuration after FailOver
Intelligent Backup Space Estimation
Can Reduce RTO under 1 Hour!
Recovery Scenario ( RTO < 1 Hour )
Recovery Target
Target
Incident Type
ZDLRA
DR Flashback
Recovery
Time Actual
Priority
Recovery
Time Actual
Priority
Failover
Recovery
Time Actual
Priority
SPFILE
LOSS
< 10 min
1
2
CONTROLFILE
LOSS
< 10 min
1
2
BLOCK
Corruption
< 10 min
1
2
REDO LOG
Current Redo Loss
1
Active/Inactive Redo Loss
< 10 min
1
TABLE Loss
< 30 min
1
< 15 min
2
PARTITON Loss
< 30 min
1
< 15 min
2
Specific Datafile Loss
< 15 min
1
2
Specific Tablespace Loss
< 30 min
1
2
System Tablespace Loss
< 30 min
1
2
Undo Tablespace Loss
< 30 min
1
2
DATABASE
Fresh DB Creation + TPITR 1)
< 60 min
1
2
SITE FAILURE
Site Failure
TABLE/PARTITION
DATAFILE
TABLESPACE
1)
2
3
< 30 min
Among 40 TB, Recent 7 days of data should be recovered first for service open. Rest of data will be recovered after service open
3
1