the Document

SAP HANA – High Availability
FAQ
This document answers some frequently asked Questions regarding SAP HANA’s High
Availability support for Fault and Disaster Recovery. For more information, see the
references at the end of this document.
Updated for SPS11
_________________________
[email protected]
[email protected]
SAP HANA Team
V4.0, January 2016
Legal Disclaimer
THIS DOCUMENT IS PROVIDED FOR INFORMATION PURPOSES ONLY AND DOES NOT MODIFY THE TERMS OF ANY AGREEMENT. THE CONENT OF
THIS DOCUMENT IS SUBJECT TO CHANGE AND NO THIRD PARTY MAY LAY LEGAL CLAIM TO THE CONTENT OF THIS DOCUMENT. IT IS CLASSIFIED AS
“CUSTOMER” AND MAY ONLY BE SHARED WITH A THIRD PARTY IN VIEW OF AN ALREADY EXISTING OR FUTURE BUSINESS CONNECTION WITH SAP.
IF THERE IS NO SUCH BUSINESS CONNECTION IN PLACE OR INTENDED AND YOU HAVE RECEIVED THIS DOCUMENT, WE STRONGLY REQUEST THAT
YOU KEEP THE CONTENTS CONFIDENTIAL AND DELETE AND DESTROY ANY ELECTRONIC OR PAPER COPIES OF THIS DOCUMENT. THIS DOCUMENT
SHALL NOT BE FORWARDED TO ANY OTHER PARTY THAN THE ORIGINALLY PROJECTED ADDRESSEE.
This document outlines our general product direction and should not be relied on in making a purchase decision. This document is not subject to your license
agreement or any other agreement with SAP. SAP has no obligation to pursue any course of business outlined in this presentation or to develop or release any
functionality mentioned in this document. This document and SAP's strategy and possible future developments are subject to change and may be changed by SAP at
any time for any reason without notice. This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied
warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP assumes no responsibility for errors or omissions in this document and shall
have no liability for damages of any kind that may result from the use of these materials, except if such damages were caused by SAP intentionally or grossly
negligent.
© Copyright 2016 SAP SE. All rights reserved.
What is High Availability?
High Availability is a set of techniques, engineering practices and design principles for Business
Continuity. This is achieved by eliminating single points of failure, and providing the ability to rapidly
resume operations after a system outage with minimal business loss. Fault Recovery is the process of
recovering and resuming operations after an outage due to a fault. Disaster Recovery deals with
recovering operations after an outage due to a prolonged datacenter or site failure.
SAP HANA provides a range of fault and disaster recovery solutions to achieve High Availability.
What is "Enterprise Ready" High Availability?
Enterprises typically have a high level of expectation, regarding business continuity, expressed in the form
of key measures around the recovery from faults. One such key measure is RPO: the maximal permissible
period of time during which operational data may be lost. Enterprises often design their systems for an RPO
of zero; no data should ever be lost. A second common measure is RTO, the maximal permissible time it
takes to recover the system at full performance. Enterprise systems are typically designed for very low RTO
values, ranging from near-zero to at most a few hours for the most critical business processes.
What do I need to do to ensure business continuity?
High Availability consists, broadly speaking, of four stages:




The planning phase
Prevention and Preparation
Failure Detection and Recovery
Fix failure and Failback
During the planning phase, business needs are evaluated against the cost and complexity of the different
solutions. Prevention and preparation includes the selection, installation and configuration of redundant
components, standby systems and replication. Failure detection involves the integration within existing
network/system monitoring and management systems that provide fault indicators via alarms and other
messages. Based on the selected high-availability solution, recovery from failures may be automatic or
manual. Failback is the process of restoring the redundancy/replication after recovery, to be prepared for a
next failure event.
What downtime scenarios must be considered?
The SAP HANA High Availability solutions address all major faults, such as OS crash, software errors,
operator error, data Corruption, disk crash, component failures (e.g. fan breakdown or PSU burnout), host
crash (general hardware fault), intermittent memory errors, power outage, cooling failure, network faults,
severed cables. In addition, SAP HANA provides recovery solutions in case of entire data center failures (aka
disasters). These solutions are also used to manage planned downtimes and upgrades.
What is k-safety?
K-safety is a term used by vendors of databases without reliable two-phase commits, to describe the
number of nodes that can fail before loss of data occurs. SAP HANA provides reliability via redo-log saves
(thus, the problem of data-loss due to local failure never exists), and additionally supports full system
replication to a secondary system.
Does SAP HANA support clustering?
Other vendors sometimes support a concept they call clustering. Clusters are groups of symmetric hosts
that are synchronized on the basis of a single shared disk image of the data. SAP HANA avoids this I/O
performance bottleneck by instead employing in-memory data distribution over multiple collaborating hosts.
This avoids unwanted data sharing and the associated locks, resulting in enhanced scaling and efficiency.
More details can be found in [6] and [7].
© 2016 SAP SE
page 2/9
What solutions does SAP offer to ensure business continuity of SAP HANA?
The following table summarizes the main SAP HANA High Availability solutions. See [1] for more details.
Solution
Supported Downtime Scenarios
System Replication
Business continuity for planned downtime, software fault, host crash or disaster
Storage Replication
Persistence
Business resumption after disaster; recovery from local storage corruption
Restart from software fault, power outage or host crash
Backups
Recovery from storage or data corruption, host crash, or disaster
Host Auto-Failover
Automatic recovery from software fault or host crash
Service Auto-Restart
Automatic recovery from software fault
HANA is an in-memory database. What happens when the power goes down?
Indeed, SAP HANA uses in-memory technology, but of course, it fully persists any transaction that changes
the data, such as row insertions, deletions and updates. Thus, the database can fully resume from an
unexpected power-outage without any loss of data, by rolling back incomplete transactions, and recovering
fully completed transactions.
How does HANA recover from a server (host) hardware failure?
SAP offers several complementary solutions, which involve the use of redundant or standby equipment. The
main solutions are:

Host Auto-failover: this solution involves adding one or more passive standby hosts to a SAP
HANA system. When one of the active hosts fails, due to a hardware failure for instance, the
standby host automatically takes over the role of the failed host.

System Replication: this solution involves the installation of a fully redundant standby system
and continuous replication of data changes to the standby system. When the primary system fails,
e.g. due to a host-crash, the secondary standby system takes over.
These solutions can be combined, and further combined with traditional backups, which provide additional
support for point-in-time recovery in case of data corruption.
Does HANA support Disaster Recovery?
Yes. In addition to recovery from traditional backups, SAP HANA provides two solutions that can be used for
disaster recovery:

Storage Replication: partner-delivered solution is a storage-level replication solution, which
delivers a backup of the volumes or file-system to a remote, networked storage system. Upon
failure of the primary system, a secondary can take over remotely using the secondary storage.

System Replication: this solution involves the installation of a fully redundant standby system
and continuous replication of data changes to the standby system. When the primary system fails,
e.g. due to a full data-center outage, the secondary standby system takes over immediately.
What is the difference between a DB restart and a replicated system takeover?
A restart requires most tables to be reloaded to memory, before the database can effectively respond to
queries. During a takeover, since tables are already loaded into the memory of the secondary system, only
a small delta must be loaded, resulting in a very short RTO.
How does SAP HANA support point-in-time recovery?
SAP HANA provides complete data backups and data snapshots as well as incremental log backups.
Combining these backups together with the redo log entries, the database can be recovered to a precise
point in time. Additionally SAP HANA supports "delta backups" where differential backups are based on the
last full data backup and incremental backups are based on the last full or delta backup.
More details can be found in [1].
© 2016 SAP SE
page 3/9
Can an interrupted recovery be resumed?
Recovering a very large database can be time consuming. Having to repeat the whole recovery process
because for example the connection to the recovery tool was temporarily lost is nobody's definition of fun.
As of SPS 11, it is possible to resume a SAP HANA database recovery.
Details can be found in [2].
Are SAP HANA backups compressed and encrypted?
Yes, SAP HANA backups represent the in-memory data image and therefore are compressed.
With “Backint for SAP HANA” we offer a 3rd party backup tool API, for instance to stream directly to external
backup devices, or to support backup encryption, for example with NetBackup from Symantec. Database
administrators also often use storage-level encryption.
Can a data backup be created in parallel using multiple threads?
As of SPS11 you can configure multiple channels to write the backup data for each service with third-party
backup tools. This will make it possible to considerably speed up the backup time by distributing backup
data in parallel to multiple devices.
For further details please read [2].
Can I use secondary hosts for development/testing?
Yes, if you fully understand the implications of dual usage, then you may use secondary hosts in a system
replication environment for development or testing, on condition that these operations do not interfere with
the regular replication or failover functionality.
In the case of storage-replication, the secondary standby hosts can be used for any other purposes, until
they are required to assume their active SAP HANA roles. Standby hosts must have enough memory for the
intended dual usage. This is particularly true in the case of full system replication, where the memory of the
active standby hosts reflects the corresponding primary hosts. Of course, parallel activities must be shut
down before a potential failover to the standby system.
See also "Using Secondary Servers for Non-Production systems" in [2] or check out [8].
What is the advantage of System Replication without Data Preload?
System Replication without data preload provides an economic option for organizations that require failure
or disaster recover without data-loss, but prefer not to dedicate an identical secondary system. Without data
preload, the secondary system uses a very small amount of memory, and can therefore be used for
additional purposes, such as development, QA or staging, while standing by. The tradeoff, of course, is that
upon take-over, there will be a period of downtime, during which the secondary system is loaded.
See also "Using Secondary Servers for Non-Production systems" in [2] or check out [8].
How about Planned Downtime (near zero-downtime upgrades)?
The System Replication solution can be used to manage planned downtime scenarios with minimal
interruption to business processes. Simply perform the required maintenance on the secondary first, failover to the secondary. Then, perform the maintenance on the primary and - if required - fail back.
See also "Using System Replication for Near Zero Downtime Upgrades" in [2].
What are typical recovery timelines?
This depends on the solution. Fundamentally, database recovery consists of the following aspects: fault
detection, getting the recovery hardware up and running, restarting the database to accept queries, and, to
reach full performance, loading most of the tables back into memory.
SAP HANA's Service Auto-Restart and Host Auto-Failover solutions automatically detect faults in less than a
minute. For other situations, e.g. full power outage, fault detection will be part of the data center operating
procedures.
Using on-line standby hosts avoids the extra effort and time involved in acquiring, connecting, and
rebooting off-line systems, which may take considerable time. Once the hardware is up and running, loading
© 2016 SAP SE
page 4/9
the HANA database is a matter of minutes in the absence of large pre-loading tables, such as row-store
tables. Pre-loading tables may take several minutes or more, depending on the size of these tables.
Once preloading is completed, the HANA database is ready to accept queries, and non-preloading tables (by
default, column-store tables) will load in parallel.
System Replication with data pre-load keeps all columnar tables in memory in the secondary system. Thus,
in addition to a very short recovery time (typically under two minutes for columnar tables), this solution also
provides immediate full performance upon recovery. Note: replicated system recovery took more time in
earlier revisions of SAP HANA if it contains large row tables, which needed to be loaded upon failover; in
earlier revisions this was a SAP HANA limitation, which could be mitigated by converting tables to columnar
structure. Currently the row store tables are kept in shared memory and are immediately available during
takeover. Additionally whith using the “continous log replay” operation mode (available since SPS11), the
redo log is continously replayed on the secondary site improving takeover performance to less than a
minute.
Restoring from a backup proceeds typically at a rate of about 400-500 GB per hour, however this depends
greatly on the storage attributes.
Which HA/DR solution to be used when?
As a first level of defense, employ hardware redundancy where possible. In particular, this means following
the hardware vendors' guidelines that include a well-mirrored (RAID-enabled) storage system with good
auto-detection and recovery from storage failures.
On top of this, SAP HANA System Replication provides the best protection against a variety of failures and
downtime cases, and by far the best recovery solution, when considering recovery time, and disaster
scenarios. With the operation mode “logreplay” introduced with SPS11 the secondary site is considered a
“HotStandby” system. It is therefore the recommended solution for highly critical business systems.
For very large systems, this solution can be further augmented with Host-Auto Failover, using additional
standby hosts at both the primary and secondary site, to ensure that both sites remain operational, even in
the presence of a local host crash. Furthermore, regular backups should be enabled, to further protect
against data-corruptions or operator errors.
However, if a fully redundant, N+N system standby solution with disaster recovery abilities cannot be
justified due to implementation cost or complexity, a simpler Host Auto-Failover only solution could be
implemented, typically in the form of an N+1 configuration on a single site, augmented by a reliable
backup-shipping method to support disaster recovery.
Realistically, the High Availability solution will be further influenced by considerations like timelines, budgets
and customer paradigm-preferences, which should be discussed in more detail with professional
consultants.
Host Auto-Failover: When would I use more than 1 standby host?
Adding more than one standby host in a Host Auto-Failover configuration increases the ability to handle
multiple faults. For instance, if the chance of a host failing is 1 %, the chance of two active hosts
independently going out of order at the same time would be approximately 1 % x 1 % = 0.01 %. Thus, a
decision to deploy more than one standby host is fundamentally driven by what-if availability requirements.
Host Auto-failover: How does HANA avoid "split brain" problems?
One classical problem in auto-failover is a situation called "split brain": rather than a real failure, two hosts
are separated due to a network disconnection. If each host thinks it is in charge, this may lead to data
corruption.
The HANA Host Auto-Failover solution avoids a "split-brain" condition, through the use of exclusive storage
locking. When a host is determined to be lost (either due to a crash, or loss of communication), one of the
standby hosts will attempt to execute a failover, by attempting to take over the storage lock. If this proves
unsuccessful (due to the existing host being alive), the failover will abort.
System Replication: How does distance affect the latency of transactions?
In case of synchronous replication, transactions are delayed by the time of a round-trip network message
from the primary to the secondary site. The longer the distance, the longer the delay. Up to 100 km, delays
© 2016 SAP SE
page 5/9
can be limited to about 5 μs on optical fiber. Over longer distances it is recommended to use asynchronous
replication, which do not add latency, at a slight data-loss risk (data lost in transit) upon failure.
System Replication: What data are transported over the communication channel?
Initially, when setting up system replication, the full set of data contained in the data volumes is sent to the
secondary site. When system replication is actively running, additionally the redo log buffers generated on
the primary site for every write transaction are immediately and continously sent to the secondary site. In
synchronous replication an acknowledgement is sent back to the primary upon arrival of the redo log buffer;
in asynchronous replication this is not done.
If system replication is running in the “classic” operation mode with delta data shipping (as exists since
SPS05) from time to time (per default every 10 minutes) the increment of the data, i.e. data that changed
since the last full or last delta data shipment, is transported from the data area of the primary to the data
area of the secondary. With the continuous log replay operation mode on the secondary site these
periodically executed delta data shippings are not necessary anymore.
For more details please have a look into [1] and [2].
System Replication: Can I use a Secondary System for read-only queries?
Not yet, but this capability, called “active/active”, is on the SAP HANA roadmap.
What is Multitier System Replication?
From SPS07 you can use System Replication to set up a local secondary system, that acts as the primary for
a third (remote) site. One common use is to synchronously replicate to the local secondary system, which
asynchronously replicates to the remote system.
System Replication: What does “Hot Standby” mean?
In system replication this term refers to a highly available secondary site prepared to take over operation
very fast. With the “continous log replay” feature of SAP HANA – available since SPS11 – the secondary site
functions as a hot standby system. While the replication is still active – prior to a takeover – the continously
shipped redo log buffers are replayed on the secondary site. In case of a takeover, the database is
operational in a very short time, because the previously necessary takeover step of replaying the log
disappears.
For more details on the operation mode “logreplay” please have a look at [2] and [8].
Is high availability and disaster recovery possible for SAP HANA multitenant
database containers?
Scale-out and Host Auto-Failover configurations can be used with SAP HANA multitenant database
containers. The tenant databases can be distributed over the host participating in this cluster.
SAP HANA multitenant database containers can be configured in a system replication. The system database
as well as all active tenant databases are replicating their data. The takeover to the seconcary site is done
for the whole system, not on a tenant database level.
For further details please have a look at [2] – section: SAP HANA System Replication with Multititenant
Database Containers.
Do SAP high availability solutions support Dynamic Tiering (extended storage)?
Host Auto-Failover clusters can be setup including this SAP HANA feature of extended storage for data that
need not to reside in memory the whole time. In addition to the extended storage host, a separate standby
host must be available and contained in the same failover group. In case of host auto-failover, you can
initiate the failover of the extended storage host manually.
For more information please have a look at [10].
How to configure clients to be failover-safe?
Clients are typically configured to connect to the SAP HANA system via its domain name. A failover to
another server poses a problem – how do clients re-connect upon a failover?
© 2016 SAP SE
page 6/9
There are several possible approaches to configure HA solutions to allow clients to recover after a failover.
One approach involves client configuration for multiple-connections (supported by the HANA JDBC/ODBC,
and (for instance) reverse-proxy for HTTP-based clients. But addressing the problem at this level is
undesirable, as it is highly client-dependent, and the solution is not uniform. Additionally, it can lead to
unintended results due to "split-brain" scenarios.
Thus, solutions at the network layer are preferred. One approach is to use the IP to domain-name binding,
or DNS. Upon failover, the DNS server is configured to return the IP address of the standby system. This
solution has the advantage that there are no client-specific configurations. Further, it supports DR
confgurations, where the primary and standby systems may be in two different networks (separated by
routers). On the other hand, modifying DNS mappings requires a vendor-specific solution, and secondly,
due to DNS caching in network elements (both clients and intermediate routers), it can take a long time for
DNS changes to propagate.
The alternate approach is at the IP layer, or the binding between IP addresses and MAC addresses, using IP
redirection or Virtual IP (VIP) redirection. Provided that the primary and standby system share the same L2
subnet, this is the most comprehensive and recommended solution.
Read about “HA/DR Provider” in [2] and check out the “client reconnect” chapter in [9].
What happens to "queries in flight" during a failure?
During a failure, the connection to clients is lost. Incomplete in-progress transactions will be rolled back.
This is as it should be. Once the failover is complete, clients reconnect to the newly active system, which
resumes the handling of queries and transactions.
SAP HANA system replication: more details
How does system replication start up?
After setting up system replication, the primary sends its topology information and immediately starts log
shipping to the secondary. It also prepares a full data set, based on snapshots, and transmits it.
Until SPS10 from time to time (default every 10 minutes) the data that has changed since the last data
transport had to be transmitted to the secondary. Since SPS11, whith the newly introduce operation mode
“logreplay” this delta data shippings are not necessary anymore.
What happens, if the connection between primary and secondary is lost?
The primary stops shipping of the redo log buffers, if the secondary does not respond within a configurable
timeout (default 30 sec). The secondary will repeatedly try to reconnect. Once the connection is reestablished, the missing redo logs are automatically shipped.
Can the secondary system perform the takeover automatically?
SAP HANA does not provide automatic failure detection and takeover with System Replication. However,
hardware vendors may provide Cluster Management software which is able to do that. SLES for SAP's
cluster solution may also be part of an integrated solution.
What happens to a BW transaction during a takeover?
For planned downtimes, like upgragdes, the “Connectivity Suspend” feature of the SAP Application server
supports real zero downtime maintenance. Long running transactions are "suspended" and resumed after
the takeover. For more information on the “Connectivity Suspend” feature please refer to SAP note [4].
Additional details regarding connections from database and web clients are described in the SAP HANA
Master Guide [5].
Is system replication secure?
Page-level encryption can be activated separately on the primary and secondary system. SPS07 also
supports SSL-encrypted data transfer between the primary and secondary system. Further details are
contained here [12] – section: Secure Internal Communication Between Sites in System Replication
Scenarios.
What happens during takeover?
Until SPS08 the takeover included these steps:
1. Open data persistence based on last savepoint.
© 2016 SAP SE
page 7/9
2.
3.
4.
Load row store tables to memory
Replay the redo logs since last shipped delta data shippment
Rebuild row store indexes
Since SPS09 the row store tables are kept in shared memory; since SPS10 rebuilding of the secondary
indexes will be decoupled from influencing the takeover time. Steps (2) and (3) do not influence the
takeover time anymore.
With SPS11 with the “continously log replay” feature (also called “HotStandby”) it is possible to replay the
redo logs on the secondary site, while the replication is running; so that step (3) will also become obsolete
during takeover – reducing the RTO to a minimum.
© 2016 SAP SE
page 8/9
Further Reading
[1] Introduction to High Availability for SAP HANA: https://scn.sap.com/docs/DOC-65585
[2] SAP HANA Administration Guide http://help.sap.com/hana/SAP_HANA_Administration_Guide_en.pdf
[3] SAP Note 1730932 – Using backup tools with Backint for HANA
[4] SAP Note 1913302 – Suspend DB connections for short maintenance tasks
[5] SAP HANA Master Guide: http://help.sap.com/hana/SAP_HANA_Master_Guide_en.pdf
[6] Scalability - SAP HANA: http://www.saphana.com/docs/DOC-2022.
[7] SAP HANA Host Auto-Failover: http://scn.sap.com/docs/DOC-62494.
[8] HowTo SAP HANA system replication: https://scn.sap.com/docs/DOC-63221.
[9] SAP HANA Network Requiremtens: https://scn.sap.com/docs/DOC-63221.
[10] SAP HANA Dynamic Tiering Adminstration Guide: http://help.sap.com/Download/Multimedia/ziphana_options_dt/Dynamic_Tiering_Option_Administration_Guide_en.pdf
[11] SAP Note 1999880 - FAQ: SAP HANA System Replication
[12] SAP HANA Security Guide: http://help.sap.com/hana/SAP_HANA_Security_Guide_en.pdf
Find more information about SAP HANA on http://help.sap.com/hana_platform/.
© 2016 SAP SE
page 9/9