HYDRAstor: a Scalable Secondary Storage

HYDRAstor:
a Scalable Secondary Storage
7th TF-Storage Meeting
September 9th 2010
Łukasz Heldt
Largest Japanese IT company
$43 Billion in annual revenue
143,000 staff
www.nec.com
Owns
& sells
Polish R&D company
50 engineers and scientists
www.9livesdata.com
R&D of critical
backend component
Scalable disk based storage for backup with global deduplication
Started in 2003 in NEC Labs by Cezary Dubnicki
2007 Product of the year award by SearchStorage.com
2008 Product innovation award by Network Products Guide
2009/2010 FAST conference publication in San Jose
Sold in US and Japan since 2007
Will be sold in Poland in 2011 by 9LivesData in coop. with NEC
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Backup storage
●
Tapes are most common, despite:
●
Sensitive environment requirements
●
Unreliable restore
●
Low performance
●
Manual labor or expensive robots
●
Problematic replication
3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
4
Backup storage size
●
Usual backup policy
●
4-12+ full backups
●
7-30+ incremental
●
●
Majority of data does
not change
●
Secondary storage
size:
●
●
Data compression 2:1
●
5x-20x more than
primary storage
Includes many copies
of the same data
Each data chunk
stored 5-10+ times
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
5
Backup storage size
●
Usual backup policy
●
4-12+ full backups
●
7-30+ incremental
●
●
Majority of data does
not change
●
Secondary storage
size:
●
●
Data compression 2:1
●
5x-20x more than
primary storage
Includes many copies
of the same data
Each data chunk
stored 5-10+ times
High potential for the deduplication technology.
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
6
Deduplication
●
Save disk space by eliminating duplicates
●
Sample reduction ratio 10:1
●
Lowers price of gigabyte
(depends on backup policy)
Sub-file level deduplication
File A
A B C
File B
A D E
File A
A B C
Only unique blocks
are stored
Stored blocks
A B C D E
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Global deduplication
●
Prevent silos of deduped data
●
One system to manage
Global vs. siloed dedup
7
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
HYDRAstor product
●
Provides
●
●
global deduplication using DataRedux™
performance, storage scalability
and data resiliency using
Distributed Resilient Data™
8
HYDRAstor deployment
●
Interface: CIFS, NFS, Symantec OST
●
Marker filtering for: Tivoli, Netbackup, Networker, CommVault
9
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
10
HYDRAstor architecture
●
Accelerator Nodes realize performance
●
Storage Nodes realize capacity
NFS / CIFS / OST
over Ethernet
Accelerator Nodes
Internal
Network
Storage Nodes
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
11
HYDRAstor architecture
●
Accelerator Nodes realize performance
●
Storage Nodes realize capacity
NFS / CIFS / OST
over Ethernet
Accelerator Nodes
Internal
Network
Storage Nodes
Non-disruptive
grid expansion
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
HYDRAstor scalability
●
●
●
MiniHYDRA – single server
●
Storage: 12 TB – 240 TB*
●
Performance: 1.3 TB / hour
2AN 4SN
●
Storage: 48 TB – 960 TB*
●
Performance: 3.6 TB / hour
20AN 40SN (4 racks)
●
Storage: 480 TB – 9600 TB*
●
Performance: 36 TB / hour
* - assuming 20x data reduction through DataRedux™
12
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
13
HYDRAstor scalability
●
Slide from Curtis Preston presentation
Curtis Preston is a famous storage analyst owning independent consulting company
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
HYDRAstor other features
●
●
Fully automatic/non-disruptive mgmt
●
Recovery of lost data resiliency
●
Periodic data scrubbing
●
Machine and disk failure recovery
Configurable redundancy level
●
erasure coding – better than RAID6
●
Optimized replication
●
Smart resource management
14
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
HYDRAstor backend
design
Details of the design:
http://www.usenix.org/events/fast09/tech/full_papers/dubnicki/dubnicki.pdf
15
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
16
Programming Model
●
Repository of blocks
●
Content-addressed
●
Immutable
●
Variable-sized
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
17
Programming Model
●
Repository of blocks
●
Content-addressed
●
Immutable
●
Variable-sized
Exposed pointers to other
blocks
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
18
Programming Model
●
●
Repository of blocks
●
Content-addressed
●
Immutable
●
Variable-sized
Exposed pointers to other
blocks
Trees of blocks
hash=010..1
Root1
E
E
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
19
Programming Model
●
●
Content-addressed
●
Immutable
●
Variable-sized
Exposed pointers to other
blocks
Trees of blocks
●
DAGs due to deduplication
●
No cycles possible
Root2
E
Root1
E
hash=110..0
E
01
1.
.0
●
Repository of blocks
hash=010..1
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
20
Programming Model
●
●
●
Content-addressed
●
Immutable
●
Variable-sized
Exposed pointers to other
blocks
Trees of blocks
●
DAGs due to deduplication
●
No cycles possible
Deletion of whole trees
Root2
E
Root1
E
hash=110..0
E
01
1.
.0
●
Repository of blocks
hash=010..1
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
21
Programming Model
●
●
●
Content-addressed
●
Immutable
●
Variable-sized
Exposed pointers to other
blocks
Trees of blocks
●
DAGs due to deduplication
●
No cycles possible
Deletion of whole trees
Root2
E
Root1
E
hash=110..0
E
01
1.
.0
●
Repository of blocks
hash=010..1
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
22
Programming Model
●
●
●
Content-addressed
●
Immutable
●
Variable-sized
Exposed pointers to other
blocks
Trees of blocks
●
DAGs due to deduplication
●
No cycles possible
Deletion of whole trees
Root2
E
Root1
E
hash=110..0
E
01
1.
.0
●
Repository of blocks
hash=010..1
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
23
Programming Model
●
●
●
Content-addressed
●
Immutable
●
Variable-sized
Root2
E
hash=110..0
Exposed pointers to other
blocks
Trees of blocks
●
DAGs due to deduplication
●
No cycles possible
Deletion of whole trees
01
1.
.0
●
Repository of blocks
E
011.
.0
●
hash=011..0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
24
Original
Fragments
Original
block
Any 3 fragments can be lost
Decode
Encode
Example: N=8, m=5
Redundant
Fragments
Failure tolerance: erasure coding
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
25
Original
Fragments
Original
block
Decode
Encode
Example: N=8, m=5
Redundant
Fragments
Failure tolerance: erasure coding
Any 3 fragments can be lost
Assuming 12 disks array
Mirror
3-copy
RAID6
Erasure coding
Resiliency
1
2
2
2
3
Overhead
100%
200%
20%
20%
33%
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
26
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
empty prefix
0
00
1
01
10
11
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
27
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
hash=011..0
empty prefix
Block
0
00
1
01
10
11
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
28
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
hash=011..0
empty prefix
Block
Prefix components
●
●
0
Hosted on SNs
N components
per prefix
00
Node 1
1
Node 1
2
3
1
Node 3
2
Node 4
1
0
1
01
1
N=4
10
11
0
2
1
3
0
1
Node 5
2
1
Node 6
3
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
29
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
Prefix components
●
●
●
Store fragments
Block
0
Hosted on SNs
N components
per prefix
hash=011..0
empty prefix
00
Node 1
1
Node 1
2
3
1
Node 3
2
Node 4
1
0
1
01
1
N=4
10
11
0
2
1
3
0
1
Node 5
2
1
Node 6
3
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
30
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
Prefix components
●
●
●
●
Store fragments
Distributed
consensus
Block
0
Hosted on SNs
N components
per prefix
hash=011..0
empty prefix
00
Node 1
1
Node 1
2
3
1
Node 3
2
Node 4
1
0
1
01
1
N=4
10
11
0
2
1
3
0
1
Node 5
2
1
Node 6
3
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
31
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
Prefix components
●
●
●
●
Store fragments
Distributed
consensus
Block
0
Hosted on SNs
N components
per prefix
hash=011..0
empty prefix
00
Node 1
1
Node 1
2
3
1
Node 3
2
Node 4
1
0
1
01
1
N=4
10
11
0
2
1
3
0
1
Node 5
2
1
Node 6
3
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
32
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
Prefix components
●
●
●
●
Store fragments
Distributed
consensus
Block
0
Hosted on SNs
N components
per prefix
hash=011..0
empty prefix
1
00
01
10
Node 1
2
3
1
1
1
Node 3
2
Node 4
1
0
0
1
Node 5
1
2
N=4
11
Node 1
1
Node 6
3
0
3
2
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
33
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
Prefix components
●
●
●
●
Store fragments
Distributed
consensus
Block
0
Hosted on SNs
N components
per prefix
hash=011..0
empty prefix
1
00
01
10
Node 1
2
3
1
1
1
Node 3
2
Node 4
1
0
0
1
Node 5
1
2
N=4
11
Node 1
1
Node 6
3
0
3
2
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
34
Scalability with DHT: data placement
●
Block location: DHT with prefix routing
●
Block mapped to hash prefix
●
Prefix components
●
●
●
●
●
Store fragments
Distributed
consensus
Load balancing
Block
0
Hosted on SNs
N components
per prefix
hash=011..0
empty prefix
1
00
01
10
Node 1
2
3
1
1
1
Node 3
2
Node 4
1
0
0
1
Node 5
1
2
N=4
11
Node 1
1
Node 6
3
0
3
2
2
1
3
0
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
35
Data organization: synchrun chains
A
B
C
D
E
F
G
●
Data stream split to blocks
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
36
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
Hash
000…
F
Hash
011…
G
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
37
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
Hash
000…
F
Hash
011…
G
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
38
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
Prefix 01
C
Hash
110…
D
Hash
011…
E
Hash
000…
F
Hash
011…
G
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
39
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
F
Hash
000…
Prefix 01
Compression
Erasure Coding
Hash
011…
G
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
40
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
F
Hash
000…
Hash
011…
G
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
Prefix 01
Compression
Erasure Coding
Component
0
Component
1
Component
2
Component
3
●
Erasure-coded fragments
stored by components
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
41
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
F
Hash
000…
Hash
011…
G
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
Prefix 01
Compression
Erasure Coding
Component
0
Component
1
Component
2
●
A D F
A D F
A D F
Component
3
A D F
Erasure-coded fragments
stored by components
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
42
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
F
Hash
000…
G
Hash
011…
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
Prefix 01
Compression
Erasure Coding
Synchrun 1
Synchrun 2
Component
●
Synchrun 3
A D F
0
Component
A D F
1
Component
A D F
2
Component
3
A D F
Synchrun
●
Erasure-coded fragments
stored by components
Grouped into synchruns
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
43
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
F
Hash
000…
G
Hash
011…
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
Prefix 01
Compression
Erasure Coding
Synchrun 1
Synchrun 2
Component
●
Synchrun 3
A D F
0
Component
A D F
1
Erasure-coded fragments
stored by components
●
Grouped into synchruns
●
Containers stored on disks
●
Component
A D F
2
Component
3
Container
A D F
Synchrun
Fragment metadata
separately from data
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
44
Data organization: synchrun chains
A
Hash
010…
B
Hash
101…
C
Hash
110…
D
Hash
011…
E
F
Hash
000…
G
Hash
011…
Hash
100…
●
Data stream split to blocks
●
Hashes of blocks computed
●
Routing through DHT
Prefix 01
Compression
Erasure Coding
Synchrun 1
Synchrun 2
Component
●
Synchrun 3
A D F
0
Component
●
Grouped into synchruns
●
Containers stored on disks
A D F
1
Erasure-coded fragments
stored by components
●
Component
A D F
2
●
Component
3
Container
Fragment metadata
separately from data
Ordered synchrun chains
A D F
Synchrun
●
Preserve order & locality
●
Manageable
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
45
Data Services:
Identification of data resiliency level
Missing fragments
Component
01:0
Component
01:1
Component
01:2
Component
01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data Services:
Identification of data resiliency level
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Chain scanning
46
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data Services:
Identification of data resiliency level
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Chain scanning
47
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data Services:
Identification of data resiliency level
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Chain scanning
48
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data Services:
Identification of data resiliency level
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Chain scanning
49
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: reconstruction
Component
01:0
Component
01:1
Component
01:2
Component
01:3
●
Sequential read/write of entire Containers
●
Erasure decoding and re-encoding
50
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: reconstruction
Component
01:0
Component
01:1
Component
01:2
Component
01:3
●
Sequential read/write of entire Containers
●
Erasure decoding and re-encoding
51
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: reconstruction
Component
01:0
Component
01:1
Component
01:2
Component
01:3
●
Sequential read/write of entire Containers
●
Erasure decoding and re-encoding
52
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
53
Data services: fast data transfer
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Location of new
node (DHT)
Old
component
01:3
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: fast data transfer
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Data transfer
Old
component
01:3
54
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: fast data transfer
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Data transfer
Old
component
01:3
55
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: fast data transfer
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Data transfer
Old
component
01:3
56
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services: fast data transfer
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Old
component
01:3
57
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
58
Data services for deduplication
hash=011..
Block
Component
01:0
Component
Choose
complete
chain
01:1
Component
01:2
Component
01:3
Completeness: “definitely not a duplicate”
Deletion interaction: wasn't the block scheduled for deletion?
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services for deduplication
hash=011..
Block
Component
01:0
Component
01:1
Component
01:2
Query
Component
01:3
59
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services for deduplication
hash=011..
Block
Component
01:0
Component
01:1
Component
01:2
Component
01:3
Local candidate found
60
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Data services for deduplication
hash=011..
Block
Component
01:0
Component
01:1
Component
01:2
Successful
dedup
Component
01:3
Candidate verification
61
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
On-demand data deletion
●
●
●
Distributed garbage collection
Per-block reference counter stored perfragment
Failure-tolerant
●
●
Block reference counter calculated independently
on peer Container chains
Interference with duplicate elimination:
●
duplicates resurrection after garbage collection
●
space reclamation in background
62
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Resource management
●
Configurable load balancing between:
●
backup/restore
●
background tasks (reconstruction, transfer, etc.)
●
garbage collection
●
Shares depend on system state
●
Assigns priority of tasks automatically
●
●
e.g. reconstruction before transfer or space
reclamation
Maximizes resources utilization
63
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Topics for further discussion
●
Features and technical details of HYDRAstor
●
Sales of HYDRAstor in Poland
●
Cooperation with 9LivesData on other projects
64
HYDRAstor: a Scalable Secondary Storage. 9LivesData, LLC
Questions?
Contact:
[email protected]
www.9livesdata.com
www.hydrastor.com
65