EFFICIENT TIME-SERIES in HBase

EFFICIENT TIME-SERIES IN HBASE
Vladimir Rodionov, SMTS Hortonworks
Time Series
• Sequence of data points
• Triplet: [ID][TIME][VALUE] – basic
• Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE]
• Stock Closing Value DJIA
• User behavior (web clicks)
• Credit card transactions
• Health data
• Fitness indicators
• Sensor data (IoT)
• Application and system metrics - ODS
TSDS requirements
• Data Store MUST preserve temporal locality of data for
better in-memory caching
• Facebook ODS : 85% queries are for last 26 hours
• Data Store MUST provide efficient compression
• Time – series are highly compressible (less than 2 bytes per data
point in some cases)
• Facebook custom compression codec produces less than 1.4 bytes
per data point
• Data Store MUST provide automatic time-based rollup
aggregations: sum, count, avg, min, max, etc., by min,
hour, day and so on – configurable. Most of the time its
aggregated data we are interested in.
•
OpenTSDB 2.x
• Data Store MUST preserve temporal locality of data for
better in-memory caching – NO
• Size-Tiered HBase compaction does not preserve temporal locality
of data. Major compaction creates single file, for example, where
recent data is stored with data which is months or years old.
• Compaction trashes block cache as well –decreases read
performance and increases latencies.
• Data Store MUST provide efficient compression – NO
• OpenTSDB supports compression, but its very heavy (runs
externally) and usually users disable it in production.
• Data Store MUST provide automatic time-based rollup
aggregations – NOT IMPLEMENTED
Ideal HBase TSDB
• Keeps raw data for hours
• Does not compact raw data at all
• Preserves raw data in memory cache for periodic
compactions and time-based rollup aggregations
• Stores full resolution data only in compressed form
• Has different TTL for different aggregation resolutions:
• Days for by_min, by_10min etc.
• Months, years for by_hour
• Compaction should preserve temporal locality of both: full
resolution data and aggregated data.
TSDS HBase
•
Region Server
Raw Events
C
CF:Raw
C
CF:Compressed
A
CF:Aggregates
CF:Aggregates
Compressor Coprocessor
HDFS
CF:Raw – TTL hours
CF:Compressed – TTL months
A
Aggregator Coprocessor
CF:Aggregates – TTL days/months
Exploring (Size-Tiered) Compaction
• Does not preserve temporal locality of data.
• Compaction trashes block cache
• No efficient caching of data is possible
• It hurts most-recent-most-valuable data access pattern.
• Compression/Aggregation is very heavy.
• To read back recent raw data and run it through
compressor, many IO operations are required, because …
• We can’t guarantee recent data in a block cache.
HBASE-14468 FIFO compaction
• First-In-First-Out
• No compaction at all
• TTL expired data just get archived
• Ideal for raw data storage
• No compaction – no block cache trashing
• Raw data can be cached on write or on read
• Sustains 100s MB/s write throughput per RS
• Patch available
• Can be applied to 1.0/1.1/1.2/1.3/2.0
• 0.98 requires some code changes
HBASE-14477 DT compaction
• DateTieredCompactionPolicy
• CASSANDRA-6602
• Works better for time series than
•
•
•
•
•
•
ExploringCompactionPolicy
Adds delayed compaction (not in CASSANDRA)
Better temporal locality helps with reads
Good choice for compressed full resolution and
aggregated data.
Patch will follow shortly.
Again, or 1.0 and up.
Can be back-ported to 0.98
DateTieredCompactionPolicy
Temporal Locality
No compaction
STCP Major
Size
DTCP
Age
Temporal Locality
No compaction
STCP Major
Size
DTCP
Most Recent Data
Age
HBASE-14496 Delayed compaction
• Files are eligible for minor compaction if their age > delay
• Good for application where most recent data is most
•
•
•
•
•
•
valuable.
Prevents block cache from trashing for recent data due to
frequent minor compactions of a fresh store files
Will enable this feature for Exploring Compaction Policy
DTCP will have it by default.
DTCP + Delay (1-2 days) is good option for compressed
full resolution and aggregated data.
Patch available.
HBase 1.0+ (can be back-ported to 0.98)
TSDS HBase
•
Region Server
Raw Events
C
FIFO
CF:Raw
DTCP+Delay
CF:Compressed
HDFS
DTCP+Delay
A
C
Compressor Coprocessor
CF:Aggregates
CF:Aggregates
CF:Raw – TTL hours
CF:Compressed – TTL months
A
Aggregator Coprocessor
CF:Aggregates – TTL days/months
Summary
• Disable major compaction
• Disable region splits (DisabledRegionSplitPolicy)
• Presplit table in advance.
• Increase hbase.hstore.blockinStoreFiles for raw data
• Have separate column families for raw, compressed and
aggregated data (each aggregate resolution – its own
family)
• FIFO for Raw, ECP + Delay for others (now), DTCP +
Delay (in near future)
• Run periodically internal job (coprocessor) to compress
data and produce time-based rollup aggregations.
Q&A