EFFICIENT TIME-SERIES IN HBASE Vladimir Rodionov, SMTS Hortonworks Time Series • Sequence of data points • Triplet: [ID][TIME][VALUE] – basic • Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] • Stock Closing Value DJIA • User behavior (web clicks) • Credit card transactions • Health data • Fitness indicators • Sensor data (IoT) • Application and system metrics - ODS TSDS requirements • Data Store MUST preserve temporal locality of data for better in-memory caching • Facebook ODS : 85% queries are for last 26 hours • Data Store MUST provide efficient compression • Time – series are highly compressible (less than 2 bytes per data point in some cases) • Facebook custom compression codec produces less than 1.4 bytes per data point • Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg, min, max, etc., by min, hour, day and so on – configurable. Most of the time its aggregated data we are interested in. • OpenTSDB 2.x • Data Store MUST preserve temporal locality of data for better in-memory caching – NO • Size-Tiered HBase compaction does not preserve temporal locality of data. Major compaction creates single file, for example, where recent data is stored with data which is months or years old. • Compaction trashes block cache as well –decreases read performance and increases latencies. • Data Store MUST provide efficient compression – NO • OpenTSDB supports compression, but its very heavy (runs externally) and usually users disable it in production. • Data Store MUST provide automatic time-based rollup aggregations – NOT IMPLEMENTED Ideal HBase TSDB • Keeps raw data for hours • Does not compact raw data at all • Preserves raw data in memory cache for periodic compactions and time-based rollup aggregations • Stores full resolution data only in compressed form • Has different TTL for different aggregation resolutions: • Days for by_min, by_10min etc. • Months, years for by_hour • Compaction should preserve temporal locality of both: full resolution data and aggregated data. TSDS HBase • Region Server Raw Events C CF:Raw C CF:Compressed A CF:Aggregates CF:Aggregates Compressor Coprocessor HDFS CF:Raw – TTL hours CF:Compressed – TTL months A Aggregator Coprocessor CF:Aggregates – TTL days/months Exploring (Size-Tiered) Compaction • Does not preserve temporal locality of data. • Compaction trashes block cache • No efficient caching of data is possible • It hurts most-recent-most-valuable data access pattern. • Compression/Aggregation is very heavy. • To read back recent raw data and run it through compressor, many IO operations are required, because … • We can’t guarantee recent data in a block cache. HBASE-14468 FIFO compaction • First-In-First-Out • No compaction at all • TTL expired data just get archived • Ideal for raw data storage • No compaction – no block cache trashing • Raw data can be cached on write or on read • Sustains 100s MB/s write throughput per RS • Patch available • Can be applied to 1.0/1.1/1.2/1.3/2.0 • 0.98 requires some code changes HBASE-14477 DT compaction • DateTieredCompactionPolicy • CASSANDRA-6602 • Works better for time series than • • • • • • ExploringCompactionPolicy Adds delayed compaction (not in CASSANDRA) Better temporal locality helps with reads Good choice for compressed full resolution and aggregated data. Patch will follow shortly. Again, or 1.0 and up. Can be back-ported to 0.98 DateTieredCompactionPolicy Temporal Locality No compaction STCP Major Size DTCP Age Temporal Locality No compaction STCP Major Size DTCP Most Recent Data Age HBASE-14496 Delayed compaction • Files are eligible for minor compaction if their age > delay • Good for application where most recent data is most • • • • • • valuable. Prevents block cache from trashing for recent data due to frequent minor compactions of a fresh store files Will enable this feature for Exploring Compaction Policy DTCP will have it by default. DTCP + Delay (1-2 days) is good option for compressed full resolution and aggregated data. Patch available. HBase 1.0+ (can be back-ported to 0.98) TSDS HBase • Region Server Raw Events C FIFO CF:Raw DTCP+Delay CF:Compressed HDFS DTCP+Delay A C Compressor Coprocessor CF:Aggregates CF:Aggregates CF:Raw – TTL hours CF:Compressed – TTL months A Aggregator Coprocessor CF:Aggregates – TTL days/months Summary • Disable major compaction • Disable region splits (DisabledRegionSplitPolicy) • Presplit table in advance. • Increase hbase.hstore.blockinStoreFiles for raw data • Have separate column families for raw, compressed and aggregated data (each aggregate resolution – its own family) • FIFO for Raw, ECP + Delay for others (now), DTCP + Delay (in near future) • Run periodically internal job (coprocessor) to compress data and produce time-based rollup aggregations. Q&A
© Copyright 2026 Paperzz