DataDepot Applications

How To Build a Real-time
Warehouse
Theodore Johnson
Vladislav Shkapenyuk
April, 2010
Toolchain
•
•
•
•
•
•
•
•
GS
Bistro
Daytona
DataDepot / Egil
Update Manager
Warehouse Dashboard S. Seidel (Yates)
Data Auditor
Birt C. Lund (Merritt) / Ptolemy D. Caldwell (North)
Applications
DataDepot Warehouses
• GS/DPI
– Mobility, Uverse, ICDNS, etc.
• PMOSS
– RTPMS, NEMO, VIGMON, OMVT, etc.
• BVOIP
– Business VOIP CDR correlation
• Real-time Darkstar
– Ptolemy, PathMiner, RouterMiner, G-RCA
– 60+ reports, used by tier 2, 3, 4 NOCs.
Real-time Darkstar Cluster
• 314 tables
– 107 feeds
• 340+ million raw records ingested / day
• Inexpensive, redundant cluster
• ETL in warehouse; complex application
logic.
• Serving NOC applications since Nov ’09
• Near Real-time Warehouse Loads
Raw Input Records / Day
•
•
•
•
•
•
•
•
•
•
BPS
PPS
INOUTFRAMES
TACACS
SYSLOG
WIPM_UPS
CPU
MEMORY
WIPM_DPS
CISCOIF
153,103,805
135,493,336
28,255,242
8,839,793
8,626,282
1,458,163
1,261,039
1,202,163
933,401
609,365
BPS
WIPM_REPORTS
MPLS_MON
Technologies
• Move from:
–
–
–
–
Expensive and unreliable large server
Per-day warehouse loads
Best-effort feeds
ETL logic buried in loading scripts
• Move to:
–
–
–
–
Inexpensive and high-performance cluster
RT updates (5 minute, 1 minute periods)
Push-based feeds
ETL in the warehouse
• And also app logic
Contributors
• Theodore Johnson
– DataDepot, Egil
• Vladislav Shkapenyuk
– Update Manager, Bistro
• Lukasz Golab
– DataAuditor, database design
• Spence Seidel
– Warehouse Dashboard
• Ken Martau
– Various bug fixes
• Jennifer Yates
– Darkstar
• Carsten Lund
– Birt
• North / Caldwell / Ballance
– Visualization
Update Propagation
Update Propagation
• We can build complex apps if we’re confident
that all updates get propagated.
• 1st version: used make-style algorithm
– Not correct for complex configurations
– Requires global analysis and all-at-once updates.
• Developed update propagation theory
– Interaction with scheduler
– Merge tables, partition rollups, etc.
• Implemented correct algorithm
– Has some scheduling restrictions
• Problem: scheduling restrictions cause update
delays in RT tables.
– Move to algorithm without scheduler restrictions.
Incremental Updates
• Only propagate the increment.
• Update only those partitions whose
sources have new data.
• How can we determine if a source partition
has more recent data?
“make” doesn’t work
1
2
B1
B2
3
4
S2
S1
5
D
“make” doesn’t work
1 6
2 7
B1
B2
3
4
S2
S1
5
D
“make” doesn’t work
1 6
2 7
B1
B2
7
3
4 8
S2
S1
5
D
“make” doesn’t work
1 6
2 7
B1
B2
9
7
3 10
4 8
S2
S1
5 11
D
8
Correctness Theory
• The warehouse consists of base tables and derived
tables.
• Each partition of a base table has a generation number:
the number of times it has been updated.
• A derived table partition D depends on a collection B(D)
of base table partitions
– One entry for each distinct dependence path.
• GB(D) is the collection of generation numbers of the
base table partitions in B(D) used to compute D.
• Gb(D) is the collection of base table partition numbers of
D’s dependent partitions
• Update D iff. Gb(D) > GB(D).
g(B1)=1
B1
g(B2)=2
g(B3)=1
B2
S1
B3
S2
GB(S2)=(1,1)
GB(S1)=(1,2)
D
Gb(D)=((1,2),(1,1))
GB(D)=((1,1),(1,1))
Source-vector Protocol
• Each table D maintains an update counter ut.
• Each partition d of D maintains an updatestamp
u which is, assigned the value of ut when the
partition is updated.
• Each partition also maintains a source vector
mu=(M1,..,Mn), which is the maximum
updatestamp in the partitions from table Si that
affect partition d.
• Update partition d iff. there is a source table Si
with partitions (si1,..,sij) such that
max(u(si1),…,u(sij)) > Mi.
u= 2
u=3
u=1
B1
u=1
mu=(1,2)
B2
S1
B3
S2
D
u=1
mu=(1,2)
u=2
mu=(2,3)
u= 2
u= 4
u=1
B1
u=1
mu=(1,2)
u=3
B2
S1
B3
S2
D
u=1
mu=(1,2)
u=2
mu=(2,3)
u= 2
u= 4
u=1
B1
u=1
mu=(1,2)
u=3
B2
S1
B3
S2
u=3
mu=(1,4)
D
u=1
mu=(1,2)
u=2
mu=(2,3)
u= 2
u= 4
u=1
B1
u=1
mu=(1,2)
u=3
B2
S1
B3
S2
u=3
mu=(1,4)
D
u=1
mu=(1,2)
u=2
mu=(3,2)
u=2
mu=(2,3)
Suppose B1, B2, B3 are in table B, S1, S2 in table S
u= 2
u= 4
u=1
B1
u=1
mu=(2)
u=3
B2
S1
B3
S2
u=3
mu=(4)
D
u=1
mu=(2)
u=2
mu=(3)
u=2
mu=(3)
Extensions
• Interval-timestamp protocol
– Store start and end timestamp of the update.
– Fixed-size metadata, but it requires synchronized timestamps in a
cluster
• Trailing-edge consistency
– Propagate “no more updates” punctuation
• MERGE tables (Union)
– Treat new partitions as newly loaded.
• Partition Rollups
– Problem: load data once per minute, store for 2 years
– 1-minute partitions => 1,051,200 partitions to cover a year
– Rollup older partitions
• First 2 days : 1 minute partitions
• Next 728 days : 1 day partitions
• 3608 partitions to cover 2 years, but with efficient real-time updates.
• Any single-timestamp protocol requires scheduling restrictions
– Scheduling restrictions are bad for a real-time warehouse.
Effects of Scheduling Restrictions
Update propagation from CPU_RAW to CPU
Update propagation from CPU_RAW to CPU
1400
1400
1200
1200
AGG_60_C
1000
seconds
seconds
1000
800
600
800
600
C_POLL_COUNTS
400
400
200
200
0
0
1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09
1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09 1E+09
time
time
Starting-timestamp update protocol
Interval-timestamp update protocol
Data Feed Management
Data Feed Management
• Previous
– Collection of shell scripts invoked using cron
• about 200 scripts for all the Darkstar feeds
– Heavy usage of rsync and find
– Slow, cumbersome, buggy, unmaintenable
– Not intended to provide real-time data access
• no real-time triggers
• Cron jobs
– Propagation delays
– Can step on previous unfinished scripts
– No prioritized resource management
• rsync
– Lack of notification on destination side (triggers)
– Subscribers must keep identical directory structure and
time window
– No systemic performance and quality monitoring
Bistro
•
Bistro
•
Structured configuration language
•
Maintain database of transferred files
•
Intelligent transfer scheduling
•
Triggers
•
Logging
– derived from the Russian word быстро which means quickly
– Defines patterns to classify incoming files into feeds
– Customer subscribe to feeds of interest
– Bistro server can subscribe to each other
– Avoid repeated scans of millions of files
– Different subscribers can keep differently sized time windows of data
– Avoid unnecessary retransmission
– Deal with periods of subscriber unavailability
– Parallel data transfer trying to maximize the locality of file accesses
– Notify subscriber about file deliver, batch notifications
– Performance monitoring
– Monitor for new feeds / feed evolution
– Feed quality monitoring
Bistro Feed Architecture
Bistro source
Bistro source
Bistro
Bistro source
Bistro
Bistro
Bistro
Bistro client
Bistro source
Bistro client
Bistro
Bistro client
Bistro client
• Similar to distributed caches / CDN
– Each server can keep a different window of data
• Distribute feed delivery workload across several server
• Minimize impact of pipes
Example Configuration
Feeds
FeedGroup{
Name COMPASS
AutoCompress On
ScanFreq 5
// scan every 5 minutes
LandingDir
/stage/datastore/inbound
StageDir
/data/dee_1/bistro
LogDir
/export/home/datastor/log
Window 14400 // 10 days window
StagePath
$y/$m/$d
}
Feed{
Name COMPASS_BPS
Parent COMPASS
FilePattern INOUTOCTS%s.%Y%m%d%h_%n.gz
}
Feed{
Name COMPASS_CPU
Parent COMPASS
FilePattern CPU%s.%Y%m%d%h_%n.gz
}
Subscribers
Subscriber{
Name pride
HostName pride-db1.research.att.com
UserName darkstar
TriggerPath
/home/darkstar/bin/upd_client
Feed COMPASS
/export/SNMP/compass
}
Subscriber{
Name diamond
HostName diamond.research.att.com
TriggerPath
/darkstar/bin/update_manager_client.pl
Feed COMPASS
/dfeed_2/performance/compass/ddb
}
Technical Issues
• Real-time feed management
– Need to deliver files to subscribers with well-defined tardiness
– Subscribers can be unavailable for long time
• potentially huge volumes of historical data need to be sent when they go back up
• Similarly for new subscribers
– Avoid reading same file twice, stream data to several subscribers at
the same time
– Constrained resources – pipe bandwidth, cpu, disk, subscriber
performance
• Feed Discovery
– which Bistro server has the feed subscriber needs with large
enough time window?
– what is the “best” Bistro servers that satisfies this criteria?
• Access control, data acquisition, feed archival and many other
issues
• Feed Performance and Quality Monitoring
– Detect feed outages and performance degradations
– Detect feed changes
– Discover new feeds
Feed Analyzer
• Bistro file matching
– Feeds are defined using a set of simple patterns
• CPU%s.%Y%m%d%h_%n.gz, INOUTOCTS%s.%Y%m%d%h_%n.gz
– Best case – pattern specified by the source (extremely rare)
– Usual case – pattern is a best guess
• Need to be general enough to deal with future changes
• Specific enough to discard files that are clearly outside of the feed
• Data feeds evolve over time
–
–
–
–
File format changes, filename format changes
New data sources for existing feeds
New feeds quietly introduced
Feed patterns also need to evolve over time
• Bistro Feed Analyzer
– Analyze unmatched files and identify new feeds
– Identify files resembling existing feeds
– Discover filename structure for already defined feeds
• suggest less generic definitions
– Suggest better feed definitions
Evolving Feed Patterns
• Pattern too generic – false positives
– Example pattern %s%Y%m%d%h%n.gz
– Subscribers don’t have a good understanding of filename
structure
– Trying to use as simple of expression as possible
– Other sources started producing similarly named files
• Pattern too specific – false negatives
– Example pattern - poller1_%Y%m%d%h%n.gz
– Feed file naming convention changed
• poller1_%Y%m%d%h%n.gz changed to poller1_ver1.1_%Y%m%d%h%n.gz
– More sources are contributing to a feed
• poller2_%Y%m%d%h%n.gz in addition to poller1_%Y%m%d%h%n.gz
– Incomplete understanding of the data feeds
• poller%i_%Y%m%d%h%n.gz would be specified if we only
new there several pollers available
False Positives
• Patterns too generic
– Discover a structure generically described feeds
%s%Y%m%d%h%n.gz
– Identify a small set of patterns that describe all matching files
serviceMapping_%Y%m%d%h%n.cvs.gz
UNIFORM-fspz_%Y%m%d%h%n_%i{2}.csv.gz
– Suggest more refined feed definition
• Related to learning language from a set of positive examples
– Computationally expensive
– Can produce FA with hundred of states
• Asking subscriber or feed provider to refine that pattern becomes impossible
• Bistro uses rule-based pattern learning
– Break filenames into fixed and mutable parts
TRAP_ 20100308172519 _DCTAGN_rlph03.txt
• Fixed parts – names of the objects (e.g. router names, interface names, etc)
• Mutable parts – timestamps, dates, sequence number, etc
– Use file history to figure to figure out which parts are mutable
False Negatives
• Detecting False Negatives
– Analyze unmatched files and check similarity to existing feeds
– How to define a notion of similarly?
Pattern:
Unmatched file:
PM_%Y%m%d%h%n%s_ACMEPMSYS.pm.txt
PM_20100303203801_ACMEPMINT.pm.txt
• Approximate regex matching
– Edit distance between filename and pattern
• Number of modifications that make a filename match existing pattern
– Works well in some cases
• poller1_ver2_20100303203801.gz matches poller1_%Y%m%d%h%n.gz
• edit distance 5
– Frequently edit distance can be significantly larger than common parts
TRAP_20100308172519_DCTAGN_rlph039.txt
and
TRAP_20100308173000_UVIPTV-PER-BAN-DSPS-IPTV_MOM-rcsntxsqlcv002_900SEC_klpi026.txt
– Need more robust similarity metric
• Use pattern similarity
– Generalize unmatched file names to generate patterns
– Perform comparison between defined and unmatched patterns
Output of Feed Analyzer
•
Discovery of new feeds
Pattern
Frequency:
PM_%i{14}%s%i{2}ema-%i{14}_rlpi%i{3}.txt
33,520
%s[1 ] = (['_GSX-SystemStats_SONUS-ff4ca', '_GSX-PNSEnetStats_SONUSff4ca', '_IPNSEP_ff4ca', '_GSX-CongestionIntervalStats_SONUS-ff4ca',
'_TGIS-GSX_SONUS-ff4ca', '_TGS-GSX_SONUS-ff4ca', '_PSX-CM_SONUSff4ca'])
Pattern
Freq:
%s%i{14}%s%i{3}SEC_klpi%i{3}.txt.gz
20,958
%s[ 0 ] = ('TRAP_', 'PM_’)
%s[ 2 ] = ('_MERCURY-RUM-EUA-baccenter_', '_MERCURY-RUM-EMSbaccenter_’)
Pattern
Freq:
•
mokscy3ivmsxa%i{2}_report-%i{10}%s%i{6}.csv.gz
2,078
Identifying files similar to existing feeds
Pattern:
File:
PM_%Y%m%d%h%n%s_ACMEPMSYS.pm.txt
PM_20100303203801_ACMEPMINT.pm.txt
Conclusions
• We thought building a RT warehouse
would be easy
– Still learning
• End-to-end toolchain
– from data feeds to end applications
• Contact us if you want to build a
warehouse or manage your data feeds.