Online 16916 2748 0 86.0% 100.0%

UC Berkeley
Online System Problem Detection by
Mining Console Logs
Wei Xu* Ling Huang†
Armando Fox* David Patterson* Michael Jordan*
*UC Berkeley
† Intel
Labs Berkeley
Why console logs?
• Detecting problems in large scale Internet services
often requires detailed instrumentation
• Instrumentation can be costly to insert & maintain
• High code churn
• Often combine open-source building blocks that are not
all instrumented
• Can we use console logs in lieu of instrumentation?
+ Easy for developer, so nearly all software has them
–
Imperfect: not originally intended for instrumentation
2
Problems we are looking for
• The easy case – rare messages
what is wrong with blk_2 ???
• Harder but useful - abnormal sequences
receiving blk_1
received blk_1
receiving blk_2
NORMAL
ERROR
3
Overview and Contributions
Dominant cases
Parsing*
Frequent pattern
based filtering
Non-pattern
Normal cases
Free text logs
200 nodes
PCA Detection
Real anomalies
OK
OK
ERROR
• Accurate online detection with small latency
* Large-scale system problem detection by mining console logs (SOSP’ 09)
4
Constructing event traces from
console logs
• Parse: message type + variables
• Group messages by identifiers (automatically
discovered)
– Group ~= event trace
receiving blk_1
receiving blk_2
receiving blk_1
received blk_2
received blk_1
received blk_1
receiving blk_2
reading blk_1
reading blk_1
5
Online detection: When to make
detection?
• Cannot wait for the entire trace
• Can last arbitrarily long time
• How long do we have to wait?
• Long enough to keep correlations
• Wrong cut = false positive
• Difficulties
• No obvious boundaries
• Inaccurate message ordering
• Variations in session duration
receiving blk_1
receiving blk_1
received blk_1
received blk_1
reading blk_1
reading blk_1
deleting blk_1
deleted blk_1
Time
6
Frequent patterns help determine
session boundaries
• Key Insight: Most messages/traces are normal
• Strong patterns
• “Make common paths fast”
• Tolerate noise
7
Two stage detection overview
Dominant cases
Parsing
Frequent pattern
based filtering
Non-pattern
Normal cases
Free text logs
200 nodes
PCA Detection
Real anomalies
OK
OK
ERROR
8
Stage 1 - Frequent patterns (1):
Frequent event sets
Repeat until all patterns found
Coarse cut by time
Find frequent item set
receiving blk_1
receiving blk_1
received blk_1
reading blk_1
received blk_1
Refine time estimation
error blk_1
reading blk_1
deleting blk_1
deleted blk_1
PCA
TimeDetection9
Stage 1 - Frequent patterns (2) :
Modeling session duration time
• Assuming Gaussian?
• 99.95th percentile estimation is off by half
• 45% more false alarms
• Mixture distribution
Count
Pr(X>=x)
• Power-law tail + histogram head
Duration
Duration
10
Stage 2 - Handling noise with PCA
detection
Dominant cases
Parsing
Frequent pattern
based filtering
OK
Non-pattern
Normal cases
Free text logs
200 nodes
PCA Detection
Real anomalies
OK
ERROR
• More tolerant to noise
• Principal Component Analysis (PCA) based
detection
11
Frequent pattern matching filters most
of the normal events
86%
100%
Parsing*
Dominant cases
Frequent pattern
based filtering
14%
Free text logs
200 nodes
OK
Non-pattern
13.97%
Normal cases
PCA Detection
Real anomalies
OK
ERROR
0.03%
Evaluation setup
• Hadoop file system (HDFS)
• Experiment on Amazon’s EC2 cloud
• 203 nodes x 48 hours
• Running standard map-reduce jobs
• ~24 million lines of console logs
• 575,000 traces
• ~ 680 distinct ones
• Manual label from previous work
• Normal/abnormal + why it is abnormal
• “Eventually normal” – did not consider time
• For evaluation only
13
Frequent patterns in HDFS
Frequent Pattern
99.95th percentile
Duration
Allocate, begin write
% of messages
13 sec
20.3%
8 sec
44.6%
Delete
-
12.5%
Serving block
-
3.8%
Read exception
-
3.2%
Verify block
-
1.1%
Done write, update metadata
Total
• Covers most messages
• Short durations
85.6%
(Total events ~20 million)
14
Detection latency
• Detection latency is dominated by the wait time
Frequent pattern
(matched)
Single event
pattern
Frequent pattern
(timed out)
Non pattern events
15
Detection accuracy
True
Positives
False
Positives
False
Negatives
Precision
Recall
Online
16,916
2,748
0
86.0%
100.0%
Offline
16,808
1,746
108
90.6%
99.3%
(Total trace = 575,319)
• Ambiguity on “abnormal”
• Manual labels: “eventually normal”
• > 600 FPs in online detection as very long latency
• E.g. a write session takes >500sec to complete
(99.99th percentile is 20sec)
16
Future work
• Distributed log stream processing
– Handle large scale cluster + partial failures
• Clustering alarms
• Allowing feedback from operators
• Correlation on logs from multiple applications /
layers
17
Summary
Dominant cases
Frequent pattern
based filtering
Parsing
OK
Non-pattern
Normal cases
Free text logs
200 nodes
24 million lines
PCA Detection
Real anomalies
Precision
Online
OK
ERROR
Recall
86.0%
http://www.cs.berkeley.edu/~xuw/
Wei Xu <[email protected]>
100.0%
18