UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang† Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel Labs Berkeley Why console logs? • Detecting problems in large scale Internet services often requires detailed instrumentation • Instrumentation can be costly to insert & maintain • High code churn • Often combine open-source building blocks that are not all instrumented • Can we use console logs in lieu of instrumentation? + Easy for developer, so nearly all software has them – Imperfect: not originally intended for instrumentation 2 Problems we are looking for • The easy case – rare messages what is wrong with blk_2 ??? • Harder but useful - abnormal sequences receiving blk_1 received blk_1 receiving blk_2 NORMAL ERROR 3 Overview and Contributions Dominant cases Parsing* Frequent pattern based filtering Non-pattern Normal cases Free text logs 200 nodes PCA Detection Real anomalies OK OK ERROR • Accurate online detection with small latency * Large-scale system problem detection by mining console logs (SOSP’ 09) 4 Constructing event traces from console logs • Parse: message type + variables • Group messages by identifiers (automatically discovered) – Group ~= event trace receiving blk_1 receiving blk_2 receiving blk_1 received blk_2 received blk_1 received blk_1 receiving blk_2 reading blk_1 reading blk_1 5 Online detection: When to make detection? • Cannot wait for the entire trace • Can last arbitrarily long time • How long do we have to wait? • Long enough to keep correlations • Wrong cut = false positive • Difficulties • No obvious boundaries • Inaccurate message ordering • Variations in session duration receiving blk_1 receiving blk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 deleting blk_1 deleted blk_1 Time 6 Frequent patterns help determine session boundaries • Key Insight: Most messages/traces are normal • Strong patterns • “Make common paths fast” • Tolerate noise 7 Two stage detection overview Dominant cases Parsing Frequent pattern based filtering Non-pattern Normal cases Free text logs 200 nodes PCA Detection Real anomalies OK OK ERROR 8 Stage 1 - Frequent patterns (1): Frequent event sets Repeat until all patterns found Coarse cut by time Find frequent item set receiving blk_1 receiving blk_1 received blk_1 reading blk_1 received blk_1 Refine time estimation error blk_1 reading blk_1 deleting blk_1 deleted blk_1 PCA TimeDetection9 Stage 1 - Frequent patterns (2) : Modeling session duration time • Assuming Gaussian? • 99.95th percentile estimation is off by half • 45% more false alarms • Mixture distribution Count Pr(X>=x) • Power-law tail + histogram head Duration Duration 10 Stage 2 - Handling noise with PCA detection Dominant cases Parsing Frequent pattern based filtering OK Non-pattern Normal cases Free text logs 200 nodes PCA Detection Real anomalies OK ERROR • More tolerant to noise • Principal Component Analysis (PCA) based detection 11 Frequent pattern matching filters most of the normal events 86% 100% Parsing* Dominant cases Frequent pattern based filtering 14% Free text logs 200 nodes OK Non-pattern 13.97% Normal cases PCA Detection Real anomalies OK ERROR 0.03% Evaluation setup • Hadoop file system (HDFS) • Experiment on Amazon’s EC2 cloud • 203 nodes x 48 hours • Running standard map-reduce jobs • ~24 million lines of console logs • 575,000 traces • ~ 680 distinct ones • Manual label from previous work • Normal/abnormal + why it is abnormal • “Eventually normal” – did not consider time • For evaluation only 13 Frequent patterns in HDFS Frequent Pattern 99.95th percentile Duration Allocate, begin write % of messages 13 sec 20.3% 8 sec 44.6% Delete - 12.5% Serving block - 3.8% Read exception - 3.2% Verify block - 1.1% Done write, update metadata Total • Covers most messages • Short durations 85.6% (Total events ~20 million) 14 Detection latency • Detection latency is dominated by the wait time Frequent pattern (matched) Single event pattern Frequent pattern (timed out) Non pattern events 15 Detection accuracy True Positives False Positives False Negatives Precision Recall Online 16,916 2,748 0 86.0% 100.0% Offline 16,808 1,746 108 90.6% 99.3% (Total trace = 575,319) • Ambiguity on “abnormal” • Manual labels: “eventually normal” • > 600 FPs in online detection as very long latency • E.g. a write session takes >500sec to complete (99.99th percentile is 20sec) 16 Future work • Distributed log stream processing – Handle large scale cluster + partial failures • Clustering alarms • Allowing feedback from operators • Correlation on logs from multiple applications / layers 17 Summary Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases Free text logs 200 nodes 24 million lines PCA Detection Real anomalies Precision Online OK ERROR Recall 86.0% http://www.cs.berkeley.edu/~xuw/ Wei Xu <[email protected]> 100.0% 18
© Copyright 2025 Paperzz