Reliability testing for Spark Streaming

Reliability testing for
Spark Streaming
… from POC to production
What does VideoAnalytics do? (1)
What does VideoAnalytics do? (2)
What does VideoAnalytics do? (3)
Why Spark Streaming?
•
It's a text-book example for the type of app. SS was
designed for.
•
We need a blinding-fast dev cycle.
•
We need to scale big
•
We need (built-in) resilience and fault-tolerance.
Getting ready for production
• use
kafka direct APIs
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integrationof-spark-streaming.html
• activate
Spark checkpoints
http://spark.apache.org/docs/latest/streaming-programmingguide.html#checkpointing
• move
to Spark 1.4
• deploy
it on YARN
Tooling for reliability testing
•
Spark framework telemetrics via Spark-UI
•
Spark app. telemetrics via OpenTSDB metrics (grafana as
viewer)
•
YARN/Spark REST APIs
•
traffic shaping in Linux with TC
•
automate everything with Ansible.
Test design
•
define the test event
•
define the relevant metrics
•
identify the test's independent variable
Test environment
•
the VA-DUB1-DEV cluster
•
6 machines
 6 executors / 4 cores / 2 GB per core
 1 driver
•
inject traffic from 6k simultaneous player sessions
TEST 001 - Killing Spark executors
•
event: randomly kill spark executor processes
•
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
•
test independent variable: number of executors being killed
BOOM!!!
•
On driver
2015-07-17 09:52:18,446 [task-result-getter-1] WARN
org.apache.spark.scheduler.TaskSetManager - Lost task 4.0 in stage 708.2031 (TID 19426,
hcd07.va.dub1.eur.adobe.com): FetchFailed(null, shuffleId=103, mapId=-1, reduceId=4,
message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for
shuffle 103
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapSta
tuses$1.apply(MapOutputTracker.scala:389)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapSta
tuses$1.apply(MapOutputTracker.scala:386)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike
•
On executors
2015-07-17 09:53:04,393 [Executor task launch worker-1] ERROR
org.apache.spark.MapOutputTracker - Missing an output location for shuffle 103
TEST 002 - Killing the Spark driver
•
event: randomly kill spark driver process
•
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
•
test independent variable: N/A
TEST 003 - Brown-out: packet loss
•
event: apply a certain rate of packet loss on the ports used by Spark executor
processes
•
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
•
test independent variable: packet loss rate (in %)
TEST 004 - Brown-out: packet
corruption
•
event: apply a certain rate of packet corruption on the ports used by Spark
executor processes
•
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
•
test independent variable: packet corruption rate (in %)
TEST 005 - Congestion: packet delay
•
event: apply a certain rate of packet delays on the ports used by Spark
executor processes
•
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
•
test independent variable: packet delay (in ms)
TEST 006 - Congestion: packet
duplication
•
event: apply a certain rate of packet duplication on the ports used by Spark
executor processes
•
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
•
test independent variable: packet duplication rate (in %)
Monitoring and alerting
•
monitoring checks must:
 be necessary (avoid redundant checks)
 be based on well-defined relevant metrics
 have well chosen threshold values
•
we use nagios/opsview
What we monitor
•
average job duration
•
number of active sessions
•
number of completed sessions
•
failed tasks rate (in %)
•
number of SC calls (video-completes)
Q&A
•
reliability testing: https://git.corp.adobe.com/primetime/vaprocessing/wiki/Reliability-testing-for-the-Spark-processing-pipeline
•
cluster configuration: https://git.corp.adobe.com/primetime/vaprocessing/wiki/Baseline-configuration-for-the-Spark-job
•
cluster monitoring and alerting:
 https://git.corp.adobe.com/primetime/va-processing/wiki/Monitoring-the-Sparkprocessing-pipeline
 https://git.corp.adobe.com/primetime/va-processing/wiki/Implementing-monitoringchecks-for-the-Spark-processing-pipeline