Reliability testing for Spark Streaming

Reliability testing for
Spark Streaming
… from POC to production
What does VideoAnalytics do? (1)
What does VideoAnalytics do? (2)
What does VideoAnalytics do? (3)
Why Spark Streaming?
It's a text-book example for the type of app. SS was
designed for.
We need a blinding-fast dev cycle.
We need to scale big
We need (built-in) resilience and fault-tolerance.
Getting ready for production
• use
kafka direct APIs
• activate
Spark checkpoints
• move
to Spark 1.4
• deploy
it on YARN
Tooling for reliability testing
Spark framework telemetrics via Spark-UI
Spark app. telemetrics via OpenTSDB metrics (grafana as
traffic shaping in Linux with TC
automate everything with Ansible.
Test design
define the test event
define the relevant metrics
identify the test's independent variable
Test environment
the VA-DUB1-DEV cluster
6 machines
 6 executors / 4 cores / 2 GB per core
 1 driver
inject traffic from 6k simultaneous player sessions
TEST 001 - Killing Spark executors
event: randomly kill spark executor processes
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
test independent variable: number of executors being killed
On driver
2015-07-17 09:52:18,446 [task-result-getter-1] WARN
org.apache.spark.scheduler.TaskSetManager - Lost task 4.0 in stage 708.2031 (TID 19426, FetchFailed(null, shuffleId=103, mapId=-1, reduceId=4,
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for
shuffle 103
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike
On executors
2015-07-17 09:53:04,393 [Executor task launch worker-1] ERROR
org.apache.spark.MapOutputTracker - Missing an output location for shuffle 103
TEST 002 - Killing the Spark driver
event: randomly kill spark driver process
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
test independent variable: N/A
TEST 003 - Brown-out: packet loss
event: apply a certain rate of packet loss on the ports used by Spark executor
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
test independent variable: packet loss rate (in %)
TEST 004 - Brown-out: packet
event: apply a certain rate of packet corruption on the ports used by Spark
executor processes
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
test independent variable: packet corruption rate (in %)
TEST 005 - Congestion: packet delay
event: apply a certain rate of packet delays on the ports used by Spark
executor processes
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
test independent variable: packet delay (in ms)
TEST 006 - Congestion: packet
event: apply a certain rate of packet duplication on the ports used by Spark
executor processes
relevant metrics:
 max total processing delay per micro-batch
 number of SC calls
 recovery time
test independent variable: packet duplication rate (in %)
Monitoring and alerting
monitoring checks must:
 be necessary (avoid redundant checks)
 be based on well-defined relevant metrics
 have well chosen threshold values
we use nagios/opsview
What we monitor
average job duration
number of active sessions
number of completed sessions
failed tasks rate (in %)
number of SC calls (video-completes)
reliability testing:
cluster configuration:
cluster monitoring and alerting: