Reliability testing for Spark Streaming … from POC to production What does VideoAnalytics do? (1) What does VideoAnalytics do? (2) What does VideoAnalytics do? (3) Why Spark Streaming? • It's a text-book example for the type of app. SS was designed for. • We need a blinding-fast dev cycle. • We need to scale big • We need (built-in) resilience and fault-tolerance. Getting ready for production • use kafka direct APIs https://databricks.com/blog/2015/03/30/improvements-to-kafka-integrationof-spark-streaming.html • activate Spark checkpoints http://spark.apache.org/docs/latest/streaming-programmingguide.html#checkpointing • move to Spark 1.4 • deploy it on YARN Tooling for reliability testing • Spark framework telemetrics via Spark-UI • Spark app. telemetrics via OpenTSDB metrics (grafana as viewer) • YARN/Spark REST APIs • traffic shaping in Linux with TC • automate everything with Ansible. Test design • define the test event • define the relevant metrics • identify the test's independent variable Test environment • the VA-DUB1-DEV cluster • 6 machines 6 executors / 4 cores / 2 GB per core 1 driver • inject traffic from 6k simultaneous player sessions TEST 001 - Killing Spark executors • event: randomly kill spark executor processes • relevant metrics: max total processing delay per micro-batch number of SC calls recovery time • test independent variable: number of executors being killed BOOM!!! • On driver 2015-07-17 09:52:18,446 [task-result-getter-1] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 4.0 in stage 708.2031 (TID 19426, hcd07.va.dub1.eur.adobe.com): FetchFailed(null, shuffleId=103, mapId=-1, reduceId=4, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 103 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapSta tuses$1.apply(MapOutputTracker.scala:389) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapSta tuses$1.apply(MapOutputTracker.scala:386) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike • On executors 2015-07-17 09:53:04,393 [Executor task launch worker-1] ERROR org.apache.spark.MapOutputTracker - Missing an output location for shuffle 103 TEST 002 - Killing the Spark driver • event: randomly kill spark driver process • relevant metrics: max total processing delay per micro-batch number of SC calls recovery time • test independent variable: N/A TEST 003 - Brown-out: packet loss • event: apply a certain rate of packet loss on the ports used by Spark executor processes • relevant metrics: max total processing delay per micro-batch number of SC calls recovery time • test independent variable: packet loss rate (in %) TEST 004 - Brown-out: packet corruption • event: apply a certain rate of packet corruption on the ports used by Spark executor processes • relevant metrics: max total processing delay per micro-batch number of SC calls recovery time • test independent variable: packet corruption rate (in %) TEST 005 - Congestion: packet delay • event: apply a certain rate of packet delays on the ports used by Spark executor processes • relevant metrics: max total processing delay per micro-batch number of SC calls recovery time • test independent variable: packet delay (in ms) TEST 006 - Congestion: packet duplication • event: apply a certain rate of packet duplication on the ports used by Spark executor processes • relevant metrics: max total processing delay per micro-batch number of SC calls recovery time • test independent variable: packet duplication rate (in %) Monitoring and alerting • monitoring checks must: be necessary (avoid redundant checks) be based on well-defined relevant metrics have well chosen threshold values • we use nagios/opsview What we monitor • average job duration • number of active sessions • number of completed sessions • failed tasks rate (in %) • number of SC calls (video-completes) Q&A • reliability testing: https://git.corp.adobe.com/primetime/vaprocessing/wiki/Reliability-testing-for-the-Spark-processing-pipeline • cluster configuration: https://git.corp.adobe.com/primetime/vaprocessing/wiki/Baseline-configuration-for-the-Spark-job • cluster monitoring and alerting: https://git.corp.adobe.com/primetime/va-processing/wiki/Monitoring-the-Sparkprocessing-pipeline https://git.corp.adobe.com/primetime/va-processing/wiki/Implementing-monitoringchecks-for-the-Spark-processing-pipeline
© Copyright 2024 Paperzz