'

All Things Open - Spark & Storm - Where & When?

Понравилась презентация – покажи это...





Слайд 0

Spark & Storm: When & Where?


Слайд 1

Mammoth Data, based in downtown Durham (right above Toast) The Leader in Big Data Consulting ● ● ● ● ● BI/Data Strategy ○ Development of a business intelligence/ data architecture strategy. Installation ○ Installation of Hadoop or relevant technology. Data Consolidation ○ Load data from diverse sources into a single scalable repository. Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions. Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to necessary employees who will analyze the data. www.mammothdata.com | @mammothdataco


Слайд 2

Me! ● Lead Consultant on all things DevOps and Spark ● @carsondial on Twitter www.mammothdata.com | @mammothdataco


Слайд 3

What This Talk Is About ● Quick overview of Spark Streaming ● Reasons why Spark Streaming can be tricky in practice ● Performance and tuning tips we’ve learnt over the past two years ● …and when to pack it all in and use Storm instead www.mammothdata.com | @mammothdataco


Слайд 4

This IS WEB SCALE! www.mammothdata.com | @mammothdataco


Слайд 5

Beyond Web Scale ● I kid, Rails! ● (mostly) www.mammothdata.com | @mammothdataco


Слайд 6

Beyond Web Scale ● Spark & Storm - millions of requests / second on commodity hardware ● Different problems at different scales! www.mammothdata.com | @mammothdataco


Слайд 7

Spark ● Directed Acyclic Graph Data Processing Engine ● Based around the Resilient Distributed Dataset (RDD) primitive www.mammothdata.com | @mammothdataco


Слайд 8

Spark Streaming — Overview www.mammothdata.com | @mammothdataco


Слайд 9

Spark Streaming — In Production? ● Yes! ● (Alibaba, AutoTrader, Cisco, Netflix, etc.) www.mammothdata.com | @mammothdataco


Слайд 10

Spark Streaming — Overview ● Streaming by running batches very quickly! ● Batch length: can be as low as 0.5s / batch ● Every X seconds, get Y records (DStream/RDDs) www.mammothdata.com | @mammothdataco


Слайд 11

Spark Streaming — Good Things ● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!) ● Access to rest of Spark - Dataframes, MLLib, GraphX, etc. www.mammothdata.com | @mammothdataco


Слайд 12

Spark Streaming — Bad Things! ● What happens if you can’t process Y records in X seconds? ● What happens if you require sub-second latency? www.mammothdata.com | @mammothdataco


Слайд 13

Spark Streaming — I’m so sorry. www.mammothdata.com | @mammothdataco


Слайд 14

Spark Streaming — Bad Things! ● What happens if you can’t process Y records in X seconds? ● Data builds up in executors ● Executors run out of memory… www.mammothdata.com | @mammothdataco


Слайд 15

Spark Streaming — Bad Things! ● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?” www.mammothdata.com | @mammothdataco


Слайд 16

Spark Streaming — It Will Be Okay www.mammothdata.com | @mammothdataco


Слайд 17

Spark Streaming — Bad Things! ● As a former Ops person: ● WE WILL REMEMBER. www.mammothdata.com | @mammothdataco


Слайд 18

Spark Streaming — Tuning ● Do you need low-latency? ● If so, a 10-minute nap is advisable! ● Everybody else, let’s dive in… www.mammothdata.com | @mammothdataco


Слайд 19

Spark Streaming — Tuning www.mammothdata.com | @mammothdataco


Слайд 20

Spark Streaming — Down In The Hole www.mammothdata.com | @mammothdataco


Слайд 21

Spark Streaming — Down In The Hole www.mammothdata.com | @mammothdataco


Слайд 22

Spark Streaming — Down In The Hole ● Easiest method — alter the batch window until it’s all fine! ● Tiny batches provide tight execution times! www.mammothdata.com | @mammothdataco


Слайд 23

Spark Streaming — Tuning ● Use Kafka. ● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+) ● (other sources get the features…eventually) www.mammothdata.com | @mammothdataco


Слайд 24

Spark Streaming — Tuning ● Use Scala. ● CPython = slower in execution ● PyPy is much faster…but… ● New features always come to Scala first. www.mammothdata.com | @mammothdataco


Слайд 25

Spark Streaming — Tuning ● (or Java if you really must) www.mammothdata.com | @mammothdataco


Слайд 26

Spark Streaming — Cores ● Spark Streaming = data receivers + Spark ● spark.cores.max = x * number of receivers ● For Great Data Locality and Parallelism! www.mammothdata.com | @mammothdataco


Слайд 27

Spark Streaming — Caching ● Are you using a foreachRDD loop? rdd.foreachRDD{ rdd => rdd.cache() … rdd.unpersist() } www.mammothdata.com | @mammothdataco


Слайд 28

Spark Streaming — Caching ● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win ● It really shouldn’t work so well… www.mammothdata.com | @mammothdataco


Слайд 29

Spark Streaming — Backpressure ● Hurrah for Spark 1.5! ● spark.streaming.backpressure.enabled = true ● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors) ● Works for all data sources (for once!) www.mammothdata.com | @mammothdataco


Слайд 30

Storm ● I really need that low-latency response! www.mammothdata.com | @mammothdataco


Слайд 31

Storm ● Directed Acyclic Graph Data Processing Engine www.mammothdata.com | @mammothdataco


Слайд 32

Spark “Very Good, Sir” www.mammothdata.com | @mammothdataco


Слайд 33

Storm “Here you go!” www.mammothdata.com | @mammothdataco


Слайд 34

Storm Concepts ● Stream of tuples ● Bolts ● Spouts ● Topologies www.mammothdata.com | @mammothdataco


Слайд 35

Storm — Streams ● Unbounded stream of tuples ● Tuples are defined via schema (usual base types plus custom serializers) www.mammothdata.com | @mammothdataco


Слайд 36

Storm — Spouts ● Sources of tuples in a topology ● Read from external sources (e.g. Kafka) and emitting them ● Can emit multiple streams from a spout! www.mammothdata.com | @mammothdataco


Слайд 37

Storm — Bolts ● ● ● ● ● Where your processing happens Roll your own aggregations / filtering / windowing Bolts can feed into other bolts Potentially easier to test than Spark Streaming Many Bolt connectors for external sources (e.g. Cassandra, Redis, Hive, etc) www.mammothdata.com | @mammothdataco


Слайд 38

Storm — Topologies ● The DAG of the spouts and bolts ● Built programmatically in code and submitted to the Storm cluster ● Flux - Do It In YAML (and then complain about whitespace) www.mammothdata.com | @mammothdataco


Слайд 39

Storm — Tasks ● Each bolt or spout runs 'tasks' across the cluster ● How parallelism works in Storm ● Set in topology submission www.mammothdata.com | @mammothdataco


Слайд 40

Storm — Workers ● Where the topology runs ● 1 worker = 1 JVM ● Tasks run as threads on a worker ● Storm distributes tasks evenly across cluster www.mammothdata.com | @mammothdataco


Слайд 41

Storm — Good Things ● True Streaming ● Tuples processed as they enter topology - low latency ● Scales far beyond Spark Streaming (currently) www.mammothdata.com | @mammothdataco


Слайд 42

Storm — Good Things ● Battle-tested at Twitter & Yahoo! ● Yahoo! has 300-node clusters and working to support 1000+ nodes ● Single node clocked at over 1.5m tuples / second at Twitter www.mammothdata.com | @mammothdataco


Слайд 43

Storm — Bad Things ● Very DIY (bring your own aggregations, ML, etc) ● Your DAG construction may not be optimal ● Operationally more complex (and Storm WebUI is more primitive) ● Where’s Me REPL? www.mammothdata.com | @mammothdataco


Слайд 44

Spark or Storm? www.mammothdata.com | @mammothdataco


Слайд 45

Spark or Storm? ● SLA on latency? www.mammothdata.com | @mammothdataco


Слайд 46

Spark or Storm? ● Storm! ● (though simply because it’s possible doesn’t mean you’ll get it!) www.mammothdata.com | @mammothdataco


Слайд 47

Spark or Storm? ● Insane data needs (e.g. ~100m records/second?) www.mammothdata.com | @mammothdataco


Слайд 48

Spark or Storm? ● Storm! ● (though, again, it’s not a magic bullet!) www.mammothdata.com | @mammothdataco


Слайд 49

Spark or Storm? ● For almost anything else? Spark. ● High-level vs. Low-level ● Each new version of Spark delivers improvements! www.mammothdata.com | @mammothdataco


Слайд 50

Other Listing Magazines Are Available ● Other frameworks that show promise: ○ Flink ○ Apex ○ Samza ○ Heron (Twitter’s not-public Storm replacement) www.mammothdata.com | @mammothdataco


Слайд 51

Questions? www.mammothdata.com | @mammothdataco


Слайд 52


×

HTML:





Ссылка: