'

New Developments in Spark

Понравилась презентация – покажи это...





Слайд 0

New Developments in Spark Matei Zaharia August 18th, 2015


Слайд 1

About Databricks Founded by creators of Spark in 2013 and remains the top contributor End-to-end service for Spark on EC2 • Interactive notebooks, dashboards, and production jobs


Слайд 2

Our Goal for Spark Unified engine across data workloads and platforms Streaming SQL ML Graph Batch … …


Слайд 3

Past 2 Years Fast growth in libraries and integration points • New library for SQL + DataFrames • 10x growth of ML library • Pluggable data source API • R language Result: very diverse use of Spark • Only 40% of users on Hadoop YARN • Most users use at least 2 of Spark’s built-in libraries • 98% of Databricks customers use SQL, 60% use Python


Слайд 4

Beyond Libraries Best thing about basing Spark’s libraries on a high-level API is that we can also make big changes underneath them Now working on some of the largest changes to Spark Core since the project began


Слайд 5

This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution


Слайд 6

Hardware Trends Storage Network CPU


Слайд 7

Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz


Слайд 8

Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz


Слайд 9

Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10x Network 1Gbps 10Gbps 10x CPU ~3GHz ~3GHz L


Слайд 10

Tungsten: Preparing Spark for Next 5 Years Substantially speed up execution by optimizing CPU efficiency, via: (1) Off-heap memory management (2) Runtime code generation (3) Cache-aware algorithms


Слайд 11

Interfaces to Tungsten Spark SQL DataFrames (Python, Java, Scala, R) RDDs … NVRAM … Data schema + query plan Tungsten backends JVM LLVM GPU


Слайд 12

DataFrame API Single-node tabular structure in R and Python, with APIs for: relational algebra (filter, join, …) math and stats input/output (CSV, JSON, …) Google Trends for “data frame”


Слайд 13

DataFrame: lingua franca for “small data” head(flights) #>  Source:  local  data  frame  [6  x  16] #>   #>        year  month  day  dep_time dep_delay arr_time arr_delay carrier  tailnum #>  1    2013          1      1            517                  2            830                11            UA    N14228 #>  2    2013          1      1            533                  4            850                20            UA    N24211 #>  3    2013          1      1            542                  2            923                33            AA    N619AA #>  4    2013          1      1            544                -­‐1          1004              -­‐18            B6    N804JB #>  ..    ...      ...  ...            ...              ...            ...              ...          ...          ...


Слайд 14

Spark DataFrames • DataFrame = RDD + schema Capture many operations as expressions in a DSL • Enables rich optimizations df = jsonFile(“tweets.json”) df(df(“user”) === “matei”) .groupBy(“date”) .sum(“retweets”) 10 Running Time Structured data collections with similar API to R/Python 5 0 Python RDD Scala RDD DataFrame 15


Слайд 15

How does Tungsten help?


Слайд 16

1. Off-Heap Memory Management Store data outside JVM heap to avoid object overhead & GC • For RDDs: fast serialization libraries • For DataFrames & SQL: binary format we compute on directly 2-10x space saving, especially for strings, nested objects Can use new RAM-like devices, e.g. flash, 3D XPoint


Слайд 17

2. Runtime Code Generation Can do same in core, ML and graph • Code-gen serializers, fused functions, math expressions Interpreted Code gen Projection Avoids virtual calls and generics/boxing Evaluating “SELECT a+a+a” (time in seconds) Hand written Generate Java code for DataFrame and SQL expressions requested by user 36.7 9.4 9.3


Слайд 18

3. Cache-Aware Algorithms Use custom memory layout to better leverage CPU cache Example: AlphaSort-style prefix sort • Store prefixes of sort key inside pointer array • Compare prefixes to avoid full record fetches + comparisons Naïve layout Cache friendly layout pointer key prefix record pointer record


Слайд 19

Tungsten Performance Results 1200 1000 800 Default Run time 600 (seconds) 400 Code Gen Tungsten onheap 200 Tungsten offheap 0 1x 2x 4x 8x Data set size (relative) 16x


Слайд 20

This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution


Слайд 21

Motivation Network and storage speeds have improved 10x, but this speed isn’t always easy to leverage! Many challenges with: • Keeping disk operations large (even on SSDs) • Keeping network connections busy & balanced across cluster • Doing all this on many cores and many disks


Слайд 22

Sort Benchmark Started by Jim Gray in 1987 to measure HW+SW advances • Many entrants use purpose-built hardware & software Participated in largest category: Daytona GraySort • Sort 100 TB of 100-byte records in a fault-tolerant manner Set a new world record (tied with UCSD) • Saturated 8 SSDs and 10 Gbps network / node • 1st time public cloud + open source won


Слайд 23

On-Disk Sort Record Time to sort 100 TB 2013 Record: Hadoop 2100 machines 2014 Record: Spark 207 machines 72 minutes 23 minutes Also sorted 1 PB in 4 hours Source: Daytona GraySort benchmark, sortbenchmark.org


Слайд 24

Saturating the Network 1.1GB/sec per node


Слайд 25

This Talk Project Tungsten: CPU and memory efficiency Network and disk I/O Adaptive query execution


Слайд 26

Motivation Query planning is crucial to performance in distributed setting • Level of parallelism in operations • Choice of algorithm (e.g. broadcast vs. shuffle join) Hard to do well for big data even with cost-based optimization • Unindexed data => don’t have statistics • User-defined functions => hard to predict Solution: let Spark change query plan adaptively


Слайд 27

Traditional Spark Scheduling file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce sort


Слайд 28

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map


Слайд 29

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map


Слайд 30

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce


Слайд 31

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce


Слайд 32

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce


Слайд 33

Adaptive Planning file.map(word   =>  (word,   1)).reduceByKey(_   +  _) .sortByKey() map reduce sort


Слайд 34

Advanced Example: Join Goal: Bring together data items with the same key


Слайд 35

Advanced Example: Join Goal: Bring together data items with the same key Shuffle join (good if both datasets large)


Слайд 36

Advanced Example: Join Goal: Bring together data items with the same key Broadcast join (good if top dataset small)


Слайд 37

Advanced Example: Join Goal: Bring together data items with the same key Hybrid join (broadcast popular key, shuffle rest)


Слайд 38

Advanced Example: Join Goal: Bring together data items with the same key Hybrid join (broadcast popular key, shuffle rest) More details: SPARK-9850


Слайд 39

Impact of Adaptive Planning Level of parallelism: 2-3x Choice of join algorithm: as much as 10x Follow it at SPARK-9850


Слайд 40

Effect of Optimizations in Core Often, when we made one optimization, we saw all of the Spark components get faster • Scheduler optimization for Spark Streaming => SQL 2x faster • Network optimizations => speed up all comm-intensive libraries • Tungsten => DataFrames, SQL and parts of ML Same applies to other changes in core, e.g. debug tools


Слайд 41

Conclusion Spark has grown a lot, but it still remains the most active open source project in big data Small core + high-level API => can make changes quickly New hardware => exciting optimizations at all levels


Слайд 42

Learn More: sparkhub.databricks.com


Слайд 43


×

HTML:





Ссылка: