'

Big Data Platform at interest

Понравилась презентация – покажи это...





Слайд 0

Mao Ye Big Data Platform at interest 1


Слайд 1


Слайд 2

Data Architecture


Слайд 3

Data at Pinterest 60 Billion Pins 1 Billion boards 100M MAU 60 PB of data on S3 3 PB processed every day 2000 node Hadoop cluster 250 engineers


Слайд 4

Pinterest Data Architecture App


Слайд 5

Pinterest Data Architecture App events Kafka Secor Singer


Слайд 6

Pinterest Data Architecture App events Kafka Secor Singer


Слайд 7

Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer


Слайд 8

Design Choices for Hadoop Platform


Слайд 9

Ephemeral clusters Access control layer Shared data store Easy deployment Hadoop Platform Requirements Isolated multi-tenancy Elasticity Support multiple clusters


Слайд 10

Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store


Слайд 11

Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 Data Metadata


Слайд 12

Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI


Слайд 13

Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server


Слайд 14

API for simplified executor abstraction Advanced support for spot instances Baked AMI customization Why Qubole? Hadoop & Spark as managed services Tight integration with Hive Graceful cluster scaling


Слайд 15

Pinball for Workflow Management


Слайд 16

Scale: 60 Billion Pins Hundreds of workflows Thousands of jobs 500+ jobs in a workflow 3 petabytes processed daily Support: Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow


Слайд 17

Why Pinball? Requirements Simple abstractions Extensible in future Reliable stateless computing Easy to debug Scales horizontally Can be upgraded w/o aborting workflows Rich features like auto-retries, per-job emails, overrun policies… Options Apache Oozie, Azkaban, Luigi


Слайд 18

Pinball Design Master Worker Scheduler Command Line Clients UI


Слайд 19

Workflow A directed graph of nodes called jobs Edge Run after dependence Node Job is a node Workflow Model


Слайд 20

Job State Job state is captured in a token Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)


Слайд 21

Job State Machine RUNNABLE RUNNING WAITING


Слайд 22

Master keeps the state Workers claim and execute tasks Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack


Слайд 23

Master Entire state is kept in memory Each state update is synchronously persisted before master replies to client Master runs on a single thread – no concurrency issues


Слайд 24

Worker


Слайд 25

Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/pinball-users


Слайд 26

Thank You


×

HTML:





Ссылка: