Big Data Platform at interest

Понравилась презентация – покажи это...

Слайд 0

Mao Ye Big Data Platform at interest 1

Слайд 1

Слайд 2

Data Architecture

Слайд 3

Data at Pinterest 60 Billion Pins 1 Billion boards 100M MAU 60 PB of data on S3 3 PB processed every day 2000 node Hadoop cluster 250 engineers

Слайд 4

Pinterest Data Architecture App

Слайд 5

Pinterest Data Architecture App events Kafka Secor Singer

Слайд 6

Pinterest Data Architecture App events Kafka Secor Singer

Слайд 7

Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer

Слайд 8

Design Choices for Hadoop Platform

Слайд 9

Ephemeral clusters Access control layer Shared data store Easy deployment Hadoop Platform Requirements Isolated multi-tenancy Elasticity Support multiple clusters

Слайд 10

Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store

Слайд 11

Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 Data Metadata

Слайд 12

Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI

Слайд 13

Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server

Слайд 14

API for simplified executor abstraction Advanced support for spot instances Baked AMI customization Why Qubole? Hadoop & Spark as managed services Tight integration with Hive Graceful cluster scaling

Слайд 15

Pinball for Workflow Management

Слайд 16

Scale: 60 Billion Pins Hundreds of workflows Thousands of jobs 500+ jobs in a workflow 3 petabytes processed daily Support: Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow

Слайд 17

Why Pinball? Requirements Simple abstractions Extensible in future Reliable stateless computing Easy to debug Scales horizontally Can be upgraded w/o aborting workflows Rich features like auto-retries, per-job emails, overrun policies… Options Apache Oozie, Azkaban, Luigi

Слайд 18

Pinball Design Master Worker Scheduler Command Line Clients UI

Слайд 19

Workflow A directed graph of nodes called jobs Edge Run after dependence Node Job is a node Workflow Model

Слайд 20

Job State Job state is captured in a token Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)

Слайд 21


Слайд 22

Master keeps the state Workers claim and execute tasks Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack

Слайд 23

Master Entire state is kept in memory Each state update is synchronously persisted before master replies to client Master runs on a single thread – no concurrency issues

Слайд 24


Слайд 25

Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/pinball-users

Слайд 26

Thank You