Big Data Platform at interest

If you like this presentation – show it...

Slide 0

Mao Ye Big Data Platform at interest 1

Slide 1

Slide 2

Data Architecture

Slide 3

Data at Pinterest 60 Billion Pins 1 Billion boards 100M MAU 60 PB of data on S3 3 PB processed every day 2000 node Hadoop cluster 250 engineers

Slide 4

Pinterest Data Architecture App

Slide 5

Pinterest Data Architecture App events Kafka Secor Singer

Slide 6

Pinterest Data Architecture App events Kafka Secor Singer

Slide 7

Pinterest Data Architecture App events Kafka Secor Skyline Pinball Redshift Pinalytics Features Qubole (Hadoop) Singer

Slide 8

Design Choices for Hadoop Platform

Slide 9

Ephemeral clusters Access control layer Shared data store Easy deployment Hadoop Platform Requirements Isolated multi-tenancy Elasticity Support multiple clusters

Slide 10

Decoupling compute & storage Hadoop Cluster 1 Transient HDFS Hadoop Cluster 2 Transient HDFS S3 Persistent Store

Slide 11

Centralized Hive Metastore Hive Metastore Pig Cascading Hive HDFS/S3 Data Metadata

Slide 12

Multi-layered Packaging Mapreduce Jobs Hadoop Jars/Libs Job/User level Configs Software Packages/Libs Configs (OS/Hadoop) Misc Sys Admin OS Bootstrap Script Core SW Runtime Staging (on S3) Automated Configuration (Masterless Puppet) Baked AMI

Slide 13

Executor Abstraction Layer Hive Metastore HDFS/S3 Qubole Managed Hadoop EMR Executor Pinball Dev Server

Slide 14

API for simplified executor abstraction Advanced support for spot instances Baked AMI customization Why Qubole? Hadoop & Spark as managed services Tight integration with Hive Graceful cluster scaling

Slide 15

Pinball for Workflow Management

Slide 16

Scale: 60 Billion Pins Hundreds of workflows Thousands of jobs 500+ jobs in a workflow 3 petabytes processed daily Support: Hadoop, Cascading, Hive, Spark … Scale of Processing job workflow

Slide 17

Why Pinball? Requirements Simple abstractions Extensible in future Reliable stateless computing Easy to debug Scales horizontally Can be upgraded w/o aborting workflows Rich features like auto-retries, per-job emails, overrun policies… Options Apache Oozie, Azkaban, Luigi

Slide 18

Pinball Design Master Worker Scheduler Command Line Clients UI

Slide 19

Workflow A directed graph of nodes called jobs Edge Run after dependence Node Job is a node Workflow Model

Slide 20

Job State Job state is captured in a token Tokens are named hierarchically Master Job Token version: 123 name: /workflow/w1/job owner: worker_0 expiration: 1234567 data: JobTemplate(....)

Slide 21


Slide 22

Master keeps the state Workers claim and execute tasks Horizontally scalable Master Worker Interaction Worker Master Persistent Store 1: request 2: update 3: ack

Slide 23

Master Entire state is kept in memory Each state update is synchronously persisted before master replies to client Master runs on a single thread – no concurrency issues

Slide 24


Slide 25

Open Source Git repo: https://github.com/pinterest/pinball Mailing list: https://groups.google.com/forum/#!forum/pinball-users

Slide 26

Thank You