'

Maintaining the Front Door to Netflix

Понравилась презентация – покажи это...





Слайд 0

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson


Слайд 1

There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation


Слайд 2

Global Streaming Video for TV Shows and Movies


Слайд 3

More than 44 Million Subscribers More than 40 Countries


Слайд 4

Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month


Слайд 5


Слайд 6


Слайд 7

Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: Non-Member Discovery Streaming


Слайд 8

Key Responsibilities Broker data between services and UIs Maintain a resilient front-door Scale the system vertically and horizontally Maintain high velocity


Слайд 9

But Before Streaming…


Слайд 10


Слайд 11


Слайд 12

Monolithic Application In Netflix Data Centers


Слайд 13

The bigger the ship… the slower it turns


Слайд 14

Distributed Architecture


Слайд 15


Слайд 16

1000+ Device Types


Слайд 17

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies


Слайд 18

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 19

Dependency Relationships


Слайд 20

2,000,000,000 Requests Per Day to the Netflix API


Слайд 21

30 Distinct Dependent Services for the Netflix API


Слайд 22

~500 Dependency jars Slurped into the Netflix API


Слайд 23

14,000,000,000 Netflix API Calls Per Day to those Dependent Services


Слайд 24

0 Dependent Services with 100% SLA


Слайд 25

99.99% = 99.7% 30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month


Слайд 26

99.99% = 99.7% 30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month


Слайд 27

99.9% = 97% 30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month


Слайд 28

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 29

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 30

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 31

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 32

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 33


Слайд 34

Circuit Breaker Dashboard


Слайд 35


Слайд 36

Call Volume and Health / Last 10 Seconds


Слайд 37

Call Volume / Last 2 Minutes


Слайд 38

Successful Requests


Слайд 39

Successful, But Slower Than Expected


Слайд 40

Short-Circuited Requests, Delivering Fallbacks


Слайд 41

Timeouts, Delivering Fallbacks


Слайд 42

Thread Pool & Task Queue Full, Delivering Fallbacks


Слайд 43

Exceptions, Delivering Fallbacks


Слайд 44

Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate


Слайд 45

Status of Fallback Circuit


Слайд 46

Requests per Second, Over Last 10 Seconds


Слайд 47

SLA Information


Слайд 48

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 49

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 50

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 51

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback


Слайд 52

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback


Слайд 53

Scaling the Distributed System


Слайд 54


Слайд 55

AWS Cloud


Слайд 56


Слайд 57

Autoscaling


Слайд 58

Autoscaling


Слайд 59

Amazon Auto Scaling Limitations Hard to fit policies to variable traffic patterns (weekday vs weekend) Limited control over capacity adjustments (absolute value or %)


Слайд 60

The Impact of AAS Limitations Traffic drop can lead to scale downs during outage Performance degradation between new instance launch and taking traffic Excess capacity at peak and trough


Слайд 61

Scryer : Predictive Auto Scaling Not yet…


Слайд 62

Typical Traffic Patterns Over Five Days


Слайд 63

Predicted RPS Compared to Actual RPS


Слайд 64

Scaling Plan for Predicted Workload


Слайд 65

What is Scryer Doing? Evaluating needs based on historical data Week over week, month over month metrics Adjusts instance minimums based on algorithms Relies on Amazon Auto Scaling for unpredicted events


Слайд 66

Results


Слайд 67

Results : Load Average Reactive Predictive


Слайд 68

Results : Response Latencies Reactive Predictive


Слайд 69

Results : Outage Recovery


Слайд 70

Results : Outage Recovery


Слайд 71

Results : AWS Costs


Слайд 72

Scaling Globally


Слайд 73

More than 44 Million Subscribers More than 40 Countries


Слайд 74

Zuul Gatekeeper for the Netflix Streaming Application


Слайд 75

Zuul * Multi-Region Resiliency Insights Stress Testing Canary Testing Dynamic Routing Load Shedding Security Static Response Handling Authentication * Most closely resembles an API proxy


Слайд 76

Isthmus


Слайд 77


Слайд 78

All of these approaches are designed to prevent failures…


Слайд 79

But sometimes the best way to prevent failures is to force them!


Слайд 80


Слайд 81

I randomly terminate instances in production to identify dormant failures. Chaos Monkey


Слайд 82

Chaos Gorilla I simulate an outage of an entire Amazon availability zone.


Слайд 83

I simulate an outage in an AWS region. Chaos Kong


Слайд 84

I find instances that don’t adhere to best practices. Conformity Monkey


Слайд 85

I extend Conformity Monkey to find security violations. Security Monkey


Слайд 86

I detect unhealthy instances and remove them from service. Doctor Monkey


Слайд 87

I clean up the clutter and waste that runs in the cloud. Janitor Monkey


Слайд 88

I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey


Слайд 89


Слайд 90

Deployments in the Cloud


Слайд 91

Dependency Relationships


Слайд 92


Слайд 93

Testing Philosophy: Act Fast, React Fast


Слайд 94

That Doesn’t Mean We Don’t Test


Слайд 95

Automated Delivery Pipeline


Слайд 96

Cloud-Based Deployment Techniques


Слайд 97

Current Code In Production API Requests from the Internet


Слайд 98

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet


Слайд 99

Canary Analysis Automation


Слайд 100

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!


Слайд 101

Current Code In Production API Requests from the Internet


Слайд 102

Current Code In Production API Requests from the Internet


Слайд 103

Current Code In Production API Requests from the Internet Perfect!


Слайд 104

Stress Test with Zuul


Слайд 105

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 106

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 107

Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 108

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 109

Current Code In Production API Requests from the Internet Perfect!


Слайд 110

Stress Test with Zuul


Слайд 111

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 112

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 113

API Requests from the Internet New Code Getting Prepared for Production


Слайд 114

Brokering Data to 1,000+ Device Types


Слайд 115


Слайд 116


Слайд 117

Screen Real Estate


Слайд 118

Controller


Слайд 119

Technical Capabilities


Слайд 120

One-Size-Fits-All API Request Request Request Request Request Request Request Request Request Request Request Request Request Request Request Request


Слайд 121

Courtesy of South Florida Classical Review


Слайд 122


Слайд 123

Resource-Based API vs. Experience-Based API


Слайд 124

Resource-Based Requests /users/<id>/ratings/title /users/<id>/queues /users/<id>/queues/instant /users/<id>/recommendations /catalog/titles/movie /catalog/titles/series /catalog/people


Слайд 125

REST API RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS Network Border Network Border


Слайд 126

RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE


Слайд 127

RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING


Слайд 128


Слайд 129


Слайд 130

Experience-Based Requests /ps3/homescreen


Слайд 131

JAVA API Network Border Network Border RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS Groovy Layer


Слайд 132


Слайд 133

RECOMMENDATIONSAZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border


Слайд 134

RECOMMENDATIONSAZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border


Слайд 135


Слайд 136

https://www.github.com/Netflix


Слайд 137

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson


×

HTML:





Ссылка: