'

Maintaining the Front Door to Netflix

Понравилась презентация – покажи это...





Слайд 1

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson


Слайд 2

There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation


Слайд 3

Global Streaming Video for TV Shows and Movies


Слайд 4

More than 44 Million Subscribers More than 40 Countries


Слайд 5

Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month


Слайд 6


Слайд 7


Слайд 8

Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: Non-Member Discovery Streaming


Слайд 9

Key Responsibilities Broker data between services and UIs Maintain a resilient front-door Scale the system vertically and horizontally Maintain high velocity


Слайд 10

But Before Streaming…


Слайд 11


Слайд 12


Слайд 13

Monolithic Application In Netflix Data Centers


Слайд 14

The bigger the ship… the slower it turns


Слайд 15

Distributed Architecture


Слайд 16


Слайд 17

1000+ Device Types


Слайд 18

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies


Слайд 19

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 20

Dependency Relationships


Слайд 21

2,000,000,000 Requests Per Day to the Netflix API


Слайд 22

30 Distinct Dependent Services for the Netflix API


Слайд 23

~500 Dependency jars Slurped into the Netflix API


Слайд 24

14,000,000,000 Netflix API Calls Per Day to those Dependent Services


Слайд 25

0 Dependent Services with 100% SLA


Слайд 26

99.99% = 99.7% 30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month


Слайд 27

99.99% = 99.7% 30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month


Слайд 28

99.9% = 97% 30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month


Слайд 29

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 30

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 31

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 32

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 33

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 34


Слайд 35

Circuit Breaker Dashboard


Слайд 36


Слайд 37

Call Volume and Health / Last 10 Seconds


Слайд 38

Call Volume / Last 2 Minutes


Слайд 39

Successful Requests


Слайд 40

Successful, But Slower Than Expected


Слайд 41

Short-Circuited Requests, Delivering Fallbacks


Слайд 42

Timeouts, Delivering Fallbacks


Слайд 43

Thread Pool & Task Queue Full, Delivering Fallbacks


Слайд 44

Exceptions, Delivering Fallbacks


Слайд 45

Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate


Слайд 46

Status of Fallback Circuit


Слайд 47

Requests per Second, Over Last 10 Seconds


Слайд 48

SLA Information


Слайд 49

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 50

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 51

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Слайд 52

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback


Слайд 53

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback


Слайд 54

Scaling the Distributed System


Слайд 55


Слайд 56

AWS Cloud


Слайд 57


Слайд 58

Autoscaling


Слайд 59

Autoscaling


Слайд 60

Amazon Auto Scaling Limitations Hard to fit policies to variable traffic patterns (weekday vs weekend) Limited control over capacity adjustments (absolute value or %)


Слайд 61

The Impact of AAS Limitations Traffic drop can lead to scale downs during outage Performance degradation between new instance launch and taking traffic Excess capacity at peak and trough


Слайд 62

Scryer : Predictive Auto Scaling Not yet…


Слайд 63

Typical Traffic Patterns Over Five Days


Слайд 64

Predicted RPS Compared to Actual RPS


Слайд 65

Scaling Plan for Predicted Workload


Слайд 66

What is Scryer Doing? Evaluating needs based on historical data Week over week, month over month metrics Adjusts instance minimums based on algorithms Relies on Amazon Auto Scaling for unpredicted events


Слайд 67

Results


Слайд 68

Results : Load Average Reactive Predictive


Слайд 69

Results : Response Latencies Reactive Predictive


Слайд 70

Results : Outage Recovery


Слайд 71

Results : Outage Recovery


Слайд 72

Results : AWS Costs


Слайд 73

Scaling Globally


Слайд 74

More than 44 Million Subscribers More than 40 Countries


Слайд 75

Zuul Gatekeeper for the Netflix Streaming Application


Слайд 76

Zuul * Multi-Region Resiliency Insights Stress Testing Canary Testing Dynamic Routing Load Shedding Security Static Response Handling Authentication * Most closely resembles an API proxy


Слайд 77

Isthmus


Слайд 78


Слайд 79

All of these approaches are designed to prevent failures…


Слайд 80

But sometimes the best way to prevent failures is to force them!


Слайд 81


Слайд 82

I randomly terminate instances in production to identify dormant failures. Chaos Monkey


Слайд 83

Chaos Gorilla I simulate an outage of an entire Amazon availability zone.


Слайд 84

I simulate an outage in an AWS region. Chaos Kong


Слайд 85

I find instances that don’t adhere to best practices. Conformity Monkey


Слайд 86

I extend Conformity Monkey to find security violations. Security Monkey


Слайд 87

I detect unhealthy instances and remove them from service. Doctor Monkey


Слайд 88

I clean up the clutter and waste that runs in the cloud. Janitor Monkey


Слайд 89

I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey


Слайд 90


Слайд 91

Deployments in the Cloud


Слайд 92

Dependency Relationships


Слайд 93


Слайд 94

Testing Philosophy: Act Fast, React Fast


Слайд 95

That Doesn’t Mean We Don’t Test


Слайд 96

Automated Delivery Pipeline


Слайд 97

Cloud-Based Deployment Techniques


Слайд 98

Current Code In Production API Requests from the Internet


Слайд 99

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet


Слайд 100

Canary Analysis Automation


Слайд 101

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!


Слайд 102

Current Code In Production API Requests from the Internet


Слайд 103

Current Code In Production API Requests from the Internet


Слайд 104

Current Code In Production API Requests from the Internet Perfect!


Слайд 105

Stress Test with Zuul


Слайд 106

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 107

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 108

Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 109

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 110

Current Code In Production API Requests from the Internet Perfect!


Слайд 111

Stress Test with Zuul


Слайд 112

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 113

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Слайд 114

API Requests from the Internet New Code Getting Prepared for Production


Слайд 115

Brokering Data to 1,000+ Device Types


Слайд 116


Слайд 117


Слайд 118

Screen Real Estate


Слайд 119

Controller


Слайд 120

Technical Capabilities


Слайд 121

One-Size-Fits-All API Request Request Request Request Request Request Request Request Request Request Request Request Request Request Request Request


Слайд 122

Courtesy of South Florida Classical Review


Слайд 123


Слайд 124

Resource-Based API vs. Experience-Based API


Слайд 125

Resource-Based Requests /users/<id>/ratings/title /users/<id>/queues /users/<id>/queues/instant /users/<id>/recommendations /catalog/titles/movie /catalog/titles/series /catalog/people


Слайд 126

REST API RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS Network Border Network Border


Слайд 127

RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE


Слайд 128

RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING


Слайд 129


Слайд 130


Слайд 131

Experience-Based Requests /ps3/homescreen


Слайд 132

JAVA API Network Border Network Border RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS Groovy Layer


Слайд 133


Слайд 134

RECOMMENDATIONSAZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border


Слайд 135

RECOMMENDATIONSAZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border


Слайд 136


Слайд 137

https://www.github.com/Netflix


Слайд 138

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson


×

HTML:





Ссылка: