Scaling LinkedIn - A Brief History

Понравилась презентация – покажи это...

Слайд 0

SCALING LINKEDIN A BRIEF HISTORY Josh Clemm www.linkedin.com/in/joshclemm

Слайд 1

“ Scaling = replacing all the components of a car while driving it at 100mph Via Mike Krieger, “Scaling Instagram”

Слайд 2

LinkedIn started back in 2003 to “connect to your network for better job opportunities.” It had 2700 members in first week.

Слайд 3

First week growth guesses from founding team

Слайд 4

400M 400M 350M 300M 250M 200M Fast forward to today... 150M 100M 50M 32M 0M 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 5

Слайд 5

LINKEDIN SCALE TODAY LinkedIn is a global site with over 400 million members Web pages and mobile traffic are served at tens of thousands of queries per second Backend systems serve millions of queries per second

Слайд 6

How did we get there? 7

Слайд 7

Let’s start from the beginning

Слайд 8


Слайд 9

LINKEDIN’S ORIGINAL ARCHITECTURE LEO LEO ● Huge monolithic app called Leo ● Java, JSP, Servlets, JDBC DB ● Served every page, same SQL database Circa 2003

Слайд 10

So far so good, but two areas to improve: 1. The growing member to member connection graph 2. The ability to search those members

Слайд 11

MEMBER CONNECTION GRAPH ● Needed to live in-memory for top performance ● Used graph traversal queries not suitable for the shared SQL database. ● Different usage profile than other parts of site

Слайд 12

MEMBER CONNECTION GRAPH ● Needed to live in-memory for top performance ● Used graph traversal queries not suitable for the shared SQL database. ● Different usage profile than other parts of site So, a dedicated service was created. LinkedIn’s first service.

Слайд 13

MEMBER SEARCH ● Social networks need powerful search ● Lucene was used on top of our member graph

Слайд 14

MEMBER SEARCH ● Social networks need powerful search ● Lucene was used on top of our member graph LinkedIn’s second service.

Слайд 15

LINKEDIN WITH CONNECTION GRAPH AND SEARCH LEO RPC Member Graph Lucene DB Connection / Profile Updates Circa 2004

Слайд 16

Getting better, but the single database was under heavy load. Vertically scaling helped, but we needed to offload the read traffic...

Слайд 17

REPLICA DBs ● Master/slave concept ● Read-only traffic from replica ● Writes go to main DB ● Early version of Databus kept DBs in sync Main DB Databus relay Replica Replica Replica DB

Слайд 18

REPLICA DBs TAKEAWAYS ● Good medium term solution ● We could vertically scale servers for a while ● Master DBs have finite scaling limits ● These days, LinkedIn DBs use partitioning Main DB Databus relay Replica Replica Replica DB

Слайд 19

LINKEDIN WITH REPLICA DBs RPC LEO R/O R/W Main DB Member Graph Connection Updates Databus relay Search Profile Updates Replica Replica Replica DB Circa 2006

Слайд 20

As LinkedIn continued to grow, the monolithic application Leo was becoming problematic. Leo was difficult to release, debug, and the site kept going down...

Слайд 21

Слайд 22

Слайд 23

Слайд 24


Слайд 25

SERVICE ORIENTED ARCHITECTURE Extracting services (Java Spring MVC) from legacy Leo monolithic application Recruiter Web App Public Profile Web App LEO Profile Service Yet another Service Circa 2008 on

Слайд 26

SERVICE ORIENTED ARCHITECTURE Profile Web App Profile Service ● Goal - create vertical stack of stateless services ● Frontend servers fetch data from many domains, build HTML or JSON response ● Mid-tier services host APIs, business logic Profile DB ● Data-tier or back-tier services encapsulate data domains

Слайд 27

Слайд 28

EXAMPLE MULTI-TIER ARCHITECTURE AT LINKEDIN Browser / App Frontend Web App Profile Mid-tier Content Service Service Connections Mid-tier Content Service Service Groups Mid-tier Content Service Service Edu Data Data Service Kafka Service DB Voldemort Hadoop

Слайд 29

SERVICE ORIENTED ARCHITECTURE COMPARISON PROS ● Stateless services easily scale CONS ● Ops overhead ● Decoupled domains ● Introduces backwards compatibility issues ● Build and deploy independently ● Leads to complex call graphs and fanout

Слайд 30

SERVICES AT LINKEDIN ● In 2003, LinkedIn had one service (Leo) ● By 2010, LinkedIn had over 150 services ● Today in 2015, LinkedIn has over 750 services bash$ eh -e %%prod | awk -F. '{ print $2 }' | sort | uniq | wc -l 756

Слайд 31

Getting better, but LinkedIn was experiencing hypergrowth...

Слайд 32

Слайд 33

CACHING Frontend Web App Mid-tier Service Cache Cache DB ● Simple way to reduce load on servers and speed up responses ● Mid-tier caches store derived objects from different domains, reduce fanout ● Caches in the data layer ● We use memcache, couchbase, even Voldemort

Слайд 34

“ hard problems in There are only two Computer Science: Cache invalidation, naming things, and off-by-one errors. Via Twitter by Kellan Elliott-McCrea and later Jonathan Feinberg

Слайд 35

CACHING TAKEAWAYS ● Caches are easy to add in the beginning, but complexity adds up over time. ● Over time LinkedIn removed many mid-tier caches because of the complexity around invalidation ● We kept caches closer to data layer

Слайд 36

CACHING TAKEAWAYS (cont.) ● Services must handle full load - caches improve speed, not permanent load bearing solutions ● We’ll use a low latency solution like Voldemort when appropriate and precompute results

Слайд 37

LinkedIn’s hypergrowth was extending to the vast amounts of data it collected. Individual pipelines to route that data weren’t scaling. A better solution was needed...

Слайд 38

Слайд 39

KAFKA MOTIVATIONS ● LinkedIn generates a ton of data ○ Pageviews ○ Edits on profile, companies, schools ○ Logging, timing ○ Invites, messaging ○ Tracking ● Billions of events everyday ● Separate and independently created pipelines routed this data

Слайд 40


Слайд 41

A WHOLE LOT OF CUSTOM PIPELINES... As LinkedIn needed to scale, each pipeline needed to scale.

Слайд 42

KAFKA Distributed pub-sub messaging platform as LinkedIn’s universal data pipeline Frontend service Frontend service Backend Service Kafka DWH Oracle Monitoring Analytics Hadoop

Слайд 43

KAFKA AT LINKEDIN BENEFITS ● Enabled near realtime access to any data source ● Empowered Hadoop jobs ● Allowed LinkedIn to build realtime analytics ● Vastly improved site monitoring capability ● Enabled devs to visualize and track call graphs ● Over 1 trillion messages published per day, 10 million messages per second

Слайд 44


Слайд 45

Let’s end with the modern years

Слайд 46

Слайд 47

REST.LI ● Services extracted from Leo or created new were inconsistent and often tightly coupled ● Rest.li was our move to a data model centric architecture ● It ensured a consistent stateless Restful API model across the company.

Слайд 48

REST.LI (cont.) ● By using JSON over HTTP, our new APIs supported non-Java-based clients. ● By using Dynamic Discovery (D2), we got load balancing, discovery, and scalability of each service API. ● Today, LinkedIn has 1130+ Rest.li resources and over 100 billion Rest.li calls per day

Слайд 49

REST.LI (cont.) Rest.li Automatic API-documentation

Слайд 50

REST.LI (cont.) Rest.li R2/D2 tech stack

Слайд 51

LinkedIn’s success with Data infrastructure like Kafka and Databus led to the development of more and more scalable Data infrastructure solutions...

Слайд 52

DATA INFRASTRUCTURE ● It was clear LinkedIn could build data infrastructure that enables long term growth ● LinkedIn doubled down on infra solutions like: ○ Storage solutions ■ Espresso, Voldemort, Ambry (media) ○ Analytics solutions like Pinot ○ Streaming solutions ■ Kafka, Databus, and Samza ○ Cloud solutions like Helix and Nuage

Слайд 53


Слайд 54

LinkedIn is a global company and was continuing to see large growth. How else to scale?

Слайд 55

MULTIPLE DATA CENTERS ● Natural progression of horizontally scaling ● Replicate data across many data centers using storage technology like Espresso ● Pin users to geographically close data center ● Difficult but necessary

Слайд 56

MULTIPLE DATA CENTERS ● Multiple data centers are imperative to maintain high availability. ● You need to avoid any single point of failure not just for each service, but the entire site. ● LinkedIn runs out of three main data centers, additional PoPs around the globe, and more coming online every day...

Слайд 57

MULTIPLE DATA CENTERS LinkedIn's operational setup as of 2015 (circles represent data centers, diamonds represent PoPs)

Слайд 58

Of course LinkedIn’s scaling story is never this simple, so what else have we done?

Слайд 59

WHAT ELSE HAVE WE DONE? ● Each of LinkedIn’s critical systems have undergone their own rich history of scale (graph, search, analytics, profile backend, comms, feed) ● LinkedIn uses Hadoop / Voldemort for insights like People You May Know, Similar profiles, Notable Alumni, and profile browse maps.

Слайд 60

WHAT ELSE HAVE WE DONE? (cont.) ● Re-architected frontend approach using ○ Client templates ○ BigPipe ○ Play Framework ● LinkedIn added multiple tiers of proxies using Apache Traffic Server and HAProxy ● We improved the performance of servers with new hardware, advanced system tuning, and newer Java runtimes.

Слайд 61

Scaling sounds easy and quick to do, right?

Слайд 62

“ Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law. Via  Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid

Слайд 63

THANKS! Josh Clemm www.linkedin.com/in/joshclemm

Слайд 64

LEARN MORE ● Blog version of this slide deck https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin ● Visual story of LinkedIn’s history https://ourstory.linkedin.com/ ● LinkedIn Engineering blog https://engineering.linkedin.com ● LinkedIn Open-Source https://engineering.linkedin.com/open-source ● LinkedIn’s communication system slides which include earliest LinkedIn architecture http://www.slideshare. net/linkedin/linkedins-communication-architecture ● Slides which include earliest LinkedIn data infra work http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012

Слайд 65

LEARN MORE (cont.) ● Project Inversion - internal project to enable developer productivity (trunk based model), faster deploys, unified services http://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-codefreeze-that-saved-linkedin ● LinkedIn’s use of Apache Traffic server http://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-trafficserver ● Multi Data Center - testing fail overs https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promotedangel-au-yeung

Слайд 66

LEARN MORE - KAFKA ● History and motivation around Kafka http://www.confluent.io/blog/stream-data-platform-1/ ● Thinking about streaming solutions as a commit log https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineershould-know-about-real-time-datas-unifying ● Kafka enabling monitoring and alerting http://engineering.linkedin.com/52/autometrics-self-service-metrics-collection ● Kafka enabling real-time analytics (Pinot) http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot ● Kafka’s current use and future at LinkedIn http://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future ● Kafka processing 1 trillion events per day https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancingkafka-linkedin

Слайд 67

LEARN MORE - DATA INFRASTRUCTURE ● Open sourcing Databus https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-lowlatency-change-data-capture-system ● Samza streams to help LinkedIn view call graphs https://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-usingapache-samza ● Real-time analytics (Pinot) http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot ● Introducing Espresso data store http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-newdistributed-document-store

Слайд 68

LEARN MORE - FRONTEND TECH ● LinkedIn’s use of client templates ○ Dust.js http://www.slideshare.net/brikis98/dustjs ○ Profile http://engineering.linkedin.com/profile/engineering-new-linkedin-profile ● Big Pipe on LinkedIn’s homepage http://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-page ● Play Framework ○ Introduction at LinkedIn https://engineering.linkedin. com/play/composable-and-streamable-play-apps ○ Switching to non-block asynchronous model https://engineering.linkedin.com/play/play-framework-async-io-without-thread-pooland-callback-hell

Слайд 69

LEARN MORE - REST.LI ● Introduction to Rest.li and how it helps LinkedIn scale http://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale ● How Rest.li expanded across the company http://engineering.linkedin.com/restli/linkedins-restli-moment

Слайд 70

LEARN MORE - SYSTEM TUNING ● JVM memory tuning http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-highthroughput-and-low-latency-java-applications ● System tuning http://engineering.linkedin.com/performance/optimizing-linux-memory-managementlow-latency-high-throughput-databases ● Optimizing JVM tuning automatically https://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-itsdifficulties-and-using-jtune-solution

Слайд 71

WE’RE HIRING LinkedIn continues to grow quickly and there’s still a ton of work we can do to improve. We’re working on problems that very few ever get to solve - come join us!

Слайд 72

Слайд 73