Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Понравилась презентация – покажи это...

Слайд 0

1 Asif Daniel Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Слайд 1

Overview What is LinkedIn search and why should you care? What are our systems challenges? What are our relevance challenges? 2

Слайд 2


Слайд 3


Слайд 4

Search helps members find and be found. 5

Слайд 5

Search for people, jobs, groups, and more. 6

Слайд 6


Слайд 7

A separate product for recruiters. 8

Слайд 8

Search is the core of key LinkedIn use cases. 9

Слайд 9

What’s unique Personalized Part of a larger product experience Many products Big part Task-centric Find a job, hire top talent, find a person, … 10

Слайд 10

Systems Challenges 11

Слайд 11

Evolution of LinkedIn’ Search Architecture 2004: No Search Engine Iterate through your network and filter 12

Слайд 12

13 Lucene Lucene Lucene Lucene (Single Shard) Updates Queries Results 2007: Introducing Lucene (single shard, multiple replicas)

Слайд 13

14 Lucene Lucene Lucene Updates Queries Results Updater Updater Lucene Zoie 2008: Zoie - real-time search (search without commits/shutdown)

Слайд 14

15 Lucene Lucene Lucene Source 1 Queries Results Updater Updater Lucene Zoie Source 2 …. Source N Content Store …. 2008: Content Store (aggregating multiple input sources)

Слайд 15

16 Source 1 Queries Results Updater Updater Source 2 …. Source N Content Store …. Sharded Broker 2008: Sharded search

Слайд 16

17 Source 1 Queries Results Updater Updater Source 2 …. Source N Content Store …. Sensei Broker Lucene Zoie Bobo 2009: Bobo – Faceted Search

Слайд 17

18 Updater Updater 2010: SenseiDB (cluster management, new query language, wrapping existing pieces)

Слайд 18

19 Updater Updater 2011: Cleo (instant typeahead results)

Слайд 19

20 Updater Updater 2013: Too many stacks Group Search Article/Post Search And more…

Слайд 20

Challenges Index rebuilding very difficult Live updates are at an entity granularity Scoring is inflexible Lucene limitations Fragmentation – too many components, too many stacks Economic Graph 21 Opportunity

Слайд 21

22 Updater Updater 2014: Introducing Galene

Слайд 22

Life of a Query 23 Query Rewriter/ Planner Results Merging User Query Search Results Search Shard Search Shard

Слайд 23

Life of a Query – Within A Search Shard 24 Rewritten Query Top Results From Shard

Слайд 24

Life of a Query – Within A Rewriter 25

Слайд 25


Слайд 26

Improvements Regular full index builds using Hadoop Easier to reshard, add fields Improved Relevance Offline relevance, query rewriting frameworks Partial Live Updates Support Allows efficient updates of high frequency fields (no sync) Goodbye Content Store, Goodbye Zoie Early termination Ultra low latency for instant results Goodbye Cleo Indexing and searching across graph entities/attributes Single engine, single stack 27

Слайд 27

Galene Deep dive 28

Слайд 28

Primer on Search 29

Слайд 29

Lucene An open source API that supports search functionality: Add new documents to index Delete documents from the index Construct queries Search the index using the query Score the retrieved documents 30

Слайд 30

The Search Index Inverted Index: Mapping from (search) terms to list of documents (they are present in) Forward Index: Mapping from documents to metadata about them 31

Слайд 31


Слайд 32

The Search Index The lists are called posting lists Upto hundreds of millions of posting lists Upto hundreds of millions of documents Posting lists may contain as few as a single hit and as many as tens of millions of hits Terms can be words in the document inferred attributes about the document 33

Слайд 33

Lucene Queries term:“asif makhani” term:asif term:daniel +term:daniel +prefix:tunk +asif +linkedIn +term:daniel connection:50510 +term:daniel industry:software connection:50510^4 34

Слайд 34

Early termination We order documents in the index based on a static rank – from most important to least important An offline relevance algorithm assigns a static rank to each document on which the sorting is performed This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query) Also works well with personalized search +term:asif +prefix:makh +(connection:35176 connection:418001 connection:1520032) 35

Слайд 35

Partial Updates Lucene segments are “document-partitioned” We have enhanced Lucene with “term-partitioned” segments We use 3 term-partitioned segments: Base index (never changed) Live update buffer Snapshot index 36

Слайд 36

37 Base Index Snapshot Index Live Update Buffer

Слайд 37

Going Forward Consolidation across verticals Improved Relevance Support Machine-learned models, query rewriting, relevant snippets,… Improved Performance Search as a Service (SeaS) Exploring the Economic Graph 38

Слайд 38

Quality Challenges 39

Слайд 39

The Search Quality Pipeline 40 spellcheck query tagging vertical intent query expansion

Слайд 40

Spellcheck 41 PEOPLE NAMES COMPANIES TITLES PAST QUERIES n-grams marissa => ma ar ri is ss sa metaphone mark/marc => MRK co-occurrence counts marissa:mayer = 1000 marisa meyer yahoo marissa marisa meyer mayer yahoo

Слайд 41

Query Tagging 42 machine learning data scientist brooklyn

Слайд 42

Vertical Intent: Results Blending 43 [company] [employees] [jobs] [name search]

Слайд 43

Vertical Intent: Typeahead 44 P(mongodb | mon) = 5% P(monsanto | mons): 50% P(mongodb | mong): 80%

Слайд 44

Query Expansion 45

Слайд 45

Ranking 46

Слайд 46

Ranking is highly personalized. 47

Слайд 47

Not just for name search. 48

Слайд 48

Relevance Model 49

Слайд 49

Examples of Features 50 Search keywords matching title = 3 Searcher location = Result location Searcher network distance to result = 2 …

Слайд 50

Model Training: Traditional Approach 51

Слайд 51

Model Training: LinkedIn’s Approach 52

Слайд 52

Fair Pairs and Easy Negatives 53 Sample negatives from bottom results But watch out for variable length result sets. Compromise, e.g., sample from page 10.

Слайд 53

Model Selection Select model based on user and query features. e.g., person name queries, recruiters making skills queries Resulting model is a tree with logistic regression leaves. Only one regression model evaluated for each document. 54

Слайд 54

Summary What is LinkedIn search and why should you care? LinkedIn search enables the participants in the economic graph to find and be found. What are our systems challenges? Indexing rich, structured content; retrieving using global and social factors; real-time updates. What are our relevance challenges? Query understanding, personalized machine-learned ranking models. 55

Слайд 55

56 Asif Makhani Daniel Tunkelang amakhani@linkedin.com dtunkelang@linkedin.com https://linkedin.com/in/asifmakhani https://linkedin.com/in/dtunkelang