'

A Journey into Evaluation: From Retrieval Effectiveness to User Engagement

Понравилась презентация – покажи это...





Слайд 0

SPIRE 2015 – King’s College London A  Journey  into  Evalua0on:   from  Retrieval  Effec0veness  to   User  Engagement   Mounia Lalmas Yahoo Labs London mounia@acm.org


Слайд 1

This talk § Introduction to user engagement § Evaluation in information retrieval (retrieval effectiveness) § From retrieval effectiveness to user engagement (from intra-session to inter-session evaluation) (from small- to large-scale evaluation)


Слайд 2

This talk beyond the click beyond relevance towards user engagement


Слайд 3

User engagement


Слайд 4

What is user engagement? “User engagement is a quality of the user experience that emphasizes the phenomena associated with wanting to use a technological resource longer and frequently” (Attfield et al, 2011) self-report: happy, sad, enjoyment, … physiology: gaze, body heat, mouse movement, … analytics: click, upload, read, comment, share … emotional, cognitive and behavioural connection that exists, at any point in time and over time, between a user and a technological resource


Слайд 5

Why is it important to engage users? §  In today’s wired world, users have enhanced expectations about their interactions with technology … resulting in increased competition amongst the purveyors and designers of interactive systems. §  In addition to utilitarian factors, such as usability, we must consider the hedonic and experiential factors of interacting with technology, such as fun, fulfillment, play, and user engagement. (O’Brien, Lalmas & Yom-Tov, 2014)


Слайд 6

Online sites differ with respect to their engagement pattern Games Users spend much time per visit Social media Users come frequently and stay long Niche Users come on average once a week e.g. weekly post Service Users visit site, when needed, e.g. to renew subscription (Lehmann etal, 2012) Search Users come frequently and do not stay long News Users come periodically, e.g. morning and evening


Слайд 7

Characteristics of user engagement Endurability Aesthetics (Read, MacFarlane, & Casey, (Jacques et al, 1995; O’Brien, 2008) 2002; O’Brien, 2008) Motivation, interests, incentives, and benefits (Jacques et al., 1995; O’Brien & Toms, 2008) Focused attention (Webster & Ho, 1997; O’Brien, 2008) Novelty (Webster & Ho, 1997; O’Brien, 2008) Reputation, trust and expectation (Attfield et al, 2011) Richness and control Positive Affect (O’Brien & Toms, 2008) (O’Brien, Lalmas & Yom-Tov, 2014) (Jacques et al, 1995; Webster & Ho, 1997)


Слайд 8

Measuring user engagement Measures   Self-report Questionnaire, interview, think-aloud and think after protocols Physiology EEG, SCL, fMRI eye tracking Attributes   Subjective Short- and long-term Lab and field Small scale Objective Short-term Lab and field Small and large scale mouse-tracking Analytics within- and across-session metrics data science Objective Short- and long-term Field Large scale


Слайд 9

Attributes of user engagement § Scale (small versus large) § Setting (laboratory versus field) § Objective versus subjective § Temporality (short- versus long-term) We focus on 1.  Temporality: from intra- to inter-session 2.  Scalability: from small- to large-scale


Слайд 10

Evaluation in information retrieval


Слайд 11

How to evaluate a search engine Sec. 8.6 § Coverage   § Speed   § Query  language   § User  interface   § User  happiness   ›  ›  Users  find  what  they  want  and  return  to  the  search  engine   Users  complete  the  search  task,  where  search  is  a  means,  not   an  end   (Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)


Слайд 12

(Lehmann etal, 2013) Within an online session ›  ›  ›  ›  July 2012 2.5M users 785M page views Categorization of the most frequent accessed sites •  11 categories (e.g. news), 33 subcategories (e.g. news finance, news society) •  760 sites from 70 countries/regions short sessions: average 3.01 distinct sites visited with revisitation rate 10% long sessions: average 9.62 distinct sites visited with revisitation rate 22%


Слайд 13

Measuring user happiness Sec. 8.1 Most  common  proxy:  relevance  of  search  results   Retrieved Relevant all items Evaluation measures: •  precision, recall, R-precision; precision@n; mean average precision; F-measure; … •  bpref; cumulative gains, … §  User  informa)on  need  translated  into   a  query   §  Relevance  assessed  rela0ve  to     informa)on  need  not  the  query   §  Example:   ›  Informa0on  need:  I  am  looking  for  tennis   holiday  in  a  country  with  no  rain   ›  Query:  tennis  academy  good  weather   precision recall


Слайд 14

Measuring user happiness Sec. 8.1 Most  common  proxy:  relevance  of  search  result   Explicit signals Test collection methodology (TREC, CLEF, …) Human labeled corpora Implicit signals User behavior in online settings (clicks, skips, …)


Слайд 15

Examples of implicit signals in web search §  Number of clicks §  Click at given position §  Time to first click §  Skipping §  Abandonment rate §  Number of query reformulations §  Dwell time


Слайд 16

What is a happy user in web search 1.  The user information need is satisfied 2.  The user has learned about a topic and even about other topics 3.  The system was inviting and even fun to use USER ENGAGEMENT In-the-moment engagement Users active on a site or stayed long Long-term engagement Users come back frequently and over a long-term period


Слайд 17

Interpreting the signals


Слайд 18

Click-through rates CTR new ranking algorithm new design of search result page …


Слайд 19

No clicks I just wanted the phone number … I am totally happy J


Слайд 20

(Lalmas etal, 2015) Dwell time non-mobile optimized mobile optimized DWELL TIME used a proxy of user experience click on an ad on mobile device Dwell time on non-optimized landing pages comparable and even higher than on mobileoptimized ones … when mobile optimized, users realize quickly whether they “like” the ad or not? Publisher


Слайд 21

Relevance in multimedia search Multimedia search activities often driven by entertainment needs, not by information needs (Slaney, 2011)


Слайд 22

Explorative or serendipitous search (Miliaraki, Blanco & Lalmas, 2015)


Слайд 23

Objectivity versus subjectivity top most popular tweets top most popular tweets + geographical diverse Being from a central or peripheral location makes a difference. Peripheral users did not perceive the timeline as being diverse It should never be just about the algorithm, but also how users respond to what the algorithm returns to them à USER ENGAGEMENT (Eduardo Graells, 2015)


Слайд 24

Let us revisit


Слайд 25

USER ENGAGEMENT Interactive Information Retrieval (Ingwersen, Human Aspects in IR, ESSIR 2011)


Слайд 26

Beyond clicks and relevance towards user engagement § From intra- to inter-session evaluation Dwell time and absence time ›  Linking strategy ›  Mobile advertising ›  happy users come back § From small- to large-scale evaluation Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire ›  we need to properly identify the happy users


Слайд 27

From intra- to inter-session evaluation


Слайд 28

From short- to long-term engagement: From intra- to inter-session engagement We know what it will mean proxy intra-session metric(s) how users engage within a session? inter-session metric(s) how users engage across sessions? future engagement We monitor


Слайд 29

User engagement metrics


Слайд 30

User engagement metrics intra-session metrics •  Dwell time •  Session duration •  Bounce rate •  Play time (video) •  Mouse movement •  Click through rate (CTR) •  Number of pages viewed (click depth) •  Conversion rate •  Number of UCG (comments) •  … Dwell time as a proxy of user interest Dwell time as a proxy of relevance Dwell time as a proxy of conversion Dwell time as a proxy of post-click ad quality … intra-session inter-session


Слайд 31

Dwell time § Definition The contiguous time spent on a site or web page § Similar measures Play time (for video sites) § Cons Not clear that the user was actually looking at the site while there à blur/focus (O’Brien, Lalmas & Yom-Tov, 2014) Distribution of dwell times on 50 websites


Слайд 32

Dwell time Dwell time varies by site type: •  leisure sites tend to have longer dwell times than news, e-commerce, etc. Dwell time has a relatively large variance even for the same site (tourists, VIP, active … users) (O’Brien, Lalmas & Yom-Tov, 2014) Dwell time on 50 websites


Слайд 33

Dwell time across sessions or absence time


Слайд 34

The context – search experience


Слайд 35

The context – search experience


Слайд 36

1.0 Absence time and survival analysis 0.8 Users (%) who read story 2 but did not come back after 10 hours 0.6 SURVIVE 0.0 0.2 0.4 story 1 story 2 story 3 story 4 story 5 story 6 story 7 story 8 story 9 0 Users (%) who did come back DIE 5 10 DIE = RETURN TO SITE èSHORT ABSENCE TIME 15 20 hours


Слайд 37

Absence time applied to search Ranking function on Yahoo Answer Japan Two-weeks click data on Yahoo Answer Japan: search One millions users Six ranking functions 30-minute session boundary


Слайд 38

Absence time and number of clicks on search result page control = no click survival analysis: high hazard rate (die quickly) = short absence 3 clicks 5 clicks


Слайд 39

(Dupret & Lalmas, 2013) Absence time – search experience search session metrics à absence time 1.  No click means a bad user experience 2.  Clicking between 3-5 results leads to same user experience 3.  Clicking on more than 5 results reflects poorer user experience; users cannot find what they are looking for 4.  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice from the user (compared to 1st) 5.  Clicking at bottom is a sign of low quality overall ranking 6.  Users finding their answers quickly (time to 1st click) return sooner to the search application 7.  Returning to the same search result page is a worse user experience than reformulating the query


Слайд 40

Others


Слайд 41

p(absence 12h) The context – Linking strategy in online news News provider No Click Off-site click Off-site link à absence time Providing links to related off-site content has a positive long-term effect (Lehmann etal, In Progress) Related  off-­‐site  content  


Слайд 42

The Context – Mobile advertising Dwell time à ad click 600% ad click difference Positive post-click experience (“long” clicks) has an effect on users clicking on ads again 400% 200% 0% short ad clicks (Lalmas etal, 2015) long ad clicks


Слайд 43

Beyond clicks and relevance towards user engagement § From intra- to inter-session evaluation Dwell time and absence time ›  Linking strategy ›  Mobile advertising ›  happy users come back


Слайд 44

From small- to large-scale evaluation


Слайд 45

Small scale measurement – focused attention questionnaire 5-point scale (strong disagree to strong agree) 1.  2.  3.  4.  I lost myself in this news tasks experience I was so involved in my news tasks that I lost track of time I blocked things out around me when I was completing the news tasks When I was performing these news tasks, I lost track of the world around me 5.  The time I spent performing these news tasks just slipped away 6.  I was absorbed in my news tasks 7.  During the news tasks experience I let myself go (O'Brien & Toms, 2010)


Слайд 46

Small scale measurement – PANAS questionnaire (10 positive items and 10 negative items) §  You feel this way right now, that is, at the present moment [1 = very slightly or not at all; 2 = a little; 3 = moderately; 4 = quite a bit; 5 = extremely] [randomize items] distressed, upset, guilty, scared, hostile, irritable, ashamed, nervous, jittery, afraid interested, excited, strong, enthusiastic, proud, alert, inspired, determined, attentive, active (Watson, Clark & Tellegen, 1988)


Слайд 47

Small scale measurement – gaze and self-reporting News interest 57 users reading task (114) Three metrics: gaze, focus attention and positive affect •  questionnaire (qualitative data) •  record eye tracking (quantitative data) •  (Arapakis etal, 2014) All three metrics align: interesting content promote all engagement metrics


Слайд 48

From small- to large-scale measurement – mouse tracking §  Navigation & interaction with digital environment usually involves the use of a mouse (selecting, positioning, clicking) §  Several works show mouse cursor as weak proxy of gaze (attention) §  Low-cost, scalable alternative §  Can be performed in a non-invasive manner, without removing users from their natural setting


Слайд 49

Relevance, dwell time & cursor (Guo & Agichtein, 2012) “reading” a relevant long document vs “scanning” a long non-relevant document


Слайд 50

“Ugly” vs “Normal” Interface BBC News Wikipedia


Слайд 51

Mouse tracking and self-reporting §  324 users from Amazon Mechanical Turk (between subject design) §  Two tasks (reading and search) §  “Normal vs Ugly” interface §  Questionnaires (qualitative data) ›  focus attention, positive effect ›  interest, aesthetics §  Mouse tracking (quantitative data) ›  movement speed, movement rate, click rate, pause length, percentage of time still (Warnock & Lalmas, 2015)


Слайд 52

Mouse tracking could not tell much about •  focused attention and positive affect •  user interests in the task/topic •  aesthetics BUT BUT BUT BUT ›  ›  “ugly” variant did not result in lower USER aesthetics scores although BBC > Wikipedia BUT – the comments left … ›  Wikipedia: “The website was simply awful. Ads flashing everywhere, poor text colors on a dark blue background.”; “The webpage was entirely blue. I don't know if it was supposed to be like that, but it definitely detracted from the browsing experience.” ›  BBC News: “The website's layout and color scheme were a bitch to navigate and read.”; “Comic sans is a horrible font.”


Слайд 53

Flawed methodology? Non-existing signal? Wrong metric? Wrong measure? § Hawthorne Effect § Design ›  ›  Usability versus engagement Within- versus between-subject § Mouse movement was not sophisticated enough


Слайд 54

Mouse Gestures à Features 6000 x0y0 x3y3 x4y4 x2y2 resting cursor (500ms) resting cursor (1000ms) resting cursor (1500ms) click x8y8 ● ● x7y7 ● ● ● ● ● x6y6 4000 y x1y1 x5y5 t Δt rest (Arapakis, Lalmas & Valkanas, 2014) Δt rest 22 users reading two articles 176,550 cursor positions 2,913 mouse gestures


Слайд 55

Towards a taxonomy of mouse gestures for user engagement measurement §  The top-ranked clustering configuration is the Spectral Clustering for the original dataset, with hyperbolic tangent kernel, for k = 38 •  certain types of mouse gestures occur more or less often, depending on user interest in article •  significant correlations between certain types of mouse gestures and selfreport measures •  cursor behaviour goes beyond measuring frustration •  inform about the positive and negative interaction


Слайд 56

Beyond clicks and relevance towards user engagement § From small- to large-scale evaluation Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire ›  we need to properly identify the happy users


Слайд 57

Towards user engagement


Слайд 58

Towards User Engagement happy users come back we need to properly identify the happy users


Слайд 59

Thank you §  “If you cannot measure it, you cannot improve it” William Thomson (Lord Kelvin) §  “You cannot control what you cannot measure” DeMarco §  “The way you measure is more important than what you measure” Art Gust


Слайд 60


×

HTML:





Ссылка: