'

Take Action on Results with Statistics

Понравилась презентация – покажи это...





Слайд 0

Take Action on Results With Statisitcs An Optimizely Online Workshop Statistician: Leonid Pekelis


Слайд 1

Optimizely’s Stats Engine is designed to work with you, not against you, to provide results which are reliable and accurate, without requiring statistical training. At the same time, by knowing some statistics of your own, you can tune Stats Engine to get the most performance for your unique needs.


Слайд 2

After this workshop, you should be able to answer… 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? 2. What are the three tradeoffs in an A/B Test? And how are they related? 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?


Слайд 3

We will also preview How to choose the number of goals and variations for your experiment.


Слайд 4

First, some vocabulary (yay!)


Слайд 5

• A) The original, or baseline version of content that you are testing through a variation. • B) Metric used to measure impact of control and variation • C) The control group’s expected conversion rate. • D) The relative percentage difference of your variation from baseline. • E) The number of visitors in your test. Which is the Improvement?


Слайд 6

• A) Control and Variation The original, or baseline version of content that you are testing through a variation. • B) Goal Metric used to measure impact of control and variation • C) Baseline conversion rate The control group’s expected conversion rate. • D) Improvement The relative percentage difference of your variation from baseline. • E) Sample size The number of visitors in your test.


Слайд 7

Stats Engine corrects the pitfalls of A/B Testing with classical statistics.


Слайд 8

A procedure for classical statistics (a.k.a. “T-test”, a.k.a. “Traditional Frequentist”, a.k.a “Fixed Horizon Testing” ) Farmer Fred wants to compare the effect of two fertilizers on crop yield. 1. Chooses how many plots to use (sample size). 2. Waits for a crop cycle, collects data once at the end. 3. Asks “What are the chances I’d have gotten these results if there was no difference between the fertilizers?” (a.k.a. p-value) If p-value < 5%, his results are significant. 4. Goes on, maybe to test irrigation methods.


Слайд 9

Classical statistics were designed for an offline world. 1915 Data is expensive. Data is slow. Practitioners are trained. 2015 Data is cheap. Data is real-time. Practitioners are everyone.


Слайд 10

The modern A/B Testing procedure is different 1. Start without good estimate of sample size. 2. Check results early and often. Estimate ROI as quickly as possible. 3. Ask “How likely did my testing procedure give a wrong answer?” 4. Many variations on multiple goals, not just 1. 5. Iterate. Iterate. Iterate.


Слайд 11

Pitfall 1. Peeking


Слайд 12

Peeking Time Experiment Starts p-Value > 5%. Inconclusive. Min Sample Size p-Value > 5%. Inconclusive. p-Value > 5%. Inconclusive. p-Value < 5%. Significant!


Слайд 13

Why is this a problem? There is a ~5% chance of false positive each time you peek.


Слайд 14

Peeking Time Experiment Starts p-Value > 5%. Inconclusive. Min Sample Size p-Value > 5%. Inconclusive. p-Value > 5%. Inconclusive. p-Value < 5%. Significant! 4 peeks —> ~18% chance of seeing a false positive


Слайд 15

Pitfall 2. Mistaking “False Positive Rate” for “Chance of a wrong conclusion”


Слайд 16

Say I run an experiment.


Слайд 17

1 original page, 5 variations, 6 goals = 30 “A/B Tests”


Слайд 18

After I reach my minimum sample size, I stop the experiment and see 2 of my variations beating control and 1 variation losing to control


Слайд 19

Winner Winner Loser Classical statistics guarantee <= 5% false positives. What % of my 2 winners and 1 loser do I expect to be false positives?


Слайд 20

Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Loser Inconclusive Inconclusive Winner Inconclusive Inconclusive Winner Inconclusive Inconclusive Inconclusive 2 winners, 1 loser, and 27 inconclusives


Слайд 21

Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Inconclusive Loser Inconclusive Inconclusive Winner Inconclusive Inconclusive Winner Inconclusive Inconclusive Inconclusive 30 A/B Tests x 5% = 1.5 false positives!


Слайд 22

Classical statistics guarantee <= 5% false positives. Winner Winner Loser What % of my winners & losers do I expect to be false positives? Answer: C) With 30 A/B Tests, we can expect a 1.5 = 50% chance of a wrong conclusion! 3 In general, we can’t say without knowing how many other goals & variations were tested.


Слайд 23

After this workshop you should be able to answer … 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? 2. What are the three tradeoffs in an A/B Test? And how are they related? 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?


Слайд 24

After this webinar, you should be able to answer … 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? A. Peeking and mistaking “False Positive Rate” for “Chance of a wrong conclusion.”


Слайд 25

The tradeoffs of A/B Testing


Слайд 26

Error rates Runtime Improvement & Baseline CR


Слайд 27

Error rates Runtime “Chance of a wrong conclusion” Improvement & Baseline CR


Слайд 28

Error rates Runtime “Chance of a wrong conclusion calling a nonwinner a winner, or a nonloser a loser.” Improvement & Baseline CR


Слайд 29

Error rates Runtime Improvement & Baseline CR


Слайд 30

Where is the error rate on Optimizely’s results page? Statistical Significance = “Chance of a right conclusion” = (a.k.a.) 100 x (1 - False Discovery Rate) I. II. III. IV.


Слайд 31

How can you control the error rate?


Слайд 32

Error rates Runtime Improvement & Baseline CR


Слайд 33

Where is runtime on Optimizely’s results page?


Слайд 34

Error rates Runtime Were you expecting a funny picture? Improvement & Baseline CR


Слайд 35

Where is effect size on Optimizely’s results page?


Слайд 36

These three quantities are all … Error rates Runtime Inversely Related Improvement & Baseline CR


Слайд 37

Error rates At any number of visitors, Runtime Inversely Related the higher error rate I allow, the smaller improvement you can detect. Improvement & Baseline CR


Слайд 38

Error rates Runtime Inversely Related At any error rate threshold, stopping your test earlier means you can only detect larger improvements. Improvement & Baseline CR


Слайд 39

For any improvement, the lower error rate you want, the longer you need to run your test. Error rates Runtime Inversely Related Improvement & Baseline CR


Слайд 40

What does this look like in practice? Baseline conversion rate = 10% Improvement (relative) Average Visitors needed to reach significance with Stats Engine 5% 25% 95 (5%) Significance Threshold (Error Rate) 10% 62 K 14 K 1,800 90 (10%) 59 K 12 K 1,700 80 (20%) 53 K 11 K 1,500


Слайд 41

~ 1 K visitors per day Baseline conversion rate = 10% Improvement (relative) Average Visitors needed to reach significance with Stats Engine 5% 25% 95 (5%) Significance Threshold (Error Rate) 10% 62 K 14 K 1,800 90 (10%) 59 K 12 K 1,700 80 (20%) 53 K 11 K 1,500 (1 day)


Слайд 42

~ 10K visitors per day Baseline conversion rate = 10% Improvement (relative) Average Visitors needed to reach significance with Stats Engine 5% 25% 95 (5%) Significance Threshold (Error Rate) 10% 62 K 14 K 1,800 90 (10%) 59 K 12 K 1,700 80 (20%) 53 K 11 K (1 day) 1,500


Слайд 43

~ 50K visitors per day Baseline conversion rate = 10% Improvement (relative) Average Visitors needed to reach significance with Stats Engine 3% 10% 95 (5%) Significance Threshold (Error Rate) 5% 190 K 62 K 14 K 90 (10%) 180 K 59 K 12 K 80 (20%) 160 K 53 K (1 day) 11 K


Слайд 44

> 100K visitors per day Baseline conversion rate = 10% Improvement (relative) Average Visitors needed to reach significance with Stats Engine 3% 10% 95 (5%) Significance Threshold (Error Rate) 5% 190 K 62 K 14 K 90 (10%) 180 K 59 K 12 K 80 (20%) 160 K (1 day) 53 K 11 K


Слайд 45

After this workshop, you should be able to answer … 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? 2. What are the three tradeoffs in an A/B Test? And how are they related? 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?


Слайд 46

After this workshop, you should be able to answer … 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? 2. What are the three tradeoffs in an A/B Test? And how are they related? A. Error Rates, Runtime, and Effect Size. They are all inversely related.


Слайд 47

Use tradeoffs to align your testing goals


Слайд 48

In the beginning, we make an educated guess … ? Error rates Runtime 5% 53 K Inversely Related +5%, 10% Improvement & Baseline CR


Слайд 49

… but after 1 day … Data! How can we update the tradeoffs?


Слайд 50

1. Adjust your timeline


Слайд 51

Improvement turns out to be better … Error rates Runtime 5% 1,600 Inversely Related +13%, 10% Improvement & Baseline CR Instead of: 53K - 10K = 43K


Слайд 52

… or worse. Error rates Runtime 5% 75 K Inversely Related +2%, 8% Improvement & Baseline CR


Слайд 53

2. Accept higher / lower error rate


Слайд 54

Improvement turns out to be better … Error rates Runtime 1% 43 K Inversely Related +13%, 10% Improvement & Baseline CR


Слайд 55

… or worse. Error rates Runtime 30% 43 K Inversely Related +2%, 8% Improvement & Baseline CR


Слайд 56

3. Admit it. It’s inconclusive.


Слайд 57

… or a lot worse. Error rates Runtime > 99% > 100K Inversely Related +.2%, 8% Improvement & Baseline CR iterate, iterate, iterate!


Слайд 58

Seasonality & Time Variation Your experiments will not always have the same improvement over time. So, run A/B Tests for at least a business cycle appropriate for that test and your company.


Слайд 59

After this workshop, you should be able to answer … 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? 2. What are the three tradeoffs in an A/B Test? And how are they related? 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?


Слайд 60

After this workshop, you should be able to answer … 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with for Stats Engine? 2. What are the three tradeoffs in one A/B Test? 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals? A. Adjust your timeline. Accept higher / lower error rate. Admit an inconclusive result.


Слайд 61

Review 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.” 2. What are the three tradeoffs in one A/B Test? B. Error Rates, Runtime, and Effect Size. They are all negatively related. 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals? C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.


Слайд 62

Preview: How many goals and variations should I use?


Слайд 63

Stats Engine is more conservative when there are more goals that are not affected by a variation. So, adding a lot of “random” goals will slow down your experiment.


Слайд 64

Tips & Tricks for using Stats Engine with multiple goals and variations • Ask: Which goal is most important to me? -This should be the primary goal (not impacted by all other goals) • Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration. • For maximum velocity, only test goals and variations that you believe will have highest impact.


Слайд 65


Слайд 66

Review 1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine? A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.” 2. What are the three tradeoffs in one A/B Test? B. Error Rates, Runtime, and Effect Size. They are all negatively related. 3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals? C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.


Слайд 67


×

HTML:





Ссылка: