PPT - ExP Platform

A/B Testing Pitfalls
Slides at http://bit.ly/ABPitfalls
Ronny Kohavi, Distinguished Engineer, General Manager,
Analysis and Experimentation, Microsoft
Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham, A&E Team
A/B Tests in One Slide
Concept is trivial
 Randomly split traffic between two (or more) versions
oA (Control)
oB (Treatment)
 Collect metrics of interest
 Analyze
A/B test is the simplest controlled experiment
 A/B/n refers to multiple treatments (often used and encouraged: try control + two or three treatments)
 MVT refers to multivariable designs (rarely used by our teams)
Must run statistical tests to confirm differences are not due to chance
Best scientific way to prove causality, i.e., the changes in metrics are caused by changes
introduced in the treatment(s)
Ronny Kohavi
2
ConversionXL Audience Statistics
83% of attendees ran less than 30 experiments last year.
Experimenters at Microsoft use our ExP platform to start ~30 experiments per day
Experiments per year run by attendees
based on pre-conference survey (N=118)
35%
30%
25%
20%
15%
10%
5%
0%
30%
20%
12 (one per
month)
15-30
13%
None
Ronny Kohavi
20%
Few
17%
Lots and lots
3
Experimentation at Scale
I’ve been fortunate to work at an organization that values being data-driven (video)
We finish about ~300 experiment treatments per week, mostly on Bing, MSN, but also on
Office, OneNote, Xbox, Cortana, Skype, Exchange, OneDrive.
(These are “real” useful treatments, not 3x10x10 MVT = 300.)
Each variant is exposed to between 100K and millions of users, sometimes tens of millions
At Bing, 90% of eligible users are in experiments (10% are a global holdout changed once a year)
There is no single Bing. Since a user
is exposed to over 15 concurrent
experiments, they get one of
5^15 = 30 billion variants
(debugging takes a new meaning).
Until 2014, the system was
limiting usage as it scaled.
Now the limits come from
engineers’ ability to code new ideas
Ronny Kohavi
4
5
Two Valuable Real Experiments
What is a valuable experiment?
 Absolute value of delta between expected outcome and actual outcome is large
 If you thought something is going to win and it wins, you have not learned much
 If you thought it was going to win and it loses, it’s valuable (learning)
 If you thought it was “meh” and it was a breakthrough, it’s HIGHLY valuable
See http://bit.ly/expRulesOfThumb for some examples of breakthroughs
Experiments ran at Microsoft’s Bing with millions of users in each
For each experiment, we provide the OEC, the Overall Evaluation Criterion
Can you guess the winner correctly? Three choices are:
oA wins (the difference is statistically significant)
oFlat: A and B are approximately the same (no stat sig diff)
oB wins
Example : Bing Ads with Site Links
Should Bing add “site links” to ads, which allow advertisers to offer several
destinations on ads?
OEC: Revenue, ads constraint to same vertical pixels on avg
A
Pro adding: richer ads, users better informed where they land
B
Cons: Constraint means on average 4 “A” ads vs. 3 “B” ads
Variant B is 5msc slower (compute + higher page weight)
• Raise your left hand if you think A Wins (left)
• Raise your right hand if you think B Wins (right)
Ronny Kohavi
• Don’t raise
your hand if they are the about the same
6
Bing Ads with Site Links
If you raised your left hand, you were wrong
If you did not raise a hand, you were wrong
Site links generate incremental revenue on the order of tens of millions of dollars
annually for Bing
The above change was costly to implement. We made two small changes to Bing, which
took days to develop, each increased annual revenues by over $100 million
Ronny Kohavi
7
Example: Underlining Links
Does underlining increase or decrease clickthrough-rate?
Ronny Kohavi
8
Example 4: Underlining Links
Does underlining increase or decrease clickthrough-rate?
OEC: Clickthrough Rate on search engine result page (SERP) for a query
A (with underlines)
B (no underlines)
• Raise your left hand if you think A Wins (left, with underlines)
• Raise your right hand if you think B Wins (right, without underlines)
Ronny Kohavi
• Don’t raise
your hand if they are the about the same
9
Underlines
If you raised your right hand, you were wrong
If you did not raise a hand, you were wrong
Underlines improve clickthrough-rate for both algorithmic results and ads (so more
revenue) and improve time to successful click
Modern web designs do away with underlines, and most sites have adopted this design,
despite data showing that users click less and take more time to click
For search engines (Google, Bing Yahoo), this is a very questionable industry direction
Ronny Kohavi
10
Pitfall 1: Misinterpreting P-values
NHST = Null Hypothesis Statistical Testing, the “standard” model commonly used
P-value <= 0.05 is the “standard” for rejecting the Null hypothesis
P-value is often mis-interpreted.
Here are some incorrect statements from Steve Goodman’s A Dirty Dozen
1. If P = .05, the null hypothesis has only a 5% chance of being true
2. A non-significant difference (e.g., P >.05) means there is no difference between groups
3. P = .05 means that we have observed data that would occur only 5% of the time under the null hypothesis
4. P = .05 means that if you reject the null hyp, the probability of a type I error (false positive) is only 5%
The problem is that p-value gives us Prob (X >= x | H_0), whereas what we want is
Prob (H_0 | X = x)
Ronny Kohavi
11
Pitfall 2: Expecting Breakthroughs
 Breakthroughs are rare after initial optimizations.
 At Bing (well optimized), 80% of ideas fail to show value
 At other products across Microsoft, about 2/3 of ideas fail
Take Sessions/User, a key metric at Bing.
Historically, it improves 0.02% of the time: that’s one in 5,000 treatments we try!
Most of the time, we invoke Twyman’s law (http://bit.ly/twymanLaw)
Any figure that looks interesting or different is usually wrong
Note relationship to prior pitfall
 With standard p-value computations, 5% of experiments will show stat-sig movement to Sessions/User
when there is no real movement (i.e., if the Null Hypothesis is true), half of those positive
 99.6% of the time, a stat-sig movement with p-value = 0.05 will be a false positive
Ronny Kohavi
12
Pitfall 3: Not Checking for SRM
SRM = Sample Ratio Mismatch
If you run an experiment with equal percentages assigned to Control/Treatment (A/B),
you should have approximately the same number of users in each
Real example from an experiment alert I received this week:
 Control: 821,588 users, Treatment: 815,482 users
 Ratio: 50.2% (should have been 50%)
 Should I be worried?
Absolutely
 The p-value is 1.8e-6, so the probability of this split (or more extreme) happening by chance is less than 1
in 500,000
 Note that the above statement is not a violation of the pitfall #1 because by the experiment design, there
should be an equal number of users in control/treatment, so we want the conditional probability
P(actual split=50.2% | designed split=50%)
Ronny Kohavi
13
Pitfall 4: Wrong Success Metric (OEC)
 Office Online tested new design for homepage
Overall Evaluation Criterion (OEC) was clicks to the Buy Button [shown in red boxes]
Which one was better?
Treatment
Control
 Objective: increase sales of Office products
Pitfall: Wrong OEC
Treatment had a drop in the OEC (clicks on buy) of 64%!
Not having the price shown in the Control lead more people to click to determine the price
Lesson: measure what you really need to measure: actual sales
(it is more difficult at times)
Lesson 2: Focus on long-term customer lifetime value
Peep in keynote here said (he was OK with me mentioning this):
 What’s the goal? More money right now
 Common pitfall: You want to optimize long-term money. NOT right now.
Raising prices gets you short-term money, but long-term abandonment
Coming up with a good OEC using short-term metrics is REALLY hard
Example: OEC for Search
KDD 2012 Paper: Trustworthy Online Controlled Experiments:
Five Puzzling Outcomes Explained
Search engines (Bing, Google) are evaluated on query share (distinct queries) and
revenue as long-term goals
Puzzle
 A ranking bug in an experiment resulted in very poor search results
 Degraded (algorithmic) search results cause users to search more to complete
their task, and ads appear more relevant
 Distinct queries went up over 10%, and revenue went up over 30%
This problem is now in the book data science interviews exposed
What metrics should be in the OEC for a search engine?
Ronny Kohavi
16
Puzzle Explained
Analyzing queries per month, we have
𝑄𝑢𝑒𝑟𝑖𝑒𝑠
𝑄𝑢𝑒𝑟𝑖𝑒𝑠 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 𝑈𝑠𝑒𝑟𝑠
=
×
×
𝑀𝑜𝑛𝑡ℎ
𝑆𝑒𝑠𝑠𝑖𝑜𝑛
𝑈𝑠𝑒𝑟
𝑀𝑜𝑛𝑡ℎ
where a session begins with a query and ends with 30-minutes of inactivity.
(Ideally, we would look at tasks, not sessions).
Key observation: we want users to find answers and complete tasks quickly,
so queries/session should be smaller
In a controlled experiment, the variants get (approximately) the same
number of users by design, so the last term is about equal
The OEC should therefore include the middle term: sessions/user
Ronny Kohavi
17
Bad OEC Example
Your data scientists makes an observation:
2% of queries end up with “No results.”
Manager: must reduce.
Assigns a team to minimize “no results” metric
Metric improves, but results for query
brochure paper
are crap (or in this case, paper to clean crap)
Sometimes it *is* better to show “No Results.”
Real example from my Amazon Prime now search 3/26/2016
https://twitter.com/ronnyk/status/713949552823263234
Ronny Kohavi
18
Pitfall 5: Combining Data when Treatment Percent
Varies with time
Simplified example: 1,000,000 users per day
Conversion Rate
for two days
Friday
Saturday
C/T split: 99/1
C/T split: 50/50
Total
Control
20,000
5,000
25,000
= 2.02%
= 1.00%
= 1.68%
990,000
500,000
1,490,000
Treatment
230
6,000
6,230
= 2.30%
= 1.20%
= 1.22%
10,000
500,000
510,000
For each individual day the Treatment is much better
However, cumulative result for Treatment is worse (Simpson’s paradox)
Pitfall 6: Get the Stats Right
Two very good books on A/B testing (A/B Testing from Optimizely founders
Dan Siroker and Peter Koomen; and You Should Test That by WiderFunnel’s
CEO Chris Goward) get the stats wrong (see Amazon reviews).
Optimizely recently updated their stats in the product to correct for this
Best techniques to find issues: run A/A tests
 Like an A/B test, but both variants are exactly the same
 Are users split according to the planned percentages?
 Is the data collected matching the system of record?
 Are the results showing non-significant results 95% of the time?
Ronny Kohavi
20
More Pitfalls
See KDD paper: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
(http://bit.ly/expPitfalls)
Incorrectly computing confidence intervals for percent change
Using standard statistical formulas for computations of variance and power
Neglecting to filter robots/bots
Lucrative business, as shown in photo I took ->
Instrumentation issues
Ronny Kohavi
21
The HiPPO
HiPPO = Highest Paid Person’s Opinion
We made thousands toy HiPPOs and handed them
at Microsoft to help change the culture
Grab one here at ConversionXL
Change the culture at your company
Fact: Hippos kill more humans than any other (non-human) mammal
Listen to the customers and don’t let the HiPPO kill good ideas
Ronny Kohavi
22
Remember this
Getting numbers is easy;
getting numbers you can trust is hard
Slides at http://bit.ly/ABPitfalls
See http://exp-platform.com for papers.
Plane reading booklets with selected papers available outside room
Ronny Kohavi
23