A/B Testing Pitfalls Slides at http://bit.ly/ABPitfalls Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham, A&E Team A/B Tests in One Slide Concept is trivial Randomly split traffic between two (or more) versions oA (Control) oB (Treatment) Collect metrics of interest Analyze A/B test is the simplest controlled experiment A/B/n refers to multiple treatments (often used and encouraged: try control + two or three treatments) MVT refers to multivariable designs (rarely used by our teams) Must run statistical tests to confirm differences are not due to chance Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s) Ronny Kohavi 2 ConversionXL Audience Statistics 83% of attendees ran less than 30 experiments last year. Experimenters at Microsoft use our ExP platform to start ~30 experiments per day Experiments per year run by attendees based on pre-conference survey (N=118) 35% 30% 25% 20% 15% 10% 5% 0% 30% 20% 12 (one per month) 15-30 13% None Ronny Kohavi 20% Few 17% Lots and lots 3 Experimentation at Scale I’ve been fortunate to work at an organization that values being data-driven (video) We finish about ~300 experiment treatments per week, mostly on Bing, MSN, but also on Office, OneNote, Xbox, Cortana, Skype, Exchange, OneDrive. (These are “real” useful treatments, not 3x10x10 MVT = 300.) Each variant is exposed to between 100K and millions of users, sometimes tens of millions At Bing, 90% of eligible users are in experiments (10% are a global holdout changed once a year) There is no single Bing. Since a user is exposed to over 15 concurrent experiments, they get one of 5^15 = 30 billion variants (debugging takes a new meaning). Until 2014, the system was limiting usage as it scaled. Now the limits come from engineers’ ability to code new ideas Ronny Kohavi 4 5 Two Valuable Real Experiments What is a valuable experiment? Absolute value of delta between expected outcome and actual outcome is large If you thought something is going to win and it wins, you have not learned much If you thought it was going to win and it loses, it’s valuable (learning) If you thought it was “meh” and it was a breakthrough, it’s HIGHLY valuable See http://bit.ly/expRulesOfThumb for some examples of breakthroughs Experiments ran at Microsoft’s Bing with millions of users in each For each experiment, we provide the OEC, the Overall Evaluation Criterion Can you guess the winner correctly? Three choices are: oA wins (the difference is statistically significant) oFlat: A and B are approximately the same (no stat sig diff) oB wins Example : Bing Ads with Site Links Should Bing add “site links” to ads, which allow advertisers to offer several destinations on ads? OEC: Revenue, ads constraint to same vertical pixels on avg A Pro adding: richer ads, users better informed where they land B Cons: Constraint means on average 4 “A” ads vs. 3 “B” ads Variant B is 5msc slower (compute + higher page weight) • Raise your left hand if you think A Wins (left) • Raise your right hand if you think B Wins (right) Ronny Kohavi • Don’t raise your hand if they are the about the same 6 Bing Ads with Site Links If you raised your left hand, you were wrong If you did not raise a hand, you were wrong Site links generate incremental revenue on the order of tens of millions of dollars annually for Bing The above change was costly to implement. We made two small changes to Bing, which took days to develop, each increased annual revenues by over $100 million Ronny Kohavi 7 Example: Underlining Links Does underlining increase or decrease clickthrough-rate? Ronny Kohavi 8 Example 4: Underlining Links Does underlining increase or decrease clickthrough-rate? OEC: Clickthrough Rate on search engine result page (SERP) for a query A (with underlines) B (no underlines) • Raise your left hand if you think A Wins (left, with underlines) • Raise your right hand if you think B Wins (right, without underlines) Ronny Kohavi • Don’t raise your hand if they are the about the same 9 Underlines If you raised your right hand, you were wrong If you did not raise a hand, you were wrong Underlines improve clickthrough-rate for both algorithmic results and ads (so more revenue) and improve time to successful click Modern web designs do away with underlines, and most sites have adopted this design, despite data showing that users click less and take more time to click For search engines (Google, Bing Yahoo), this is a very questionable industry direction Ronny Kohavi 10 Pitfall 1: Misinterpreting P-values NHST = Null Hypothesis Statistical Testing, the “standard” model commonly used P-value <= 0.05 is the “standard” for rejecting the Null hypothesis P-value is often mis-interpreted. Here are some incorrect statements from Steve Goodman’s A Dirty Dozen 1. If P = .05, the null hypothesis has only a 5% chance of being true 2. A non-significant difference (e.g., P >.05) means there is no difference between groups 3. P = .05 means that we have observed data that would occur only 5% of the time under the null hypothesis 4. P = .05 means that if you reject the null hyp, the probability of a type I error (false positive) is only 5% The problem is that p-value gives us Prob (X >= x | H_0), whereas what we want is Prob (H_0 | X = x) Ronny Kohavi 11 Pitfall 2: Expecting Breakthroughs Breakthroughs are rare after initial optimizations. At Bing (well optimized), 80% of ideas fail to show value At other products across Microsoft, about 2/3 of ideas fail Take Sessions/User, a key metric at Bing. Historically, it improves 0.02% of the time: that’s one in 5,000 treatments we try! Most of the time, we invoke Twyman’s law (http://bit.ly/twymanLaw) Any figure that looks interesting or different is usually wrong Note relationship to prior pitfall With standard p-value computations, 5% of experiments will show stat-sig movement to Sessions/User when there is no real movement (i.e., if the Null Hypothesis is true), half of those positive 99.6% of the time, a stat-sig movement with p-value = 0.05 will be a false positive Ronny Kohavi 12 Pitfall 3: Not Checking for SRM SRM = Sample Ratio Mismatch If you run an experiment with equal percentages assigned to Control/Treatment (A/B), you should have approximately the same number of users in each Real example from an experiment alert I received this week: Control: 821,588 users, Treatment: 815,482 users Ratio: 50.2% (should have been 50%) Should I be worried? Absolutely The p-value is 1.8e-6, so the probability of this split (or more extreme) happening by chance is less than 1 in 500,000 Note that the above statement is not a violation of the pitfall #1 because by the experiment design, there should be an equal number of users in control/treatment, so we want the conditional probability P(actual split=50.2% | designed split=50%) Ronny Kohavi 13 Pitfall 4: Wrong Success Metric (OEC) Office Online tested new design for homepage Overall Evaluation Criterion (OEC) was clicks to the Buy Button [shown in red boxes] Which one was better? Treatment Control Objective: increase sales of Office products Pitfall: Wrong OEC Treatment had a drop in the OEC (clicks on buy) of 64%! Not having the price shown in the Control lead more people to click to determine the price Lesson: measure what you really need to measure: actual sales (it is more difficult at times) Lesson 2: Focus on long-term customer lifetime value Peep in keynote here said (he was OK with me mentioning this): What’s the goal? More money right now Common pitfall: You want to optimize long-term money. NOT right now. Raising prices gets you short-term money, but long-term abandonment Coming up with a good OEC using short-term metrics is REALLY hard Example: OEC for Search KDD 2012 Paper: Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained Search engines (Bing, Google) are evaluated on query share (distinct queries) and revenue as long-term goals Puzzle A ranking bug in an experiment resulted in very poor search results Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevant Distinct queries went up over 10%, and revenue went up over 30% This problem is now in the book data science interviews exposed What metrics should be in the OEC for a search engine? Ronny Kohavi 16 Puzzle Explained Analyzing queries per month, we have 𝑄𝑢𝑒𝑟𝑖𝑒𝑠 𝑄𝑢𝑒𝑟𝑖𝑒𝑠 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 𝑈𝑠𝑒𝑟𝑠 = × × 𝑀𝑜𝑛𝑡ℎ 𝑆𝑒𝑠𝑠𝑖𝑜𝑛 𝑈𝑠𝑒𝑟 𝑀𝑜𝑛𝑡ℎ where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions). Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smaller In a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equal The OEC should therefore include the middle term: sessions/user Ronny Kohavi 17 Bad OEC Example Your data scientists makes an observation: 2% of queries end up with “No results.” Manager: must reduce. Assigns a team to minimize “no results” metric Metric improves, but results for query brochure paper are crap (or in this case, paper to clean crap) Sometimes it *is* better to show “No Results.” Real example from my Amazon Prime now search 3/26/2016 https://twitter.com/ronnyk/status/713949552823263234 Ronny Kohavi 18 Pitfall 5: Combining Data when Treatment Percent Varies with time Simplified example: 1,000,000 users per day Conversion Rate for two days Friday Saturday C/T split: 99/1 C/T split: 50/50 Total Control 20,000 5,000 25,000 = 2.02% = 1.00% = 1.68% 990,000 500,000 1,490,000 Treatment 230 6,000 6,230 = 2.30% = 1.20% = 1.22% 10,000 500,000 510,000 For each individual day the Treatment is much better However, cumulative result for Treatment is worse (Simpson’s paradox) Pitfall 6: Get the Stats Right Two very good books on A/B testing (A/B Testing from Optimizely founders Dan Siroker and Peter Koomen; and You Should Test That by WiderFunnel’s CEO Chris Goward) get the stats wrong (see Amazon reviews). Optimizely recently updated their stats in the product to correct for this Best techniques to find issues: run A/A tests Like an A/B test, but both variants are exactly the same Are users split according to the planned percentages? Is the data collected matching the system of record? Are the results showing non-significant results 95% of the time? Ronny Kohavi 20 More Pitfalls See KDD paper: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web (http://bit.ly/expPitfalls) Incorrectly computing confidence intervals for percent change Using standard statistical formulas for computations of variance and power Neglecting to filter robots/bots Lucrative business, as shown in photo I took -> Instrumentation issues Ronny Kohavi 21 The HiPPO HiPPO = Highest Paid Person’s Opinion We made thousands toy HiPPOs and handed them at Microsoft to help change the culture Grab one here at ConversionXL Change the culture at your company Fact: Hippos kill more humans than any other (non-human) mammal Listen to the customers and don’t let the HiPPO kill good ideas Ronny Kohavi 22 Remember this Getting numbers is easy; getting numbers you can trust is hard Slides at http://bit.ly/ABPitfalls See http://exp-platform.com for papers. Plane reading booklets with selected papers available outside room Ronny Kohavi 23
© Copyright 2026 Paperzz