Controlled Online Experiments: What They Are and How to Do Them Bachelor’s Thesis Marius Brand Spring Term 2014 Advisor: Veronica Valli Chair of Quantitative Marketing and Consumer Analytics L5, 2 - 2. OG 68161 Mannheim Internet: www.quantitativemarketing.org II Table of Content List of Tables........................................................................................................................... III List of Figures ......................................................................................................................... IV Abstract .................................................................................................................................... V 1. Introduction .......................................................................................................................... 1 2. Controlled Experiments ....................................................................................................... 2 2.1 Structure and Terminology of A/B Tests ......................................................................... 3 2.1.1 Technical vocabulary. ............................................................................................... 3 2.1.2 Statistical foundation of experiments. ....................................................................... 5 2.2 Multivariable Testing (MVT)........................................................................................... 6 2.2.1 Characteristics. ......................................................................................................... 6 2.2.2 Variants of MVT. ....................................................................................................... 7 2.3 Technical Implementation ................................................................................................ 8 2.3.1 Randomization algorithm. ......................................................................................... 8 2.3.2 Assignment method. ................................................................................................... 9 2.3.3 Data path. .................................................................................................................. 9 2.4 Suggestions for Improving the Test Performance .......................................................... 10 2.5 Possible Pitfalls and Technical Limitations of A/B Tests .............................................. 13 3. Outcomes and Application ................................................................................................ 15 3.1 Application in the Software and Online Service Industry.............................................. 16 3.2 Technology Provider of A/B Tests................................................................................. 17 3.3 A/B Testing For Research Purposes............................................................................... 18 4. Discussion ............................................................................................................................ 20 4.1 Managerial Implications and Guidelines for the Industry .............................................. 20 4.2 Limitations and Future Research.................................................................................... 22 5. Conclusion ........................................................................................................................... 24 Appendix ................................................................................................................................. 30 References ............................................................................................................................... 59 Affidavit .................................................................................................................................. 64 III List of Tables Table 1: Advantages and Disadvantages of all Assignment Methods ..................................... 26 IV List of Figures Figure 1: Structural Overview of Controlled Online Experiments .......................................... 27 Figure 2: A/B Testing Software Market: Distribution in May 2014 ........................................ 28 Figure 3: Sources, Trends and Existing Problems for Controlled Online Experiments .......... 29 V Abstract Controlled Online Experiments are an integrated component of development for the majority of today’s web-based companies. With the establishment of tests on the internet, it is feasible to measure the reactions and preferences of users toward one of two alternative versions of the same website. Companies do not have to discuss about the superior design and structure, they simply involve their customers in the process. Regarding the fact that forecasting web users’ behaviour on a corporate level is often imperfect, firms like Microsoft or Amazon heavily invest in experimentation and online testing research. For them, anticipating the behaviour and opinion of their customers is one key element of success. The main focus of this literature review will be on scientific research papers and reports by employees of companies operating in the online market. Besides establishing strategies for the technical implementation, recommendations regarding the evaluation and improvement proposals are outlined and connected. Comprehensively, the thesis aims at providing an introductive guideline and revealing the recent state of the art regarding online experimenting. 1 Introduction For a long period of time, offline controlled experiments have captured a highly important part of the development process and customer integration for new products and services. With the expansion of the internet as a market place and a huge global business platform, companies started to implement online experiments of similar purpose and structure. The consumers of the world wide web as a fast-changing conglomerate demand user-friendly and meaningful content. Therefore, besides fostering innovation processes, the executing companies focus on answering the question what their customers want or prefer. Historically, controlled online experiments illustrate a consequence of difficulties in the estimation of customer opinion and desire. Having severe problems in forecasting user reactions to new designs and functions on their websites, companies started to develop tools that were able to test customer responses beforehand. On the other hand, after long periods of research and analysis, the management of a corporation has a strong opinion about new components and their abilities to benefit the company. Through the implementation of experiments, this opinion is contrasted with the actual customer response. Surprisingly, the results of the online trials are often very unexpected and contrary compared to corporate estimations. In consequence of the fact that forecasting web users’ behaviour on a corporate level is mostly inaccurate and imperfectly, online experiments became the new standard for website improvement and alteration. There is no consistent expression for the testing processes yet. Randomized experiments, online trials and A/B tests usually refer to the same or similar processes and can be consolidated as controlled online experiments. If used in this literature review, all named terms refer to the same category of online testing. Generally, A/B tests are the easiest and 2 most used type of online experiments. The majority of all relevant papers focuses on the implementation and evaluation of A/B tests, therefore the thesis will mainly focus on this most popular pattern of testing as well. An exception is Multivariable testing, which can be seen as an advanced and multi-layered approach of A/B testing. As a first step, the paper provides an overview of the common structure of experiments, examining simple A/B tests as well as Multivariable testing. After establishing a technical implementation standard, contemporary and relevant test performance suggestions and limitations from leading authors are presented. Subsequently, the third chapter illustrates examples of online experiments in a corporate environment as well as for research purposes, highlighting the outcomes and individual benefits the executors were able to accomplish. The fourth chapter discusses managerial implications based on results from online experimentation, implications for the online industry as a market and business place itself and possible limitations researchers may face when delving about this particular topic. The final conclusion resumes the main findings over the course of the thesis and closes with the recent standpoint of controlled online experiments. Controlled Experiments The origin of controlled experiments on the internet from offline tests in traditional production business areas shows the mutual correlation to consumer behaviour research. The conduction of online trials applies many overlapping technical standards and implementation technologies with its offline counterpart. Therefore, various steps described hereafter share similarities in primal experimenting practises. Frequently, authors of meaningful papers and research studies cited in this review resort to statistical and technological fundamentals. Hence, important implementation principles also serve as a basis in this thesis for explaining 3 the fundamental IT-processes. After expounding the structural foundation of controlled online experiments, this chapter focuses on key indicators of a successful test performance and lists relevant guidelines and recommendations for a smooth test realization. Structure and Terminology of A/B Tests There are different approaches to conduct an controlled online experiment. A/B tests are the most simple form of controlled experiments and widely used. Visitors are randomly exposed to one of two different variants of the same website: Control (A) or Treatment (B). (Crook et al. 2009, p. 1106). Control means in this case, that the original and unchanged website is displayed for the user. The remaining part of users browsing the website sees the Treatment. For them, the new version of the website is presented. The test initiator can split the distribution of users confronted with A or B independently, for example the allocation of respectively one half to each variant (Kohavi and Longbotham 2010, p. 203). It’s important to emphasize that no factor is allowed to influence the distribution decision during the test execution, the allocation has to be random and in a persistent manner (Kohavi et al. 2014, p. 2). Through this method, the different response on the Treatment path can be measured compared to the original process. (Peterson 2004, p. 76). Technical vocabulary. Afore enabling a precise analysis of the collected observations, an Overall Evaluation Criterion (OEC) has to be derived for each variant. The OEC is defined as a quantitative measure of the overall objective of the experiment (Kohavi et al. 2009, p.150). Trials with more than one objective can obtain a compromise by combining the objectives into one OEC under various criteria (Roy 2001, p. 26). Therefore, the OEC is the important key metric of the experiment that is going to be compared and analysed (Crook et al. 2009, p. 1106). Meaningful examples for an OEC can be the conversion rate, the units purchased after exposing the users to one of the variants, the resulting revenue or a weighted combination of several factors. Specifically for online experiments, selecting a single metric 4 aligns the organization behind a clear objective due to the enforcement of trade-offs to be conducted (Kohavi, Henne, and Sommerfield 2007, p. 961). In addition, a reasonable OEC should contain factors focused on long-term goals, for example predicted lifetime value or repeat visits. Therefore, a factor can be defined as a variable with the purpose of influencing the OEC. In simple A/B tests one single factor embraces two values or variants: A and B (Crook et al. 2009, p. 1106). Before executing a controlled experiment and measuring the changes in the OEC, the relevant experimental unit has to be determined. Assumed to be independent from each other, an experimental unit is outlined as the entity on which the observations during the experiment are made (Kohavi et al. 2009, p. 150). In the case of an online trial, the user of the website or application is the usual experimentation unit. Nonetheless, tests can also be based on units like sessions or page views (Kohavi, Henne, and Sommerfield 2007, p. 962). Peterson (2004, p. 77) recommends to run a Null Test before implementing the actual experiment. Commonly known as A/A test, it allocates the users to two variants but in contrast to the actual A/B test, the participants are all exposed to the same website. As a result, the verification of the same conversion and abandonment rates of both variants is possible. The author claims that this secures the correct set up of measurement tools. The Null Hypothesis1 assumes that the OECs for the tested variants are not differing and furthermore, that any observed differences are due to random fluctuations (Kohavi et al. 2009, p. 150). Generally, the hypothesis shouldn’t be rejected during a Null Test. If a rejection occurs, the probability of an error is high and the test setting should be revised (Kohavi, Henne, and Sommerfield 2007, p. 962). Kohavi et al. (2012, p. 788) recommend every experimenter to continuously run A/A tests for identifying issues in the construction of test systems. In conclusion, if designed and executed properly, the only relevant discrepancy after conducting an A/B experiment should be the change between Control and Treatment. Hence, 5 the differences in the OEC directly result from this assignment and, on that note, a certain degree of causality is established. (Weiss 1997, p. 215). Statistical foundation of experiments. Subsequently, it’s essential to perform a statistical test to estimate whether the difference between Control and Treatment is remarkable. A Treatment is accepted as being statistically significantly different as soon as the test is rejecting the Null Hypothesis, which means that the OECs of the variants are not differentiating (Crook et al. 2009, p. 1106). There are several elements influencing the outcome of the statistical test. The confidence level, which is defined as the probability of failing to reject the Null Hypothesis, although it is true, is typically set to 95%. That implies, that in 5% of all tests a significant difference is determined when there is none (Kohavi el al. 2009, p.151). For companies like Microsoft, which constantly run a high amount of tests, this can mean hundreds of false positive results and therefore corrupted test outcomes (Kohavi et al. 2014, p. 4). The smaller the confidence level, the higher is the probability to detect a significant difference, in other words the power of the experiment (Kohavi et al. 2009, p. 151). Plus, the smaller the standard error of the test, the more powerful is the test itself. A suitable OEC is connected to a large sample size. Plus, including components of low variability reduce the overall standard error. Users who are not exposed to the variants during the test should be filtered out. This lowers the variability of the OEC itself and therefore the standard error of the whole test (Kohavi, Henne, and Sommerfield 2007, p. 962). Mistakes can be avoided if the executor ensures that the testing time is sufficiently dimensioned. Peterson (2004, p. 77) advises against too short tests because if run not long enough it is not fully ensured that a change in the Treatment really means an overall significant difference. Inherently, a test should be continued until the result is certainly representative. In their paper, Kohavi et al. (2009, p. 152) provide useful formulas which define the minimum sample size of a classical A/B test and the decision of rejecting/accepting 6 the Null Hypothesis within the scope of a t-test. A detailed description of the formulas can be found in the Appendix A. Taking the mathematical analysis one step further, Deng, Li and Guo (2014, p. 610-616) discuss statistical dependencies between test variants, hypothesis testing and point estimations. Hence, their empirical results provide bias correction methods which are able to enhance A/B tests in terms of their accuracy and absence of errors. Satisfying all statistical standards and general requirements, an A/B test can eventually provide a valuable insight: The difference in the OEC between the recent and planned setup of a website or the different OECs for two alternatively planned setups. Multivariable Testing (MVT) Aiming for a bigger and multi-layered experimental result, some companies consider testing several factors in a single experiment. This trial is called Multivariable or Multivariate testing (MVT) and can be used to estimate effects of each of the tested factors as well as possible interactive effects between those factors (Kohavi et al. 2009, p. 158-159). 2.2.1 Characteristics. Bell (2008, p. 16) claims, that through MVT numerous marketing elements can be tested at once with a similar accuracy and deeper insights compared to various sequential A/B tests. MVT is described to be advantageous because many factors can be tested in a short period of time, hence accelerating general improvements. The interactions between two or more factors are able to alter the overall effect. Kohavi et al. (2009, p. 160) claim, that an estimation of synergistic or antagonistic behaviour between the factors extends the comprehensive dimension of the tested variants. Therefore, a Multivariable test can show results from interactions that wouldn’t be visible if each variant is tested on its own. However, MVTs are more expensive and usually require prior expertise in testing (Bell, 2008, p. 16). Kohavi et al. (2009, p. 160) legitimate the more complex testing process 7 with the additional value a MVT can yield. Practically, the gained insights outweigh minor limitations such as the longer preparation time or the risk of a suboptimal user experience. Yet another paper names MVT an alternative but adheres to A/B testing as the most appropriate experiment design. The rarity of interactions representing statistical interrelations make the authors believe that the additional value provided by MTVs is not capable of outweighing negative factors in most of the cases (Kohavi et al. 2013, p. 1172-1173). As a further alternative, Deng, Li and Guo (2014, p. 609) propose to use a two-stage or multi-stage A/B testing as soon as the executor is confident with one-stage testing. Due to the early employment of experiments in recent feature development cycles, tests support most of the designing steps. This timing demands several experiment stages to evolve the finished design together with the target users. The authors underline, that a single stage A/B test on the other hand only affirms or rejects the finished design, the testing is not a part of the development cycle. Variants of MVT. On behalf of online experimenting, three possibilities of executing MVTs are available: A traditional approach, running concurrent tests or overlapping experiments. Every method has its distinctive benefits, however, due to its lack of interaction estimation, Kohavi et al. (2009, p. 161) clearly recommend against the traditional approach if conducting online tests. During concurrent testing, it is possible to turn off any factor at any time without influencing the other concurrent factors. Overlapping experiments concentrate on testing a factor as a one-factor experiment whenever the factor is ready to be tested. This method enables a fast testing of factors and also illustrates possible interactions between overlapping factors (Kohavi et al. 2009, p. 163). Further insight into the implementation of overlapping experiments offers a paper focusing on trials conducted at Google but also providing general guidelines. The authors claim that an overlapping design keeps the advantages of an single layer system while 8 increasing additional factors such as scalability, flexibility and robustness (Tang et al. 2010, p. 17-26). Kohavi et al. (2009, p. 163) believe that if a quick testing process is the focus of the trial, using overlapping experiments is the most beneficial. However, if the trial concentrates on the estimation of interactions between factors, the use of concurrent tests is suggested. Technical Implementation The implementation of an experiment on the internet differs from the traditional offline versions. The most comprehensive approach of describing the technical realization was done by the Experimentation Platform at Microsoft. In their first paper, Kohavi, Henne, and Sommerfield (2007, p. 963) establish two necessary components. Later on, Kohavi et al. (2009, p. 163) add one more factor to the implementation process. Randomization algorithm. In the beginning of the experiment, a randomization algorithm allocates the users to the different variants (Kohavi et al. 2009, p. 163). To ensure the statistical significance discussed afore, the algorithm has to encompass several properties such as an appropriate user split between Control and Treatment and the avoidance of any possible correlations between parallel experiments (Kohavi, Henne, and Sommerfield 2007, p. 964). In the follow-up version of the 2007 released paper, Kohavi et al. (2009, p. 164) add two more desirable but not essential properties. A pseudorandom number generator (PRNG) can be used as the needed algorithm if coupled with a form of caching2. In the case of an experiment, the assignment of end users is cached once they are exposed to either the Control or Treatment variant to prevent any correlations between concurrent experiments (Kohavi, Henne, and Sommerfield 2007, p. 964). The caching can be accomplished on the server side by storing the relevant information in a database. Ali, Shamsuddin and Ismail (2011, pp. 19-29, 37-38) introduce and review several web caching approaches that can be considered as an option for the testing implementation. Though, it is significantly cheaper to store the data in a cookie3 as no 9 database is required in this case (Kohavi, Henne, and Sommerfield 2007, p. 964). Kohavi et al. (2009, p. 164) underline, that the devices of the users need to have cookies turned on, otherwise the caching does not work. The PRNG can also be replaced by a hash function4. However, Kohavi et al. (2009, p. 166) believe that many popular hash functions failed to execute experiments properly. Additionally, the overall running-time performance is limited by the running-time performance of the hash function itself, therefore it can restrict the whole testing process. For that reason, the authors recommend a hybrid approach with either a small database or the use of cookies. Assignment method. After finding the most suitable algorithm, an assignment method has to be determined. With the help of one of these methods, the execution of different variants for different end users on the experimenting website is enabled (Kohavi, Henne, and Sommerfield 2007, p. 964). Relevant papers introduce four different approaches: Traffic splitting, Page rewriting, Server-side assignment and Client-side assignment. For an in-depth look, Kohavi et al. (2009, pp. 166-171) provide a detailed description of the different methods, illustrating the technical implementation in several figures. In a nutshell, Table 1 illustrates the main advantages and disadvantages of the described assignment methods (Insert Table 1 about here). The table focuses on the intensity of cost generating factors as well as on possible impacts for the user. Data path. As a final step, a data path has to be established that conducts all necessary actions for receiving a meaningful and informative outcome of the experiment. Kohavi et al. (2009, pp. 171-172) can be seen as pioneers in describing a precise establishment of the data path. This includes the actual capturing of the raw observation data during the trial, the aggregation of the data, application of statistics and ultimately the preparation of the overall results. During the data collection and aggregation process, a very large amount of traffic5 10 arises. In detail, the website has to record all treatment assignments of end users and collect the arising data such as page views, clicks or the resulting revenue. Next, the data is converted into metrics that summarize the results and make them comparable to other variants of the experiment. Hence, the actual statistical significance of all variants can be analysed (Kohavi et al. 2009, p. 171). There are three different methods of raw data collection available. Kohavi et al. (2009, p. 172) describe existing data collection, local data collection and service-based data collection and name occurring benefits and difficulties for each method. Plus, the authors select the service-based collection approach as the most flexible and therefore the preferred method for data collection. Suggestions for Improving the Test Performance After finishing the theoretical and statistical set-up, the online experiment is executed and the accruing data and insights collected and analysed. Thereby, results and difficulties can arise that were not planned or expected beforehand. Several papers describe incoming situations the executors faced throughout and after the practical implementation. Based on these conclusions, suggestions and guidelines have been developed for subsequent controlled experiments. During the analysing process of an implemented trial, Kohavi et al. (2009, p. 173) recommend to store the collected data. Besides the statistically significance of the Control and Treatment variants, the data can also provide insights in the behaviour of different user groups or problems, that only a specific group of users have faced. Based on that information, some populations may be excluded from future tests to enhance the overall quality of the results. The speed of the execution of Control and Treatment is crucial for the whole experiment. A slow performance of the Treatment shown to users may influence the whole test negatively. Linden (2006, p. 10) as well as Kohavi et al. (2009, p. 173) report, that a 11 delayed execution of the Treatment can harmfully change the behaviour of the end user to perform actions or even stay on the same website. A certain discrepancy exists regarding the question, how many factors should be tested at one time. Peterson (2004, p. 76) and Eisenberg (2005) agree on the fact that testing factor after factor is beneficial because in this case the intermixing of different effects is impossible. However, Kohavi et al. (2009, p. 174) believe that this approach limits organizations in terms of the scope of the desired improvements. The authors claim that interactions between different factors are less frequent than assumed by the executors. To avoid any interactions, they recommend single-factor experiments when incremental changes are proposed to be implemented. Plus, in their opinion, very different variants in terms of their design or implementation can minimize interrelation as well as different algorithms used. As already mentioned, continuous A/A tests help validating the correct performance of later A/B tests. Only if a website executes an A/A test faultlessly, a reasonable test set-up can be assumed (Peterson, 2004, p. 77). Additionally, running A/A tests continuously at the same time with other experiments supports a frictionless continuation (Kohavi et al. 2009, p. 174). In case of testing outcomes with very marginal statistically significant results, Kohavi et al. (2013, p. 1174) endorse rerunning the processes to replicate the results and improve them. The same applies to abnormally positive test results. The authors caution that instrumentation errors or software bugs being the reason for the result are much more probable than obtaining test metrics with extraordinary increases (Kohavi et al. 2014, p. 4). Kohavi, Henne & Sommerfield (2007, p. 963) propose to increase the percentage of users assigned to the Treatment variant gradually. This has the advantage that a Treatment, that is significantly underperforming compared to the Control, can be shut down at the relative beginning of the experiment. Therefore, possible damages can be reduced as well as the risk of exposing many users to unintended errors. One exemplary implementation was 12 installed for testing processes at Microsoft’s Bing search engine, where alerts automatically signal a negative user experience or interactions with other experiments (Kohavi et al. 2013, p. 1168). Hence, executors are able to enquire the actual reasons for unexpected testing results immediately. Little or insignificant outcomes can result from the tested scenario content itself and not from the tested variants. In this case, an evaluation of possible problems or obstacles that users face during participation in the test, is reasonable (Eisenberg, 2005). If running smoothly, an equal allocation of users to Control and Treatment is favoured to equip the experiment with maximal power and minimal running time needed. Kohavi et al. (2009, p. 175) estimate the required running time of an experiment with a Control-Treatment allocation of 99% - 1% of all users to be about 25 times higher than of an experiment with an equivalent allocation of users. Furthermore, running experiments that are underpowered regarding participants can lead to distorted results. Before ending a test, the minimum sample size has to be fulfilled for statistical significance (Kohavi et al. 2014, p. 8). An approach at Microsoft utilizes data from pre-test periods to reduce the variability of the used metrics and therefore achieve a better sensitivity. As a result, less test users and less time is needed to achieve the same statistical power (Deng et al. 2013, pp. 123-124). Resuming many of the suggestions and guidelines published within the scope of all Microsoft-related papers of the last couple of years, the forthcoming paper from Kohavi et al. provides an extensive insight in long-time testing experience. Plus, all findings are documented with tests occurred at Microsoft and other testing experts, offering additional cross references to the whole internet-based industry. 13 Possible Pitfalls and Technical Limitations of A/B Tests Beneath distinguishing areas of improvement, the executor of a controlled online experiment also has to evaluate parts of the test that did not perform as planned beforehand. A rather small number of papers illustrate the problems accrued during and after running online tests. Crook et al. (2009, pp. 1105-1114) describe several difficulties that occurred during testing periods at Microsoft. For instance, the authors caution against a design of the OEC that only focuses on “beating” the Control and showing a significant difference. Mostly, these types of goals ignore long-term value of the variant or the final effect on generated revenue. Moreover, the disregard of robots in online experiment environments is another focus of the paper. Robots such as automated search programs or tools that act like a virtual human being generate traffic on the internet. The authors advise to exclude traffic generated by robots from experiments that focus on human users, otherwise the overall test results can be very misleading. Additionally, the work includes statistical and metrical difficulties as well as pitfalls when neglecting precedent A/A tests and Control variants. A common mistake made in the analysis of experiment results is the abandoning of cannibalizing effects. For positive test results, Kohavi et al. (2014, p. 7) advice to examine if the increase in the OEC occurred additionally or at the expense of other elements on the website. The authors experienced regular success for local improvements, but global improvements for the website in total were generally harder to achieve. Another possible pitfall is the usage of browser redirects (Kohavi et al. 2009, p. 158). During an A/B test, many companies use the approach that users allocated to the Treatment variant are redirected to the modified page, which is not the same as the one displayed for the Control variant. Kohavi and Longbotham (2010, pp. 31-32) claim that when tested in an A/A trial, experiments using redirects underperform. Reasons for that can be delaying performance differences for the Treatment group, bots that may block the redirection process or the fact 14 that the user itself could hinder the experiment by saving the link to the Treatment website. Therefore, the authors suggest the avoidance of redirects and recommend a server-side mechanism as previously introduced. The cited paper also characterizes additional pitfalls such as a bad exposure control, technical differences with different internet browsers and the Simpson’s Paradox (2010, pp. 33-34). This elementary paradox states that two groups with the same effect do not need to face the same effect if combined to one group (Malinas and Bigelow 2008). Therefore, the combined test results of two testing days can be different from the results of each day. Besides possible pitfalls, controlled online experiments have general technical limitations that need to be considered. According to Nielsen (2005), tests provide the answer to which variant performs better but not the reason for the favouritism. For example, a test shows that a certain design is highly favoured by the testing participants. The executor knows which design should be implemented but not the reason why users preferred the chosen design compared to others. Several authors criticise that online experiments only run over a rather short period of time. Therefore, the whole experiment and OEC are short-term focused and do not include possible long-term consequences (Nielsen 2005; Quarto-vonTivadar 2006). Kohavi, Henne, and Sommerfield (2007, p. 963) disagree, stating that long-term considerations can be achieved when choosing the right OEC. Introducing a new feature or design on a website, a newness effect can set in and users need longer time to navigate on the site. This influences the performance of the Treatment variant compared to the Control variant and needs to be considered when measuring for example the time spent on the website or the clicks per minute (Kohavi et al. 2009, p. 158). Criticizing the relatively big sample size needed for a realisable A/B test, one paper announces a better alternative. According to Burk (2006, p. 1-2), using Control Charts instead 15 of t-tests reduces the needed sample size and delivers superior and well-comprehensible results. In a Control Chart, the Treatment variant is not tested all the time against the Control variant. After a certain period of only Control testing, the Treatment is activated and the difference in user behaviour measured. The author claims that without adding complexity, Control Charts are very flexible and can be extended to multiple factors testing, therefore outperforming the limitations of A/B testing. Being the only paper declaring Control Charts as the superior alternative, this opinion should rather be seen as an exceptional position. Despite several limitations and difficulties, the majority of authors approve A/B testing as the most efficient and most beneficial method for experimenting (Kohavi et al. 2013, p. 1175; Kirby and Stewart 2007, p. 81; Tang et al. 2010, p. 17-18). As a comprising overview of Chapter Two, Figure 1 shows the central processes and factors regarding controlled online experiments in an interactive structure (Insert Figure 1 about here). It illustrates the experimenting process as a multi-layered conglomerate, consisting of an essential foundation and the test analysis at the top. The displayed input factors are capable of altering the test structure beforehand, during and even after the implementation. Outcomes and Application With the rise of the internet as a highly-competitive marketplace, appealing website designs with a high degree of usability become more and more important. Particularly companies earning their money with online services need to focus on a high client satisfaction rate for using their websites. Therefore, especially web-related firms like Amazon, Microsoft or Yahoo conduct controlled online experiments to spot customer wishes and foster innovation. Kohavi et al. (2013, p. 1168) assess running online experiments as very useful, especially in combination with agile development processes, for example in start-ups. This chapter 16 provides examples of experiments from different areas of business and research and presents the outcomes and usefulness of these tests. Application in the Software and Online Service Industry A straightforward example for an A/B test was conducted at Microsoft for the Office Online website. Crook et al. (2009, p.1107) tested the recent design against a Treatment variant after agreeing on the OEC “clicks on revenue generating links”. These links can be described as areas that have a certain probability of leading to a sale of an Microsoft Office Product. The test results showed that the Treatment design had 64% fewer clicks on the revenue generating links, behaving absolutely contrary as expected by the designing team. This represents a good learning experience example for the executors, with end users acting totally different than predicted. The experiment also showed serious flaws in the OEC. The treatment displayed the price of the Office product on the website, leading to a higher overall revenue. Nevertheless, the OEC was still lower, purporting that the Treatment would be less beneficial for the company than the old Control variant (Crook et al. 2009, p. 1107). Deng et al. (2013, p. 123) depict an experiment at MSN Real Estate that tested six different design versions of a „Find a Home“ search box. Users were randomly split between the Control and the five Treatment versions. The company established „visits to linked partner sites“ as the main OEC. The winning design of the experiment accomplished to increase the revenue from transfer fees for MSN by nearly 10% compared to the Control. In an interview with Harvard Business Review, Amazon founder Jeff Bezos describes the online testing processes inside the company’s Web Lab as a very important part of the firm (Kirby and Stewart 2007, p.81). Beneath constantly monitoring customer responses for conducted experiments, the laboratory also researches about the cheapest way possible to implement those tests. Bezos illustrates several examples of helpful online experiments. In one case, the company wanted to implement a feature showing users another Amazon 17 customer with a very similar purchase history. The management was sure that their customers would like the new feature of the website but an A/B test showed no significant difference in user behaviour. In the end not many customers wanted to use the new feature (Kirby and Stewart 2007, p. 81). Additional examples of implemented online experiments can be found for various companies: Amatriain and Basilico (2012) explain how the recommendation system at Netflix is based on A/B testing and McKinley (2012) portraits the A/B analyser implemented at Etsy, a software tool which illustrates important financial metrics for each conducted A/B test. Technology Provider of A/B Tests In recent years, the market leaders in the internet business established software and tools specialized for testing on their own websites. As abovementioned, a simple Control/Treatment test needs detailed technical know-how for the implementation. Through the establishment of, for instance, the Bing Experimentation System, which performs online experiments at Microsoft’s search engine website, innovation can be accelerated and therefore annual revenues increased intensely (Kohavi et al. 2013, p. 1170). However, the amount of small and middle-sized firms offering services or selling products on the internet is numerous. From their point of view, controlled online experiments are often too expensive to implement independently. Hence, a strong demand for customized and adequately priced experimenting tools emerged. Occupying this new business area, a high number of companies providing A/B testing technology was founded in the last 4-5 years (Walker, 2013). Villano (2013, p. 74) describes that every experiment on the web needs an IT-expert to develop the needed codes and software. Optimizely, the current market leader in providing A/B testing technology, started to undertake these tasks for mostly small and midsize customers. Today, multinational companies make use of the fast and easy testing possibilities, too. Dan Siroker, the founder of Optimizely, defines in the interview his company’s core 18 competency as a pop-up editing tool that each customer can adapt to his or her needs. Different variants can be tested without altering the underlying structure of the website. Siroker characterizes this ability along with the time savings as one of the main benefits when using the professional tool. (Villano, 2013, p. 74). A direct competitor is Google Website Optimizer, which Becker (2008, p. 24) illustrates as free of charge for all Google advertisers and cooperative retailers. The software and internet company has strong interest in continually improving its online experiment tools. Therefore, it made the optimization tools available for the public and third-party plug-ins and modifications. As a result, Google can foster steady innovation from the inside as well as from the outside. Additionally, the tool enables Multivariable testing and is currently one of the most advanced online experimentation optimizers available (Becker 2008, p. 24). The internet service company BuiltWith, which monitors the development of the A/B testing market as a whole, records a rapid growth in the A/B testing market in the last couple of years (Walker, 2013). This implicates an increasing amount of companies and websites using the trials to test new designs and functions. Figure 2 shows the distribution of A/B testing tools used worldwide at the end of May 2014 (Insert Figure 2 about here). Optimizely, the market leader, captures more than one fourth of the whole market. The Figure also shows that there are numerous provider for testing technology at the moment. Walker (2013), author of an article about recent developments in the A/B testing market on builtwith.com, anticipates steady growth rates also in the near future. However, in his opinion, not all testing provider will assert themselves in the long-run, estimating an impending market saturation. A/B Testing For Research Purposes Besides companies seeking for increases in revenue and attention, controlled online experiments are also conducted to learn more about general consumer behaviour on the internet. One insightful test revolves around the two website design variables “visual 19 complexity” and “source of interactivity control”. The online experiment revealed that users can be clustered in two groups: The “seeker” group performs goal-directed search processes on the internet and prefers a consumer-controlled interactive environment and a simple design. On the other hand the “surfer” group performs more experimental search processes and therefore prefers marketer-controlled interactivity and a complex visual design (Stanaland and Tan 2010, pp. 569-571). In this case, one A/B test is not sufficient. The users did not decide for or against one new design, the main purpose was to collect the information about who preferred which design, then draw conclusions from this information and cluster the test participants in different groups. Another online experiment was used to evaluate the differences in usage of continuous and discrete rating scales when questioning survey participants about the level of their happiness. The executor established “new analysis possibilities on data quality” and “insights in the distribution of happiness scores through the trial” as the most important OECs. As a result, the test showed that while data quality remained on the same level, additional information could be concluded through continuous rating scales. The experiment showed significant differences between male and female participants, therefore gender specific question design effects could be found. With the help of the experiment, the executor recommends the use of continuous rating scales when questioning people about their happiness (Studer 2012, pp. 317,336). Summarizing, online experiments are already in use for general research purposes, but compared to corporate testing, the area is underdeveloped. Possible reasons for this condition will be argued in the subsequent section. 20 Discussion The final chapter resumes the main findings of the existing literature and reviews them. Advice for managers in the online business is given as well as possibilities for the whole industry with the help of controlled online experiments. Finally, current limitations of testing on the internet are demonstrated and the focus of future research areas proposed. Managerial Implications and Guidelines for the Industry Controlled online experiments provide valuable insights in the minds and desires of the participants. Succeeding in the internet means for the most companies a widespread customization. However, the sole knowledge of correct customer behaviour estimation doesn’t lead automatically to increasing revenue and high customer satisfaction rates. Several papers describe the significance of a deliberate and encouraged management. Focusing on a long-term value generation for the company, responsible managers should agree on a suitable OEC upfront (Crook et al. 2009, p.1107). As described, a good OEC should measure the value of the new features for the business objectively and efficiently. Coming to a decision before the experiment is run the corporation forces itself to weigh values of various inputs and decide on their relative importance. Kohavi (2012, p. 20) recommends the assignment of lifetime value of users and the connected actions. Therefore, the upfront work aligns the organisation and clarifies the goals targeted in the future. Furthermore, controlled online experiments can simplify the decision-making process as a whole. Different employees may support different design or application improvements. From a corporate point of view, every design may have its advantages. Experiments can function as a decision maker between two or more parties supporting distinct approaches. Disagreements are time-intensive and burdensome for a positive atmosphere. Asking the 21 customers for their opinion, experiments can terminate possible dissents regarding pending changes. Therefore, online tests can contribute to a efficient decision making and can foster a harmonic work environment. Jeff Bezos, the CEO of Amazon, claims that “it’s important to be stubborn on the vision and flexible on the details” (Kirby and Stewart 2007, p. 81). In his opinion a company has to decide on the broad and long-term goals it wants to achieve. However, the exact route toward those objectives is going to be evolved together with the customers. Regalado (2014, p. 62) reports that customers tend to choose simple and effective schemes, thereby overlooking sophisticated design approaches. Competition-wise, this can arouse problems for traditional media companies, where editors and designers can be still biased toward the superiority of their own opinion. The author concludes, that online testing is capable of reshaping the entire appearance of the internet. Focusing the user’s point of view, Kohavi et al. (2009, p. 176) caution against the implementation of features that didn’t result in a significant statistical difference. The authors dismiss the thinking that the new features won’t have any effect. In their opinion, it’s still possible that the experiment will have negative impacts on the user experience although the test didn’t show perceptible alterations. Another paper describes that the understanding of invalid test results is the real challenge, not the discarding (Kohavi et al. 2012, p. 793). This may be the case with underpowered experiments, which fail to project possible pitfalls and actual consequences in a reliable way. Almost every organisation operating in the software and internet industry has a special business culture. Particularly for software and internet companies such as Google or Apple, an intense cohesion internally and a proud appearance of their employees externally constitutes an important component of the overall strategy. Kohavi, Henne and Sommerfield (2007, p. 966) criticize that many organizations developed a business culture where features are fully designed before the actual implementation. The authors recommend to alter that 22 approach to a direct integration of customer feedback through prototypes and experimentation. Therefore, the culture would change to a “data-driven” approach. Another recommendation for companies conducting online tests is that they have to be careful when applying offline testing schemes to online experimentation. Corporations testing their product features or designs offline shouldn’t apply those schemes when designing a new website and testing different variants. Carryover effects and incongruous confidence intervals can be possible pitfalls when copying the testing procedures (Kohavi et al. 2012, p. 793). The online business follows different rules, therefore the testing approach has to be unique and adjusted. Limitations and Future Research Controlled Online Experiments are a young and developing field of research. As with many new business areas, cost-effectiveness considerations play a major part in the design phase and can be a powerful limitation. Implementing tests is expensive, therefore even large-sized corporations like Amazon focus on reducing the costs of experimentation. Consequently, experiments can only be performed with sustainable funding. Through third-party suppliers of testing programs, general costs of simple A/B testing decreased but widespread experimentation processes are still costly (Kirby and Stewart 2007, p. 81). A further possible limitation but similarly an option for future research potentials is the way how users access the internet: Browsing is becoming increasingly multifaceted. Nearly 30% of today’s internet users access the internet from a mobile device, for example a tablet or smartphone (Hawthorne 2013, p. 78). For this group of users, providing a smooth online experiment involvement, can hold a lot of pitfalls. The author criticises that many websites still aren’t optimized for mobile use. In that case, buttons may be displaced or schemes discoloured. In his opinion, this can affect the user experience and therefore also up to one third of the testing results. 23 Currently, the biggest visible limitation of online controlled experiments is the scarcity of publicly available information. Professional papers describing practical experiences with online experiments are hard to find. Especially in the ranking Marketing Journals there was no explicit publication about the structure and implementation of controlled online experiments yet. One big stream of literature is available from Microsoft employees who founded the Experimentation Platform6 and publish papers regarding experimentation at Microsoft and connected entities. On one hand, this information is very useful because it illustrates in numerous papers the whole technical and statistical process of implementing experiments in detail. On the other hand, the papers are targeted at Microsoft’s testing processes and therefore subjective and only relevant for one part of the potential test users. Especially small and medium sized firms with less financial resources can’t implement testing processes the way global software producers as Microsoft or Google do. The scarcity has three reasons: Firstly, online testing is a relatively new method and business area. Not long ago, websites were updated the way the responsible designer or software developer preferred it. Relying on user opinions became more and more important just in recent years. Secondly, there are still a lot of companies feeling confident that they do not need online experiments. In this case, the managers and designers of websites assume that there is no need for evaluating their perfected ideas. Staying in this state of hubris, the chance of a reliable examination of their work is declined and a contribution to the online experiment market forfeited (Kohavi 2012, pp. 25-26). Thirdly, many firms conducting online experiments are reluctant to share their available insights and findings. The smooth and meaningful implementation of own testing processes can be valuable know-how that companies use as an competitive advantage against international contestants. Therefore, most of the global software and internet companies keep their knowledge private. This 24 circumstance won’t change in the future if competition between the companies keeps being strong. Legitimately, Kohavi et al. (2014, p. 5) also raise concern about the quality of conducted research. The authors propose to filter out low-quality testing processes: Papers and test result descriptions should be peer reviewed to ensure a proper implementation. Furthermore, some testing outcomes can’t be transferred to other products or websites because they are very specific. In their opinion, outcomes should be checked for possible misinterpretation of the results due to incorrect assumptions of the underlying reasons. In their paper, the authors provide several examples of tests lacking one or more of the described qualities. However, if focusing on high quality works, the whole market could benefit from an open and organized research pool for online experiments. Funding and encouraging skilled scientists to do further research about how experiments can be enhanced and implementation costs can be lowered could lead to further innovation and improvement. Hence, start-ups and global players could benefit from the gained knowledge equally. Possible synergy effects could contribute as a stimulus for further research about experiments. Recognizably, there are still many potentials for researching and improvement. Conclusion Controlled Online Experiments are undoubtedly one of the key measures to understand one’s customers better. After a period of time where design and functionality were determined by the developer itself, experiments ousted the hubris that a company can estimate best what a user desires regarding their website. Nowadays, controlled online experiments are used to create better and more effective internet presence and therefore generate more value and 25 revenue for the implementing company. The literature review concluded that a technical and statistical standard is needed to receive significant and relevant results in testing. Not every experiment ends in a meaningful result or has the needed foundation for analysis purposes. The pioneers of experimentation on the web had to experience arising pitfalls as well as achievable improvements to set the standard for testing as it is used today. Through the establishment of basic A/B testing platforms, access to fundamental online testing is enabled for start-ups and middle sized firms as well. Nevertheless, the whole process still needs a lot of advancement and improved accessibility to become the norm when changing websites. The literature research showed how limited available information about online experiments is and that companies prefer to keep know-how in that area locked up. Top quality Journals yet have to release papers about the online experimenting topic, probably not receiving submissions from respectable scientists. Therefore, more information and insights should be published in the future to allow others an easy and effective adaption of controlled online experiments according to their needs. Figure 3 summarizes the main findings of this literature review and illustrates the present problems (Insert Figure 3 about here). The figure can also be seen as a form of conclusion for the literature researching within the framework of the Bachelors thesis. Removing the illustrated obstacles will be the most significant challenge for researchers and associated companies. Consequently, the experimentation process has to be attuned to contemporary standards and trends. Otherwise, the high levels of customer agreement and user responsiveness companies strive for won’t be maintained in the long run. With that factors in mind, controlled online experiments can be one major growth area for firms in the future, representing a worthwhile generator for revenue and better satisfaction rates. 26 Table 1: Advantages and Disadvantages of all Assignment Methods Adapted from “Controlled Experiments on the Web: Survey and Practical Guide “ (Kohavi et al. 2009, p. 171) Traffic Splitting Page Rewriting Client-side Assignment Server-side Assignment Changes in website code No No Yes (moderately) Yes (highly) Implementation cost of first experiment ++/+++ ++ ++ +++ Implementation cost of subsequent experiments ++/+++ ++ ++ +/++ Hardware cost (server) +++ ++/+++ + + Flexibility (render time) +++ ++ + ++++ Negative impact on test performance for user + +++ +++ + Signs and Symbols Low + Moderate ++ High +++ Very high ++++ 27 Figure 1: Structural Overview of Controlled Online Experiments Contains terms and definitions from several authors (Deng et al. 2013; Kohavi et al. 2009; Kohavi et al. 2012; Kohavi, Henne, and Sommerfield 2007; Peterson 2004) Retention of current design Control variant Treatment variant Analysis of Test Results 3 Inputs Testing process 2 External information Internal testing know-how Financial resources Goal Setting (OECs), 1 Statistical Base Frame (hypothesis testing, sample size) Implementation (randomization, assignment of variants) Technical Facilities (server, software, data path) Experimentation Structure Introduction of changed design 28 Figure 2: A/B Testing Software Market: Distribution in May 2014 Adapted from BuiltWith.com (available at http://trends.builtwith.com/analytics/a-b-testing; select option “The Entire Internet” on the right) Other, 26.39 Optimizely, 26.47 Omniture Adobe Test and Target, 6.74 SiteSpect, 14.62 Google Website Optimizer, 10.16 [The market shares are expressed as a percentage] Visual Website Optimizer, 15.62 29 Figure 3: Sources, Trends and Existing Problems for Controlled Online Experiments Contains terms and definitions from several authors (Kohavi et al. 2014; Kohavi et al. 2009; Nielsen 2005; Peterson 2004; Quarto-vonTivadar 2006; Tang et al. 2010) Knowledge Sources Testing Technology Provider General Research Efforts Problem 1 Companyspecific Testing Secrecy of Know-How Problem 2 Underdeveloped Area Problem 3 Controlled Online Experiments Knowledge & Research Pool Scientific Paper in Marketing Journals A/A Testing Conference Proceedings Main Literature Stream Online Articles and Weblogs Alternatives Quality Concerns (no peer evaluation) Control Charts Segmented Testing Multivariable Testing Qualitative User Behaviour Observation A/B Testing 30 Appendix Appendix A: T-test and Minimum Sample Size For running a successful and statistically meaningful experiment, several minimum requirements have to be fulfilled. The first one is the t-test or also called single factor hypothesis test, which is used for determining the statistical significance of an A/B test (Kohavi et al. 2009, p. 152). 𝑡= 𝑂! − 𝑂! 𝜎! In the t-test formula, the enumerator represents the differential of the estimated values for the OEC of Treatment (B) and Control (A). The denominator represents the estimated standard deviation of the difference between the OECs. Kohavi et al. (2009, p. 152) emphasize that a threshold has to be established, which bases on the confidence level of the test, being very often 95%. If t, the result of the t-test, is larger than the threshold, the hypothesis is rejected and the Treatment’s OEC is statistically significantly different than the Control’s OEC. Moreover, the paper describes the effects of different alterations such as varying the sensitivity of the sample size (Kohavi et al. 2009, pp. 152-154). For the minimum sample size, Kohavi et al. (2007, pp. 962; 2009, pp. 152-153) introduce two formulas. 𝑛= !"! ! !! ;𝑛=( !!" ! ) ! In both formulas, n describes the number of users needed for each variant, σ describes the variance of the OEC and Δ the amount of change that should be detected with the experiment. In the second formula, r describes the number of variants in the test. Both formulas can be applied, the second one being a more conservative approach. 31 Appendix B: Literature Review Tables Author/s (Year) Amatriain and Basilico (2012) Title Netflix Recommendations: Beyond the Stars (Part 2) Journal/Source Techblog Netflix (online source) Theoretical Background • Gives insights into personalisation technology implemented at the Netflix Website • Author/s (Year) Bansal, Buchbinder, and Naor (2012) Title Journal/Source Randomized Society for Industrial Competitive Algorithms and Applied For Generalized Mathematics Caching Examples are provided how innovation and research are conducted • • Main Findings Company uses parallel A/B tests after offline tests had significant results Relevancy ++ (Functions as example in 3.1) Automatic recommendation system on the website is based on A/B tests Theoretical Background Discusses which type of • randomized algorithms are suitable for the generalized caching problem Main Findings Design of online rounding procedure that converts algorithm into randomized algorithm • Provide framework for caching Relevancy + (2.3 explanation PRNG) 32 Author/s (Year) Becker (2008) Title Best Bets for Site Tests Journal/Source Theoretical Background Multichannel Merchant Describes the rise of • (article) A/B testing tools, focusing on the Google product Goole Website Optimizer • Main Findings Relevancy ++ Google’s tool offers MVT and is open to (Functions as third-party example in developers 3.2) Therefore, continuous innovation is fostered Author/s (Year) Bell (2008) Title Multivariable Testing: An Illustration Journal/Source Circulation Management (article) Theoretical Background • Describes • implementation process of Multivariable testing Main Findings MVT provides better and faster results than executing several A/B tests • Sample size doesn’t have to be increased for MVT Provides comparison with single A/B tests • Relevancy +++ 33 Author/s (Year) Burk (2006) Title Journal/Source A Better Statistical Marketing Bulletin Method for A/B Testing in Marketing Campaigns Author/s (Year) Title Crook, Frasca, Kohavi, Seven Pitfalls to Avoid and Longbotham (2009) when Running Controlled Experiments on the Web Theoretical Background Main Findings Alternative to the usual Testing by using A/B testing method Control Charts is more presented effective and has more beneficial effects than A/B testing Journal/Source Theoretical Background KDD '09 Proceedings of • Focus on pitfalls the • the 15th ACM SIGKDD authors have International experienced after Conference on running numerous Knowledge Discovery experiments at • and Data Mining Microsoft • The pitfalls include a wide range of topics, such as assuming that • common statistical formulas used to calculate standard deviation and statistical power can be applied, ignoring robots in analysis Main Findings Flaws in the OEC can damage the whole test outcome Focus of tests has to be on human users only, otherwise the results can be very misleading OEC focus should be long-term and not short-term Relevancy +++ (Used as counter statement in 2.5) Relevancy +++ 34 Author/s (Year) Deng, Li, and Guo (2014) Title Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation Journal/Source Theoretical Background WWW '14 Proceedings • The authors propose • of the 23rd international a general conference on World methodology for wide web combining the first screening stage data together with validation stage data for more sensitive hypothesis testing and more accurate • point estimation of the treatment effect. • The method is widely applicable to existing online controlled experimentation systems. Main Findings Using generalized step-down tests to adjust for multiple comparison can be applied to any A/B testing with relatively few treatments Bias correction methods are useful when a point estimate is important Relevancy ++ 35 Author/s (Year) Deng, Xu, Kohavi, and Walker (2013) Title Improving the Sensitivity of Online Controlled Experiments by Utilizing PreExperiment Data Journal/Source WSDM '13 Proceedings of the sixth ACM international conference on Web search and data mining Theoretical Background Proposal of an approach that utilizes data from the pre-experiment period to reduce metric variability and hence achieves better sensibility Main Findings Relevancy Technique is applicable ++ to wide variety of key business metrics, practical and easy to implement Author/s (Year) Eisenberg (2005) Title How to Improve A/B Testing Journal/Source Clicz.com (online source) Theoretical Background Provides practical • examples and experiences of the author to get better results from A/B testing • • Main Findings A/B testing is more difficult and complex than it appears to be Only one change should be implemented at a time Small changes can lead to big differences Relevancy +++ 36 Author/s (Year) Hawthorne (2013) Title 4 Great Ways to Ramp up Your Web Testing Journal/Source Response Magazine (article) Theoretical Background Resulting from the • difference between offline and online testing, author provides • recommendations for starting online experiments successfully • • Author/s (Year) Hernández-Campos, Jeffay, and Donelson (2003) Title Tracking the Evolution of Web Traffic: 19952003 Journal/Source Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) Theoretical Background Provides data suitable for the construction of synthetic web traffic generators and in doing so retrospectively examines the evolution of web traffic Main Findings A/B testing should always be the basis Relevancy +++ If performed smoothly, multivariate testing can be added Most obvious objects should be tested first Include users that browse from mobile devices Main Findings Web traffic has been the dominant traffic type on the Internet, representing more bytes, packets, and flows on the Internet than any other single traffic class Relevancy + (Mainly used for definition of web traffic) 37 Author/s (Year) Kirby and Stewart (2007) Title The Institutional Yes Journal/Source Harvard Business Review Theoretical Background Interview with Jeff • Bezos, the founder of amazon.com about innovation inside the company and how continuous testing helps the company to understand their customers better • Main Findings Relevancy +++ There are big differences between what the company thinks will be successful and what the customers actually really like and use Customer focus can foster innovation and improvement of the company from the outside • Main focus of the company is to lower the overall costs of experimenting • Firms should agree on a long-term goal but be flexible in terms of how to get there specifically 38 Author/s (Year) Kohavi (2012) Title Journal/Source Online Controlled RecSys ’12 Experiments: (presentation slides) Introduction, Learnings, and Humbling Statistics Theoretical Background • Introduces • Controlled Online Experiments, shows examples and shares advantages (MSN, Microsoft Office) • • Describes cultural challenges regarding online experiments Main Findings Twyman’s Law: Every figure that looks interesting or different is usually wrong • Best practices include A/A tests, ramp up, and large user samples Cultural challenges have to be evolved in a four step process Relevancy +++ 39 Author/s (Year) Kohavi and Longbotham (2010) Title Unexpected Results in Online Controlled Experiments Journal/Source Theoretical Background ACM SIGKDD • Real examples of • Explorations Newsletter unexpected results and lessons learned regarding controlled online experiments at Microsoft • Authors explain the reasons for wrong forecasting and share resulting lessons Main Findings Relevancy +++ Anomalies are expensive to investigate, but they found that some lead to critical insights that have long-term impact • Often, incorrect results are due to subtle errors which are hard to anticipate or detect • Recommend frequent A/A testing to increase quality • Often, unexpected results are resulting from lack of understanding of user behaviour 40 Author/s (Year) Kohavi, Deng, Frasca, Walker, Xu, and Pohlmann (2013) Title Online Controlled Experiments at Large Scale Journal/Source Theoretical Background KDD '13 Proceedings of • Addresses three • the 19th ACM SIGKDD areas that are international conference challenged when on Knowledge implementing Discovery and Data experiments at large Mining scale: Cultural/organizatio nal, engineering and trustworthiness • • Focuses on the number of experiments: How organizations could evaluate more hypotheses, increasing velocity of validated learning • Main Findings Relevancy +++ Negative experiments, which degrade the user experience shorttime, should be run given the learning value and long-term benefits Due to the fact that there are many live variants of one site, alerts should be used to identify issues rather than relying on heavy upfront testing High occurrence of false positives when running experiments large scale 41 Author/s (Year) Kohavi, Deng, Frasca, Longbotham, Walker, and Xu (2012) Title Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained Journal/Source KDD '12 Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Theoretical Background • Explains five • outcomes about the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover • effects • Helps readers to increase the trustworthiness of results coming out of controlled experiments Main Findings Relevancy +++ Instrumentation is not as precise at it is expected to be: There are interactions in subtle ways with experiments Lessons learned from offline experiments don’t always function well online because of carryover effects and not shrinking confidence intervals 42 Author/s (Year) Kohavi, Deng, Longbotham, and Xu (2014) Title Seven Rules of Thumb for Web Site Experimenters Journal/Source Theoretical Background To appear in KDD 2014 • Provides principles • (forthcoming) for experiment implementation with broad applicability in web optimization and analysis • • Two goals: Guide experimenters in terms of optimizing and provide • community with new research challenges Main Findings Small changes can lead to high differences in Return-onInvestment It’s rather rare to get big changes in key metrics Speed is a main factor of an successful experiment implementation • Complex designs overburden users • The more users involved in testing the better Relevancy +++ 43 Author/s (Year) Kohavi, Henne, and Sommerfield (2007) Title Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO Journal/Source KDD '07 Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Theoretical Background • Practical guide to • conducting online experiments, especially where end-users can help developing possible features • • • • Provides examples of controlled experiments and discusses their technical and organizational limitations Introduces important testing areas: Statistical power, sample size and techniques for variance reduction Evaluates randomization and caching techniques Authors share key lessons helping Main Findings Controlled experiments neutralize confounding variables by distributing them equally over all values through random assignment • Establishment of casual relationship between changes made in different variants and measures of interest (OEC) • Success depends on the users’ perception and not on the Highest Paid Person’s Opinion (HiPPO) • Companies can accelerate innovation through experimentation Relevancy +++ 44 practitioners in running trustworthy controlled experiments because it’s the customers’ experience which is important 45 Author/s (Year) Kohavi, Logbotham, Sommerfield, and Henne (2008) Title Journal/Source Controlled Experiments Data Mining and on the Web: Survey and Knowledge Discovery Practical Guide Theoretical Background • Extended follow-up • paper to “Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO” • • Added section 4 which presents Multivariable tests in an online setting • Provides additional practical examples • Updated formulas used for sampling size and hypothesis testing • Main Findings Same findings as underlying paper, extended formulations in some places Multivariable testing can serve as an alternative to several A/B tests, but there are also accompanying disadvantages such as a poor user experience, a more difficult analysis and interpretation, and a longer preparation time for a test Before conducting MVT, an A/B test should always be performed beforehand Relevancy +++ 46 Author/s (Year) Linden (2006) Title Make Data Useful Journal/Source Stanford Data Mining 2006 (Presentation Slides) Theoretical Background • Testing algorithms • and techniques used at Amazon • • Reasons for A/B testing and how to improve them • Author/s (Year) Malinas and Bigelow (2008) Title Simpson’s Paradox Journal/Source Stanford Encyclopaedia of Philosophy (online source) Main Findings Speed of testing is very important Relevancy +++ Bias should be toward new items and recent history/mission Expectations for customers should be set Theoretical Background Main Findings Used for definition and Paradox states that two explanation of the groups with the same Simpson’s Paradox effect don’t need to have the same effect if combined to one group Relevancy ++ 47 Author/s (Year) McKinley (2012) Title Design for Continuous Experimentation: Talk and Slides Journal/Source mcfunley.com (online source) Theoretical Background • Describes the testing • procedure implemented on the Etsy.com website Main Findings Relevancy ++ Design and product process must change to accommodate experimentation • Changing too many things at once will result in confusion and negative results Description of the • A/B analyser, which automatically generates a dashboard with relevant business metrics for each performed test 48 Author/s (Year) Nielsen (2005) Title Putting A/B Testing in its Place Journal/Source nngroup.com (online resource) Theoretical Background • Describes why • measuring the live impact of design changes on key business metrics is valuable, but often creates a focus on short-term • improvements • Same issues remain for multivariable testing Main Findings A/B tests with OECs with nearterm focus neglect bigger issues that only qualitative studies can find A/B testing shouldn’t be the only method evaluating a project or change Relevancy + (Article may not be written from an objective/ scientific point of view. Therefore it is only used in the thesis as an example that there are divergent opinions about A/B testing) 49 Author/s (Year) Peterson (2004) Title Web Analytics Demystified: A Marketer's Guide to Understanding How Your Web Site Affects Your Business Journal/Source Theoretical Background Celilo Group Media and • Creates awareness CafePress (book) of what web analytics is (from a practical perspective) and what a web analytics program can and can't tell you. • • Main Findings Relevancy Ways to get meaningful +++ and measurable results (A/B test): • Only change one variable at a time • Understand the process for diverting traffic Helps developing an understanding of • which tools and statistics are useful to a web analytics • program. • Establishes knowledge of which • available statistics are most useful to your particular online business and • how this statistics should be used. Determine accurate measures of volume Analyse carefully Run a "Null Test" Run your test until you are sure the results are real Consider segmenting test subjects 50 Author/s (Year) Quarto-vonTivadar (2006) Title Journal/Source A/B Testing: Too Little, Future Now, Inc. Too Soon? (article) Theoretical Background • Evaluation if A/B • testing delivers what its promising • • A/B testing is not as meaningful as the industry is assuming • Main Findings A/B tests can be highly subjective Relevancy + (Article may not be written Many A/B tests aren’t implemented from an in the most efficient objective/ scientific way point of view. A/B testing ignores Therefore it is only used in variance the thesis as an example that there are divergent opinions about A/B testing) 51 Author/s (Year) Regalado, Antonio (2014) Author/s (Year) Roy (2001) Title Journal/Source Seeking Edge, Websites MIT Technology Turn to Experiments Review (article) Title Design of Experiments using the Taguchi Approach: 16 Steps to Product and Process Improvement Theoretical Background • Illustrates several • examples of companies conducting online experiments • • Main topic is about how users influence today’s web design Main Findings Nowadays, users heavily influence the design of a website • Website users prefer usability over a good looking design Provides examples unexpected outcomes for companies Journal/Source Theoretical Background John Wiley & Sons, Inc. Understanding the (book) Taguchi Method by providing examples and statistical explanations • Relevancy +++ Companies can’t estimate reliably what their customers want Main Findings Why using an Overall Evaluation Criterion is meaningful: Combines several objectives the company wants to achieve through the new feature Relevancy ++ (Mainly used for definition of OEC) 52 Author/s (Year) Stanaland and Tan (2010) Title The impact of surfer/seeker mode on the effectiveness of website characteristics Journal/Source International Journal of Advertising Theoretical Background Main Findings Relevancy There are two groups of ++ • Examines how the effectiveness of two website visitors: website designs is dependent of the • Seekers: goalconsumer’s purpose directed, prefer of visiting a website consumer-controlled interactive environment and • Therefore, an online visually simple experiment was design conducted in Singapore • Surfer: experiential, prefer marketercontrolled interactivity and visually complex layout 53 Author/s (Year) Studer (2012) Title Journal/Source Does it Matter How Journal of Economic Happiness is Measured? and Social Evidence from a Measurement Randomized Controlled Experiment Theoretical Background • Randomized • controlled experiment about the distribution of happiness scores Main Findings Gender happiness inequality depends on gender specific question design effects • It is beneficial to use a continuous rating scale instead of the widespread secrete scale Discrete single item • Likert scale is tested against a continuous rating scale (visual analogue scale) Relevancy ++ 54 Author/s (Year) Tang, Agarwal, O'Brien, and Meyer (2010) Title Overlapping Experiment Infrastructure: More, Better, Faster Experimentation Journal/Source Theoretical Background KDD'10 Proceedings of • Describes the 16th ACM SIGKDD overlapping international conference experiments on Knowledge implemented at discovery and data Google mining • Discusses associated tools and educational processes to use this variant effectively • Describes trends that show success of this overall experimental environment Main Findings Relevancy Benefits of overlapping +++ experiments: • More experiments can be conducted in the same period of time • Experiments are better in quality • Experiments itself are faster in implementation and data gathering 55 Author/s (Year) Villano (2013) Title You Can Go with This or You Can Go with That Journal/Source Entrepreneur (article) Theoretical Background Describes the rise of • Optimizely, a web company that facilitates A/B testing for companies • • Main Findings Very difficult and expensive to implement an A/B test by your own Market for testing programs is big A company that is specialized in this area should be able to provide quick and user-friendly programs Relevancy +++ 56 Author/s (Year) Walker (2013) Title Journal/Source What's Happening in Builtwith.com (online the A/B Testing Market source) Theoretical Background • Describes • fundamentals of A/B testing • • Probably not all programs will persist on the long run, signs of oversupply Theoretical Background • Provide background • to understand what a hash function is and what problems it addresses Main Findings Hash functions are basic building blocks that are central to a cryptography’s mission • Author/s (Year) Walker, Kounavis, Gueron, and Graunke (2009) Title Journal/Source Recent Contributions to Intel Technology Cryptographic Hash Journal Functions Offers overview over recent A/B testing program provider Main Findings Relevancy +++ Market for A/B testing programs is increasing and shows no signs of slowing down in the near future • Evaluates development of most important tools in detail Describe differences of two designs • (Skein and Vortex) There has been a wave of new research into hash functions Relevancy + (Used for explanation of hash functions) 57 Author/s (Year) Weiss (1997) Title Journal/Source Evaluation: Methods for Prentice Hall (book) Studying Programs and Policies Theoretical Background Establishing causality through differences in the OEC (Control vs. Treatment) Main Findings Relevancy Differences in the OEC +++ are - if designed and executed properly - the result of the assignment of Control and Treatment variants 58 Only for definition purposes Author/s (Year) Merriam Webster (2014) Definition “Null Hypothesis” Journal/Source m-w.com Theoretical Background Main Findings Online encyclopaedia “Null Hypothesis is a statistical hypothesis to be tested and accepted or rejected in favour of an alternative” Relevancy Author/s (Year) Merriam Webster (2014) Definition “Cookie” Journal/Source m-w.com Theoretical Background Main Findings Relevancy Online encyclopaedia “A cookie is a file that + may be added to your computer when you visit a Web site and that contains information about you” + 59 References Ali, Waleed, Siti Mariyam Shamsuddin, and Abdul Samad Ismail (2011), “A Survey of Web Caching and Prefetching,” International Journal of Advances in Soft Computing and its Applications, 3 (1), 18-44. Amatriain, Xavier, and Basilico, Justin (2012, June), “Netflix Recommendations: Beyond the Stars (Part 2),” (accessed May 16, 2014), [available at http://techblog.netflix.com/2012/06/netflix-recommendations-beyond-5-stars.html]. Bansal, Nikhil, Niv Buchbinder, and Joseph Naor (2012), “Randomized Competitive Algorithms For Generalized Caching,” Society for Industrial and Applied Mathematics, 41 (2), 391–414. Becker, Larry (2008), “Best Bets for Site Tests,” Multichannel Merchant, April edition, 2425. Bell, Gordon H. (2008), “Multivariable Testing: An Insight,” Circulation Management, May edition, 16-18. Burk, Scott (2006), “A Better Statistical Method for A/B Testing in Marketing Campaigns,” Marketing Bulletin, 17, 1-8. Crook, Thomas, Brian Frasca, Ron Kohavi, and Roger Longbotham (2009), “Seven Pitfalls to Avoid when Running Controlled Experiments on the Web,” International Conference on Knowledge Discovery and Data Mining (KDD), Paris (June 28-July 1), 1105-1114. Deng, Alex, Tianxi Li, and Yu Guo (2014), “Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation,” 2014 International World Wide Web Conference (WWW), Seoul (April 7-11), 609-618. 60 ———, Ya Xu, Ron Kohavi, and Toby Walker (2013), “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data,” WSDM 2013, Rome (February 4-8), 123-131. Eisenberg, Bryan (2005, April) “How to Improve A/B Testing,” (accessed May 8, 2014), [available at http://www.clickz.com/clickz/column/1717234/how-improve-a-b-testing]. Hawthorne, Timothy R. (2013), “4 Great Ways to Ramp Up Your Web Testing,” Response Magazine, September edition, 78. Hernández-Campos, Félix, Kevin Jeffay, and F. Donelson Smith (2003), “Tracking the Evolution of Web Traffic: 1995-2003,” Proceedings of the 11th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Orlando (October), 16-25. Kirby, Julia, Thomas A. Stewart (2007), “The Institutional Yes,” Havard Business Review, October edition, 74-82. Kohavi, Ron (2012), “Online Controlled Experiments: Introduction, Learnings and Humbling Statistics,” RecSys’12, Dublin (September 9-13), 1-46. ———, Ron, and Roger Longbotham (2010), “Unexpected Results in Online Controlled Experiments,” ACM SIGKDD Explorations Newsletter, 12 (2), 31-35. ———, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu (2012), “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,” KDD 2012, Beijing (August 12-16), 786-794. ———, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann (2013), “Online Controlled Experiments at Large Scale,” KDD 2013, Chicago (August 11-14), 1168-1176. ———, Alex Deng, Roger Longbotham, and Ya Xu (2014), “Seven Rules of Thumb for Web Site Experimenters,” KDD 2014 (forthcoming), New York City (August 24-27), [available at http://www.exp-platform.com/Documents/2014%20experimentersRulesOfThumb.pdf]. 61 ———, Randal M. Henne, and Dan Sommerfield (2007), “Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO,” Industrial and Government Track Paper (KDD’07), San Jose (August 12-15), 959-967. ———, Roger Longbotham, Dan Sommerfield, Randal M. Henne (2009), “Controlled Experiments on the Web: Survey and Practical Guide,” Data Mining and Knowledge Discovery , 18 (1), 140-181. Linden, Greg (2006), “Make Data Useful,” presented at the 2006 Stanford Data Mining, Stanford University (December 4). Malinas, Gary, and John Bigelow (2009), “Simpson’s Paradox,” Stanford Encyclopedia of Philosophy, (accessed May 20, 2014), [available at http://plato.stanford.edu/entries/paradox-simpson/]. McKinley, Dan (2012), “Design for Continuous Experimentation: Talk and Slides,” (accessed May 18, 2014), [available at http://mcfunley.com/design-for-continuous-experimentation]. Merriam-Webster (n.d.), “Null Hypothesis,” (accessed May 12, 2014), [available at http://www.merriam-webster.com/dictionary/null%20hypothesis]. ——— (n.d.), “Cookie,” (accessed May 12, 2014), [available at http://www.merriam-webster.com/dictionary/cookie]. Nielsen, Jakob (2005), “Putting A/B Testing in its Place,” (accessed April 18, 2014), [available at http://www.nngroup.com/articles/putting-ab-testing-in-its-place/]. Peterson, Eric T. (2004), Web Analytics Demystified: A Marketer's Guide to Understanding How Your Web Site Affects Your Business, s.l. :Celilo Group Media and CafePress. Quarto-vonTivadar, John (2006), “A/B Testing: Too Little, Too Soon?,” published for Furure Now Inc., New York City, 1-21. Regalado, Antonio (2014), “Seeking Edge, Websites Turn to Experiments,” MIT Technology Review, 117 (2), 62-63. 62 Roy, Ranjit K. (2001), Design of Experiments using the Taguchi Approach: 16 Steps to Product and Process Improvement, s.l. :John Wiley & Sons, Inc. Stanaland, Andrea J. S., and Juliana Tan (2010), “The Impact of Surfer/Seeker Mode on the Effectiveness of Website Characteristics,” International Journal of Advertising, 29 (4), 569-595. Studer, Raphael (2012), “Does it Matter How Happiness is Measured? Evidence from a Randomized Controlled Experiment,” Journal of Economic and Social Measurement, 317-336. Tang, Diane, Ashish Agarwal, Deirde O'Brien, and Mike Meyer (2010), “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation,” KDD'10, Washington D.C. (July 25-28), 17-26. Villano, Matt (2013), “You Can Go With This, or You Can Go With That,” Entrepreneur, August edition, 74. Walker, Chris (2013), “What's Happening in the A/B Testing Market,” (accessed May 23), [available at https://blog.builtwith.com/2013/07/19/whats-happening-in-the-ab-testingmarket/]. Walker, Jesse, Michael Kounavis, Shay Gueron, and Gary Graunke (2009), “Recent Contributions to Cryptographic Hash Functions,” Intel Technology Journal, 13 (2), 80-95. Weiss, Carol H. (1997), Evaluation: Methods for Studying Programs and Policies,s.l. : Prentice Hall. 63 Footnotes 1 Affidavit “The hypothesis that an observed difference is due to chance alone and not due to a systematic cause” (Merriam Webster 2014). 2 Caching is conducted to accelerate the performance of the system through storing an extent of visited websites of an user (Bansal, Buchbinder, and Naor 2012, p. 391) 3 “A small file or part of a file stored on a World Wide Web user's computer, created and subsequently read by a Web site server, and containing personal information“ (Merriam Webster 2014). 4 “Hash functions are one of cryptography’s most fundamental building blocks, even more so than encryption functions. For example, hash functions are used for […] random number generation, as well as for digital signature schemes, stream ciphers, and random oracles” (Walker et al. 2009, p. 80) 5 Traffic describes the data which is sent and received between a client and a server, therefore between a website and the visitors of this site (Hernández-Campos, Jeffay, and Donelson Smith 2003, p. 16). 6 Experimentation Platform: http://www.exp-platform.com/Pages/default.aspx
© Copyright 2026 Paperzz