Controlled Online Experiments - Chair of Quantitative Marketing

Controlled Online Experiments:
What They Are and How to Do
Them
Bachelor’s Thesis
Marius Brand
Spring Term 2014
Advisor:
Veronica Valli
Chair of Quantitative Marketing and Consumer Analytics
L5, 2 - 2. OG
68161 Mannheim
Internet: www.quantitativemarketing.org
II
Table of Content
List of Tables........................................................................................................................... III
List of Figures ......................................................................................................................... IV
Abstract .................................................................................................................................... V
1. Introduction .......................................................................................................................... 1
2. Controlled Experiments ....................................................................................................... 2
2.1 Structure and Terminology of A/B Tests ......................................................................... 3
2.1.1 Technical vocabulary. ............................................................................................... 3
2.1.2 Statistical foundation of experiments. ....................................................................... 5
2.2 Multivariable Testing (MVT)........................................................................................... 6
2.2.1 Characteristics. ......................................................................................................... 6
2.2.2 Variants of MVT. ....................................................................................................... 7
2.3 Technical Implementation ................................................................................................ 8
2.3.1 Randomization algorithm. ......................................................................................... 8
2.3.2 Assignment method. ................................................................................................... 9
2.3.3 Data path. .................................................................................................................. 9
2.4 Suggestions for Improving the Test Performance .......................................................... 10
2.5 Possible Pitfalls and Technical Limitations of A/B Tests .............................................. 13
3. Outcomes and Application ................................................................................................ 15
3.1 Application in the Software and Online Service Industry.............................................. 16
3.2 Technology Provider of A/B Tests................................................................................. 17
3.3 A/B Testing For Research Purposes............................................................................... 18
4. Discussion ............................................................................................................................ 20
4.1 Managerial Implications and Guidelines for the Industry .............................................. 20
4.2 Limitations and Future Research.................................................................................... 22
5. Conclusion ........................................................................................................................... 24
Appendix ................................................................................................................................. 30
References ............................................................................................................................... 59
Affidavit .................................................................................................................................. 64
III
List of Tables
Table 1: Advantages and Disadvantages of all Assignment Methods ..................................... 26
IV
List of Figures
Figure 1: Structural Overview of Controlled Online Experiments .......................................... 27
Figure 2: A/B Testing Software Market: Distribution in May 2014 ........................................ 28
Figure 3: Sources, Trends and Existing Problems for Controlled Online Experiments .......... 29
V
Abstract
Controlled Online Experiments are an integrated component of development for the majority
of today’s web-based companies. With the establishment of tests on the internet, it is feasible
to measure the reactions and preferences of users toward one of two alternative versions of
the same website. Companies do not have to discuss about the superior design and structure,
they simply involve their customers in the process. Regarding the fact that forecasting web
users’ behaviour on a corporate level is often imperfect, firms like Microsoft or Amazon
heavily invest in experimentation and online testing research. For them, anticipating the
behaviour and opinion of their customers is one key element of success.
The main focus of this literature review will be on scientific research papers and
reports by employees of companies operating in the online market. Besides establishing
strategies for the technical implementation, recommendations regarding the evaluation and
improvement proposals are outlined and connected. Comprehensively, the thesis aims at
providing an introductive guideline and revealing the recent state of the art regarding online
experimenting.
1
Introduction
For a long period of time, offline controlled experiments have captured a highly important
part of the development process and customer integration for new products and services. With
the expansion of the internet as a market place and a huge global business platform,
companies started to implement online experiments of similar purpose and structure. The
consumers of the world wide web as a fast-changing conglomerate demand user-friendly and
meaningful content. Therefore, besides fostering innovation processes, the executing
companies focus on answering the question what their customers want or prefer.
Historically, controlled online experiments illustrate a consequence of difficulties in
the estimation of customer opinion and desire. Having severe problems in forecasting user
reactions to new designs and functions on their websites, companies started to develop tools
that were able to test customer responses beforehand. On the other hand, after long periods of
research and analysis, the management of a corporation has a strong opinion about new
components and their abilities to benefit the company. Through the implementation of
experiments, this opinion is contrasted with the actual customer response. Surprisingly, the
results of the online trials are often very unexpected and contrary compared to corporate
estimations. In consequence of the fact that forecasting web users’ behaviour on a corporate
level is mostly inaccurate and imperfectly, online experiments became the new standard for
website improvement and alteration.
There is no consistent expression for the testing processes yet. Randomized
experiments, online trials and A/B tests usually refer to the same or similar processes and can
be consolidated as controlled online experiments. If used in this literature review, all named
terms refer to the same category of online testing. Generally, A/B tests are the easiest and
2
most used type of online experiments. The majority of all relevant papers focuses on the
implementation and evaluation of A/B tests, therefore the thesis will mainly focus on this
most popular pattern of testing as well. An exception is Multivariable testing, which can be
seen as an advanced and multi-layered approach of A/B testing.
As a first step, the paper provides an overview of the common structure of
experiments, examining simple A/B tests as well as Multivariable testing. After establishing a
technical implementation standard, contemporary and relevant test performance suggestions
and limitations from leading authors are presented. Subsequently, the third chapter illustrates
examples of online experiments in a corporate environment as well as for research purposes,
highlighting the outcomes and individual benefits the executors were able to accomplish. The
fourth chapter discusses managerial implications based on results from online
experimentation, implications for the online industry as a market and business place itself and
possible limitations researchers may face when delving about this particular topic. The final
conclusion resumes the main findings over the course of the thesis and closes with the recent
standpoint of controlled online experiments.
Controlled Experiments
The origin of controlled experiments on the internet from offline tests in traditional
production business areas shows the mutual correlation to consumer behaviour research. The
conduction of online trials applies many overlapping technical standards and implementation
technologies with its offline counterpart. Therefore, various steps described hereafter share
similarities in primal experimenting practises. Frequently, authors of meaningful papers and
research studies cited in this review resort to statistical and technological fundamentals.
Hence, important implementation principles also serve as a basis in this thesis for explaining
3
the fundamental IT-processes. After expounding the structural foundation of controlled online
experiments, this chapter focuses on key indicators of a successful test performance and lists
relevant guidelines and recommendations for a smooth test realization.
Structure and Terminology of A/B Tests
There are different approaches to conduct an controlled online experiment. A/B tests are the
most simple form of controlled experiments and widely used. Visitors are randomly exposed
to one of two different variants of the same website: Control (A) or Treatment (B). (Crook et
al. 2009, p. 1106). Control means in this case, that the original and unchanged website is
displayed for the user. The remaining part of users browsing the website sees the Treatment.
For them, the new version of the website is presented. The test initiator can split the
distribution of users confronted with A or B independently, for example the allocation of
respectively one half to each variant (Kohavi and Longbotham 2010, p. 203). It’s important to
emphasize that no factor is allowed to influence the distribution decision during the test
execution, the allocation has to be random and in a persistent manner (Kohavi et al. 2014, p.
2). Through this method, the different response on the Treatment path can be measured
compared to the original process. (Peterson 2004, p. 76).
Technical vocabulary. Afore enabling a precise analysis of the collected observations,
an Overall Evaluation Criterion (OEC) has to be derived for each variant. The OEC is defined
as a quantitative measure of the overall objective of the experiment (Kohavi et al. 2009,
p.150). Trials with more than one objective can obtain a compromise by combining the
objectives into one OEC under various criteria (Roy 2001, p. 26). Therefore, the OEC is the
important key metric of the experiment that is going to be compared and analysed (Crook et
al. 2009, p. 1106). Meaningful examples for an OEC can be the conversion rate, the units
purchased after exposing the users to one of the variants, the resulting revenue or a weighted
combination of several factors. Specifically for online experiments, selecting a single metric
4
aligns the organization behind a clear objective due to the enforcement of trade-offs to be
conducted (Kohavi, Henne, and Sommerfield 2007, p. 961). In addition, a reasonable OEC
should contain factors focused on long-term goals, for example predicted lifetime value or
repeat visits. Therefore, a factor can be defined as a variable with the purpose of influencing
the OEC. In simple A/B tests one single factor embraces two values or variants: A and B
(Crook et al. 2009, p. 1106).
Before executing a controlled experiment and measuring the changes in the OEC, the
relevant experimental unit has to be determined. Assumed to be independent from each other,
an experimental unit is outlined as the entity on which the observations during the experiment
are made (Kohavi et al. 2009, p. 150). In the case of an online trial, the user of the website or
application is the usual experimentation unit. Nonetheless, tests can also be based on units
like sessions or page views (Kohavi, Henne, and Sommerfield 2007, p. 962).
Peterson (2004, p. 77) recommends to run a Null Test before implementing the actual
experiment. Commonly known as A/A test, it allocates the users to two variants but in
contrast to the actual A/B test, the participants are all exposed to the same website. As a
result, the verification of the same conversion and abandonment rates of both variants is
possible. The author claims that this secures the correct set up of measurement tools. The Null
Hypothesis1 assumes that the OECs for the tested variants are not differing and furthermore,
that any observed differences are due to random fluctuations (Kohavi et al. 2009, p. 150).
Generally, the hypothesis shouldn’t be rejected during a Null Test. If a rejection occurs, the
probability of an error is high and the test setting should be revised (Kohavi, Henne, and
Sommerfield 2007, p. 962). Kohavi et al. (2012, p. 788) recommend every experimenter to
continuously run A/A tests for identifying issues in the construction of test systems.
In conclusion, if designed and executed properly, the only relevant discrepancy after
conducting an A/B experiment should be the change between Control and Treatment. Hence,
5
the differences in the OEC directly result from this assignment and, on that note, a certain
degree of causality is established. (Weiss 1997, p. 215).
Statistical foundation of experiments. Subsequently, it’s essential to perform a
statistical test to estimate whether the difference between Control and Treatment is
remarkable. A Treatment is accepted as being statistically significantly different as soon as
the test is rejecting the Null Hypothesis, which means that the OECs of the variants are not
differentiating (Crook et al. 2009, p. 1106). There are several elements influencing the
outcome of the statistical test. The confidence level, which is defined as the probability of
failing to reject the Null Hypothesis, although it is true, is typically set to 95%. That implies,
that in 5% of all tests a significant difference is determined when there is none (Kohavi el al.
2009, p.151). For companies like Microsoft, which constantly run a high amount of tests, this
can mean hundreds of false positive results and therefore corrupted test outcomes (Kohavi et
al. 2014, p. 4). The smaller the confidence level, the higher is the probability to detect a
significant difference, in other words the power of the experiment (Kohavi et al. 2009, p.
151). Plus, the smaller the standard error of the test, the more powerful is the test itself. A
suitable OEC is connected to a large sample size. Plus, including components of low
variability reduce the overall standard error. Users who are not exposed to the variants during
the test should be filtered out. This lowers the variability of the OEC itself and therefore the
standard error of the whole test (Kohavi, Henne, and Sommerfield 2007, p. 962).
Mistakes can be avoided if the executor ensures that the testing time is sufficiently
dimensioned. Peterson (2004, p. 77) advises against too short tests because if run not long
enough it is not fully ensured that a change in the Treatment really means an overall
significant difference. Inherently, a test should be continued until the result is certainly
representative. In their paper, Kohavi et al. (2009, p. 152) provide useful formulas which
define the minimum sample size of a classical A/B test and the decision of rejecting/accepting
6
the Null Hypothesis within the scope of a t-test. A detailed description of the formulas can be
found in the Appendix A. Taking the mathematical analysis one step further, Deng, Li and
Guo (2014, p. 610-616) discuss statistical dependencies between test variants, hypothesis
testing and point estimations. Hence, their empirical results provide bias correction methods
which are able to enhance A/B tests in terms of their accuracy and absence of errors.
Satisfying all statistical standards and general requirements, an A/B test can eventually
provide a valuable insight: The difference in the OEC between the recent and planned setup
of a website or the different OECs for two alternatively planned setups.
Multivariable Testing (MVT)
Aiming for a bigger and multi-layered experimental result, some companies consider testing
several factors in a single experiment. This trial is called Multivariable or Multivariate testing
(MVT) and can be used to estimate effects of each of the tested factors as well as possible
interactive effects between those factors (Kohavi et al. 2009, p. 158-159).
2.2.1 Characteristics. Bell (2008, p. 16) claims, that through MVT numerous
marketing elements can be tested at once with a similar accuracy and deeper insights
compared to various sequential A/B tests. MVT is described to be advantageous because
many factors can be tested in a short period of time, hence accelerating general
improvements. The interactions between two or more factors are able to alter the overall
effect. Kohavi et al. (2009, p. 160) claim, that an estimation of synergistic or antagonistic
behaviour between the factors extends the comprehensive dimension of the tested variants.
Therefore, a Multivariable test can show results from interactions that wouldn’t be visible if
each variant is tested on its own.
However, MVTs are more expensive and usually require prior expertise in testing
(Bell, 2008, p. 16). Kohavi et al. (2009, p. 160) legitimate the more complex testing process
7
with the additional value a MVT can yield. Practically, the gained insights outweigh minor
limitations such as the longer preparation time or the risk of a suboptimal user experience.
Yet another paper names MVT an alternative but adheres to A/B testing as the most
appropriate experiment design. The rarity of interactions representing statistical interrelations
make the authors believe that the additional value provided by MTVs is not capable of
outweighing negative factors in most of the cases (Kohavi et al. 2013, p. 1172-1173).
As a further alternative, Deng, Li and Guo (2014, p. 609) propose to use a two-stage
or multi-stage A/B testing as soon as the executor is confident with one-stage testing. Due to
the early employment of experiments in recent feature development cycles, tests support most
of the designing steps. This timing demands several experiment stages to evolve the finished
design together with the target users. The authors underline, that a single stage A/B test on the
other hand only affirms or rejects the finished design, the testing is not a part of the
development cycle.
Variants of MVT. On behalf of online experimenting, three possibilities of executing
MVTs are available: A traditional approach, running concurrent tests or overlapping
experiments. Every method has its distinctive benefits, however, due to its lack of interaction
estimation, Kohavi et al. (2009, p. 161) clearly recommend against the traditional approach if
conducting online tests. During concurrent testing, it is possible to turn off any factor at any
time without influencing the other concurrent factors. Overlapping experiments concentrate
on testing a factor as a one-factor experiment whenever the factor is ready to be tested. This
method enables a fast testing of factors and also illustrates possible interactions between
overlapping factors (Kohavi et al. 2009, p. 163).
Further insight into the implementation of overlapping experiments offers a paper
focusing on trials conducted at Google but also providing general guidelines. The authors
claim that an overlapping design keeps the advantages of an single layer system while
8
increasing additional factors such as scalability, flexibility and robustness (Tang et al. 2010,
p. 17-26). Kohavi et al. (2009, p. 163) believe that if a quick testing process is the focus of the
trial, using overlapping experiments is the most beneficial. However, if the trial concentrates
on the estimation of interactions between factors, the use of concurrent tests is suggested.
Technical Implementation
The implementation of an experiment on the internet differs from the traditional offline
versions. The most comprehensive approach of describing the technical realization was done
by the Experimentation Platform at Microsoft. In their first paper, Kohavi, Henne, and
Sommerfield (2007, p. 963) establish two necessary components. Later on, Kohavi et al.
(2009, p. 163) add one more factor to the implementation process.
Randomization algorithm. In the beginning of the experiment, a randomization
algorithm allocates the users to the different variants (Kohavi et al. 2009, p. 163). To ensure
the statistical significance discussed afore, the algorithm has to encompass several properties
such as an appropriate user split between Control and Treatment and the avoidance of any
possible correlations between parallel experiments (Kohavi, Henne, and Sommerfield 2007,
p. 964). In the follow-up version of the 2007 released paper, Kohavi et al. (2009, p. 164) add
two more desirable but not essential properties.
A pseudorandom number generator (PRNG) can be used as the needed algorithm if
coupled with a form of caching2. In the case of an experiment, the assignment of end users is
cached once they are exposed to either the Control or Treatment variant to prevent any
correlations between concurrent experiments (Kohavi, Henne, and Sommerfield 2007, p.
964). The caching can be accomplished on the server side by storing the relevant information
in a database. Ali, Shamsuddin and Ismail (2011, pp. 19-29, 37-38) introduce and review
several web caching approaches that can be considered as an option for the testing
implementation. Though, it is significantly cheaper to store the data in a cookie3 as no
9
database is required in this case (Kohavi, Henne, and Sommerfield 2007, p. 964). Kohavi et
al. (2009, p. 164) underline, that the devices of the users need to have cookies turned on,
otherwise the caching does not work.
The PRNG can also be replaced by a hash function4. However, Kohavi et al. (2009, p.
166) believe that many popular hash functions failed to execute experiments properly.
Additionally, the overall running-time performance is limited by the running-time
performance of the hash function itself, therefore it can restrict the whole testing process. For
that reason, the authors recommend a hybrid approach with either a small database or the use
of cookies.
Assignment method. After finding the most suitable algorithm, an assignment method
has to be determined. With the help of one of these methods, the execution of different
variants for different end users on the experimenting website is enabled (Kohavi, Henne, and
Sommerfield 2007, p. 964). Relevant papers introduce four different approaches: Traffic
splitting, Page rewriting, Server-side assignment and Client-side assignment. For an in-depth
look, Kohavi et al. (2009, pp. 166-171) provide a detailed description of the different
methods, illustrating the technical implementation in several figures. In a nutshell, Table 1
illustrates the main advantages and disadvantages of the described assignment methods (Insert
Table 1 about here). The table focuses on the intensity of cost generating factors as well as on
possible impacts for the user.
Data path. As a final step, a data path has to be established that conducts all necessary
actions for receiving a meaningful and informative outcome of the experiment. Kohavi et al.
(2009, pp. 171-172) can be seen as pioneers in describing a precise establishment of the data
path. This includes the actual capturing of the raw observation data during the trial, the
aggregation of the data, application of statistics and ultimately the preparation of the overall
results. During the data collection and aggregation process, a very large amount of traffic5
10
arises. In detail, the website has to record all treatment assignments of end users and collect
the arising data such as page views, clicks or the resulting revenue. Next, the data is converted
into metrics that summarize the results and make them comparable to other variants of the
experiment. Hence, the actual statistical significance of all variants can be analysed (Kohavi
et al. 2009, p. 171).
There are three different methods of raw data collection available. Kohavi et al. (2009,
p. 172) describe existing data collection, local data collection and service-based data
collection and name occurring benefits and difficulties for each method. Plus, the authors
select the service-based collection approach as the most flexible and therefore the preferred
method for data collection.
Suggestions for Improving the Test Performance
After finishing the theoretical and statistical set-up, the online experiment is executed and the
accruing data and insights collected and analysed. Thereby, results and difficulties can arise
that were not planned or expected beforehand. Several papers describe incoming situations
the executors faced throughout and after the practical implementation. Based on these
conclusions, suggestions and guidelines have been developed for subsequent controlled
experiments.
During the analysing process of an implemented trial, Kohavi et al. (2009, p. 173)
recommend to store the collected data. Besides the statistically significance of the Control and
Treatment variants, the data can also provide insights in the behaviour of different user groups
or problems, that only a specific group of users have faced. Based on that information, some
populations may be excluded from future tests to enhance the overall quality of the results.
The speed of the execution of Control and Treatment is crucial for the whole
experiment. A slow performance of the Treatment shown to users may influence the whole
test negatively. Linden (2006, p. 10) as well as Kohavi et al. (2009, p. 173) report, that a
11
delayed execution of the Treatment can harmfully change the behaviour of the end user to
perform actions or even stay on the same website.
A certain discrepancy exists regarding the question, how many factors should be tested
at one time. Peterson (2004, p. 76) and Eisenberg (2005) agree on the fact that testing factor
after factor is beneficial because in this case the intermixing of different effects is impossible.
However, Kohavi et al. (2009, p. 174) believe that this approach limits organizations in terms
of the scope of the desired improvements. The authors claim that interactions between
different factors are less frequent than assumed by the executors. To avoid any interactions,
they recommend single-factor experiments when incremental changes are proposed to be
implemented. Plus, in their opinion, very different variants in terms of their design or
implementation can minimize interrelation as well as different algorithms used.
As already mentioned, continuous A/A tests help validating the correct performance of
later A/B tests. Only if a website executes an A/A test faultlessly, a reasonable test set-up can
be assumed (Peterson, 2004, p. 77). Additionally, running A/A tests continuously at the same
time with other experiments supports a frictionless continuation (Kohavi et al. 2009, p. 174).
In case of testing outcomes with very marginal statistically significant results, Kohavi et al.
(2013, p. 1174) endorse rerunning the processes to replicate the results and improve them.
The same applies to abnormally positive test results. The authors caution that instrumentation
errors or software bugs being the reason for the result are much more probable than obtaining
test metrics with extraordinary increases (Kohavi et al. 2014, p. 4).
Kohavi, Henne & Sommerfield (2007, p. 963) propose to increase the percentage of
users assigned to the Treatment variant gradually. This has the advantage that a Treatment,
that is significantly underperforming compared to the Control, can be shut down at the
relative beginning of the experiment. Therefore, possible damages can be reduced as well as
the risk of exposing many users to unintended errors. One exemplary implementation was
12
installed for testing processes at Microsoft’s Bing search engine, where alerts automatically
signal a negative user experience or interactions with other experiments (Kohavi et al. 2013,
p. 1168). Hence, executors are able to enquire the actual reasons for unexpected testing results
immediately.
Little or insignificant outcomes can result from the tested scenario content itself and
not from the tested variants. In this case, an evaluation of possible problems or obstacles that
users face during participation in the test, is reasonable (Eisenberg, 2005). If running
smoothly, an equal allocation of users to Control and Treatment is favoured to equip the
experiment with maximal power and minimal running time needed. Kohavi et al. (2009, p.
175) estimate the required running time of an experiment with a Control-Treatment allocation
of 99% - 1% of all users to be about 25 times higher than of an experiment with an equivalent
allocation of users.
Furthermore, running experiments that are underpowered regarding participants can
lead to distorted results. Before ending a test, the minimum sample size has to be fulfilled for
statistical significance (Kohavi et al. 2014, p. 8). An approach at Microsoft utilizes data from
pre-test periods to reduce the variability of the used metrics and therefore achieve a better
sensitivity. As a result, less test users and less time is needed to achieve the same statistical
power (Deng et al. 2013, pp. 123-124).
Resuming many of the suggestions and guidelines published within the scope of all
Microsoft-related papers of the last couple of years, the forthcoming paper from Kohavi et al.
provides an extensive insight in long-time testing experience. Plus, all findings are
documented with tests occurred at Microsoft and other testing experts, offering additional
cross references to the whole internet-based industry.
13
Possible Pitfalls and Technical Limitations of A/B Tests
Beneath distinguishing areas of improvement, the executor of a controlled online experiment
also has to evaluate parts of the test that did not perform as planned beforehand. A rather
small number of papers illustrate the problems accrued during and after running online tests.
Crook et al. (2009, pp. 1105-1114) describe several difficulties that occurred during testing
periods at Microsoft. For instance, the authors caution against a design of the OEC that only
focuses on “beating” the Control and showing a significant difference. Mostly, these types of
goals ignore long-term value of the variant or the final effect on generated revenue. Moreover,
the disregard of robots in online experiment environments is another focus of the paper.
Robots such as automated search programs or tools that act like a virtual human being
generate traffic on the internet. The authors advise to exclude traffic generated by robots from
experiments that focus on human users, otherwise the overall test results can be very
misleading. Additionally, the work includes statistical and metrical difficulties as well as
pitfalls when neglecting precedent A/A tests and Control variants.
A common mistake made in the analysis of experiment results is the abandoning of
cannibalizing effects. For positive test results, Kohavi et al. (2014, p. 7) advice to examine if
the increase in the OEC occurred additionally or at the expense of other elements on the
website. The authors experienced regular success for local improvements, but global
improvements for the website in total were generally harder to achieve.
Another possible pitfall is the usage of browser redirects (Kohavi et al. 2009, p. 158).
During an A/B test, many companies use the approach that users allocated to the Treatment
variant are redirected to the modified page, which is not the same as the one displayed for the
Control variant. Kohavi and Longbotham (2010, pp. 31-32) claim that when tested in an A/A
trial, experiments using redirects underperform. Reasons for that can be delaying performance
differences for the Treatment group, bots that may block the redirection process or the fact
14
that the user itself could hinder the experiment by saving the link to the Treatment website.
Therefore, the authors suggest the avoidance of redirects and recommend a server-side
mechanism as previously introduced. The cited paper also characterizes additional pitfalls
such as a bad exposure control, technical differences with different internet browsers and the
Simpson’s Paradox (2010, pp. 33-34). This elementary paradox states that two groups with
the same effect do not need to face the same effect if combined to one group (Malinas and
Bigelow 2008). Therefore, the combined test results of two testing days can be different from
the results of each day.
Besides possible pitfalls, controlled online experiments have general technical
limitations that need to be considered. According to Nielsen (2005), tests provide the answer
to which variant performs better but not the reason for the favouritism. For example, a test
shows that a certain design is highly favoured by the testing participants. The executor knows
which design should be implemented but not the reason why users preferred the chosen
design compared to others.
Several authors criticise that online experiments only run over a rather short period of
time. Therefore, the whole experiment and OEC are short-term focused and do not include
possible long-term consequences (Nielsen 2005; Quarto-vonTivadar 2006). Kohavi, Henne,
and Sommerfield (2007, p. 963) disagree, stating that long-term considerations can be
achieved when choosing the right OEC.
Introducing a new feature or design on a website, a newness effect can set in and users
need longer time to navigate on the site. This influences the performance of the Treatment
variant compared to the Control variant and needs to be considered when measuring for
example the time spent on the website or the clicks per minute (Kohavi et al. 2009, p. 158).
Criticizing the relatively big sample size needed for a realisable A/B test, one paper
announces a better alternative. According to Burk (2006, p. 1-2), using Control Charts instead
15
of t-tests reduces the needed sample size and delivers superior and well-comprehensible
results. In a Control Chart, the Treatment variant is not tested all the time against the Control
variant. After a certain period of only Control testing, the Treatment is activated and the
difference in user behaviour measured. The author claims that without adding complexity,
Control Charts are very flexible and can be extended to multiple factors testing, therefore
outperforming the limitations of A/B testing. Being the only paper declaring Control Charts
as the superior alternative, this opinion should rather be seen as an exceptional position.
Despite several limitations and difficulties, the majority of authors approve A/B
testing as the most efficient and most beneficial method for experimenting (Kohavi et al.
2013, p. 1175; Kirby and Stewart 2007, p. 81; Tang et al. 2010, p. 17-18). As a comprising
overview of Chapter Two, Figure 1 shows the central processes and factors regarding
controlled online experiments in an interactive structure (Insert Figure 1 about here). It
illustrates the experimenting process as a multi-layered conglomerate, consisting of an
essential foundation and the test analysis at the top. The displayed input factors are capable of
altering the test structure beforehand, during and even after the implementation.
Outcomes and Application
With the rise of the internet as a highly-competitive marketplace, appealing website designs
with a high degree of usability become more and more important. Particularly companies
earning their money with online services need to focus on a high client satisfaction rate for
using their websites. Therefore, especially web-related firms like Amazon, Microsoft or
Yahoo conduct controlled online experiments to spot customer wishes and foster innovation.
Kohavi et al. (2013, p. 1168) assess running online experiments as very useful, especially in
combination with agile development processes, for example in start-ups. This chapter
16
provides examples of experiments from different areas of business and research and presents
the outcomes and usefulness of these tests.
Application in the Software and Online Service Industry
A straightforward example for an A/B test was conducted at Microsoft for the Office Online
website. Crook et al. (2009, p.1107) tested the recent design against a Treatment variant after
agreeing on the OEC “clicks on revenue generating links”. These links can be described as
areas that have a certain probability of leading to a sale of an Microsoft Office Product. The
test results showed that the Treatment design had 64% fewer clicks on the revenue generating
links, behaving absolutely contrary as expected by the designing team. This represents a good
learning experience example for the executors, with end users acting totally different than
predicted. The experiment also showed serious flaws in the OEC. The treatment displayed the
price of the Office product on the website, leading to a higher overall revenue. Nevertheless,
the OEC was still lower, purporting that the Treatment would be less beneficial for the
company than the old Control variant (Crook et al. 2009, p. 1107).
Deng et al. (2013, p. 123) depict an experiment at MSN Real Estate that tested six
different design versions of a „Find a Home“ search box. Users were randomly split between
the Control and the five Treatment versions. The company established „visits to linked
partner sites“ as the main OEC. The winning design of the experiment accomplished to
increase the revenue from transfer fees for MSN by nearly 10% compared to the Control.
In an interview with Harvard Business Review, Amazon founder Jeff Bezos describes
the online testing processes inside the company’s Web Lab as a very important part of the
firm (Kirby and Stewart 2007, p.81). Beneath constantly monitoring customer responses for
conducted experiments, the laboratory also researches about the cheapest way possible to
implement those tests. Bezos illustrates several examples of helpful online experiments. In
one case, the company wanted to implement a feature showing users another Amazon
17
customer with a very similar purchase history. The management was sure that their customers
would like the new feature of the website but an A/B test showed no significant difference in
user behaviour. In the end not many customers wanted to use the new feature (Kirby and
Stewart 2007, p. 81).
Additional examples of implemented online experiments can be found for various
companies: Amatriain and Basilico (2012) explain how the recommendation system at Netflix
is based on A/B testing and McKinley (2012) portraits the A/B analyser implemented at Etsy,
a software tool which illustrates important financial metrics for each conducted A/B test.
Technology Provider of A/B Tests
In recent years, the market leaders in the internet business established software and tools
specialized
for
testing
on
their
own
websites.
As
abovementioned,
a
simple
Control/Treatment test needs detailed technical know-how for the implementation. Through
the establishment of, for instance, the Bing Experimentation System, which performs online
experiments at Microsoft’s search engine website, innovation can be accelerated and therefore
annual revenues increased intensely (Kohavi et al. 2013, p. 1170). However, the amount of
small and middle-sized firms offering services or selling products on the internet is numerous.
From their point of view, controlled online experiments are often too expensive to implement
independently. Hence, a strong demand for customized and adequately priced experimenting
tools emerged. Occupying this new business area, a high number of companies providing A/B
testing technology was founded in the last 4-5 years (Walker, 2013).
Villano (2013, p. 74) describes that every experiment on the web needs an IT-expert to
develop the needed codes and software. Optimizely, the current market leader in providing
A/B testing technology, started to undertake these tasks for mostly small and midsize
customers. Today, multinational companies make use of the fast and easy testing possibilities,
too. Dan Siroker, the founder of Optimizely, defines in the interview his company’s core
18
competency as a pop-up editing tool that each customer can adapt to his or her needs.
Different variants can be tested without altering the underlying structure of the website.
Siroker characterizes this ability along with the time savings as one of the main benefits when
using the professional tool. (Villano, 2013, p. 74).
A direct competitor is Google Website Optimizer, which Becker (2008, p. 24)
illustrates as free of charge for all Google advertisers and cooperative retailers. The software
and internet company has strong interest in continually improving its online experiment tools.
Therefore, it made the optimization tools available for the public and third-party plug-ins and
modifications. As a result, Google can foster steady innovation from the inside as well as
from the outside. Additionally, the tool enables Multivariable testing and is currently one of
the most advanced online experimentation optimizers available (Becker 2008, p. 24).
The internet service company BuiltWith, which monitors the development of the A/B
testing market as a whole, records a rapid growth in the A/B testing market in the last couple
of years (Walker, 2013). This implicates an increasing amount of companies and websites
using the trials to test new designs and functions. Figure 2 shows the distribution of A/B
testing tools used worldwide at the end of May 2014 (Insert Figure 2 about here). Optimizely,
the market leader, captures more than one fourth of the whole market. The Figure also shows
that there are numerous provider for testing technology at the moment. Walker (2013), author
of an article about recent developments in the A/B testing market on builtwith.com,
anticipates steady growth rates also in the near future. However, in his opinion, not all testing
provider will assert themselves in the long-run, estimating an impending market saturation.
A/B Testing For Research Purposes
Besides companies seeking for increases in revenue and attention, controlled online
experiments are also conducted to learn more about general consumer behaviour on the
internet. One insightful test revolves around the two website design variables “visual
19
complexity” and “source of interactivity control”. The online experiment revealed that users
can be clustered in two groups: The “seeker” group performs goal-directed search processes
on the internet and prefers a consumer-controlled interactive environment and a simple
design. On the other hand the “surfer” group performs more experimental search processes
and therefore prefers marketer-controlled interactivity and a complex visual design (Stanaland
and Tan 2010, pp. 569-571). In this case, one A/B test is not sufficient. The users did not
decide for or against one new design, the main purpose was to collect the information about
who preferred which design, then draw conclusions from this information and cluster the test
participants in different groups.
Another online experiment was used to evaluate the differences in usage of continuous
and discrete rating scales when questioning survey participants about the level of their
happiness. The executor established “new analysis possibilities on data quality” and “insights
in the distribution of happiness scores through the trial” as the most important OECs. As a
result, the test showed that while data quality remained on the same level, additional
information could be concluded through continuous rating scales. The experiment showed
significant differences between male and female participants, therefore gender specific
question design effects could be found. With the help of the experiment, the executor
recommends the use of continuous rating scales when questioning people about their
happiness (Studer 2012, pp. 317,336).
Summarizing, online experiments are already in use for general research purposes, but
compared to corporate testing, the area is underdeveloped. Possible reasons for this condition
will be argued in the subsequent section.
20
Discussion
The final chapter resumes the main findings of the existing literature and reviews them.
Advice for managers in the online business is given as well as possibilities for the whole
industry with the help of controlled online experiments. Finally, current limitations of testing
on the internet are demonstrated and the focus of future research areas proposed.
Managerial Implications and Guidelines for the Industry
Controlled online experiments provide valuable insights in the minds and desires of the
participants. Succeeding in the internet means for the most companies a widespread
customization. However, the sole knowledge of correct customer behaviour estimation
doesn’t lead automatically to increasing revenue and high customer satisfaction rates. Several
papers describe the significance of a deliberate and encouraged management.
Focusing on a long-term value generation for the company, responsible managers
should agree on a suitable OEC upfront (Crook et al. 2009, p.1107). As described, a good
OEC should measure the value of the new features for the business objectively and
efficiently. Coming to a decision before the experiment is run the corporation forces itself to
weigh values of various inputs and decide on their relative importance. Kohavi (2012, p. 20)
recommends the assignment of lifetime value of users and the connected actions. Therefore,
the upfront work aligns the organisation and clarifies the goals targeted in the future.
Furthermore, controlled online experiments can simplify the decision-making process
as a whole. Different employees may support different design or application improvements.
From a corporate point of view, every design may have its advantages. Experiments can
function as a decision maker between two or more parties supporting distinct approaches.
Disagreements are time-intensive and burdensome for a positive atmosphere. Asking the
21
customers for their opinion, experiments can terminate possible dissents regarding pending
changes. Therefore, online tests can contribute to a efficient decision making and can foster a
harmonic work environment. Jeff Bezos, the CEO of Amazon, claims that “it’s important to
be stubborn on the vision and flexible on the details” (Kirby and Stewart 2007, p. 81). In his
opinion a company has to decide on the broad and long-term goals it wants to achieve.
However, the exact route toward those objectives is going to be evolved together with the
customers. Regalado (2014, p. 62) reports that customers tend to choose simple and effective
schemes, thereby overlooking sophisticated design approaches. Competition-wise, this can
arouse problems for traditional media companies, where editors and designers can be still
biased toward the superiority of their own opinion. The author concludes, that online testing
is capable of reshaping the entire appearance of the internet.
Focusing the user’s point of view, Kohavi et al. (2009, p. 176) caution against the
implementation of features that didn’t result in a significant statistical difference. The authors
dismiss the thinking that the new features won’t have any effect. In their opinion, it’s still
possible that the experiment will have negative impacts on the user experience although the
test didn’t show perceptible alterations. Another paper describes that the understanding of
invalid test results is the real challenge, not the discarding (Kohavi et al. 2012, p. 793). This
may be the case with underpowered experiments, which fail to project possible pitfalls and
actual consequences in a reliable way.
Almost every organisation operating in the software and internet industry has a special
business culture. Particularly for software and internet companies such as Google or Apple,
an intense cohesion internally and a proud appearance of their employees externally
constitutes an important component of the overall strategy. Kohavi, Henne and Sommerfield
(2007, p. 966) criticize that many organizations developed a business culture where features
are fully designed before the actual implementation. The authors recommend to alter that
22
approach to a direct integration of customer feedback through prototypes and
experimentation. Therefore, the culture would change to a “data-driven” approach.
Another recommendation for companies conducting online tests is that they have to be
careful when applying offline testing schemes to online experimentation. Corporations testing
their product features or designs offline shouldn’t apply those schemes when designing a new
website and testing different variants. Carryover effects and incongruous confidence intervals
can be possible pitfalls when copying the testing procedures (Kohavi et al. 2012, p. 793). The
online business follows different rules, therefore the testing approach has to be unique and
adjusted.
Limitations and Future Research
Controlled Online Experiments are a young and developing field of research. As with many
new business areas, cost-effectiveness considerations play a major part in the design phase
and can be a powerful limitation. Implementing tests is expensive, therefore even large-sized
corporations like Amazon focus on reducing the costs of experimentation. Consequently,
experiments can only be performed with sustainable funding. Through third-party suppliers of
testing programs, general costs of simple A/B testing decreased but widespread
experimentation processes are still costly (Kirby and Stewart 2007, p. 81).
A further possible limitation but similarly an option for future research potentials is
the way how users access the internet: Browsing is becoming increasingly multifaceted.
Nearly 30% of today’s internet users access the internet from a mobile device, for example a
tablet or smartphone (Hawthorne 2013, p. 78). For this group of users, providing a smooth
online experiment involvement, can hold a lot of pitfalls. The author criticises that many
websites still aren’t optimized for mobile use. In that case, buttons may be displaced or
schemes discoloured. In his opinion, this can affect the user experience and therefore also up
to one third of the testing results.
23
Currently, the biggest visible limitation of online controlled experiments is the scarcity
of publicly available information. Professional papers describing practical experiences with
online experiments are hard to find. Especially in the ranking Marketing Journals there was
no explicit publication about the structure and implementation of controlled online
experiments yet. One big stream of literature is available from Microsoft employees who
founded the Experimentation Platform6 and publish papers regarding experimentation at
Microsoft and connected entities. On one hand, this information is very useful because it
illustrates in numerous papers the whole technical and statistical process of implementing
experiments in detail. On the other hand, the papers are targeted at Microsoft’s testing
processes and therefore subjective and only relevant for one part of the potential test users.
Especially small and medium sized firms with less financial resources can’t implement testing
processes the way global software producers as Microsoft or Google do.
The scarcity has three reasons: Firstly, online testing is a relatively new method and
business area. Not long ago, websites were updated the way the responsible designer or
software developer preferred it. Relying on user opinions became more and more important
just in recent years. Secondly, there are still a lot of companies feeling confident that they do
not need online experiments. In this case, the managers and designers of websites assume that
there is no need for evaluating their perfected ideas. Staying in this state of hubris, the chance
of a reliable examination of their work is declined and a contribution to the online experiment
market forfeited (Kohavi 2012, pp. 25-26). Thirdly,
many firms conducting online
experiments are reluctant to share their available insights and findings. The smooth and
meaningful implementation of own testing processes can be valuable know-how that
companies use as an competitive advantage against international contestants. Therefore, most
of the global software and internet companies keep their knowledge private. This
24
circumstance won’t change in the future if competition between the companies keeps being
strong.
Legitimately, Kohavi et al. (2014, p. 5) also raise concern about the quality of
conducted research. The authors propose to filter out low-quality testing processes: Papers
and test result descriptions should be peer reviewed to ensure a proper implementation.
Furthermore, some testing outcomes can’t be transferred to other products or websites
because they are very specific. In their opinion, outcomes should be checked for possible
misinterpretation of the results due to incorrect assumptions of the underlying reasons. In
their paper, the authors provide several examples of tests lacking one or more of the described
qualities.
However, if focusing on high quality works, the whole market could benefit from an
open and organized research pool for online experiments. Funding and encouraging skilled
scientists to do further research about how experiments can be enhanced and implementation
costs can be lowered could lead to further innovation and improvement. Hence, start-ups and
global players could benefit from the gained knowledge equally. Possible synergy effects
could contribute as a stimulus for further research about experiments. Recognizably, there are
still many potentials for researching and improvement.
Conclusion
Controlled Online Experiments are undoubtedly one of the key measures to understand one’s
customers better. After a period of time where design and functionality were determined by
the developer itself, experiments ousted the hubris that a company can estimate best what a
user desires regarding their website. Nowadays, controlled online experiments are used to
create better and more effective internet presence and therefore generate more value and
25
revenue for the implementing company. The literature review concluded that a technical and
statistical standard is needed to receive significant and relevant results in testing. Not every
experiment ends in a meaningful result or has the needed foundation for analysis purposes.
The pioneers of experimentation on the web had to experience arising pitfalls as well as
achievable improvements to set the standard for testing as it is used today. Through the
establishment of basic A/B testing platforms, access to fundamental online testing is enabled
for start-ups and middle sized firms as well. Nevertheless, the whole process still needs a lot
of advancement and improved accessibility to become the norm when changing websites. The
literature research showed how limited available information about online experiments is and
that companies prefer to keep know-how in that area locked up. Top quality Journals yet have
to release papers about the online experimenting topic, probably not receiving submissions
from respectable scientists. Therefore, more information and insights should be published in
the future to allow others an easy and effective adaption of controlled online experiments
according to their needs.
Figure 3 summarizes the main findings of this literature review and illustrates the
present problems (Insert Figure 3 about here). The figure can also be seen as a form of
conclusion for the literature researching within the framework of the Bachelors thesis.
Removing the illustrated obstacles will be the most significant challenge for researchers and
associated companies. Consequently, the experimentation process has to be attuned to
contemporary standards and trends. Otherwise, the high levels of customer agreement and
user responsiveness companies strive for won’t be maintained in the long run. With that
factors in mind, controlled online experiments can be one major growth area for firms in the
future, representing a worthwhile generator for revenue and better satisfaction rates.
26
Table 1: Advantages and Disadvantages of all Assignment Methods
Adapted from “Controlled Experiments on the Web: Survey and Practical Guide “ (Kohavi et
al. 2009, p. 171)
Traffic
Splitting
Page
Rewriting
Client-side
Assignment
Server-side
Assignment
Changes in website
code
No
No
Yes
(moderately)
Yes (highly)
Implementation
cost of first
experiment
++/+++
++
++
+++
Implementation
cost of subsequent
experiments
++/+++
++
++
+/++
Hardware cost
(server)
+++
++/+++
+
+
Flexibility (render
time)
+++
++
+
++++
Negative impact on
test performance
for user
+
+++
+++
+
Signs and Symbols
Low
+
Moderate
++
High
+++
Very high
++++
27
Figure 1: Structural Overview of Controlled Online Experiments
Contains terms and definitions from several authors (Deng et al. 2013; Kohavi et al. 2009;
Kohavi et al. 2012; Kohavi, Henne, and Sommerfield 2007; Peterson 2004)
Retention of current
design
Control
variant
Treatment
variant
Analysis of Test Results
3
Inputs
Testing
process
2
External
information
Internal
testing
know-how
Financial
resources
Goal Setting (OECs), 1
Statistical Base Frame
(hypothesis testing, sample
size)
Implementation
(randomization, assignment
of variants)
Technical Facilities (server, software, data path)
Experimentation Structure
Introduction of
changed design
28
Figure 2: A/B Testing Software Market: Distribution in May 2014
Adapted from BuiltWith.com (available at http://trends.builtwith.com/analytics/a-b-testing;
select option “The Entire Internet” on the right)
Other, 26.39
Optimizely, 26.47
Omniture Adobe
Test and Target,
6.74
SiteSpect, 14.62
Google Website
Optimizer, 10.16
[The market shares are expressed as a percentage]
Visual Website
Optimizer, 15.62
29
Figure 3: Sources, Trends and Existing Problems for Controlled Online Experiments
Contains terms and definitions from several authors (Kohavi et al. 2014; Kohavi et al. 2009;
Nielsen 2005; Peterson 2004; Quarto-vonTivadar 2006; Tang et al. 2010)
Knowledge Sources
Testing
Technology
Provider
General
Research
Efforts
Problem 1
Companyspecific
Testing
Secrecy
of Know-How
Problem 2
Underdeveloped
Area
Problem 3
Controlled Online Experiments
Knowledge & Research Pool
Scientific Paper in
Marketing Journals
A/A Testing
Conference Proceedings
Main
Literature
Stream
Online Articles and
Weblogs
Alternatives
Quality
Concerns
(no peer
evaluation)
Control Charts
Segmented Testing
Multivariable Testing
Qualitative User Behaviour
Observation
A/B Testing
30
Appendix
Appendix A: T-test and Minimum Sample Size
For running a successful and statistically meaningful experiment, several minimum
requirements have to be fulfilled. The first one is the t-test or also called single factor
hypothesis test, which is used for determining the statistical significance of an A/B test
(Kohavi et al. 2009, p. 152).
𝑡=
𝑂! − 𝑂! 𝜎!
In the t-test formula, the enumerator represents the differential of the estimated values for the
OEC of Treatment (B) and Control (A). The denominator represents the estimated standard
deviation of the difference between the OECs. Kohavi et al. (2009, p. 152) emphasize that a
threshold has to be established, which bases on the confidence level of the test, being very
often 95%. If t, the result of the t-test, is larger than the threshold, the hypothesis is rejected
and the Treatment’s OEC is statistically significantly different than the Control’s OEC.
Moreover, the paper describes the effects of different alterations such as varying the
sensitivity of the sample size (Kohavi et al. 2009, pp. 152-154). For the minimum sample
size, Kohavi et al. (2007, pp. 962; 2009, pp. 152-153) introduce two formulas.
𝑛=
!"! !
!!
;𝑛=(
!!" !
)
!
In both formulas, n describes the number of users needed for each variant, σ describes the
variance of the OEC and Δ the amount of change that should be detected with the experiment.
In the second formula, r describes the number of variants in the test. Both formulas can be
applied, the second one being a more conservative approach.
31
Appendix B: Literature Review Tables
Author/s (Year)
Amatriain and Basilico
(2012)
Title
Netflix
Recommendations:
Beyond the Stars (Part
2)
Journal/Source
Techblog Netflix
(online source)
Theoretical Background
• Gives insights into
personalisation
technology
implemented at the
Netflix Website
•
Author/s (Year)
Bansal, Buchbinder,
and Naor (2012)
Title
Journal/Source
Randomized
Society for Industrial
Competitive Algorithms and Applied
For Generalized
Mathematics
Caching
Examples are
provided how
innovation and
research are
conducted
•
•
Main Findings
Company uses
parallel A/B tests
after offline tests
had significant
results
Relevancy
++
(Functions as
example in
3.1)
Automatic
recommendation
system on the
website is based on
A/B tests
Theoretical Background
Discusses which type of •
randomized algorithms
are suitable for the
generalized caching
problem
Main Findings
Design of online
rounding procedure
that converts
algorithm into
randomized
algorithm
•
Provide framework
for caching
Relevancy
+
(2.3
explanation
PRNG)
32
Author/s (Year)
Becker (2008)
Title
Best Bets for Site Tests
Journal/Source
Theoretical Background
Multichannel Merchant Describes the rise of
•
(article)
A/B testing tools,
focusing on the Google
product Goole Website
Optimizer
•
Main Findings
Relevancy
++
Google’s tool offers
MVT and is open to
(Functions as
third-party
example in
developers
3.2)
Therefore,
continuous
innovation is
fostered
Author/s (Year)
Bell (2008)
Title
Multivariable Testing:
An Illustration
Journal/Source
Circulation
Management (article)
Theoretical Background
• Describes
•
implementation
process of
Multivariable
testing
Main Findings
MVT provides
better and faster
results than
executing several
A/B tests
•
Sample size doesn’t
have to be increased
for MVT
Provides
comparison with
single A/B tests
•
Relevancy
+++
33
Author/s (Year)
Burk (2006)
Title
Journal/Source
A Better Statistical
Marketing Bulletin
Method for A/B Testing
in Marketing
Campaigns
Author/s (Year)
Title
Crook, Frasca, Kohavi, Seven Pitfalls to Avoid
and Longbotham (2009) when Running
Controlled Experiments
on the Web
Theoretical Background
Main Findings
Alternative to the usual Testing by using
A/B testing method
Control Charts is more
presented
effective and has more
beneficial effects than
A/B testing
Journal/Source
Theoretical Background
KDD '09 Proceedings of • Focus on pitfalls the •
the 15th ACM SIGKDD
authors have
International
experienced after
Conference on
running numerous
Knowledge Discovery
experiments at
•
and Data Mining
Microsoft
•
The pitfalls include
a wide range of
topics, such as
assuming that
•
common statistical
formulas used to
calculate standard
deviation and
statistical power can
be applied, ignoring
robots in analysis
Main Findings
Flaws in the OEC
can damage the
whole test outcome
Focus of tests has to
be on human users
only, otherwise the
results can be very
misleading
OEC focus should
be long-term and
not short-term
Relevancy
+++
(Used as
counter
statement in
2.5)
Relevancy
+++
34
Author/s (Year)
Deng, Li, and Guo
(2014)
Title
Statistical Inference in
Two-Stage Online
Controlled Experiments
with Treatment
Selection and
Validation
Journal/Source
Theoretical Background
WWW '14 Proceedings • The authors propose •
of the 23rd international
a general
conference on World
methodology for
wide web
combining the first
screening stage data
together with
validation stage data
for more sensitive
hypothesis testing
and more accurate
•
point estimation of
the treatment effect.
•
The method is
widely applicable to
existing online
controlled
experimentation
systems.
Main Findings
Using generalized
step-down tests to
adjust for multiple
comparison can be
applied to any A/B
testing with
relatively few
treatments
Bias correction
methods are useful
when a point
estimate is
important
Relevancy
++
35
Author/s (Year)
Deng, Xu, Kohavi, and
Walker (2013)
Title
Improving the
Sensitivity of Online
Controlled Experiments
by Utilizing PreExperiment Data
Journal/Source
WSDM '13 Proceedings
of the sixth ACM
international conference
on Web search and data
mining
Theoretical Background
Proposal of an approach
that utilizes data from
the pre-experiment
period to reduce metric
variability and hence
achieves better
sensibility
Main Findings
Relevancy
Technique is applicable ++
to wide variety of key
business metrics,
practical and easy to
implement
Author/s (Year)
Eisenberg (2005)
Title
How to Improve A/B
Testing
Journal/Source
Clicz.com (online
source)
Theoretical Background
Provides practical
•
examples and
experiences of the
author to get better
results from A/B testing
•
•
Main Findings
A/B testing is more
difficult and
complex than it
appears to be
Only one change
should be
implemented at a
time
Small changes can
lead to big
differences
Relevancy
+++
36
Author/s (Year)
Hawthorne (2013)
Title
4 Great Ways to Ramp
up Your Web Testing
Journal/Source
Response Magazine
(article)
Theoretical Background
Resulting from the
•
difference between
offline and online
testing, author provides •
recommendations for
starting online
experiments
successfully
•
•
Author/s (Year)
Hernández-Campos,
Jeffay, and Donelson
(2003)
Title
Tracking the Evolution
of Web Traffic: 19952003
Journal/Source
Proceedings of the 11th
IEEE/ACM
International
Symposium on
Modeling, Analysis,
and Simulation of
Computer and
Telecommunication
Systems (MASCOTS)
Theoretical Background
Provides data suitable
for the construction of
synthetic web traffic
generators and in doing
so retrospectively
examines the evolution
of web traffic
Main Findings
A/B testing should
always be the basis
Relevancy
+++
If performed
smoothly,
multivariate testing
can be added
Most obvious
objects should be
tested first
Include users that
browse from mobile
devices
Main Findings
Web traffic has been the
dominant traffic type on
the Internet,
representing more
bytes, packets, and
flows on the Internet
than any other single
traffic class
Relevancy
+
(Mainly used
for definition
of web traffic)
37
Author/s (Year)
Kirby and Stewart
(2007)
Title
The Institutional Yes
Journal/Source
Harvard Business
Review
Theoretical Background
Interview with Jeff
•
Bezos, the founder of
amazon.com about
innovation inside the
company and how
continuous testing helps
the company to
understand their
customers better
•
Main Findings
Relevancy
+++
There are big
differences between
what the company
thinks will be
successful and what
the customers
actually really like
and use
Customer focus can
foster innovation
and improvement of
the company from
the outside
•
Main focus of the
company is to lower
the overall costs of
experimenting
•
Firms should agree
on a long-term goal
but be flexible in
terms of how to get
there specifically
38
Author/s (Year)
Kohavi (2012)
Title
Journal/Source
Online Controlled
RecSys ’12
Experiments:
(presentation slides)
Introduction, Learnings,
and Humbling Statistics
Theoretical Background
• Introduces
•
Controlled Online
Experiments, shows
examples and shares
advantages (MSN,
Microsoft Office)
•
• Describes cultural
challenges regarding
online experiments
Main Findings
Twyman’s Law:
Every figure that
looks interesting or
different is usually
wrong
•
Best practices
include A/A tests,
ramp up, and large
user samples
Cultural challenges
have to be evolved
in a four step
process
Relevancy
+++
39
Author/s (Year)
Kohavi and
Longbotham (2010)
Title
Unexpected Results in
Online Controlled
Experiments
Journal/Source
Theoretical Background
ACM SIGKDD
• Real examples of
•
Explorations Newsletter
unexpected results
and lessons learned
regarding controlled
online experiments
at Microsoft
•
Authors explain the
reasons for wrong
forecasting and
share resulting
lessons
Main Findings
Relevancy
+++
Anomalies are
expensive to
investigate, but they
found that some
lead to critical
insights that have
long-term impact
•
Often, incorrect
results are due to
subtle errors which
are hard to
anticipate or detect
•
Recommend
frequent A/A testing
to increase quality
•
Often, unexpected
results are resulting
from lack of
understanding of
user behaviour
40
Author/s (Year)
Kohavi, Deng, Frasca,
Walker, Xu, and
Pohlmann (2013)
Title
Online Controlled
Experiments at Large
Scale
Journal/Source
Theoretical Background
KDD '13 Proceedings of • Addresses three
•
the 19th ACM SIGKDD
areas that are
international conference
challenged when
on Knowledge
implementing
Discovery and Data
experiments at large
Mining
scale:
Cultural/organizatio
nal, engineering and
trustworthiness
•
• Focuses on the
number of
experiments: How
organizations could
evaluate more
hypotheses,
increasing velocity
of validated learning
•
Main Findings
Relevancy
+++
Negative
experiments, which
degrade the user
experience shorttime, should be run
given the learning
value and long-term
benefits
Due to the fact that
there are many live
variants of one site,
alerts should be
used to identify
issues rather than
relying on heavy upfront testing
High occurrence of
false positives when
running experiments
large scale
41
Author/s (Year)
Kohavi, Deng, Frasca,
Longbotham, Walker,
and Xu (2012)
Title
Trustworthy Online
Controlled
Experiments: Five
Puzzling Outcomes
Explained
Journal/Source
KDD '12 Proceedings
of the 18th ACM
SIGKDD International
Conference on
Knowledge Discovery
and Data Mining
Theoretical Background
• Explains five
•
outcomes about the
OEC (Overall
Evaluation
Criterion), click
tracking, effect
trends, experiment
length and power,
and carryover
•
effects
•
Helps readers to
increase the
trustworthiness of
results coming out
of controlled
experiments
Main Findings
Relevancy
+++
Instrumentation is
not as precise at it is
expected to be:
There are
interactions in
subtle ways with
experiments
Lessons learned
from offline
experiments don’t
always function
well online because
of carryover effects
and not shrinking
confidence intervals
42
Author/s (Year)
Kohavi, Deng,
Longbotham, and Xu
(2014)
Title
Seven Rules of Thumb
for Web Site
Experimenters
Journal/Source
Theoretical Background
To appear in KDD 2014 • Provides principles •
(forthcoming)
for experiment
implementation with
broad applicability
in web optimization
and analysis
•
• Two goals: Guide
experimenters in
terms of optimizing
and provide
•
community with
new research
challenges
Main Findings
Small changes can
lead to high
differences in
Return-onInvestment
It’s rather rare to
get big changes in
key metrics
Speed is a main
factor of an
successful
experiment
implementation
•
Complex designs
overburden users
•
The more users
involved in testing
the better
Relevancy
+++
43
Author/s (Year)
Kohavi, Henne, and
Sommerfield (2007)
Title
Practical Guide to
Controlled Experiments
on the Web: Listen to
Your Customers not to
the HiPPO
Journal/Source
KDD '07 Proceedings
of the 13th ACM
SIGKDD International
Conference on
Knowledge Discovery
and Data Mining
Theoretical Background
• Practical guide to
•
conducting online
experiments,
especially where
end-users can help
developing possible
features
•
•
•
•
Provides examples
of controlled
experiments and
discusses their
technical and
organizational
limitations
Introduces
important testing
areas: Statistical
power, sample size
and techniques for
variance reduction
Evaluates
randomization and
caching techniques
Authors share key
lessons helping
Main Findings
Controlled
experiments
neutralize
confounding
variables by
distributing them
equally over all
values through
random assignment
•
Establishment of
casual relationship
between changes
made in different
variants and
measures of interest
(OEC)
•
Success depends on
the users’
perception and not
on the Highest Paid
Person’s Opinion
(HiPPO)
•
Companies can
accelerate
innovation through
experimentation
Relevancy
+++
44
practitioners in
running trustworthy
controlled
experiments
because it’s the
customers’
experience which is
important
45
Author/s (Year)
Kohavi, Logbotham,
Sommerfield, and
Henne (2008)
Title
Journal/Source
Controlled Experiments Data Mining and
on the Web: Survey and Knowledge Discovery
Practical Guide
Theoretical Background
• Extended follow-up •
paper to “Practical
Guide to Controlled
Experiments on the
Web: Listen to Your
Customers not to the
HiPPO”
•
•
Added section 4
which presents
Multivariable tests
in an online setting
•
Provides additional
practical examples
•
Updated formulas
used for sampling
size and hypothesis
testing
•
Main Findings
Same findings as
underlying paper,
extended
formulations in
some places
Multivariable
testing can serve as
an alternative to
several A/B tests,
but there are also
accompanying
disadvantages such
as a poor user
experience, a more
difficult analysis
and interpretation,
and a longer
preparation time for
a test
Before conducting
MVT, an A/B test
should always be
performed
beforehand
Relevancy
+++
46
Author/s (Year)
Linden (2006)
Title
Make Data Useful
Journal/Source
Stanford Data Mining
2006 (Presentation
Slides)
Theoretical Background
• Testing algorithms •
and techniques used
at Amazon
•
• Reasons for A/B
testing and how to
improve them
•
Author/s (Year)
Malinas and Bigelow
(2008)
Title
Simpson’s Paradox
Journal/Source
Stanford Encyclopaedia
of Philosophy (online
source)
Main Findings
Speed of testing is
very important
Relevancy
+++
Bias should be
toward new items
and recent
history/mission
Expectations for
customers should be
set
Theoretical Background
Main Findings
Used for definition and Paradox states that two
explanation of the
groups with the same
Simpson’s Paradox
effect don’t need to
have the same effect if
combined to one group
Relevancy
++
47
Author/s (Year)
McKinley (2012)
Title
Design for Continuous
Experimentation: Talk
and Slides
Journal/Source
mcfunley.com (online
source)
Theoretical Background
• Describes the testing •
procedure
implemented on the
Etsy.com website
Main Findings
Relevancy
++
Design and product
process must change
to accommodate
experimentation
•
Changing too many
things at once will
result in confusion
and negative results
Description of the
•
A/B analyser, which
automatically
generates a
dashboard with
relevant business
metrics for each
performed test
48
Author/s (Year)
Nielsen (2005)
Title
Putting A/B Testing in
its Place
Journal/Source
nngroup.com (online
resource)
Theoretical Background
• Describes why
•
measuring the live
impact of design
changes on key
business metrics is
valuable, but often
creates a focus on
short-term
•
improvements
•
Same issues remain
for multivariable
testing
Main Findings
A/B tests with
OECs with nearterm focus neglect
bigger issues that
only qualitative
studies can find
A/B testing
shouldn’t be the
only method
evaluating a project
or change
Relevancy
+
(Article may
not be written
from an
objective/
scientific
point of view.
Therefore it is
only used in
the thesis as
an example
that there are
divergent
opinions
about A/B
testing)
49
Author/s (Year)
Peterson (2004)
Title
Web Analytics
Demystified: A
Marketer's Guide to
Understanding How
Your Web Site Affects
Your Business
Journal/Source
Theoretical Background
Celilo Group Media and • Creates awareness
CafePress (book)
of what web
analytics is (from a
practical
perspective) and
what a web
analytics program
can and can't tell
you.
•
•
Main Findings
Relevancy
Ways to get meaningful +++
and measurable results
(A/B test):
•
Only change one
variable at a time
•
Understand the
process for diverting
traffic
Helps developing an
understanding of
•
which tools and
statistics are useful
to a web analytics
•
program.
•
Establishes
knowledge of which •
available statistics
are most useful to
your particular
online business and •
how this statistics
should be used.
Determine accurate
measures of volume
Analyse carefully
Run a "Null Test"
Run your test until
you are sure the
results are real
Consider
segmenting test
subjects
50
Author/s (Year)
Quarto-vonTivadar
(2006)
Title
Journal/Source
A/B Testing: Too Little, Future Now, Inc.
Too Soon?
(article)
Theoretical Background
• Evaluation if A/B
•
testing delivers what
its promising
•
• A/B testing is not as
meaningful as the
industry is assuming
•
Main Findings
A/B tests can be
highly subjective
Relevancy
+
(Article may
not be written
Many A/B tests
aren’t implemented from an
in the most efficient objective/
scientific
way
point of view.
A/B testing ignores Therefore it is
only used in
variance
the thesis as
an example
that there are
divergent
opinions
about A/B
testing)
51
Author/s (Year)
Regalado, Antonio
(2014)
Author/s (Year)
Roy (2001)
Title
Journal/Source
Seeking Edge, Websites MIT Technology
Turn to Experiments
Review (article)
Title
Design of Experiments
using the Taguchi
Approach: 16 Steps to
Product and Process
Improvement
Theoretical Background
• Illustrates several
•
examples of
companies
conducting online
experiments
•
• Main topic is about
how users influence
today’s web design
Main Findings
Nowadays, users
heavily influence
the design of a
website
•
Website users prefer
usability over a
good looking design
Provides examples
unexpected
outcomes for
companies
Journal/Source
Theoretical Background
John Wiley & Sons, Inc. Understanding the
(book)
Taguchi Method by
providing examples and
statistical explanations
•
Relevancy
+++
Companies can’t
estimate reliably
what their customers
want
Main Findings
Why using an Overall
Evaluation Criterion is
meaningful: Combines
several objectives the
company wants to
achieve through the new
feature
Relevancy
++
(Mainly used
for definition
of OEC)
52
Author/s (Year)
Stanaland and Tan
(2010)
Title
The impact of
surfer/seeker mode on
the effectiveness of
website characteristics
Journal/Source
International Journal of
Advertising
Theoretical Background
Main Findings
Relevancy
There
are
two
groups
of
++
• Examines how the
effectiveness of two website visitors:
website designs is
dependent of the
• Seekers: goalconsumer’s purpose
directed, prefer
of visiting a website
consumer-controlled
interactive
environment and
• Therefore, an online
visually simple
experiment was
design
conducted in
Singapore
• Surfer: experiential,
prefer marketercontrolled
interactivity and
visually complex
layout
53
Author/s (Year)
Studer (2012)
Title
Journal/Source
Does it Matter How
Journal of Economic
Happiness is Measured? and Social
Evidence from a
Measurement
Randomized Controlled
Experiment
Theoretical Background
• Randomized
•
controlled
experiment about
the distribution of
happiness scores
Main Findings
Gender happiness
inequality depends
on gender specific
question design
effects
•
It is beneficial to use
a continuous rating
scale instead of the
widespread secrete
scale
Discrete single item •
Likert scale is tested
against a continuous
rating scale (visual
analogue scale)
Relevancy
++
54
Author/s (Year)
Tang, Agarwal,
O'Brien, and Meyer
(2010)
Title
Overlapping
Experiment
Infrastructure: More,
Better, Faster
Experimentation
Journal/Source
Theoretical Background
KDD'10 Proceedings of • Describes
the 16th ACM SIGKDD
overlapping
international conference
experiments
on Knowledge
implemented at
discovery and data
Google
mining
• Discusses associated
tools and
educational
processes to use this
variant effectively
•
Describes trends
that show success of
this overall
experimental
environment
Main Findings
Relevancy
Benefits of overlapping +++
experiments:
•
More experiments
can be conducted in
the same period of
time
•
Experiments are
better in quality
•
Experiments itself
are faster in
implementation and
data gathering
55
Author/s (Year)
Villano (2013)
Title
You Can Go with This
or You Can Go with
That
Journal/Source
Entrepreneur (article)
Theoretical Background
Describes the rise of
•
Optimizely, a web
company that facilitates
A/B testing for
companies
•
•
Main Findings
Very difficult and
expensive to
implement an A/B
test by your own
Market for testing
programs is big
A company that is
specialized in this
area should be able
to provide quick and
user-friendly
programs
Relevancy
+++
56
Author/s (Year)
Walker (2013)
Title
Journal/Source
What's Happening in
Builtwith.com (online
the A/B Testing Market source)
Theoretical Background
• Describes
•
fundamentals of
A/B testing
•
•
Probably not all
programs will
persist on the long
run, signs of
oversupply
Theoretical Background
• Provide background •
to understand what a
hash function is and
what problems it
addresses
Main Findings
Hash functions are
basic building
blocks that are
central to a
cryptography’s
mission
•
Author/s (Year)
Walker, Kounavis,
Gueron, and Graunke
(2009)
Title
Journal/Source
Recent Contributions to Intel Technology
Cryptographic Hash
Journal
Functions
Offers overview
over recent A/B
testing program
provider
Main Findings
Relevancy
+++
Market for A/B
testing programs is
increasing and
shows no signs of
slowing down in the
near future
•
Evaluates
development of
most important tools
in detail
Describe differences
of two designs
•
(Skein and Vortex)
There has been a
wave of new
research into hash
functions
Relevancy
+
(Used for
explanation of
hash
functions)
57
Author/s (Year)
Weiss (1997)
Title
Journal/Source
Evaluation: Methods for Prentice Hall (book)
Studying Programs and
Policies
Theoretical Background
Establishing causality
through differences in
the OEC (Control vs.
Treatment)
Main Findings
Relevancy
Differences in the OEC +++
are - if designed and
executed properly - the
result of the assignment
of Control and
Treatment variants
58
Only for definition purposes
Author/s (Year)
Merriam Webster
(2014)
Definition
“Null Hypothesis”
Journal/Source
m-w.com
Theoretical Background
Main Findings
Online encyclopaedia
“Null Hypothesis is a
statistical hypothesis to
be tested and accepted
or rejected in favour of
an alternative”
Relevancy
Author/s (Year)
Merriam Webster
(2014)
Definition
“Cookie”
Journal/Source
m-w.com
Theoretical Background
Main Findings
Relevancy
Online encyclopaedia
“A cookie is a file that +
may be added to your
computer when you
visit a Web site and that
contains information
about you”
+
59
References
Ali, Waleed, Siti Mariyam Shamsuddin, and Abdul Samad Ismail (2011), “A Survey of Web
Caching and Prefetching,” International Journal of Advances in Soft Computing and its
Applications, 3 (1), 18-44.
Amatriain, Xavier, and Basilico, Justin (2012, June), “Netflix Recommendations: Beyond the
Stars (Part 2),” (accessed May 16, 2014), [available at
http://techblog.netflix.com/2012/06/netflix-recommendations-beyond-5-stars.html].
Bansal, Nikhil, Niv Buchbinder, and Joseph Naor (2012), “Randomized Competitive
Algorithms For Generalized Caching,” Society for Industrial and Applied Mathematics, 41
(2), 391–414.
Becker, Larry (2008), “Best Bets for Site Tests,” Multichannel Merchant, April edition, 2425.
Bell, Gordon H. (2008), “Multivariable Testing: An Insight,” Circulation Management, May
edition, 16-18.
Burk, Scott (2006), “A Better Statistical Method for A/B Testing in Marketing Campaigns,”
Marketing Bulletin, 17, 1-8.
Crook, Thomas, Brian Frasca, Ron Kohavi, and Roger Longbotham (2009), “Seven Pitfalls to
Avoid when Running Controlled Experiments on the Web,” International Conference on
Knowledge Discovery and Data Mining (KDD), Paris (June 28-July 1), 1105-1114.
Deng, Alex, Tianxi Li, and Yu Guo (2014), “Statistical Inference in Two-Stage Online
Controlled Experiments with Treatment Selection and Validation,” 2014 International
World Wide Web Conference (WWW), Seoul (April 7-11), 609-618.
60
———, Ya Xu, Ron Kohavi, and Toby Walker (2013), “Improving the Sensitivity of Online
Controlled Experiments by Utilizing Pre-Experiment Data,” WSDM 2013, Rome
(February 4-8), 123-131.
Eisenberg, Bryan (2005, April) “How to Improve A/B Testing,” (accessed May 8, 2014),
[available at http://www.clickz.com/clickz/column/1717234/how-improve-a-b-testing].
Hawthorne, Timothy R. (2013), “4 Great Ways to Ramp Up Your Web Testing,” Response
Magazine, September edition, 78.
Hernández-Campos, Félix, Kevin Jeffay, and F. Donelson Smith (2003), “Tracking the
Evolution of Web Traffic: 1995-2003,” Proceedings of the 11th IEEE/ACM International
Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication
Systems (MASCOTS), Orlando (October), 16-25.
Kirby, Julia, Thomas A. Stewart (2007), “The Institutional Yes,” Havard Business Review,
October edition, 74-82.
Kohavi, Ron (2012), “Online Controlled Experiments: Introduction, Learnings and Humbling
Statistics,” RecSys’12, Dublin (September 9-13), 1-46.
———, Ron, and Roger Longbotham (2010), “Unexpected Results in Online Controlled
Experiments,” ACM SIGKDD Explorations Newsletter, 12 (2), 31-35.
———, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu (2012),
“Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,” KDD
2012, Beijing (August 12-16), 786-794.
———, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann (2013), “Online
Controlled Experiments at Large Scale,” KDD 2013, Chicago (August 11-14), 1168-1176.
———, Alex Deng, Roger Longbotham, and Ya Xu (2014), “Seven Rules of Thumb for Web
Site Experimenters,” KDD 2014 (forthcoming), New York City (August 24-27), [available
at http://www.exp-platform.com/Documents/2014%20experimentersRulesOfThumb.pdf].
61
———, Randal M. Henne, and Dan Sommerfield (2007), “Practical Guide to Controlled
Experiments on the Web: Listen to Your Customers not to the HiPPO,” Industrial and
Government Track Paper (KDD’07), San Jose (August 12-15), 959-967.
———, Roger Longbotham, Dan Sommerfield, Randal M. Henne (2009), “Controlled
Experiments on the Web: Survey and Practical Guide,” Data Mining and Knowledge
Discovery , 18 (1), 140-181.
Linden, Greg (2006), “Make Data Useful,” presented at the 2006 Stanford Data Mining,
Stanford University (December 4).
Malinas, Gary, and John Bigelow (2009), “Simpson’s Paradox,” Stanford Encyclopedia of
Philosophy, (accessed May 20, 2014), [available at
http://plato.stanford.edu/entries/paradox-simpson/].
McKinley, Dan (2012), “Design for Continuous Experimentation: Talk and Slides,” (accessed
May 18, 2014), [available at http://mcfunley.com/design-for-continuous-experimentation].
Merriam-Webster (n.d.), “Null Hypothesis,” (accessed May 12, 2014), [available at
http://www.merriam-webster.com/dictionary/null%20hypothesis].
——— (n.d.), “Cookie,” (accessed May 12, 2014), [available at
http://www.merriam-webster.com/dictionary/cookie].
Nielsen, Jakob (2005), “Putting A/B Testing in its Place,” (accessed April 18, 2014),
[available at http://www.nngroup.com/articles/putting-ab-testing-in-its-place/].
Peterson, Eric T. (2004), Web Analytics Demystified: A Marketer's Guide to Understanding
How Your Web Site Affects Your Business, s.l. :Celilo Group Media and CafePress.
Quarto-vonTivadar, John (2006), “A/B Testing: Too Little, Too Soon?,” published for Furure
Now Inc., New York City, 1-21.
Regalado, Antonio (2014), “Seeking Edge, Websites Turn to Experiments,” MIT Technology
Review, 117 (2), 62-63.
62
Roy, Ranjit K. (2001), Design of Experiments using the Taguchi Approach: 16 Steps to
Product and Process Improvement, s.l. :John Wiley & Sons, Inc.
Stanaland, Andrea J. S., and Juliana Tan (2010), “The Impact of Surfer/Seeker Mode on the
Effectiveness of Website Characteristics,” International Journal of Advertising, 29 (4),
569-595.
Studer, Raphael (2012), “Does it Matter How Happiness is Measured? Evidence from a
Randomized Controlled Experiment,” Journal of Economic and Social Measurement,
317-336.
Tang, Diane, Ashish Agarwal, Deirde O'Brien, and Mike Meyer (2010), “Overlapping
Experiment Infrastructure: More, Better, Faster Experimentation,” KDD'10, Washington
D.C. (July 25-28), 17-26.
Villano, Matt (2013), “You Can Go With This, or You Can Go With That,” Entrepreneur,
August edition, 74.
Walker, Chris (2013), “What's Happening in the A/B Testing Market,” (accessed May 23),
[available at https://blog.builtwith.com/2013/07/19/whats-happening-in-the-ab-testingmarket/].
Walker, Jesse, Michael Kounavis, Shay Gueron, and Gary Graunke (2009), “Recent
Contributions to Cryptographic Hash Functions,” Intel Technology Journal, 13 (2), 80-95.
Weiss, Carol H. (1997), Evaluation: Methods for Studying Programs and Policies,s.l. :
Prentice Hall.
63
Footnotes
1
Affidavit
“The hypothesis that an observed difference is due to chance alone and not due to a
systematic cause” (Merriam Webster 2014).
2
Caching is conducted to accelerate the performance of the system through storing an extent
of visited websites of an user (Bansal, Buchbinder, and Naor 2012, p. 391)
3
“A small file or part of a file stored on a World Wide Web user's computer, created and
subsequently read by a Web site server, and containing personal information“ (Merriam
Webster 2014).
4
“Hash functions are one of cryptography’s most fundamental building blocks, even more so
than encryption functions. For example, hash functions are used for […] random number
generation, as well as for digital signature schemes, stream ciphers, and random oracles”
(Walker et al. 2009, p. 80)
5
Traffic describes the data which is sent and received between a client and a server, therefore
between a website and the visitors of this site (Hernández-Campos, Jeffay, and Donelson
Smith 2003, p. 16).
6
Experimentation Platform: http://www.exp-platform.com/Pages/default.aspx