Balancing Context Effects by Controlled Experiments in Within

Balancing Context Effects by Controlled Experiments in
Within-Subject Design
Optimizing Condition Counterbalancing
Tin de Zeeuw
University of Twente
Enschede, Netherlands
[email protected]
ABSTRACT
To minimize the effects that occur because of condition ordering in within-subject designs, we propose a formula based
on the work of Barbieri and Stinstra [1]. We implement this
formula and create a program to solve the problem of creating individual orderings for each participant of a withinsubject experiment with all orderings together being balanced. This program uses a heuristic approach to bridge
some of the problems that come up due to the (almost) exponential increase in the permutations of orderings. These
approaches are successful and it is shown that it can be
used to create multiple sets of orderings for at least 52 participants. With such sets of orderings, the context effects
should be minimized and negligible in the outcome of the
experiment.
Keywords
Usability tests, context effects, within-subject experiment
setup, heuristic solution.
1.
INTRODUCTION
In every experiment involving people, there are the effects
we want to find, effects outside our control, and effects that
arise from the way the experiment is set up. We can for example set up an experiment to measure the effect the color
of light has on the perceivability of flicker. We put a participant in front of a lamp of a certain color and turn it on
and off very quickly. The frequency with which the lamp
turns on and off can be varied and when it is more than a
certain frequency, the participant will not see the flickering
anymore and instead perceive the light as continually on.
The variables we change to measure their effect are called
the independent variables. (in our example the frequency of
the flickering) We can then use several conditions in which a
specific frequency is presented to the participants of the test.
(the different colors) If we want to know if the frequency of
flickering is of influence to the perception of the light we can
ask the participants if they can see it flickering. The perception of the participants, whether they see the flickering,
is a dependent variable. The way the independent variables
are presented to the participants in an experiment can introduce several context effects that can influence the outcome.
One such effect could be that people get more tired after
looking at the lamp for a period of time, and will be less
likely to see the flickering after a while. If we then always
show the participants all different colors, first green, then
red and finally blue light, the blue light will always have
a lower threshold frequency up until which the participant
notices the flicker. And that is a context effect that can
occur because of the setup of the experiment. That is why
according to Greenwald [5], most statisticians will opt to
use a between-subject design, rather than a within-subject
design. He also explains why this should not be the case,
as there are ways of guarding and repairing these effects.
In a between-subject design every participant participates
in one condition, which should only result in changes in
the dependent variables because of differences in how the
participants perform on the task. If the participants are
divided equally over the conditions, the outcome can only
be attributed to the different conditions. In our example,
changing it to a between-subject design would have the result that every participant will only get to see one color.
Context effects are more explicit in an experiment in which
each participant repeats the experimental task several times
under different conditions: the within-subject design [8] or
repeated measures test.
In his article, Greenwald explains about the types of context effect: A practice effect, when the participant learns
from practicing the task and gets better at it after each
time he accomplishes it; all subsequent performances of the
same task will be influenced by this. A carry-over effect, in
which the experience of the first task carries over some mood
or state of being to the second task and thus influences the
performance. And the participants can also become sensitive to the variable that is changed, and try to figure out
how he should behave. An example of this would be a test
in which the participant needs to notice a light flickering. If
the flickering is altered at a specific interval, the participant
might notice the difference between two conditions and notice the interval. He can then try to look for the change and
behave differently than when only looking for a flickering
light. And another one is of course the effect mentioned earlier, participants might become fatigued after a while. The
main problem is thus that participants might change over
time and this has to be taken into account.
The first two effects (practice and carry-over) can come from
the ordering in which the independent variable is changed
for a participant and there are several ways to overall minimize the effects in the dependent variables. The effects will
always exist, but for a group of participants we can try to
smooth them out so they will influence the results equal in
all conditions. That way they do not influence all results
systematically in one direction. The easiest way to do this
is to randomize the ordering says Kirk [6]. This will work
only if there are enough participants, and even then the resulting orderings can be skewed. An example of this would
be if most orderings start with one condition which might
mean that participants will always perform worse on that
condition because they could not have practiced the task
yet. Greenwald explains that counterbalancing is a better
way to minimize these effects. He describes the process of
one-on-one counterbalancing by assigning an opposite ordering of the values of the independent variable for each
included ordering. If the first participant is given the variables ordered in conditions a and then b, then the second
participant is offered the ordering b and then a.
Counterbalancing can also be done with larger groups of
orderings, one of the most used way to do this is to use
latin squares [2]. A latin square is a n × n table filled with
n different independent conditions and makes sure that all
conditions occur only once in each row and column to make
sure that an acceptable differentiation of orderings per participant is constructed. In Table 1 an ordering for three
participants is shown, with three time slots for the three
different conditions. It can be seen that, for instance, condition a occurs at a different time for each participant. This
is also true for the other two conditions.
Participant 1
Participant 2
Participant 3
Time 1
a
c
b
Time 2
c
b
a
Time 3
b
a
c
Table 1: Latin square of 3x3
This approach only works with a n × n design, but can be
extended for other numbers of participants, time slots or
numbers of conditions. If we look at such an extension, care
must be taken with the ordering of variables per participant
and in total. This is similar to what we do in the following
sections.
In their research, Barbieri and Stinstra [1] use a formula
derived from such requirements, to calculate how balanced
a set of orderings is. This formula is a good measure to use
in finding a well balanced set, and the value of this formula
also gives an indication how much work still has to be done
to get to an optimal value. The requirements they use are
meant to equalize the number of successions of the variables
to smooth the effects of specific combinations of conditions
when the participant encounters them in succession. We
use the work of Barbieri and Stinstra and adapt their ideas
slightly to describe a set of requirements to create such a
formula in Section 2. The basic parts for the formulae are
explained here.
They start with an item, which is a specific value of an independent variable or a combination of multiple independent
variables. A runs is then filled with a specific ordering of
items at specific indices i in the run. And these runs can
be collected together in a set that corresponds to certain
requirements.
As an example we will use in the rest of this article, we take
two independent variables with each 3 values:
A:
X:
a
x
b
y
c
z
We can combine the values of these two variables to create
9 items which we can number as 1 through 9 in Table 2 for
ease of notation.
x
y
z
a
1
4
7
b
2
5
8
c
3
6
9
Table 2: All items formed from a 3 × 3 setup
The items we now specified need to be put in an ordering,
so each is included exactly once. This will ensure that each
combination of values for the two variables occurs (exactly
once). If we then make sure that no variable value repeats
itself, we have the requirements for individual runs.
Run Requirements:
1. Within one run each item occurs equally often.
2. Within one run each variable for the couple of
items at location i and i + 1 has a different value.
No value may repeat itself is a simple measure to make sure
we can compare conditions to each other. Participants are
protected from perceiving two following conditions as the
same. And at any moment during the experiment we can
ask if they liked the current condition or the previous one.
An ordering that conforms to these two requirements is for
instance: “486192735”. This ordering is shown in Table 3
with the values of the variables it is made up of. Here it is
clear that no value repeats itself and each item, or combination of values, occurs exactly once.
Item ordering:
4
8
6
1
9
2
7
3
Index:
1
2
3
4
5
6
7
8
5
9
Variable A:
Variable X:
a
y
b
z
c
y
a
x
c
z
b
x
a
z
c
x
b
y
Table 3: Valid ordering with it’s variables
If we first look at only one independent variable, the first
requirement of Barbieri and Strinstra is the same as the one
Latin squares have, that no variable can occur more than
once. If we only look at one independent variable with several values, the second requirement is automatically satisfied because all values can occur only once. For more than
one repetition of a variable, the second requirement becomes
more interesting. It will, for instance, allow an ordering of
abacbc, which is a mix of two repetitions of three items
of one variable (with three values). This is also important
when we build up items with more than one variable, because these different variables all need to be included equally
often and none may repeat itself, as can be seen in the example of Table 3. The number of combinations of two values
following each other is also a measure of balance, because
of the direct influence one condition can have on another,
this is also taken into account in the requirements. For a set
of N runs, we follow the requirements set by Barbieri and
Strinstra exactly:
Set Requirements:
3. Each value of an independent variable occurs
equally often in location i in all runs.
4. Each value of an independent variable is preceded
by each other value of that variable an equal
amount of times in the set of runs
We look for an implementation of a program to calculate a
set of orderings of the independent variables in such a way
that it minimizes the context effects in the data for a set
of experiments overall. And because calculating all possible sets of orderings that are balanced would increase the
time needed for calculations exponentially with each extra
ordering in the set, we look to a heuristic solution. A heuristic is a way of implementation that sacrifices thoroughness
in a solution to gain speed. This means it will generate a
usually acceptable solution, that by no means has to be an
optimal one [7]. Changing the program so it uses a heuristic approach to yield an acceptable set of orderings in a
reasonable time span is better, because we do not need an
exhaustive solution, we only need a single balanced set to
be able to run an experiment.
2.
BASIC FORMULA
To implement the formula as described by Barbieri and Stinstra, we follow the requirements set by them. These can be
implemented individually into a set of rules for the individual runs, and a formula for a set of runs combined.
For each run, the first two requirements can be checked
while systematically creating the runs. A simple recursive
program can add an item to the end of an ordering and
check if the requirements are not violated. If they are not,
another item can be added until a whole run is formed. If a
requirement is violated, then another option can be tried, or
if this is not possible and all options are exhausted, the previous item can be removed and retried. This way all options
for a single run, valid under the first two requirements, can
be generated systematically.
For balancing a set of runs we can translate the requirements
to define variables which we can use to calculate a measure
of balance and specify a formula not much unlike that of
Barbieri and Stinstra. In this formula we only take into
account the last two requirements and assume the first two
to be satisfied by the run-generation.
3.
4.
Tin (F ): the number of times the value n of an
independent variable F is in location i.
Pvw (F ): the number of times a value v precedes
the value w of the same independent variable F .
We can implement this formula in our program in such a
way that it can calculate a value for an arbitrary number
of independent variables per item. Our program essentially
just counts the number of occurrences for the different independent variable values and then fills them in in the formula.
If we look back at the example in Table 2 and 3, we can now
count the number of occurrences for the values of variable A
(in Table 4) and for the values of variable X (in Table 5) in
accordance to the third and fourth requirement. The values
are all occurring once in each index because each variable’s
value should be used exactly three times in one run. If we
calculate these values for a set with more than one run, the
number of occurrences can rise. A set of three runs can
theoretically fill out all indices with exactly one occurrence
of each variable’s value (the proof of this is left as an exercise
for the reader).
Ordering:
a
b
c
4
1
8
6
1
1
9
2
1
7
1
3
1
1
5
1
1
1
Table 4: Occurrences for variable A
Ordering:
x
y
z
4
8
1
6
1
1
9
2
1
7
3
1
1
5
1
1
1
1
Table 5: Occurrences for variable X
a
b
c
a
0
1
1
b
1
0
2
c
2
1
0
x
y
z
x
0
1
2
y
1
0
1
z
2
1
0
Table 6: Preceding variables for A and X
The preceding variables for this ordering can be found in
Table 6. It can be seen that no value precedes itself by the
empty counts on the diagonal. Furthermore, it can be seen
that one run cannot be completely balanced with respect to
the preceding items. This is because, for one variable, there
are only 6 combinations of 2 values that can be made with
3 values (ab, ac, ba, bc, ca and cb) while we have 8 couples
of values that precede each other.
Our goal is to minimize the value in the formula V for each
independent variable F in Tin and Pvw .
V =
n
X
D(Tin (Fm )) + D(Pvw (Fm ))
(1)
m=1
In this formula, D is the difference between the maximum
and minimum values in the matrix of variable G.
D(G) = max(G) − min(G)
(2)
For Tin , the D-values are calculated over all values in the
matrix. For the selection of the minimum and maximum
values in Pvw , the program must take into account that the
values on the diagonal will always be 0, but should not be
used. Because of the first requirements, the precedence will
always comply to v 6= w, no value of v should be equal
to that of w. And thus these values should not be used.
The value of D (the difference function) for each variable is
minimal when all items (or values of independent variables)
occur equally often and each value is preceded by each other
value of the same independent variable equally often. D is
then a value of 0 for all and this results in a value of 0 for the
whole formula V . Because a (really) unbalanced set of runs
will have a higher value of V and the optimal sets a value
of 0, we can “simply” let the program try to minimize this
value and ignore options that only make the value higher.
V
=
=
D(Tin (A)) =
D(Tin (X)) =
D(Pvw (A)) =
D(Pvw (X)) =
V
X
D(Tin ) +
X
D(Pvw )
D(Tin (A)) + D(Tin (X))
+D(Pvw (A)) + D(Pvw (X))
((max(Tin (A)) − min(Tin (A)))
((max(Tin (X)) − min(Tin (X)))
((max(Pvw (A)) − min(Pvw (A)))
((max(Pvw (X)) − min(Pvw (X)))
=
(1 − 0) + (1 − 0) + (2 − 1) + (2 − 1)
=
=
1+1+1+1
4
Table 7: Balance value for 486192735
Our goal is updated to minimizing the value of V for a set
of runs.
In the calculations in Table 7, the values of our example are
calculated for V , this leads to a value of 4. As we want
this value to be as low as possible, this is not yet a very
good value. For a single run, this is also the theoretical
minimum. The values for Tin (F ) can never be higher than
1 and include at least an occurrence of 0. The values for
D(Tin (F )) will thus always be 1 for a single run. The same
is true for the values of D(Pvw (F )), they will also be at least
1 because there are 6 combinations values (as seen before)
and 8 preceding comparisons yielding a minimum with the
values 1 and 2 for Pvw (F ).
3. CALCULATING SOLUTIONS
3.1 Exhaustive Calculation
There are 9! = 362880 different orderings that can be created with our 9 items. And while they all follow the first
requirement, most of them do not comply with the second
requirement. If we look at the number of runs that can be
created with 9 items that do satisfy the first two requirements, we find that there are 1512 possible combinations.
The generation of these takes just a few seconds on a modern PC. The exact number of 1512 distinct runs has to do
with the second requirement. Because we do not want the
same value to be repeated, the number of options for an
item in a position in the ordering is limited. If we look at
the generation of a run that is started with a “2”, the next
item can only be a 4,6,7 or 9 (see Table 8), leaving only 4
choices instead of the 8 that we could choose without the
requirement. The next step has only 3 choices because the
first option has now been taken out. This effect continues for
each new item, with items that have been used taken out.
Because these items are taken out, short of writing out all
options, we do not know how to calculate the permutations,
but the generation of runs found a total of 1512 different
runs.
a
x
y
z
4
7
b
2
c
6
9
Table 8: All next items when 2 is chosen
The minimum value for a run is 4 as we have seen in the
previous example, and the maximum we find is 7. The next
step is to combine these runs to create sets.
There are already 1512 × 1511 sets of two runs, and none
of them comply to the third and fourth requirements yet
because there are not enough items to fill out all values of
Tin (F ) and Pvw (F ). If we look at the values we get from our
formula, most of the sets do not yet have a “good” ordering
with a low balance-value. The minimal value we find is 5
and the maximum 14. Why the balance-value for these sets
cannot be as low as individual runs is we do not know.
Sets of three runs should theoretically be the first to have
balance-values that are significantly less than those before,
as it is possible to get values of 0 in D(Tin )(F ). The values
of D(Pvw )(F ) are harder to predict and may as well be 0 too,
because the total number of preceding item comparisons is
divisible by the number of possible combinations (3 × 8 =
24 = 4 × 6). It is seen in the results that sets of 3 runs have
a minimal value of 2 and a maximum of 20.
The most minimal values of V for sets of 3 are partially
computed as we expected. Both values of D(Tin )(F ) are 0,
while one of the values of D(Pvw ) for A or X is 2 while the
other is 0. Why these values cannot both be 0 at the same
time we do not know, for our goal of calculating a set of
runs that suffices for a usability test this does not matter
yet. Apparently it is not possible to get a perfect succession
of items for all independent variables at the same time with
sets of three runs. And thus these sets do not completely
fulfill all requirements, only the first three while leaving out
the fourth. Subsequent examples will leave out the reasoning about requirements, but we will return to them in our
discussion after that.
The number of sets with n runs
` can
´ be found by calculating the permutations of 1512: 1512
. For sets with 3 runs
n
this is of course 1512 × 1511 × 1510 and while it is not an
exponential increase per definition, the number of permutations of the 1512 runs grows rapidly. This means that the
computational time and disk-space needed for finding the
best balanced sets also increases with this factor. For these
sets of 3 runs, we already need several hours to exhaustively
calculate all balance values. And while our implementation
is not optimal, even with optimizations, exhaustively calculating anything near a useful set for sufficient participants
will require either a super-computer or a lot of time.
3.2
Combining Sets
This almost exponential increase in computational complexity is the reason that we look to heuristics for a way of calculating sets containing higher numbers of runs. One of the
ways we do this is to combine sets of runs instead of single
runs. We do this with the sets of 3 runs with the lowest
balance-values (V = 2) we found in the previous step. By
combining whole sets, we skip the intermediate sets of 4 and
5 runs and go on to the sets with 6 runs. And by using
only the sets with the lowest balance-values, we use about
0.0001% of all calculated sets of 3 runs. (only 3744 sets)
This also takes time off the calculation of sets of 3 runs,
they still have to be created and their balance values have
to be calculated, but the most time consuming task of saving
the set to the hard drive is eliminated. This practice also
uses a lot less disk-space, which is another limiting factor
for bigger sets.
Set N:
9
1
5
3
4
8
Set M:
When we add together two sets containing disjunctive runs,
we believe that the balance value for the new set will not
exceed that of the individual sets together. This can be
shown by taking two sets N and M that comply to our
requirements and following the next set of equations for each
value of Tin and Pvw . The values for the different sets N
and M will be denoted with an extra (N ) in the formulas.
V (N )
V (M )
=
X
D(N )
=
X
D(M )
Likewise, for easy representation, we will omit the specifying variables for the difference-function and write:
D(N ) = max(N ) − min(N ) when we actually mean:
D(Tin (F ))(N ) = max(Tin (F ))(N )−min(Tin (F ))(N ) or the same
formula for Pvw . D(N ) denotes the difference value for all values
of Tin and Pvw in set N .
D(N )
=
max(N ) − min(N )
D(M )
=
max(M ) − min(M )
D(N ) + D(M )
=
(max(N ) − min(N )) + (max(M ) − min(M ))
=
(max(N ) + max(M )) − (min(N ) + min(M ))
The steps stating that the maximum and minimum of the combination are not necessarily the same value as the addition of the
individual values comes from the way these values are picked from
the matrix of the variable. For all different variables individually,
the values of the maximum of set N can maximally be increased
with the maximum of set M . For instance max(Tin ) of set N can
be combined with maximally max(Tin ) of set M . This can best
be seen in the example in Table 10, where the previous maxima
of 4 and 5 are exchanged for the new value of 6 when both sets
are combined.
The minimum of set N will minimally be increased with the minimum of set M . This can also be seen in the example, but less
clear, because 0 + 0 also adds up to the new value of 0.
max(N + M )
≤
max(N ) + max(M )
min(N + M )
≥
min(N ) + min(M )
These two rules together will make the D value equal or smaller
than that of the individual D values. And so D(N + M ) will
be less than or equal to the combined values of the previous two
sets. And if we sum all values of D, the V (N + M ) should also
be less than the two balance values added together.
D(N + M )
V (N + M )
≤
≤
D(N ) + D(M )
V (N ) + V (M )
The complete example of the calculations involved in combining
the two sets shown in Table 9 with balance-values of 4 and 5,
can be found in Tables 10, 11, 12 and the equations in Table 13
after them. The equations show the values that are combined in
the formula for V and show that the individual values added are
higher than the value for the combined set. It can be seen in the
4
4
3
2
2
5
8
2
4
9
6
7
3
7
2
5
7
6
5
5
9
7
5
1
7
3
1
6
1
9
6
8
6
1
8
2
1
6
8
8
3
4
Table 9: Set N and M
Set N:
Set M:
Combined:
a
b
c
a
b
c
a
b
c
1
1
1
1
1
1
2
2
2
1
1
1
1
0
2
2
1
3
2
0
1
0
3
0
2
3
1
1
2
0
1
0
2
2
2
2
1
1
1
1
1
1
2
2
2
0
2
1
2
1
0
2
3
1
2
0
1
1
0
2
3
0
3
0
1
2
1
2
0
1
3
2
1
1
1
1
1
1
2
2
2
Table 10: Occurrences for variable A
Set N:
Set M:
What we want to prove is that the combination of N and M ,
V (N + M ) is less than or equal to the individual balance-values
for the sets added. This we do by going down to the level of
the difference-value for individual variables and combining the
minimum and maximum values of the variables.
2
9
7
4
9
3
Combined:
x
y
z
x
y
z
x
y
z
1
1
1
1
1
1
2
2
2
1
0
2
1
1
1
2
1
3
1
2
0
2
1
0
3
3
0
1
1
1
0
1
2
1
2
3
2
0
1
0
2
1
2
2
2
0
2
1
1
1
1
1
3
2
2
0
1
1
1
1
3
1
2
0
2
1
2
0
1
2
2
2
1
1
1
1
1
1
2
2
2
Table 11: Occurrences for variable X
Set N:
a
b
c
Set M:
a
b
c
Combined:
a
b
c
a
0
4
4
a
0
4
4
a
0
8
8
b
4
0
4
b
4
0
4
b
8
0
8
c
4
4
0
c
4
4
0
c
8
8
0
x
y
z
x
y
z
x
y
z
x
0
4
4
x
0
4
4
x
0
8
8
y
4
0
4
y
4
0
4
y
8
0
8
z
4
4
0
z
4
4
0
z
8
8
0
Table 12: Preceding variables for A and X
V (N ) = 4 = (2 − 0) + (2 − 0) + (4 − 4) + (4 − 4)
V (M ) = 5 = (3 − 0) + (2 − 0) + (4 − 4) + (4 − 4)
V (N + M ) = 6 = (3 − 0) + (3 − 0) + (8 − 8) + (8 − 8)
Table 13: Balance-value for set N + M
equations that the maxima of the two sets together is not always
as high as the individual maxima added together.
From these overviews, it is not hard to see that the values for
Tin (N ) and Tin (M ) in our example can be added together in such
a way that they yield a minimum and maximum as is described
above. For the values of Pvw (N ) and Pvw (M ) this is equally true.
After combining all optimal sets of three (minimal balance value
of 2) and finding sets of six runs with values between 0 and 4, we
find that this reasoning holds for our cases with sets of 3. And we
are confident that we can continue with adding sets of three to
the just created sets and continue doing this to find larger sets of
runs with low balance-values. To be completely sure that we are
allowed to do this for all sets that comply to the requirements,
the theoretical basis of this combining should be examined more
thoroughly.
3.3
Pruning
The next obstacle comes up when we continue with this approach.
Because we generate sets of an increasing number of runs, these
sets increase in disk space. And as the total number of possible
sets increases, so does the number of sets with a minimal balancevalue. While this is still a small fraction of the total number of
sets it also increases exponentially. And though the first of these
problems is just a linear increase, together with the sheer number
of sets it adds up to an overload of our systems.
One solution is to prune the sets that are less likely to produce
good results and take just a small number of sets from the previous round to supply combinations for the new sets (so called
“seeding”). But now care must be taken to keep enough diversity
in the chosen sets so we do not try to combine the same sets with
each other the whole time. This indicates that there certainly
is a minimum number of different sets that need to be used for
recombining or that an indicator of diversity must be introduced.
For an idea of how this can be solved, we look at the research that
has already been done on selection schemes for genetic algorithms
[4].Genetic algorithms work by taking the best “genes” on to the
next iteration of the program. And while we did not set out to
create a genetic algorithm, the setup of our algorithm is quite
similar to a genetic algorithm. All our minimal-value sets can all
potentially be included in the end result and there are no specific
criteria to test these sets against each other. So although the
selection schemes give good indications of what solutions can be
tried, none fit exactly to our problem. The nearest we come to
a solution would be to include all generated sets or a random
number of these best solutions.
This is exactly what we do, we take a random selection of 3000
sets from the best results to continue in our next iteration of
recombination. But still care must be taken to take a diverse
enough set and not rely on randomized results to be balanced, as
this is the trap we have been trying to solve from the beginning.
How such problems might be avoided best is discussed later. For
now, it suffices to take equally interspaced sets from the best
balance-valued sets.
With this setup, of only 3000 seeding sets, each iteration still
linearly increases in size because the sets themselves grow. The
time it takes to generate the next set of minimal balance-valued
sets is reduced to a few minutes, while still creating new solutions to use in usability tests. The different solutions all have a
minimal balance-value of 0 or 2 (more on the reason for this in
the discussion) and balanced sets with up to 52 runs have been
found already within a few hours of computation time, which is
very reasonable. These solutions solve our problem of creating
sets of runs that are “good enough” to use in a within-subject
design with enough participants.
4.
DISCUSSION
An interesting observation is that in our setup with two variables
of each three values, sets with a number of runs divisible by six
have minimal balance-values of 0 and those that are not (but are
divisible by three) have a minimal value of 2. This means that sets
divisible by 6 comply with all requirements, while those divisible
by 3 and not 6 do not. Because our search is not exhaustive, we
cannot prove this, but we expect that sets of a multitude of six
runs can apparently minimize all values of D(Pvw )(Fm ) while
sets that are not divisible by six cannot.
We now define the difference for each variable included in the
items as the minimum subtracted from the maximum. This instinctively feels like a good indicator of how far from the best
solution the ordering still is, and it works. But there might be
better functions that can select the right sets of runs (computationally) quicker. One idea is to still use the minimum and maximum values and their difference, but square these differences to
make higher differences increase quicker, this would introduce a
sharper cutoff of the less than optimal sets and probably lead to
lower computation time necessary. Other formulae for defining
the best run or sets of runs need to be researched. An implication of such a change in the formula could be that a combination
of two sets does not have a balance-value that is equal or less
than the values of the two individual sets. Such changes need to
be checked for when changing this formula.
As we mentioned briefly, the approach we take with solving the
problem is quite similar to how genetic algorithms are built. This
is not surprising as genetic algorithms are also known as a form
of global search heuristics [3]. Ours is also a global search problem. A genetic algorithm tries to mimic the ideas of evolution
by trying out random approaches to this problem and taking the
best to combine into a new approach. This is then done in multiple iterations, and each time the best approaches are used to
create new approaches until in the end a sufficiently good solution is generated. A genetic algorithm usually follows four phases:
Initialization, Selection, Reproduction and Termination. For us
these are the generation of individual runs that comply to the
requirements about runs (Initialization). Calculating minimal
values of all sets and selecting a random new group of these sets
(Selection). Combining these sets to create new ones (Reproduction) and repeating these three steps. This keeps iterating until
we reach sets that are big enough for the group of participants
we have in mind and then it is terminated.
One of the optimizations that could still be made is to test all
sets that are chosen to be used in our “Selection” phase, to see
if we have a sufficiently different group of runs. This would help
against steering the solution to a local minimum with too little
different runs too much. To get a set for the number of participants we needed, this would not be a problem, for higher numbers
this could cause the program to fail in adding diverse enough sets
together. More research from the domain of genetic algorithms
should be reviewed, to help improve upon these findings.
In our examples we showed that our approach works for a 3 × 3
within-subject design; a design with two independent variables
with each three values. And while we have tried to keep the explanations as general as possible, some solutions to problems that
were encountered might not work with other numbers of values
per variables. An extension of the number of values per independent variable should not give any trouble. If we want to calculate
orderings for a 4 × 3 or 4 × 4 design, this will increase the computation time, but such an increase should still be manageable.
Adding an extra independent variable will mean adding an extra
set of equations to the formula. This in itself is also not a big
problem, but the number of calculations will increase exponentially another time. If we take for instance a 3 × 3 × 3 design
with 27 items, it already has several millions of base runs. A
2 × 2 × 2 design will probably still be possible to use, but here
other problems arise. We expect that because we linearize the
items, the time and disk-space needed will depend on the number of base-runs for a setup. So we do not say it is impossible to
calculate such orderings, but more research would be needed to
adapt our work and make it useful for higher numbers of values.
So this means that we cannot use this approach for all n × n × . . .
designs and there is also a lower bound because a 2 × 2 design can
not be completely counterbalanced. With a 2 × 2 design there are
four possible options: ax, bx, ay and by. These cannot be ordered
in such a way that none of the variables repeats itself. The lower
bound is a 2 × 3 or 3 × 2 design, because these can satisfy the
second requirement. If we let go of the requirement that refrains
from repetition, then any (not too big) n × n design can be used.
For an experimental setup with just one independent variable
our solution also works. Again, care has to be taken not to try
to calculate the runs for too many values, but as most designs
with a single variable have just a few values this should not be a
problem.
We have been calculating large sets of runs, which takes quite a
while. Another solution to our problem is to create several basesets for our 3×3 design with a value of 0 for V , and use these sets
several times to create a larger set. This would of course multiply
the context-effects introduced by the ordering with the number of
times the set is used. But as this would happen over all runs, the
net-effect will still be 0. This can also be done for other numbers
of independent variables and numbers of values per variable. This
way a table can be created in which an experimenter can look-up
a base-ordering for his experimental design. He can then take this
base-ordering and use it as many times as he has participants.
5.
CONCLUSION
By using a heuristic approach to calculating a balance-value, we
can calculate enough orderings of the independent variables to
create a sufficient set to use in an experimental setting. This
approach works for n × n designs and has to have at least 2 × 3
values. The most important restriction on the number of variables
and number of values they encompass is the number of conditions
the combinations create. As long as the number of conditions is
relatively small, this approach can be used. A rule of thumb that
can be used here is to have a maximum number of conditions of
about 20.
We hope that the setup as we describe it can help other researchers to balance the variations introduced by the ordering
of the independent variables, and help to reach stronger conclusions. To help make this possible, it would be good if an easy
to use GUI implementation for our approach is created to make
it possible for all researchers to generate orderings for their user
tests.
6.
REFERENCES
[1] Barbieri, M., and Stinstra, E. Automatically generated
video previews: user study. In Human Information
Processing Colloquium (2006), Philips Research, Eindhoven,
The Netherlands.
[2] Bradley, J. Complete counterbalancing of immediate
sequential effects in a Latin square design. Journal of the
American Statistical Association 53, 282 (1958), 525–528.
[3] Goldberg, D. Genetic Algorithms in Search and
Optimization. Addison-wesley, 1989.
[4] Goldberg, D., and Deb, K. A comparative analysis of
selection schemes used in genetic algorithms. Foundations of
genetic algorithms 1 (1991), 69–93.
[5] Greenwald, A. Within-subjects designs: To use or not to
use?. Psychological Bulletin 83, 2 (1976), 314–320.
[6] Kirk, R. Experimental design. Sage Publications Ltd, 2009.
[7] Pearl, J. Heuristics–intelligent search strategies for
computer problem solving. Addison-Wesley Publishing Co.,
Reading, MA, 1984.
[8] Winer, B. J., Brown, D. R., and Michels, K. M.
Statistical Principles In Experimental Design. McGraw-Hill,
1991.