Balancing Context Effects by Controlled Experiments in Within-Subject Design Optimizing Condition Counterbalancing Tin de Zeeuw University of Twente Enschede, Netherlands [email protected] ABSTRACT To minimize the effects that occur because of condition ordering in within-subject designs, we propose a formula based on the work of Barbieri and Stinstra [1]. We implement this formula and create a program to solve the problem of creating individual orderings for each participant of a withinsubject experiment with all orderings together being balanced. This program uses a heuristic approach to bridge some of the problems that come up due to the (almost) exponential increase in the permutations of orderings. These approaches are successful and it is shown that it can be used to create multiple sets of orderings for at least 52 participants. With such sets of orderings, the context effects should be minimized and negligible in the outcome of the experiment. Keywords Usability tests, context effects, within-subject experiment setup, heuristic solution. 1. INTRODUCTION In every experiment involving people, there are the effects we want to find, effects outside our control, and effects that arise from the way the experiment is set up. We can for example set up an experiment to measure the effect the color of light has on the perceivability of flicker. We put a participant in front of a lamp of a certain color and turn it on and off very quickly. The frequency with which the lamp turns on and off can be varied and when it is more than a certain frequency, the participant will not see the flickering anymore and instead perceive the light as continually on. The variables we change to measure their effect are called the independent variables. (in our example the frequency of the flickering) We can then use several conditions in which a specific frequency is presented to the participants of the test. (the different colors) If we want to know if the frequency of flickering is of influence to the perception of the light we can ask the participants if they can see it flickering. The perception of the participants, whether they see the flickering, is a dependent variable. The way the independent variables are presented to the participants in an experiment can introduce several context effects that can influence the outcome. One such effect could be that people get more tired after looking at the lamp for a period of time, and will be less likely to see the flickering after a while. If we then always show the participants all different colors, first green, then red and finally blue light, the blue light will always have a lower threshold frequency up until which the participant notices the flicker. And that is a context effect that can occur because of the setup of the experiment. That is why according to Greenwald [5], most statisticians will opt to use a between-subject design, rather than a within-subject design. He also explains why this should not be the case, as there are ways of guarding and repairing these effects. In a between-subject design every participant participates in one condition, which should only result in changes in the dependent variables because of differences in how the participants perform on the task. If the participants are divided equally over the conditions, the outcome can only be attributed to the different conditions. In our example, changing it to a between-subject design would have the result that every participant will only get to see one color. Context effects are more explicit in an experiment in which each participant repeats the experimental task several times under different conditions: the within-subject design [8] or repeated measures test. In his article, Greenwald explains about the types of context effect: A practice effect, when the participant learns from practicing the task and gets better at it after each time he accomplishes it; all subsequent performances of the same task will be influenced by this. A carry-over effect, in which the experience of the first task carries over some mood or state of being to the second task and thus influences the performance. And the participants can also become sensitive to the variable that is changed, and try to figure out how he should behave. An example of this would be a test in which the participant needs to notice a light flickering. If the flickering is altered at a specific interval, the participant might notice the difference between two conditions and notice the interval. He can then try to look for the change and behave differently than when only looking for a flickering light. And another one is of course the effect mentioned earlier, participants might become fatigued after a while. The main problem is thus that participants might change over time and this has to be taken into account. The first two effects (practice and carry-over) can come from the ordering in which the independent variable is changed for a participant and there are several ways to overall minimize the effects in the dependent variables. The effects will always exist, but for a group of participants we can try to smooth them out so they will influence the results equal in all conditions. That way they do not influence all results systematically in one direction. The easiest way to do this is to randomize the ordering says Kirk [6]. This will work only if there are enough participants, and even then the resulting orderings can be skewed. An example of this would be if most orderings start with one condition which might mean that participants will always perform worse on that condition because they could not have practiced the task yet. Greenwald explains that counterbalancing is a better way to minimize these effects. He describes the process of one-on-one counterbalancing by assigning an opposite ordering of the values of the independent variable for each included ordering. If the first participant is given the variables ordered in conditions a and then b, then the second participant is offered the ordering b and then a. Counterbalancing can also be done with larger groups of orderings, one of the most used way to do this is to use latin squares [2]. A latin square is a n × n table filled with n different independent conditions and makes sure that all conditions occur only once in each row and column to make sure that an acceptable differentiation of orderings per participant is constructed. In Table 1 an ordering for three participants is shown, with three time slots for the three different conditions. It can be seen that, for instance, condition a occurs at a different time for each participant. This is also true for the other two conditions. Participant 1 Participant 2 Participant 3 Time 1 a c b Time 2 c b a Time 3 b a c Table 1: Latin square of 3x3 This approach only works with a n × n design, but can be extended for other numbers of participants, time slots or numbers of conditions. If we look at such an extension, care must be taken with the ordering of variables per participant and in total. This is similar to what we do in the following sections. In their research, Barbieri and Stinstra [1] use a formula derived from such requirements, to calculate how balanced a set of orderings is. This formula is a good measure to use in finding a well balanced set, and the value of this formula also gives an indication how much work still has to be done to get to an optimal value. The requirements they use are meant to equalize the number of successions of the variables to smooth the effects of specific combinations of conditions when the participant encounters them in succession. We use the work of Barbieri and Stinstra and adapt their ideas slightly to describe a set of requirements to create such a formula in Section 2. The basic parts for the formulae are explained here. They start with an item, which is a specific value of an independent variable or a combination of multiple independent variables. A runs is then filled with a specific ordering of items at specific indices i in the run. And these runs can be collected together in a set that corresponds to certain requirements. As an example we will use in the rest of this article, we take two independent variables with each 3 values: A: X: a x b y c z We can combine the values of these two variables to create 9 items which we can number as 1 through 9 in Table 2 for ease of notation. x y z a 1 4 7 b 2 5 8 c 3 6 9 Table 2: All items formed from a 3 × 3 setup The items we now specified need to be put in an ordering, so each is included exactly once. This will ensure that each combination of values for the two variables occurs (exactly once). If we then make sure that no variable value repeats itself, we have the requirements for individual runs. Run Requirements: 1. Within one run each item occurs equally often. 2. Within one run each variable for the couple of items at location i and i + 1 has a different value. No value may repeat itself is a simple measure to make sure we can compare conditions to each other. Participants are protected from perceiving two following conditions as the same. And at any moment during the experiment we can ask if they liked the current condition or the previous one. An ordering that conforms to these two requirements is for instance: “486192735”. This ordering is shown in Table 3 with the values of the variables it is made up of. Here it is clear that no value repeats itself and each item, or combination of values, occurs exactly once. Item ordering: 4 8 6 1 9 2 7 3 Index: 1 2 3 4 5 6 7 8 5 9 Variable A: Variable X: a y b z c y a x c z b x a z c x b y Table 3: Valid ordering with it’s variables If we first look at only one independent variable, the first requirement of Barbieri and Strinstra is the same as the one Latin squares have, that no variable can occur more than once. If we only look at one independent variable with several values, the second requirement is automatically satisfied because all values can occur only once. For more than one repetition of a variable, the second requirement becomes more interesting. It will, for instance, allow an ordering of abacbc, which is a mix of two repetitions of three items of one variable (with three values). This is also important when we build up items with more than one variable, because these different variables all need to be included equally often and none may repeat itself, as can be seen in the example of Table 3. The number of combinations of two values following each other is also a measure of balance, because of the direct influence one condition can have on another, this is also taken into account in the requirements. For a set of N runs, we follow the requirements set by Barbieri and Strinstra exactly: Set Requirements: 3. Each value of an independent variable occurs equally often in location i in all runs. 4. Each value of an independent variable is preceded by each other value of that variable an equal amount of times in the set of runs We look for an implementation of a program to calculate a set of orderings of the independent variables in such a way that it minimizes the context effects in the data for a set of experiments overall. And because calculating all possible sets of orderings that are balanced would increase the time needed for calculations exponentially with each extra ordering in the set, we look to a heuristic solution. A heuristic is a way of implementation that sacrifices thoroughness in a solution to gain speed. This means it will generate a usually acceptable solution, that by no means has to be an optimal one [7]. Changing the program so it uses a heuristic approach to yield an acceptable set of orderings in a reasonable time span is better, because we do not need an exhaustive solution, we only need a single balanced set to be able to run an experiment. 2. BASIC FORMULA To implement the formula as described by Barbieri and Stinstra, we follow the requirements set by them. These can be implemented individually into a set of rules for the individual runs, and a formula for a set of runs combined. For each run, the first two requirements can be checked while systematically creating the runs. A simple recursive program can add an item to the end of an ordering and check if the requirements are not violated. If they are not, another item can be added until a whole run is formed. If a requirement is violated, then another option can be tried, or if this is not possible and all options are exhausted, the previous item can be removed and retried. This way all options for a single run, valid under the first two requirements, can be generated systematically. For balancing a set of runs we can translate the requirements to define variables which we can use to calculate a measure of balance and specify a formula not much unlike that of Barbieri and Stinstra. In this formula we only take into account the last two requirements and assume the first two to be satisfied by the run-generation. 3. 4. Tin (F ): the number of times the value n of an independent variable F is in location i. Pvw (F ): the number of times a value v precedes the value w of the same independent variable F . We can implement this formula in our program in such a way that it can calculate a value for an arbitrary number of independent variables per item. Our program essentially just counts the number of occurrences for the different independent variable values and then fills them in in the formula. If we look back at the example in Table 2 and 3, we can now count the number of occurrences for the values of variable A (in Table 4) and for the values of variable X (in Table 5) in accordance to the third and fourth requirement. The values are all occurring once in each index because each variable’s value should be used exactly three times in one run. If we calculate these values for a set with more than one run, the number of occurrences can rise. A set of three runs can theoretically fill out all indices with exactly one occurrence of each variable’s value (the proof of this is left as an exercise for the reader). Ordering: a b c 4 1 8 6 1 1 9 2 1 7 1 3 1 1 5 1 1 1 Table 4: Occurrences for variable A Ordering: x y z 4 8 1 6 1 1 9 2 1 7 3 1 1 5 1 1 1 1 Table 5: Occurrences for variable X a b c a 0 1 1 b 1 0 2 c 2 1 0 x y z x 0 1 2 y 1 0 1 z 2 1 0 Table 6: Preceding variables for A and X The preceding variables for this ordering can be found in Table 6. It can be seen that no value precedes itself by the empty counts on the diagonal. Furthermore, it can be seen that one run cannot be completely balanced with respect to the preceding items. This is because, for one variable, there are only 6 combinations of 2 values that can be made with 3 values (ab, ac, ba, bc, ca and cb) while we have 8 couples of values that precede each other. Our goal is to minimize the value in the formula V for each independent variable F in Tin and Pvw . V = n X D(Tin (Fm )) + D(Pvw (Fm )) (1) m=1 In this formula, D is the difference between the maximum and minimum values in the matrix of variable G. D(G) = max(G) − min(G) (2) For Tin , the D-values are calculated over all values in the matrix. For the selection of the minimum and maximum values in Pvw , the program must take into account that the values on the diagonal will always be 0, but should not be used. Because of the first requirements, the precedence will always comply to v 6= w, no value of v should be equal to that of w. And thus these values should not be used. The value of D (the difference function) for each variable is minimal when all items (or values of independent variables) occur equally often and each value is preceded by each other value of the same independent variable equally often. D is then a value of 0 for all and this results in a value of 0 for the whole formula V . Because a (really) unbalanced set of runs will have a higher value of V and the optimal sets a value of 0, we can “simply” let the program try to minimize this value and ignore options that only make the value higher. V = = D(Tin (A)) = D(Tin (X)) = D(Pvw (A)) = D(Pvw (X)) = V X D(Tin ) + X D(Pvw ) D(Tin (A)) + D(Tin (X)) +D(Pvw (A)) + D(Pvw (X)) ((max(Tin (A)) − min(Tin (A))) ((max(Tin (X)) − min(Tin (X))) ((max(Pvw (A)) − min(Pvw (A))) ((max(Pvw (X)) − min(Pvw (X))) = (1 − 0) + (1 − 0) + (2 − 1) + (2 − 1) = = 1+1+1+1 4 Table 7: Balance value for 486192735 Our goal is updated to minimizing the value of V for a set of runs. In the calculations in Table 7, the values of our example are calculated for V , this leads to a value of 4. As we want this value to be as low as possible, this is not yet a very good value. For a single run, this is also the theoretical minimum. The values for Tin (F ) can never be higher than 1 and include at least an occurrence of 0. The values for D(Tin (F )) will thus always be 1 for a single run. The same is true for the values of D(Pvw (F )), they will also be at least 1 because there are 6 combinations values (as seen before) and 8 preceding comparisons yielding a minimum with the values 1 and 2 for Pvw (F ). 3. CALCULATING SOLUTIONS 3.1 Exhaustive Calculation There are 9! = 362880 different orderings that can be created with our 9 items. And while they all follow the first requirement, most of them do not comply with the second requirement. If we look at the number of runs that can be created with 9 items that do satisfy the first two requirements, we find that there are 1512 possible combinations. The generation of these takes just a few seconds on a modern PC. The exact number of 1512 distinct runs has to do with the second requirement. Because we do not want the same value to be repeated, the number of options for an item in a position in the ordering is limited. If we look at the generation of a run that is started with a “2”, the next item can only be a 4,6,7 or 9 (see Table 8), leaving only 4 choices instead of the 8 that we could choose without the requirement. The next step has only 3 choices because the first option has now been taken out. This effect continues for each new item, with items that have been used taken out. Because these items are taken out, short of writing out all options, we do not know how to calculate the permutations, but the generation of runs found a total of 1512 different runs. a x y z 4 7 b 2 c 6 9 Table 8: All next items when 2 is chosen The minimum value for a run is 4 as we have seen in the previous example, and the maximum we find is 7. The next step is to combine these runs to create sets. There are already 1512 × 1511 sets of two runs, and none of them comply to the third and fourth requirements yet because there are not enough items to fill out all values of Tin (F ) and Pvw (F ). If we look at the values we get from our formula, most of the sets do not yet have a “good” ordering with a low balance-value. The minimal value we find is 5 and the maximum 14. Why the balance-value for these sets cannot be as low as individual runs is we do not know. Sets of three runs should theoretically be the first to have balance-values that are significantly less than those before, as it is possible to get values of 0 in D(Tin )(F ). The values of D(Pvw )(F ) are harder to predict and may as well be 0 too, because the total number of preceding item comparisons is divisible by the number of possible combinations (3 × 8 = 24 = 4 × 6). It is seen in the results that sets of 3 runs have a minimal value of 2 and a maximum of 20. The most minimal values of V for sets of 3 are partially computed as we expected. Both values of D(Tin )(F ) are 0, while one of the values of D(Pvw ) for A or X is 2 while the other is 0. Why these values cannot both be 0 at the same time we do not know, for our goal of calculating a set of runs that suffices for a usability test this does not matter yet. Apparently it is not possible to get a perfect succession of items for all independent variables at the same time with sets of three runs. And thus these sets do not completely fulfill all requirements, only the first three while leaving out the fourth. Subsequent examples will leave out the reasoning about requirements, but we will return to them in our discussion after that. The number of sets with n runs ` can ´ be found by calculating the permutations of 1512: 1512 . For sets with 3 runs n this is of course 1512 × 1511 × 1510 and while it is not an exponential increase per definition, the number of permutations of the 1512 runs grows rapidly. This means that the computational time and disk-space needed for finding the best balanced sets also increases with this factor. For these sets of 3 runs, we already need several hours to exhaustively calculate all balance values. And while our implementation is not optimal, even with optimizations, exhaustively calculating anything near a useful set for sufficient participants will require either a super-computer or a lot of time. 3.2 Combining Sets This almost exponential increase in computational complexity is the reason that we look to heuristics for a way of calculating sets containing higher numbers of runs. One of the ways we do this is to combine sets of runs instead of single runs. We do this with the sets of 3 runs with the lowest balance-values (V = 2) we found in the previous step. By combining whole sets, we skip the intermediate sets of 4 and 5 runs and go on to the sets with 6 runs. And by using only the sets with the lowest balance-values, we use about 0.0001% of all calculated sets of 3 runs. (only 3744 sets) This also takes time off the calculation of sets of 3 runs, they still have to be created and their balance values have to be calculated, but the most time consuming task of saving the set to the hard drive is eliminated. This practice also uses a lot less disk-space, which is another limiting factor for bigger sets. Set N: 9 1 5 3 4 8 Set M: When we add together two sets containing disjunctive runs, we believe that the balance value for the new set will not exceed that of the individual sets together. This can be shown by taking two sets N and M that comply to our requirements and following the next set of equations for each value of Tin and Pvw . The values for the different sets N and M will be denoted with an extra (N ) in the formulas. V (N ) V (M ) = X D(N ) = X D(M ) Likewise, for easy representation, we will omit the specifying variables for the difference-function and write: D(N ) = max(N ) − min(N ) when we actually mean: D(Tin (F ))(N ) = max(Tin (F ))(N )−min(Tin (F ))(N ) or the same formula for Pvw . D(N ) denotes the difference value for all values of Tin and Pvw in set N . D(N ) = max(N ) − min(N ) D(M ) = max(M ) − min(M ) D(N ) + D(M ) = (max(N ) − min(N )) + (max(M ) − min(M )) = (max(N ) + max(M )) − (min(N ) + min(M )) The steps stating that the maximum and minimum of the combination are not necessarily the same value as the addition of the individual values comes from the way these values are picked from the matrix of the variable. For all different variables individually, the values of the maximum of set N can maximally be increased with the maximum of set M . For instance max(Tin ) of set N can be combined with maximally max(Tin ) of set M . This can best be seen in the example in Table 10, where the previous maxima of 4 and 5 are exchanged for the new value of 6 when both sets are combined. The minimum of set N will minimally be increased with the minimum of set M . This can also be seen in the example, but less clear, because 0 + 0 also adds up to the new value of 0. max(N + M ) ≤ max(N ) + max(M ) min(N + M ) ≥ min(N ) + min(M ) These two rules together will make the D value equal or smaller than that of the individual D values. And so D(N + M ) will be less than or equal to the combined values of the previous two sets. And if we sum all values of D, the V (N + M ) should also be less than the two balance values added together. D(N + M ) V (N + M ) ≤ ≤ D(N ) + D(M ) V (N ) + V (M ) The complete example of the calculations involved in combining the two sets shown in Table 9 with balance-values of 4 and 5, can be found in Tables 10, 11, 12 and the equations in Table 13 after them. The equations show the values that are combined in the formula for V and show that the individual values added are higher than the value for the combined set. It can be seen in the 4 4 3 2 2 5 8 2 4 9 6 7 3 7 2 5 7 6 5 5 9 7 5 1 7 3 1 6 1 9 6 8 6 1 8 2 1 6 8 8 3 4 Table 9: Set N and M Set N: Set M: Combined: a b c a b c a b c 1 1 1 1 1 1 2 2 2 1 1 1 1 0 2 2 1 3 2 0 1 0 3 0 2 3 1 1 2 0 1 0 2 2 2 2 1 1 1 1 1 1 2 2 2 0 2 1 2 1 0 2 3 1 2 0 1 1 0 2 3 0 3 0 1 2 1 2 0 1 3 2 1 1 1 1 1 1 2 2 2 Table 10: Occurrences for variable A Set N: Set M: What we want to prove is that the combination of N and M , V (N + M ) is less than or equal to the individual balance-values for the sets added. This we do by going down to the level of the difference-value for individual variables and combining the minimum and maximum values of the variables. 2 9 7 4 9 3 Combined: x y z x y z x y z 1 1 1 1 1 1 2 2 2 1 0 2 1 1 1 2 1 3 1 2 0 2 1 0 3 3 0 1 1 1 0 1 2 1 2 3 2 0 1 0 2 1 2 2 2 0 2 1 1 1 1 1 3 2 2 0 1 1 1 1 3 1 2 0 2 1 2 0 1 2 2 2 1 1 1 1 1 1 2 2 2 Table 11: Occurrences for variable X Set N: a b c Set M: a b c Combined: a b c a 0 4 4 a 0 4 4 a 0 8 8 b 4 0 4 b 4 0 4 b 8 0 8 c 4 4 0 c 4 4 0 c 8 8 0 x y z x y z x y z x 0 4 4 x 0 4 4 x 0 8 8 y 4 0 4 y 4 0 4 y 8 0 8 z 4 4 0 z 4 4 0 z 8 8 0 Table 12: Preceding variables for A and X V (N ) = 4 = (2 − 0) + (2 − 0) + (4 − 4) + (4 − 4) V (M ) = 5 = (3 − 0) + (2 − 0) + (4 − 4) + (4 − 4) V (N + M ) = 6 = (3 − 0) + (3 − 0) + (8 − 8) + (8 − 8) Table 13: Balance-value for set N + M equations that the maxima of the two sets together is not always as high as the individual maxima added together. From these overviews, it is not hard to see that the values for Tin (N ) and Tin (M ) in our example can be added together in such a way that they yield a minimum and maximum as is described above. For the values of Pvw (N ) and Pvw (M ) this is equally true. After combining all optimal sets of three (minimal balance value of 2) and finding sets of six runs with values between 0 and 4, we find that this reasoning holds for our cases with sets of 3. And we are confident that we can continue with adding sets of three to the just created sets and continue doing this to find larger sets of runs with low balance-values. To be completely sure that we are allowed to do this for all sets that comply to the requirements, the theoretical basis of this combining should be examined more thoroughly. 3.3 Pruning The next obstacle comes up when we continue with this approach. Because we generate sets of an increasing number of runs, these sets increase in disk space. And as the total number of possible sets increases, so does the number of sets with a minimal balancevalue. While this is still a small fraction of the total number of sets it also increases exponentially. And though the first of these problems is just a linear increase, together with the sheer number of sets it adds up to an overload of our systems. One solution is to prune the sets that are less likely to produce good results and take just a small number of sets from the previous round to supply combinations for the new sets (so called “seeding”). But now care must be taken to keep enough diversity in the chosen sets so we do not try to combine the same sets with each other the whole time. This indicates that there certainly is a minimum number of different sets that need to be used for recombining or that an indicator of diversity must be introduced. For an idea of how this can be solved, we look at the research that has already been done on selection schemes for genetic algorithms [4].Genetic algorithms work by taking the best “genes” on to the next iteration of the program. And while we did not set out to create a genetic algorithm, the setup of our algorithm is quite similar to a genetic algorithm. All our minimal-value sets can all potentially be included in the end result and there are no specific criteria to test these sets against each other. So although the selection schemes give good indications of what solutions can be tried, none fit exactly to our problem. The nearest we come to a solution would be to include all generated sets or a random number of these best solutions. This is exactly what we do, we take a random selection of 3000 sets from the best results to continue in our next iteration of recombination. But still care must be taken to take a diverse enough set and not rely on randomized results to be balanced, as this is the trap we have been trying to solve from the beginning. How such problems might be avoided best is discussed later. For now, it suffices to take equally interspaced sets from the best balance-valued sets. With this setup, of only 3000 seeding sets, each iteration still linearly increases in size because the sets themselves grow. The time it takes to generate the next set of minimal balance-valued sets is reduced to a few minutes, while still creating new solutions to use in usability tests. The different solutions all have a minimal balance-value of 0 or 2 (more on the reason for this in the discussion) and balanced sets with up to 52 runs have been found already within a few hours of computation time, which is very reasonable. These solutions solve our problem of creating sets of runs that are “good enough” to use in a within-subject design with enough participants. 4. DISCUSSION An interesting observation is that in our setup with two variables of each three values, sets with a number of runs divisible by six have minimal balance-values of 0 and those that are not (but are divisible by three) have a minimal value of 2. This means that sets divisible by 6 comply with all requirements, while those divisible by 3 and not 6 do not. Because our search is not exhaustive, we cannot prove this, but we expect that sets of a multitude of six runs can apparently minimize all values of D(Pvw )(Fm ) while sets that are not divisible by six cannot. We now define the difference for each variable included in the items as the minimum subtracted from the maximum. This instinctively feels like a good indicator of how far from the best solution the ordering still is, and it works. But there might be better functions that can select the right sets of runs (computationally) quicker. One idea is to still use the minimum and maximum values and their difference, but square these differences to make higher differences increase quicker, this would introduce a sharper cutoff of the less than optimal sets and probably lead to lower computation time necessary. Other formulae for defining the best run or sets of runs need to be researched. An implication of such a change in the formula could be that a combination of two sets does not have a balance-value that is equal or less than the values of the two individual sets. Such changes need to be checked for when changing this formula. As we mentioned briefly, the approach we take with solving the problem is quite similar to how genetic algorithms are built. This is not surprising as genetic algorithms are also known as a form of global search heuristics [3]. Ours is also a global search problem. A genetic algorithm tries to mimic the ideas of evolution by trying out random approaches to this problem and taking the best to combine into a new approach. This is then done in multiple iterations, and each time the best approaches are used to create new approaches until in the end a sufficiently good solution is generated. A genetic algorithm usually follows four phases: Initialization, Selection, Reproduction and Termination. For us these are the generation of individual runs that comply to the requirements about runs (Initialization). Calculating minimal values of all sets and selecting a random new group of these sets (Selection). Combining these sets to create new ones (Reproduction) and repeating these three steps. This keeps iterating until we reach sets that are big enough for the group of participants we have in mind and then it is terminated. One of the optimizations that could still be made is to test all sets that are chosen to be used in our “Selection” phase, to see if we have a sufficiently different group of runs. This would help against steering the solution to a local minimum with too little different runs too much. To get a set for the number of participants we needed, this would not be a problem, for higher numbers this could cause the program to fail in adding diverse enough sets together. More research from the domain of genetic algorithms should be reviewed, to help improve upon these findings. In our examples we showed that our approach works for a 3 × 3 within-subject design; a design with two independent variables with each three values. And while we have tried to keep the explanations as general as possible, some solutions to problems that were encountered might not work with other numbers of values per variables. An extension of the number of values per independent variable should not give any trouble. If we want to calculate orderings for a 4 × 3 or 4 × 4 design, this will increase the computation time, but such an increase should still be manageable. Adding an extra independent variable will mean adding an extra set of equations to the formula. This in itself is also not a big problem, but the number of calculations will increase exponentially another time. If we take for instance a 3 × 3 × 3 design with 27 items, it already has several millions of base runs. A 2 × 2 × 2 design will probably still be possible to use, but here other problems arise. We expect that because we linearize the items, the time and disk-space needed will depend on the number of base-runs for a setup. So we do not say it is impossible to calculate such orderings, but more research would be needed to adapt our work and make it useful for higher numbers of values. So this means that we cannot use this approach for all n × n × . . . designs and there is also a lower bound because a 2 × 2 design can not be completely counterbalanced. With a 2 × 2 design there are four possible options: ax, bx, ay and by. These cannot be ordered in such a way that none of the variables repeats itself. The lower bound is a 2 × 3 or 3 × 2 design, because these can satisfy the second requirement. If we let go of the requirement that refrains from repetition, then any (not too big) n × n design can be used. For an experimental setup with just one independent variable our solution also works. Again, care has to be taken not to try to calculate the runs for too many values, but as most designs with a single variable have just a few values this should not be a problem. We have been calculating large sets of runs, which takes quite a while. Another solution to our problem is to create several basesets for our 3×3 design with a value of 0 for V , and use these sets several times to create a larger set. This would of course multiply the context-effects introduced by the ordering with the number of times the set is used. But as this would happen over all runs, the net-effect will still be 0. This can also be done for other numbers of independent variables and numbers of values per variable. This way a table can be created in which an experimenter can look-up a base-ordering for his experimental design. He can then take this base-ordering and use it as many times as he has participants. 5. CONCLUSION By using a heuristic approach to calculating a balance-value, we can calculate enough orderings of the independent variables to create a sufficient set to use in an experimental setting. This approach works for n × n designs and has to have at least 2 × 3 values. The most important restriction on the number of variables and number of values they encompass is the number of conditions the combinations create. As long as the number of conditions is relatively small, this approach can be used. A rule of thumb that can be used here is to have a maximum number of conditions of about 20. We hope that the setup as we describe it can help other researchers to balance the variations introduced by the ordering of the independent variables, and help to reach stronger conclusions. To help make this possible, it would be good if an easy to use GUI implementation for our approach is created to make it possible for all researchers to generate orderings for their user tests. 6. REFERENCES [1] Barbieri, M., and Stinstra, E. Automatically generated video previews: user study. In Human Information Processing Colloquium (2006), Philips Research, Eindhoven, The Netherlands. [2] Bradley, J. Complete counterbalancing of immediate sequential effects in a Latin square design. Journal of the American Statistical Association 53, 282 (1958), 525–528. [3] Goldberg, D. Genetic Algorithms in Search and Optimization. Addison-wesley, 1989. [4] Goldberg, D., and Deb, K. A comparative analysis of selection schemes used in genetic algorithms. Foundations of genetic algorithms 1 (1991), 69–93. [5] Greenwald, A. Within-subjects designs: To use or not to use?. Psychological Bulletin 83, 2 (1976), 314–320. [6] Kirk, R. Experimental design. Sage Publications Ltd, 2009. [7] Pearl, J. Heuristics–intelligent search strategies for computer problem solving. Addison-Wesley Publishing Co., Reading, MA, 1984. [8] Winer, B. J., Brown, D. R., and Michels, K. M. Statistical Principles In Experimental Design. McGraw-Hill, 1991.
© Copyright 2026 Paperzz