A study on the equivalence of controlled and uncontrolled visual experiments Silvia Zuffia, Carla Brambillab, Reiner Eschbach c, Alessandro Rizzi d* a ITC, Consiglio Nazionale delle Ricerche, Milano, Italy b IMATI, Consiglio Nazionale delle Ricerche, Milano, Italy c Xerox, Webster, USA d Università degli Studi di Milano, Milano, Italy ABSTRACT Visual experiments, if performed in a traditional way, require asking participants to go to a laboratory, where displays are calibrated and illumination conditions are set in a convenient way. In recent years there has been an increasing interest in performing the experiments "out of the lab", with the aim of reducing costs, increasing the number of participants, and differentiating the population. The equivalence of the response of visual experiments performed in controlled and uncontrolled contexts is an open question which calls for research. In this work we aim at analyzing more deeply the equivalence between "controlled" and "uncontrolled" that we found in a previous study of the authors on the comparison of visual preferences for printed images. In particular, we are interested in understanding if, and to what degree, the uncontrolled experiments require actually more participants in order to average out the effects of the many uncontrolled parameters which may affects the response. In addition we aim at exploring the relationship, if any, between our previous conclusions and the difference in the attributes of the images for which the observers were asked to express their preference. Keywords: Visual experiments, viewing conditions. 1. INTRODUCTION Visual experiments demand subjects to perform a visual task, for instance to match colors, to read a sequence of words, and so on, or to express a subjective evaluation, like to assign a preference, to set a rank, or to detect differences between images or videos. These experiments, in particular those requiring a judgment, are very important to academia and industry. At present, subjective evaluation is the only recognized method of determining the perceived quality of displayed images, videos or prints, since replacing a subjective judgment expressed by an observer with a measure of quality is a very complex issue, still unresolved. Indeed, the image content plays a determinant role in the perceived quality and no quality measure exists that can take it properly into account. Visual experiments, if traditionally performed, require asking participants to go to a laboratory, where viewing conditions are set in a convenient way. In the case of those involving images or videos, the monitor is calibrated, color reproduction is managed, and the illumination conditions are set accordingly. Where prints are under judgment, they are observed in suitable illuminating systems, by selecting standardized light sources, and against a neutral background. In recent years there has been a growing interest in performing visual experiments out of laboratories, with the aim of reducing costs, increasing the number of participants, and differentiating the population. This trend evinced in the context of display media, with the recognition of the Web as a very attractive tool. Indeed, the Web allows reaching many participants from all over the world, but, on the other hand, it constitutes an environment of high indeterminateness, given that not only image reproduction and viewing conditions are uncontrolled, but also the investigator is absent and cannot verify the correctness of the execution of the visual task. Care is suggested for the implementation of visual experiments on the Web: in case of display media, upon lack of control on image reproduction, visual experiments are claimed to be suitable for investigations that do not require a careful setting of appearance parameters like color, contrast, size or resolution1. In this context, the viewing conditions could play a secondary role with respect to the image reproduction settings, due to the greater level of adaptation of the observer to the display, rather than to the environment2. Color Imaging XIV: Displaying, Processing, Hardcopy, and Applications, edited by Reiner Eschbach Gabriel G. Marcu, Shoji Tominaga, Alessandro Rizzi, Proc. of SPIE-IS&T Electronic Imaging SPIE Vol. 7241, 724102 · © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.805854 SPIE-IS&T/ Vol. 7241 724102-1 In the case of print media, on the contrary, visual appearance depends on viewing parameters like illumination conditions, viewing geometry and stimuli surround. However, particularly in the case of subjective evaluations, it is hardly the case in which the effect of viewing conditions, and thus the extent of the required control, is known. In these cases, if the experiments are performed in an uncontrolled manner, the different viewing conditions could act like a source of noise in the recorded data, whose effect could be compensated by a high number of participants, producing even more significant results, not biased by a specific set of viewing conditions. One may expect this to be a taskdependent effect. The equivalence of the response of visual experiments performed in controlled and uncontrolled contexts is an open question which calls for research. There are at least two different aspects that one could consider. The first one concerns the possibility of replacing controlled with uncontrolled experiments, with the purpose of reducing time and costs. The second one is related to the interpretation of the results of a visual experiment performed in the laboratory. Here the question is: to what extent a controlled experiment is representative of a real context? The lack of equivalence between results obtained in controlled and uncontrolled conditions could indicate that the response of the subjects depends on the viewing settings. Considering that in the majority of cases one is dealing with images, either displayed or printed, that will be observed in general environments, a dependency from viewing conditions is something that could be of interest to keep limited. Moreover, it has to be kept in mind that a lack of equivalence could be also a symptom of a wrong experiment design. There are very few works which compare the results obtained in the laboratory with those evinced by data from uncontrolled tests. Zuffi et al.3 have investigated the equivalence between controlled and Web based visual experiments in a study concerning the readability of color combinations. In this case, the uncontrolled parameters that mostly may have an influence on the response are the type of monitor and display and screen settings, including resolution. In a previous study we have investigated the equivalence between “controlled” and “uncontrolled” in an experiment of visual preference of prints4. In that experiment, which was carried on in parallel on different copies of an image dataset, we carefully controlled the print of the images, in order to perform the tests on the same source material. Thus, the decisive difference between the controlled and uncontrolled experiments was the viewing conditions that, in the uncontrolled experiment, were the most unsettled. The results of both studies support the hypothesis of equivalence. In the work we present here we consider again the experiment of visual preference of prints previously performed and address a couple of new questions. The first one concerns sample size. More specifically, our scope here is to provide, in the light of our previous results, some indications for the choice of the sample size of the uncontrolled experiment. Since the involvement of subjects may require some sort of reward, often in the form of a small amount of money, the execution of a visual experiment is an expensive task. Sample size is therefore important for economic reasons, but there are also statistical issues that have to be taken into account. In addition, we look for the relationship, if any, among the equivalence of the experiments, image preferences and image differences. In the previous study indeed, we asked the participants to set a preference between pairs of similar images obtained by applying some kind of alteration, like contrast enhancement or color shift, to a set of photos and we designed the dataset at random, in order to obtain a generic benchmark. The result in terms of strength of the equivalence of the preferences given for each image pair in the controlled and uncontrolled experiment was, in general, different for each image pair. Then the question is: is there a relationship between the results we obtained and the difference in the attributes of the images in the image pairs? In our opinion an answer to this question would allow generalizing the results and could be useful for driving the design of future studies. 2. OUR PREVIOUS WORK In the experiment of visual preference we previously performed4, and to which we refer in the present work, a large set of images randomly selected from photo collections was considered and, for each image, a slightly different version was produced. Visual attributes like contrast and color were changed applying different algorithms with different parameters. The result of this procedure was a large set of image pairs, each one composed of the original image and its modified version. In some cases, the image pair included two modified images, resulting from the application of the same algorithm with different parameters. A set of about twenty of these image pairs was randomly selected and printed, two images per page, in what we called “Book of Samples”. Two pages of the book are displayed in Figure 1. SPIE-IS&T/ Vol. 7241 724102-2 I. Figure 1. The Book of Samples. Each page is composed by two images that differ for color and/or contrast. Some of the images are printed in portrait (left), some in landscape modality (right). In the experiment each subject was asked to express a preference for each of the image pairs in the book. The hypothesis was that the random selection of the images, together with the fact that they were altered at random, gave origin to a generic benchmark whose only purpose was attesting the correspondence between visual preferences when the unique variable is the viewing condition. The experiment was conducted in controlled and uncontrolled conditions. The controlled experiment was executed in the laboratory, where the selection was performed in a Macbeth Spectralight viewing booth under a daylight light source. The numbers of observers who took part in the experiment was 23, which was in accordance with the ITU-R Rec. BT.500 guidelines5 that require a minimum of 15 observers for image quality assessment. All the observers were color-normal. The uncontrolled experiment was similar to the controlled one, but it was performed in generic environments. In this case 144 observers were involved, and the viewing conditions included indoor and outdoor, with natural or artificial light at different intensity levels. Since we did not reward the participants, the total cost of the experiment was independent of their number, thus we did not set a limit to the number of subjects, but we considered all the people we were able to get involved in a fixed interval of time. The reason of the greater number of participants in the uncontrolled experiment is partly due to the fact that we printed many copies of the Book of Samples and thus the uncontrolled experiment could be carried on in parallel. Moreover, we experienced that it is easier to get people involved by "taking the experiment to the observer" than by taking people to the laboratory, particularly in absence of a reward. In the case of the controlled experiment, the images were observed at a fix viewing geometry, given that the book was placed at the base of the light booth. In the uncontrolled experiment, the subjects were free to observe the images in any viewing geometry. In the Book of Samples some image pairs have very similar images, some have very different ones: it was clearly asked to the participants to avoid giving a preference in the cases in which they were not sure about it. The two experiments were compared in terms of the preference given to the top image in the pair, separately for each pair of images. More specifically, for each pair of images we tested the null hypothesis that the probability of preferring the top image in the controlled experiment is equal to the probability of preferring the same image in the uncontrolled one. The hypothesis could not be rejected, with the only exception of two cases, namely pairs 17 and 21, for which the test provided a very significant result in terms of difference in probability. A further analysis we carried out to understand this last result showed that it was mainly due to the data collected under artificial light in the uncontrolled experiment. At least in one case the equivalence was indeed restored if only data from the uncontrolled experiment SPIE-IS&T/ Vol. 7241 724102-3 under daylight were considered. The results we obtained support the hypothesis of the equivalence of the response of visual experiments performed in controlled and uncontrolled conditions and encouraged us to refine the study. Figure 2 shows a plot of the differences between the observed percentages of preferences given to the top image in the controlled and uncontrolled experiments. The mean (m) and standard deviation (sd) of the differences are 0.004 and 0.09, the first and third quartiles -0.041 and 0.055, respectively. As it can be seen from the plot all the differences lie between m − 2 sd and m + 2 sd and almost all lie between -0.10 and 0.10. The plot does not include pairs 17 and 21. difference in preference % (controlled-uncontrolled) 0.2 m+2sd 0.15 0.1 0.05 m 0 -0.05 -0.1 -0.15 m-2sd -0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 20 image pair Figure 2. Plot of the differences between the percentages of preferences given to the top image in each pair in the controlled and uncontrolled experiments. 3. THE SAMPLE SIZE ISSUE As previously stated, one of the reasons for studying the equivalence of the response of visual experiments performed under fixed or variable viewing conditions concerns the possibility of replacing controlled with uncontrolled experiments. But which is the advantage of exiting the laboratory if the two kinds of experiments produce similar results? We think that a good reason can be an advantage in time and cost. In terms of time, uncontrolled experiments can be indeed highly profitable since, in general, uncontrolled tests can be executed in parallel and this reduces the overall execution time of the experiment. As far as cost is concerned, an uncontrolled experiment is preferable as long as the following inequality is satisfied: N uncontrolled Cuncontrolled ≤ N controlled Ccontrolled , (1) where N is the sample size of the experiment, that is, the number of tests performed, and C is the cost of each test. We expect an uncontrolled test to be cheaper than one carried on in a laboratory, given the cost of the laboratory equipment and its maintenance, and given that in the uncontrolled experiment the subject does not have to reach the lab. If this hypothesis holds, the uncontrolled experiment turns out to be preferable if the number of participants in the two experiments is equal or if, in case the number of participants in the uncontrolled experiment is larger, it is not so large to reverse the inequality. But cost cannot be the only factor driving the choice of the sample sizes; constraints may derive from statistical issues that have to be taken into account. For instance, one might want to pay attention to the fact of assuring the same level of “precision” in the results of the two experiments, where by precision we mean here the dispersion of some measure of the observed data. In our study, since we based the comparison between the two kinds of experiments on the comparison between the probabilities of choosing the top image of the pair, a natural choice for measuring “precision” is the variance of the estimators of these probabilities. Hereafter the probabilities are denoted by p c and p u and their estimators Pc and Pu . On account of this, at first we revisit our results in order to verify whether the equality SPIE-IS&T/ Vol. 7241 724102-4 Var ( Pc ) = Var ( Pu ) (2) is satisfied or, if not, which would be a suitable number of tests for that. As already pointed out, in our previous work we did not set a priori the number of subjects, but we considered the people we were able to get involved in a fixed interval of time. As a matter of fact, since we consider the controlled experiment as a pilot experiment where the numbers of participants is in accordance with standard recommendations, our problem reduces to understand which would be the number of tests of the uncontrolled experiment which allows to satisfy the equality (2), in case it is not satisfied. Given that Var ( Pc ) = σ c2 / N c and Var ( Pu ) = σ u2 / N u , where σ c2 = pc (1 − pc ) and σ u2 = pu (1 − pu ) are the population variances in the controlled and uncontrolled settings respectively, equality (2) becomes pc (1 − pc ) pu (1 − pu ) , = Nc Nu and therefore, in practice, the equality we are faced with is (3) pˆ c (1 − pˆ c ) pˆ u (1 − pˆ u ) , = Nc Nu (4) where p̂c and p̂u are the estimates of p c and p u , namely the observed percentages of preferences for the top image of the pair in the two experiments. To achieve our aim, at first we look at the estimates s c2 and s u2 of the population variances σ c2 and σ u2 that we obtained in the two experiments. These values are provided in Table1, together with the number of tests performed. Controlled experiment Uncontrolled experiment Image pair s2 N s2 N 1 0.2379 23 0.2475 133 2 0.2275 20 0.2496 123 3 0.1411 23 0.1411 142 4 0.2275 23 0.2436 139 5 0.2304 22 0.1875 137 6 0.2059 21 0.1875 140 7 0.2356 21 0.2419 117 8 0.1411 23 0.1476 139 9 0.25 20 0.25 129 10 0.1411 23 0.2275 142 11 0.2451 23 0.2496 142 12 0.2475 22 0.25 134 13 0.1716 23 0.1539 136 14 0.2491 17 0.2451 119 15 0.2059 21 0.2275 131 16 0.1659 19 0.1924 129 18 0.2419 22 0.2451 141 19 0.0475 22 0.12 137 20 0.2451 21 0.2379 135 Table 1. Estimates of the population variances in the controlled and uncontrolled experiments and number of tests performed. The number of tests varies within the same type of experiment because in some cases the observers did not express a preference. SPIE-IS&T/ Vol. 7241 724102-5 Looking at the table we see that for most of the image pairs the two values are very similar, which was actually expected, in the light of the equivalence. With respect to the aim we are pursuing here, this makes immediately clear that: − for almost all the image pairs in the uncontrolled experiment it would be sufficient to have a sample size approximately equal to the size of the controlled experiment. Exceptions are the cases 10 and 19, where the larger differences of the values implies that the two sample sizes have to be undoubtedly different; − the uncontrolled experiment is over-sized. The plots in Figure 3 picture for the image pairs 10 and 19 the dependence of the estimate of Var ( Pu ) on the sample size of the uncontrolled experiment and compare it with the estimate of Var ( Pc ) . It can be seen that for pair 10 the sample size for which equality (4) is satisfied lies between 35 and 40, whereas for pair 19 it lies between 55 and 60. The values depicted can be very easily derived from the information provided by the Table, and correspond to those obtainable by randomly extracting a large number of subsamples of the given size, computing the estimate of the variance of interest for each of them, and then averaging. Estimate of variance Image pair 10 Estimate of variance Image pair 19 0.012 0.007 0.01 0.006 uncontrolled 0.005 0.008 controlled 0.004 0.006 0.003 uncontrolled 0.004 controlled 0.002 0.002 0.001 0 0 10 20 30 40 N 50 60 70 80 10 20 30 40 N 50 60 70 80 Figure 3. The figure illustrates the dependence of the estimate of Var ( Pu ) on the sample size of the uncontrolled experiment for the pairs 10 and 19. The straight line refers to the estimate of Var ( Pc ) . For our data we can conclude that, in order to be able to satisfy the equality (4) for all the image pairs, a number of tests of approximately 60 would have been enough. Let us now derive a general result for a preventive estimate of the sample size of the uncontrolled experiment. From equation (3) we obtain Nu = p u (1 − p u ) Nc , p c (1 − p c ) (5) from which we can compute N u for any given combination of N c , p c and p u . However, we are not really interested N in all the possibilities. First of all, we are interested only in the cases in which u > 1 since our final goal is to satisfy Nc the inequality (1) in which, as previously observed, we may assume that an uncontrolled test is cheaper than a controlled one. Thus the cases of interest are those in which the preference expressed in the controlled experiment is sharper than N the one expressed in the uncontrolled one. Indeed, in equation (5), u > 1 when p c < p u if p u ≤ 0.5 , and when Nc p c > p u , if p u ≥ 0.5 . Secondly, since we are reasoning of equivalent experiments, we are not interested in large differences in the two probabilities and we can reasonably suppose an upper limit equal to the value we observed in our data, namely 0.20. Table 2 provides some combinations of p c and p u whose difference is 0.20, together with the SPIE-IS&T/ Vol. 7241 724102-6 N Nu . It can be seen that the highest value for the ratio u is about 4 and we think that this is also the Nc Nc highest value we should expect, since, in our opinion, higher values may derive only for unlikely situations. resulting ratios pc 0.95 0.90 0.85 0.80 0.75 0.70 pu 0.75 0.70 0.65 0.60 0.55 0.50 Nu Nc 3.9 2.33 1.78 1.5 1.32 1.2 Table 2. Values of the ratio N u / N c corresponding to some probabilities one could expect in the controlled experiment, assuming that the difference in probability between the two experiments is 0.2. Figure 4 pictures the content of Table 2 when N c is 10 or 20. 90 80 Nc=20 70 Nc=10 Nu 60 50 pu=pc-0.2 40 30 20 10 0 0.95 0.9 0.85 pc 0.8 0.75 0.7 Figure 4. The figure illustrates the relationship between Nu and p c in case pu = pc − 0.2 , that is, the preference in the uncontrolled case is a 20% less pronounced that in the controlled case. The two curves refer to two different sizes for the controlled experiment of 10 and 20 samples. Clearly the above analysis can be exploited given that one has some knowledge of p c , which could come from a similar study already executed in controlled conditions. In summary, our study indicates that, in the worst scenario, in order to satisfy equation (3) a number of uncontrolled tests of at least 4 times those of the controlled experiment is requested. This is a factor that can be used jointly with inequality (1) to decide, on the basis of the cost associated with the different types of experiments, whether the execution of an uncontrolled experiment is more convenient. Obviously the investigator might want to take into account other statistical issues, first of all that of choosing a sample size that allows detecting a given difference in preference, if this difference exists. SPIE-IS&T/ Vol. 7241 724102-7 4. ANALYSIS OF THE IMAGE DIFFERENCES The Book of Samples contains image pairs in which the images are very different, as well as images where differences can be hardly perceived. We remind here that the design of the image pair collection was arbitrary, and the resulting set includes differences in terms of color, contrast, or both. Four different algorithms were applied to the original images to produce image pairs: a Retinex-like algorithm for contrast enhancement and color correction (A)6; spatial gamut mapping (B)7; a mild compression of the image color gamut (C); a rotation of the hue angle of 10 degrees (D). In some cases the images in the pair have been produced by applying one of these algorithms to an original image, thus the pair includes an unprocessed and a processed image. In other cases, limited to the application of the algorithm of type A, the pair includes two processed images; in these cases the same algorithm was applied with two different set of parameters. Table 3 reports the type of algorithm applied for each of the pair considered. IMAGE PAIR 1 2 3 4 TOP IMAGE A D B A 5 BOTTOM IMAGE A A 6 7 A C 8 9 10 A A C A 11 12 13 14 D D A C B 15 16 18 19 D C A B C A 20 D Table 3. Type of algorithm applied to each image in the Book of Samples (A = Retinex-like algorithm, B = Spatial gamut mapping, C = mild compression, D = hue rotation). As pointed out in the introduction, one of the aims of the present study is to analyze the results of equivalence we previously obtained in terms of the difference in the attributes of the images in each pair. At a first look of the prints, it can be observed that, when the algorithm of type A was applied, the differences were very variable: in some cases the images had apparently the same colors, but different contrast, especially in dark areas (4, 8); in other cases there was apparently a light global color change (1, 6, 18), which in one case was associated with the presence of extra details in one of the images (6). In one case there was a very strong color difference (10). In the case of algorithm B, the alteration appears in terms of an increased contrast for the image to which the algorithm was applied (3, 13, 19); for algorithm C and D one could notice a small color difference. difference in preference % (controlled-uncontrolled) Figure 5, like Figure 2, depicts the differences between the observed percentages of preferences given to the top image of the pair in the controlled and uncontrolled experiments. 0.2 none-A 0.15 A-A B-none 0.1 A-A 0.05 B-none 0 none-C -0.05 A-A D-none none-D none-B A-A D-none C-none none-C none-D none-C -0.1 none-C D-none -0.15 A-A -0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 20 image pair Figure 5. Plots of the differences between the percentages of preferences given to the top image in each pair in the controlled and uncontrolled experiments. The plot also shows which algorithm was applied (A = Retinex-like algorithm, B = Spatial gamut mapping, C = mild compression, D = hue rotation). Looking at the plot it can be seen that for the pairs in which the processing determined a change in contrast (3, 4, 8, 13, and 19) the difference in preference between the controlled and uncontrolled experiment is small. The cases in which we recognized a change in color are instead evenly distributed within the range m ± 2 sd . The cases with higher SPIE-IS&T/ Vol. 7241 724102-8 differences are 1, 10, and 18. We note that, consistently with the information provided by the plot, for these three differences we obtained a result that, although not statistically significant, was markedly different from the others in terms of p-value when we tested the null hypothesis that the probability of preferring the top image in the controlled experiment is equal to the probability of preferring the same image in the uncontrolled one. What it is interesting to point out is that, where the processing applied determined a difference in the spatial attributes, one of the two images in the pair appears blurred or with missing details in dark areas. Attributes like blur and absence of details could be recognized by the observers as defects, thereby determining a strong preference. On the contrary, the small color changes, caused specifically by the algorithms of type C and D, did not produce unnatural colors that could induce the observers to qualify an image as “wrong” and then determine a strong preference. This observation, in our opinion, extends also to the case of pair number 10, where the color difference was indeed very strong, but it was not possible to assign an intrinsic color to the image content. The above can be verified by looking at the effect of the kind of alteration on the preferences between the top and the bottom image in the pairs. Here the question is: is there a relationship between the algorithm applied and the preference expressed? Table 4 provides some understanding of the question. The table reports, for each pair, the percentage of observed preferences for the top image in the uncontrolled experiment, the p-value obtained by testing the hypothesis that the probability of preferring the top image is equal to that of preferring the bottom one, and finally the 95% confidence interval for the probability of preferring the top image. Given that a very small p-value means a strong preference for one of the two images, we can observe that the algorithm of type ‘B’, that is the spatial gamut mapping, determined always a highly significant preference, in particular for the image to which it was applied in the image pair. Indeed it was applied to the top image in pairs 3 and 19, and to the bottom image in pair 13. For all these three pairs the p-value is even 0. In the two other cases, in which we apparently recognized a change in local contrast in dark areas, namely 4 and 8, the p-value is also small. Image pair % preference top image p-value CI 95% 1 0.45 >0.10 0.37-0.54 2 0.48 >>0.10 0.39-0.57 3 0.83 0 0.76-0.89 4 0.58 0.09 0.49-0.66 5 0.75 0 0.67-0.82 6 0.75 0 0.67-0.82 7 0.59 0.06 0.49-0.68 8 0.18 0 0.10-0.26 9 0.50 >>0.10 0.41-0.59 10 0.65 0.0003 0.57-0.73 11 0.52 >>0.10 0.44-0.60 12 0.50 >>0.10 0.42-0.58 13 0.19 0 0.13-0.27 14 0.57 >0.10 0.48-0.66 15 0.65 0.0009 0.56-0.73 16 0.74 0 0.66-0.82 18 0.57 0.09 0.49-0.66 19 0.86 0 0.73-0.91 20 0.61 0.02 0.52-0.63 Table 4. The table reports, for each pair, the percentages of preferences given to the top image in the uncontrolled experiment, the p-value obtained by testing the hypothesis that the probability of preferring the top image is equal to that of preferring the bottom one, and the 95% confidence interval for the probability of preferring the top image. SPIE-IS&T/ Vol. 7241 724102-9 Preference data in Table 4 are plotted in Figure 6. Points above the line indicate a preference for the top image of the pair; points below the line indicate a preference for the bottom image. It is clearly seen that type D algorithm (rotation of hue angle) determined on the whole the less sharp preferences. % preferences for top image 1 B-none 0.9 B-none 0.8 A-A C-none none-C 0.7 0.6 none-C A-A 0.5 D-none none-C D-none none-C none-D D-none A-A 0.4 none-A none-D A-A 0.3 0.2 A-A none-B 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 20 image pair Figure 6. Plots of the percentage of preferences given to the top image in the uncontrolled experiment. The plot also shows which algorithm was applied (A = Retinex-like algorithm, B = Spatial gamut mapping, C = mild compression, D = hue rotation). In summary, we can observe that in our Book of Samples the application of algorithms that produced changes in the image contrast gave origin to pairs of images for most of which the preference for the more contrasted one was clearly stated in both the experiments; in these cases the strong preference is associated with a great equivalence between controlled and uncontrolled experiments. On the contrary, we cannot observe a dual behavior if we look at cases of weak preference: indeed, when the probability of choosing one of the images is around 0.5 (1, 2, 9, 11, 12, 14, and 18), the difference in preference between the controlled and uncontrolled experiments varies. 5. CONCLUSION In this work we have refined our analysis of the equivalence between controlled and uncontrolled experiments that we found in a previous work4, where the task was to express a preference between a set of image pairs printed in what we called the Book of Samples. More specifically, we addressed two questions, one concerning the sample size of the uncontrolled experiment, the other the relationship between the equivalence of the two experiments and the attributes that set the differences in the pair of images under comparison. The choice of the sample size of an experiment is important for statistical and economic reasons. If the experiment is performed in an uncontrolled setting, it is expected that the effects of the different viewing conditions are compensated by a high number of participants. On the basis of our results, we have identified that, in a very conservative strategy, a number of uncontrolled tests of at least 4 times the sample size of the controlled experiment is requested in order to achieve the same level of precision in the results. This indication could be exploited in the design of an uncontrolled experiment of visual preference of prints, considering that a test performed in a laboratory is presumably more expensive than an equivalent test performed out of the lab. The analysis of the apparent visible differences between images in each pair has pointed out that our Book of Samples includes cases with differences in color and contrast. In the cases where a general blur or a locally reduced contrast were observed, which are attributes we suspect were considered defects by the observers, the preference for the more contrasted images was strong and associated with a strong equivalence between the controlled and uncontrolled experiment. Nevertheless significant, this result could be considered quite obvious. Of greater relevance, the Book of Samples includes many cases with image differences not ascribable to image defects and for which a less clear preference could be expected. In these cases, the control in the viewing conditions could play a role, but, interestingly, the equivalence between the two experiments was anyway verified. SPIE-IS&T/ Vol. 7241 724102-10 6. REFERENCES 1. H. van Veen, H. Bülthoff, G. Givaty, “Psychophysical Experiments on the Internet,” Proceedings of the 2nd Tübinger Conference of Perception, H.H. Bülthoff, M. Fahle, K.R. Gegenfurtner, H.A. Mallot. Knirsch, Kirchentellinsfurt, Editors, Tübinger, Germany, 1999. 2. N. Katoh, K. Nakabayashi, M. Ito, S. Ohno, “Effect of Ambient Light on the Color Appearance of Softcopy Images: Mixed Chromatic Adaptation for Self-luminous Display”, Journal of Electronic Imaging, 7(4), 1998. 3. S. Zuffi, P. Scala, C. Brambilla, G. Beretta, “Web-based vs. controlled environment psychophysics experiments”, Image Quality and System Performance IV, Proceedings of SPIE 6494, L. C. Cui, Y. Miyake, Editors, San Josè, 2007. 4. S. Zuffi, C. Brambilla, R. Eschbach, A. Rizzi, “Controlled and Uncontrolled Viewing Conditions in the Evaluation of Prints”, Color Imaging XIII: Processing, Hardcopy and Applications, Proceedings of SPIE 6807, R. Eschbach, G. G. Marcu, S. Tominaga, Editors, San Josè, 2008. 5. ITU-R Recommendation BT.500-11, “Methodology for the subjective assessment of the quality of television pictures”, International Telecommunication Union, Geneva, Switzerland, 2002. 6. A. Rizzi, C. Gatta, D. Marini, “A New Algorithm for Unsupervised Global and Local Color Correction”, Pattern Recognition Letters, 24(11), 2003. 7. R. Balasubramanian, R. deQueiroz, R. Eschbach, W. Wu, “Gamut Mapping to Preserve Spatial Luminance Variations”, The Eighth Color Imaging Conference, Scottsdale, 2000. SPIE-IS&T/ Vol. 7241 724102-11
© Copyright 2026 Paperzz