A study on the equivalence of controlled and uncontrolled visual

A study on the equivalence of controlled and uncontrolled visual
experiments
Silvia Zuffia, Carla Brambillab, Reiner Eschbach c, Alessandro Rizzi d*
a
ITC, Consiglio Nazionale delle Ricerche, Milano, Italy
b
IMATI, Consiglio Nazionale delle Ricerche, Milano, Italy
c
Xerox, Webster, USA
d
Università degli Studi di Milano, Milano, Italy
ABSTRACT
Visual experiments, if performed in a traditional way, require asking participants to go to a laboratory, where displays
are calibrated and illumination conditions are set in a convenient way. In recent years there has been an increasing
interest in performing the experiments "out of the lab", with the aim of reducing costs, increasing the number of
participants, and differentiating the population. The equivalence of the response of visual experiments performed in
controlled and uncontrolled contexts is an open question which calls for research. In this work we aim at analyzing
more deeply the equivalence between "controlled" and "uncontrolled" that we found in a previous study of the authors
on the comparison of visual preferences for printed images. In particular, we are interested in understanding if, and to
what degree, the uncontrolled experiments require actually more participants in order to average out the effects of the
many uncontrolled parameters which may affects the response. In addition we aim at exploring the relationship, if any,
between our previous conclusions and the difference in the attributes of the images for which the observers were asked
to express their preference.
Keywords: Visual experiments, viewing conditions.
1. INTRODUCTION
Visual experiments demand subjects to perform a visual task, for instance to match colors, to read a sequence of words,
and so on, or to express a subjective evaluation, like to assign a preference, to set a rank, or to detect differences
between images or videos. These experiments, in particular those requiring a judgment, are very important to academia
and industry. At present, subjective evaluation is the only recognized method of determining the perceived quality of
displayed images, videos or prints, since replacing a subjective judgment expressed by an observer with a measure of
quality is a very complex issue, still unresolved. Indeed, the image content plays a determinant role in the perceived
quality and no quality measure exists that can take it properly into account. Visual experiments, if traditionally
performed, require asking participants to go to a laboratory, where viewing conditions are set in a convenient way. In
the case of those involving images or videos, the monitor is calibrated, color reproduction is managed, and the
illumination conditions are set accordingly. Where prints are under judgment, they are observed in suitable illuminating
systems, by selecting standardized light sources, and against a neutral background.
In recent years there has been a growing interest in performing visual experiments out of laboratories, with the aim of
reducing costs, increasing the number of participants, and differentiating the population. This trend evinced in the
context of display media, with the recognition of the Web as a very attractive tool. Indeed, the Web allows reaching
many participants from all over the world, but, on the other hand, it constitutes an environment of high
indeterminateness, given that not only image reproduction and viewing conditions are uncontrolled, but also the
investigator is absent and cannot verify the correctness of the execution of the visual task. Care is suggested for the
implementation of visual experiments on the Web: in case of display media, upon lack of control on image
reproduction, visual experiments are claimed to be suitable for investigations that do not require a careful setting of
appearance parameters like color, contrast, size or resolution1. In this context, the viewing conditions could play a
secondary role with respect to the image reproduction settings, due to the greater level of adaptation of the observer to
the display, rather than to the environment2.
Color Imaging XIV: Displaying, Processing, Hardcopy, and Applications, edited by Reiner Eschbach
Gabriel G. Marcu, Shoji Tominaga, Alessandro Rizzi, Proc. of SPIE-IS&T Electronic Imaging
SPIE Vol. 7241, 724102 · © 2009 SPIE-IS&T · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.805854
SPIE-IS&T/ Vol. 7241 724102-1
In the case of print media, on the contrary, visual appearance depends on viewing parameters like illumination
conditions, viewing geometry and stimuli surround. However, particularly in the case of subjective evaluations, it is
hardly the case in which the effect of viewing conditions, and thus the extent of the required control, is known. In these
cases, if the experiments are performed in an uncontrolled manner, the different viewing conditions could act like a
source of noise in the recorded data, whose effect could be compensated by a high number of participants, producing
even more significant results, not biased by a specific set of viewing conditions. One may expect this to be a taskdependent effect.
The equivalence of the response of visual experiments performed in controlled and uncontrolled contexts is an open
question which calls for research. There are at least two different aspects that one could consider. The first one
concerns the possibility of replacing controlled with uncontrolled experiments, with the purpose of reducing time and
costs. The second one is related to the interpretation of the results of a visual experiment performed in the laboratory.
Here the question is: to what extent a controlled experiment is representative of a real context? The lack of equivalence
between results obtained in controlled and uncontrolled conditions could indicate that the response of the subjects
depends on the viewing settings. Considering that in the majority of cases one is dealing with images, either displayed
or printed, that will be observed in general environments, a dependency from viewing conditions is something that could
be of interest to keep limited. Moreover, it has to be kept in mind that a lack of equivalence could be also a symptom of
a wrong experiment design.
There are very few works which compare the results obtained in the laboratory with those evinced by data from
uncontrolled tests. Zuffi et al.3 have investigated the equivalence between controlled and Web based visual experiments
in a study concerning the readability of color combinations. In this case, the uncontrolled parameters that mostly may
have an influence on the response are the type of monitor and display and screen settings, including resolution. In a
previous study we have investigated the equivalence between “controlled” and “uncontrolled” in an experiment of
visual preference of prints4. In that experiment, which was carried on in parallel on different copies of an image dataset,
we carefully controlled the print of the images, in order to perform the tests on the same source material. Thus, the
decisive difference between the controlled and uncontrolled experiments was the viewing conditions that, in the
uncontrolled experiment, were the most unsettled. The results of both studies support the hypothesis of equivalence.
In the work we present here we consider again the experiment of visual preference of prints previously performed and
address a couple of new questions. The first one concerns sample size. More specifically, our scope here is to provide,
in the light of our previous results, some indications for the choice of the sample size of the uncontrolled experiment.
Since the involvement of subjects may require some sort of reward, often in the form of a small amount of money, the
execution of a visual experiment is an expensive task. Sample size is therefore important for economic reasons, but
there are also statistical issues that have to be taken into account. In addition, we look for the relationship, if any,
among the equivalence of the experiments, image preferences and image differences. In the previous study indeed, we
asked the participants to set a preference between pairs of similar images obtained by applying some kind of alteration,
like contrast enhancement or color shift, to a set of photos and we designed the dataset at random, in order to obtain a
generic benchmark. The result in terms of strength of the equivalence of the preferences given for each image pair in
the controlled and uncontrolled experiment was, in general, different for each image pair. Then the question is: is there
a relationship between the results we obtained and the difference in the attributes of the images in the image pairs? In
our opinion an answer to this question would allow generalizing the results and could be useful for driving the design of
future studies.
2. OUR PREVIOUS WORK
In the experiment of visual preference we previously performed4, and to which we refer in the present work, a large set
of images randomly selected from photo collections was considered and, for each image, a slightly different version was
produced. Visual attributes like contrast and color were changed applying different algorithms with different
parameters. The result of this procedure was a large set of image pairs, each one composed of the original image and its
modified version. In some cases, the image pair included two modified images, resulting from the application of the
same algorithm with different parameters. A set of about twenty of these image pairs was randomly selected and
printed, two images per page, in what we called “Book of Samples”. Two pages of the book are displayed in Figure 1.
SPIE-IS&T/ Vol. 7241 724102-2
I.
Figure 1. The Book of Samples. Each page is composed by two images that differ for color and/or contrast. Some of the
images are printed in portrait (left), some in landscape modality (right).
In the experiment each subject was asked to express a preference for each of the image pairs in the book. The
hypothesis was that the random selection of the images, together with the fact that they were altered at random, gave
origin to a generic benchmark whose only purpose was attesting the correspondence between visual preferences when
the unique variable is the viewing condition. The experiment was conducted in controlled and uncontrolled conditions.
The controlled experiment was executed in the laboratory, where the selection was performed in a Macbeth Spectralight
viewing booth under a daylight light source. The numbers of observers who took part in the experiment was 23, which
was in accordance with the ITU-R Rec. BT.500 guidelines5 that require a minimum of 15 observers for image quality
assessment. All the observers were color-normal.
The uncontrolled experiment was similar to the controlled one, but it was performed in generic environments. In this
case 144 observers were involved, and the viewing conditions included indoor and outdoor, with natural or artificial
light at different intensity levels. Since we did not reward the participants, the total cost of the experiment was
independent of their number, thus we did not set a limit to the number of subjects, but we considered all the people we
were able to get involved in a fixed interval of time. The reason of the greater number of participants in the
uncontrolled experiment is partly due to the fact that we printed many copies of the Book of Samples and thus the
uncontrolled experiment could be carried on in parallel. Moreover, we experienced that it is easier to get people
involved by "taking the experiment to the observer" than by taking people to the laboratory, particularly in absence of a
reward. In the case of the controlled experiment, the images were observed at a fix viewing geometry, given that the
book was placed at the base of the light booth. In the uncontrolled experiment, the subjects were free to observe the
images in any viewing geometry. In the Book of Samples some image pairs have very similar images, some have very
different ones: it was clearly asked to the participants to avoid giving a preference in the cases in which they were not
sure about it.
The two experiments were compared in terms of the preference given to the top image in the pair, separately for each
pair of images. More specifically, for each pair of images we tested the null hypothesis that the probability of preferring
the top image in the controlled experiment is equal to the probability of preferring the same image in the uncontrolled
one. The hypothesis could not be rejected, with the only exception of two cases, namely pairs 17 and 21, for which the
test provided a very significant result in terms of difference in probability. A further analysis we carried out to
understand this last result showed that it was mainly due to the data collected under artificial light in the uncontrolled
experiment. At least in one case the equivalence was indeed restored if only data from the uncontrolled experiment
SPIE-IS&T/ Vol. 7241 724102-3
under daylight were considered. The results we obtained support the hypothesis of the equivalence of the response of
visual experiments performed in controlled and uncontrolled conditions and encouraged us to refine the study.
Figure 2 shows a plot of the differences between the observed percentages of preferences given to the top image in the
controlled and uncontrolled experiments. The mean (m) and standard deviation (sd) of the differences are 0.004 and
0.09, the first and third quartiles -0.041 and 0.055, respectively. As it can be seen from the plot all the differences lie
between m − 2 sd and m + 2 sd and almost all lie between -0.10 and 0.10. The plot does not include pairs 17 and 21.
difference in preference %
(controlled-uncontrolled)
0.2
m+2sd
0.15
0.1
0.05
m
0
-0.05
-0.1
-0.15
m-2sd
-0.2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
19
20
image pair
Figure 2. Plot of the differences between the percentages of preferences given to the top image in each pair in the controlled
and uncontrolled experiments.
3. THE SAMPLE SIZE ISSUE
As previously stated, one of the reasons for studying the equivalence of the response of visual experiments performed
under fixed or variable viewing conditions concerns the possibility of replacing controlled with uncontrolled
experiments. But which is the advantage of exiting the laboratory if the two kinds of experiments produce similar
results? We think that a good reason can be an advantage in time and cost. In terms of time, uncontrolled experiments
can be indeed highly profitable since, in general, uncontrolled tests can be executed in parallel and this reduces the
overall execution time of the experiment. As far as cost is concerned, an uncontrolled experiment is preferable as long
as the following inequality is satisfied:
N uncontrolled Cuncontrolled ≤ N controlled Ccontrolled ,
(1)
where N is the sample size of the experiment, that is, the number of tests performed, and C is the cost of each test. We
expect an uncontrolled test to be cheaper than one carried on in a laboratory, given the cost of the laboratory equipment
and its maintenance, and given that in the uncontrolled experiment the subject does not have to reach the lab. If this
hypothesis holds, the uncontrolled experiment turns out to be preferable if the number of participants in the two
experiments is equal or if, in case the number of participants in the uncontrolled experiment is larger, it is not so large to
reverse the inequality.
But cost cannot be the only factor driving the choice of the sample sizes; constraints may derive from statistical issues
that have to be taken into account. For instance, one might want to pay attention to the fact of assuring the same level of
“precision” in the results of the two experiments, where by precision we mean here the dispersion of some measure of
the observed data. In our study, since we based the comparison between the two kinds of experiments on the
comparison between the probabilities of choosing the top image of the pair, a natural choice for measuring “precision”
is the variance of the estimators of these probabilities. Hereafter the probabilities are denoted by p c and p u and their
estimators Pc and Pu . On account of this, at first we revisit our results in order to verify whether the equality
SPIE-IS&T/ Vol. 7241 724102-4
Var ( Pc ) = Var ( Pu )
(2)
is satisfied or, if not, which would be a suitable number of tests for that. As already pointed out, in our previous work
we did not set a priori the number of subjects, but we considered the people we were able to get involved in a fixed
interval of time. As a matter of fact, since we consider the controlled experiment as a pilot experiment where the
numbers of participants is in accordance with standard recommendations, our problem reduces to understand which
would be the number of tests of the uncontrolled experiment which allows to satisfy the equality (2), in case it is not
satisfied. Given that Var ( Pc ) = σ c2 / N c and Var ( Pu ) = σ u2 / N u , where σ c2 = pc (1 − pc ) and σ u2 = pu (1 − pu ) are the
population variances in the controlled and uncontrolled settings respectively, equality (2) becomes
pc (1 − pc ) pu (1 − pu )
,
=
Nc
Nu
and therefore, in practice, the equality we are faced with is
(3)
pˆ c (1 − pˆ c ) pˆ u (1 − pˆ u )
,
=
Nc
Nu
(4)
where p̂c and p̂u are the estimates of p c and p u , namely the observed percentages of preferences for the top image of
the pair in the two experiments. To achieve our aim, at first we look at the estimates s c2 and s u2 of the population
variances σ c2 and σ u2 that we obtained in the two experiments. These values are provided in Table1, together with the
number of tests performed.
Controlled experiment
Uncontrolled experiment
Image pair
s2
N
s2
N
1
0.2379
23
0.2475
133
2
0.2275
20
0.2496
123
3
0.1411
23
0.1411
142
4
0.2275
23
0.2436
139
5
0.2304
22
0.1875
137
6
0.2059
21
0.1875
140
7
0.2356
21
0.2419
117
8
0.1411
23
0.1476
139
9
0.25
20
0.25
129
10
0.1411
23
0.2275
142
11
0.2451
23
0.2496
142
12
0.2475
22
0.25
134
13
0.1716
23
0.1539
136
14
0.2491
17
0.2451
119
15
0.2059
21
0.2275
131
16
0.1659
19
0.1924
129
18
0.2419
22
0.2451
141
19
0.0475
22
0.12
137
20
0.2451
21
0.2379
135
Table 1. Estimates of the population variances in the controlled and uncontrolled experiments and number of tests
performed. The number of tests varies within the same type of experiment because in some cases the observers did
not express a preference.
SPIE-IS&T/ Vol. 7241 724102-5
Looking at the table we see that for most of the image pairs the two values are very similar, which was actually
expected, in the light of the equivalence. With respect to the aim we are pursuing here, this makes immediately clear
that:
−
for almost all the image pairs in the uncontrolled experiment it would be sufficient to have a sample size
approximately equal to the size of the controlled experiment. Exceptions are the cases 10 and 19, where the
larger differences of the values implies that the two sample sizes have to be undoubtedly different;
−
the uncontrolled experiment is over-sized.
The plots in Figure 3 picture for the image pairs 10 and 19 the dependence of the estimate of Var ( Pu ) on the sample
size of the uncontrolled experiment and compare it with the estimate of Var ( Pc ) . It can be seen that for pair 10 the
sample size for which equality (4) is satisfied lies between 35 and 40, whereas for pair 19 it lies between 55 and 60.
The values depicted can be very easily derived from the information provided by the Table, and correspond to those
obtainable by randomly extracting a large number of subsamples of the given size, computing the estimate of the
variance of interest for each of them, and then averaging.
Estimate of variance
Image pair 10
Estimate of variance
Image pair 19
0.012
0.007
0.01
0.006
uncontrolled
0.005
0.008
controlled
0.004
0.006
0.003
uncontrolled
0.004
controlled
0.002
0.002
0.001
0
0
10
20
30
40
N
50
60
70
80
10
20
30
40
N
50
60
70
80
Figure 3. The figure illustrates the dependence of the estimate of Var ( Pu ) on the sample size of the uncontrolled
experiment for the pairs 10 and 19. The straight line refers to the estimate of Var ( Pc ) .
For our data we can conclude that, in order to be able to satisfy the equality (4) for all the image pairs, a number of tests
of approximately 60 would have been enough. Let us now derive a general result for a preventive estimate of the
sample size of the uncontrolled experiment. From equation (3) we obtain
Nu =
p u (1 − p u )
Nc ,
p c (1 − p c )
(5)
from which we can compute N u for any given combination of N c , p c and p u . However, we are not really interested
N
in all the possibilities. First of all, we are interested only in the cases in which u > 1 since our final goal is to satisfy
Nc
the inequality (1) in which, as previously observed, we may assume that an uncontrolled test is cheaper than a controlled
one. Thus the cases of interest are those in which the preference expressed in the controlled experiment is sharper than
N
the one expressed in the uncontrolled one. Indeed, in equation (5), u > 1 when p c < p u if p u ≤ 0.5 , and when
Nc
p c > p u , if p u ≥ 0.5 . Secondly, since we are reasoning of equivalent experiments, we are not interested in large
differences in the two probabilities and we can reasonably suppose an upper limit equal to the value we observed in our
data, namely 0.20. Table 2 provides some combinations of p c and p u whose difference is 0.20, together with the
SPIE-IS&T/ Vol. 7241 724102-6
N
Nu
. It can be seen that the highest value for the ratio u is about 4 and we think that this is also the
Nc
Nc
highest value we should expect, since, in our opinion, higher values may derive only for unlikely situations.
resulting ratios
pc
0.95
0.90
0.85
0.80
0.75
0.70
pu
0.75
0.70
0.65
0.60
0.55
0.50
Nu
Nc
3.9
2.33
1.78
1.5
1.32
1.2
Table 2. Values of the ratio N u / N c corresponding to some probabilities one could expect in the controlled experiment,
assuming that the difference in probability between the two experiments is 0.2.
Figure 4 pictures the content of Table 2 when N c is 10 or 20.
90
80
Nc=20
70
Nc=10
Nu
60
50
pu=pc-0.2
40
30
20
10
0
0.95
0.9
0.85
pc
0.8
0.75
0.7
Figure 4. The figure illustrates the relationship between Nu and p c in case pu = pc − 0.2 , that is, the preference in the
uncontrolled case is a 20% less pronounced that in the controlled case. The two curves refer to two different sizes for
the controlled experiment of 10 and 20 samples.
Clearly the above analysis can be exploited given that one has some knowledge of p c , which could come from a similar
study already executed in controlled conditions.
In summary, our study indicates that, in the worst scenario, in order to satisfy equation (3) a number of uncontrolled
tests of at least 4 times those of the controlled experiment is requested. This is a factor that can be used jointly with
inequality (1) to decide, on the basis of the cost associated with the different types of experiments, whether the
execution of an uncontrolled experiment is more convenient. Obviously the investigator might want to take into account
other statistical issues, first of all that of choosing a sample size that allows detecting a given difference in preference, if
this difference exists.
SPIE-IS&T/ Vol. 7241 724102-7
4. ANALYSIS OF THE IMAGE DIFFERENCES
The Book of Samples contains image pairs in which the images are very different, as well as images where differences
can be hardly perceived. We remind here that the design of the image pair collection was arbitrary, and the resulting set
includes differences in terms of color, contrast, or both. Four different algorithms were applied to the original images to
produce image pairs: a Retinex-like algorithm for contrast enhancement and color correction (A)6; spatial gamut
mapping (B)7; a mild compression of the image color gamut (C); a rotation of the hue angle of 10 degrees (D). In some
cases the images in the pair have been produced by applying one of these algorithms to an original image, thus the pair
includes an unprocessed and a processed image. In other cases, limited to the application of the algorithm of type A, the
pair includes two processed images; in these cases the same algorithm was applied with two different set of parameters.
Table 3 reports the type of algorithm applied for each of the pair considered.
IMAGE PAIR
1
2
3
4
TOP IMAGE
A
D
B
A
5
BOTTOM IMAGE
A
A
6
7
A
C
8
9
10
A
A
C
A
11
12
13
14
D
D
A
C
B
15
16
18
19
D
C
A
B
C
A
20
D
Table 3. Type of algorithm applied to each image in the Book of Samples (A = Retinex-like algorithm, B = Spatial
gamut mapping, C = mild compression, D = hue rotation).
As pointed out in the introduction, one of the aims of the present study is to analyze the results of equivalence we
previously obtained in terms of the difference in the attributes of the images in each pair. At a first look of the prints, it
can be observed that, when the algorithm of type A was applied, the differences were very variable: in some cases the
images had apparently the same colors, but different contrast, especially in dark areas (4, 8); in other cases there was
apparently a light global color change (1, 6, 18), which in one case was associated with the presence of extra details in
one of the images (6). In one case there was a very strong color difference (10). In the case of algorithm B, the
alteration appears in terms of an increased contrast for the image to which the algorithm was applied (3, 13, 19); for
algorithm C and D one could notice a small color difference.
difference in preference %
(controlled-uncontrolled)
Figure 5, like Figure 2, depicts the differences between the observed percentages of preferences given to the top image
of the pair in the controlled and uncontrolled experiments.
0.2
none-A
0.15
A-A
B-none
0.1
A-A
0.05
B-none
0
none-C
-0.05
A-A
D-none
none-D
none-B
A-A
D-none
C-none
none-C
none-D
none-C
-0.1
none-C
D-none
-0.15
A-A
-0.2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
19
20
image pair
Figure 5. Plots of the differences between the percentages of preferences given to the top image in each pair in the
controlled and uncontrolled experiments. The plot also shows which algorithm was applied (A = Retinex-like
algorithm, B = Spatial gamut mapping, C = mild compression, D = hue rotation).
Looking at the plot it can be seen that for the pairs in which the processing determined a change in contrast (3, 4, 8, 13,
and 19) the difference in preference between the controlled and uncontrolled experiment is small. The cases in which
we recognized a change in color are instead evenly distributed within the range m ± 2 sd . The cases with higher
SPIE-IS&T/ Vol. 7241 724102-8
differences are 1, 10, and 18. We note that, consistently with the information provided by the plot, for these three
differences we obtained a result that, although not statistically significant, was markedly different from the others in
terms of p-value when we tested the null hypothesis that the probability of preferring the top image in the controlled
experiment is equal to the probability of preferring the same image in the uncontrolled one.
What it is interesting to point out is that, where the processing applied determined a difference in the spatial attributes,
one of the two images in the pair appears blurred or with missing details in dark areas. Attributes like blur and absence
of details could be recognized by the observers as defects, thereby determining a strong preference. On the contrary, the
small color changes, caused specifically by the algorithms of type C and D, did not produce unnatural colors that could
induce the observers to qualify an image as “wrong” and then determine a strong preference. This observation, in our
opinion, extends also to the case of pair number 10, where the color difference was indeed very strong, but it was not
possible to assign an intrinsic color to the image content.
The above can be verified by looking at the effect of the kind of alteration on the preferences between the top and the
bottom image in the pairs. Here the question is: is there a relationship between the algorithm applied and the preference
expressed? Table 4 provides some understanding of the question. The table reports, for each pair, the percentage of
observed preferences for the top image in the uncontrolled experiment, the p-value obtained by testing the hypothesis
that the probability of preferring the top image is equal to that of preferring the bottom one, and finally the 95%
confidence interval for the probability of preferring the top image. Given that a very small p-value means a strong
preference for one of the two images, we can observe that the algorithm of type ‘B’, that is the spatial gamut mapping,
determined always a highly significant preference, in particular for the image to which it was applied in the image pair.
Indeed it was applied to the top image in pairs 3 and 19, and to the bottom image in pair 13. For all these three pairs the
p-value is even 0. In the two other cases, in which we apparently recognized a change in local contrast in dark areas,
namely 4 and 8, the p-value is also small.
Image pair
% preference top image
p-value
CI 95%
1
0.45
>0.10
0.37-0.54
2
0.48
>>0.10
0.39-0.57
3
0.83
0
0.76-0.89
4
0.58
0.09
0.49-0.66
5
0.75
0
0.67-0.82
6
0.75
0
0.67-0.82
7
0.59
0.06
0.49-0.68
8
0.18
0
0.10-0.26
9
0.50
>>0.10
0.41-0.59
10
0.65
0.0003
0.57-0.73
11
0.52
>>0.10
0.44-0.60
12
0.50
>>0.10
0.42-0.58
13
0.19
0
0.13-0.27
14
0.57
>0.10
0.48-0.66
15
0.65
0.0009
0.56-0.73
16
0.74
0
0.66-0.82
18
0.57
0.09
0.49-0.66
19
0.86
0
0.73-0.91
20
0.61
0.02
0.52-0.63
Table 4. The table reports, for each pair, the percentages of preferences given to the top image in the uncontrolled
experiment, the p-value obtained by testing the hypothesis that the probability of preferring the top image is equal to
that of preferring the bottom one, and the 95% confidence interval for the probability of preferring the top image.
SPIE-IS&T/ Vol. 7241 724102-9
Preference data in Table 4 are plotted in Figure 6. Points above the line indicate a preference for the top image of the
pair; points below the line indicate a preference for the bottom image. It is clearly seen that type D algorithm (rotation
of hue angle) determined on the whole the less sharp preferences.
% preferences for top image
1
B-none
0.9
B-none
0.8
A-A
C-none
none-C
0.7
0.6
none-C
A-A
0.5
D-none
none-C
D-none
none-C
none-D
D-none
A-A
0.4
none-A
none-D
A-A
0.3
0.2
A-A
none-B
0.1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
19
20
image pair
Figure 6. Plots of the percentage of preferences given to the top image in the uncontrolled experiment. The plot also shows
which algorithm was applied (A = Retinex-like algorithm, B = Spatial gamut mapping, C = mild compression, D = hue
rotation).
In summary, we can observe that in our Book of Samples the application of algorithms that produced changes in the
image contrast gave origin to pairs of images for most of which the preference for the more contrasted one was clearly
stated in both the experiments; in these cases the strong preference is associated with a great equivalence between
controlled and uncontrolled experiments. On the contrary, we cannot observe a dual behavior if we look at cases of
weak preference: indeed, when the probability of choosing one of the images is around 0.5 (1, 2, 9, 11, 12, 14, and 18),
the difference in preference between the controlled and uncontrolled experiments varies.
5. CONCLUSION
In this work we have refined our analysis of the equivalence between controlled and uncontrolled experiments that we
found in a previous work4, where the task was to express a preference between a set of image pairs printed in what we
called the Book of Samples. More specifically, we addressed two questions, one concerning the sample size of the
uncontrolled experiment, the other the relationship between the equivalence of the two experiments and the attributes
that set the differences in the pair of images under comparison.
The choice of the sample size of an experiment is important for statistical and economic reasons. If the experiment is
performed in an uncontrolled setting, it is expected that the effects of the different viewing conditions are compensated
by a high number of participants. On the basis of our results, we have identified that, in a very conservative strategy, a
number of uncontrolled tests of at least 4 times the sample size of the controlled experiment is requested in order to
achieve the same level of precision in the results. This indication could be exploited in the design of an uncontrolled
experiment of visual preference of prints, considering that a test performed in a laboratory is presumably more
expensive than an equivalent test performed out of the lab.
The analysis of the apparent visible differences between images in each pair has pointed out that our Book of Samples
includes cases with differences in color and contrast. In the cases where a general blur or a locally reduced contrast
were observed, which are attributes we suspect were considered defects by the observers, the preference for the more
contrasted images was strong and associated with a strong equivalence between the controlled and uncontrolled
experiment. Nevertheless significant, this result could be considered quite obvious. Of greater relevance, the Book of
Samples includes many cases with image differences not ascribable to image defects and for which a less clear
preference could be expected. In these cases, the control in the viewing conditions could play a role, but, interestingly,
the equivalence between the two experiments was anyway verified.
SPIE-IS&T/ Vol. 7241 724102-10
6. REFERENCES
1. H. van Veen, H. Bülthoff, G. Givaty, “Psychophysical Experiments on the Internet,” Proceedings of the 2nd Tübinger
Conference of Perception, H.H. Bülthoff, M. Fahle, K.R. Gegenfurtner, H.A. Mallot. Knirsch, Kirchentellinsfurt,
Editors, Tübinger, Germany, 1999.
2. N. Katoh, K. Nakabayashi, M. Ito, S. Ohno, “Effect of Ambient Light on the Color Appearance of Softcopy Images:
Mixed Chromatic Adaptation for Self-luminous Display”, Journal of Electronic Imaging, 7(4), 1998.
3. S. Zuffi, P. Scala, C. Brambilla, G. Beretta, “Web-based vs. controlled environment psychophysics experiments”,
Image Quality and System Performance IV, Proceedings of SPIE 6494, L. C. Cui, Y. Miyake, Editors, San Josè, 2007.
4. S. Zuffi, C. Brambilla, R. Eschbach, A. Rizzi, “Controlled and Uncontrolled Viewing Conditions in the Evaluation of
Prints”, Color Imaging XIII: Processing, Hardcopy and Applications, Proceedings of SPIE 6807, R. Eschbach, G. G.
Marcu, S. Tominaga, Editors, San Josè, 2008.
5. ITU-R Recommendation BT.500-11, “Methodology for the subjective assessment of the quality of television
pictures”, International Telecommunication Union, Geneva, Switzerland, 2002.
6. A. Rizzi, C. Gatta, D. Marini, “A New Algorithm for Unsupervised Global and Local Color Correction”, Pattern
Recognition Letters, 24(11), 2003.
7. R. Balasubramanian, R. deQueiroz, R. Eschbach, W. Wu, “Gamut Mapping to Preserve Spatial Luminance
Variations”, The Eighth Color Imaging Conference, Scottsdale, 2000.
SPIE-IS&T/ Vol. 7241 724102-11