Visual Fidelity of Image Based Rendering ∗
Hynek Bakstein and Tomáš Pajdla
Center for Machine Perception, Czech Technical University
Karlovo nam. 13, 121 35 Prague, Czech republic
e-mail:
{bakstein,pajdla}@cmp.felk.cvut.cz
Abstract
This paper focuses on factors which influence the fidelity, or realness, of images created
by image based rendering (IBR) methods. First, we address the issue of large memory
requirements of IBR techniques and we analyze necessary number of input images for a
scene representation preserving visual fidelity of the output images. Then, we investigate
optimal scaling producing output images that are close to perspective projection. We show
that objects in different depths should be scaled differently. In case of different depths in
a view of a scene, the perceptually dominant object should be scaled properly.
1
Introduction
Figure 1: An example omnidirectional X-slits image with positions of details used in Figure 8.
Image based rendering (IBR) is an approach to representation of a real scene by a sequence
of images captured at different locations. Novel views, corresponding to viewpoints not covered by the input sequence, can be generated from these images, see Figure 1 for example of an
∗
This work was supported by GACR 102/03/0440, BeNoGo IST-2001-39184, MSMT Kontakt 22-2003-04,
MSMT Kontakt ME 678 , and MSM 212300013.
omnidirectional IBR generated image. IBR techniques do not have to assume anything about
the scene geometry and complexity [6], which makes them ideal for representation of complex real scenes. On the other hand, IBR techniques require many images to represent some
environment. Therefore, some approaches incorporate assumption of the scene structure.
One way to lower the number of input images is to constrain the freedom of movement
in the virtual environment created by IBR. Natural constraint is to allow only motion in a
single plane or its part. An example is a region of exploration (REX) limited to a circle. This
limitation may seem unnatural, but it is quite common. Imagine a situation where a person is
limited by the room walls or is sitting in a chair at a desk.
This paper analyses how can we reduce the number of input images for IBR even further.
We employ a technique called X-slits rendering [8], which allows for intuitive representation
of a virtual viewpoint position in a circular REX. We show that significant reduction of input
images is possible under assumption of limited variation of scene depth. Moreover, we also
discuss rescaling of output images which is inherent to IBR methods. We show that this
rescaling also depends on a depth in the scene.
2 X-slits rendering
We refer the reader to [8] for detailed description of X-slits rendering. Here, we just briefly
summarize some very basic concepts and then we introduce some novel analysis of quality of
rendered images. In rendering using X-slits camera model, novel views are created by pasting
columns from input images. We assume circular motion of the camera with radius r during
acquisition of input images. If we index input images by γ, image columns in the input images
by β, and columns in the novel view by α, we can write
R
sin(α)
β = arcsin
r
γ = β−α ,
,
(1)
(2)
where R is the position of the virtual camera V inside the circle, see Figure 2(a).
A possible position of the virtual camera depends on the number of input images as well
as on their resolution in pixels (so called angular resolution). We want to generate an image
acquired by a virtual camera at position V with some predefined field of view and resolution
in pixels. This determines how many light rays should be captured by this camera, in other
words, how many light rays captured by the input sequence should intersect the vertical line
representing the virtual camera. Since the number of images, as well as their resolution, is
discrete, the possible positions are discrete too, see Figure 2(b).
When creating an image at V , we approximate light rays forming the virtual image by light
rays captured in the input sequence. In case of a large number of input images, we can always
find some really close camera position in the input sequence and some light ray in a respective
image, as it is depicted in Figure 2(c). The large number of required images leads to huge
memory requirements, so it is natural to consider a case, when the input sequence is sparse.
Then, we are not able to find an input image for each light ray and we have to approximate
more light rays by rays captured at a single position, see Figure 2(d). Therefore, we do not use
only single column from each input image while creating the virtual views, but instead we use
stripes of columns. The main issue of this approach is that this approximation holds only at a
certain distance from the virtual camera. We analyse this approximation in the next section.
β
l
r
α
o
γ
C
V R
(a)
(b)
(c)
(d)
Figure 2: (a) An X-slit camera with a vertical slit V can be created by sampling columns β from
a camera at γ into columns α of the X-slit image. (b) Possible positions of V are determined by
both the number of images and the number of light rays that each image captures. The same
light rays can be acquired using (c) large number of images or (d) respective approximated
light rays by rays captured in one image.
3
Reducing the number of images
(a)
(b)
Figure 3: Instead of sampling only single column from many input images (a), we can sample
stripes of columns (b) from less images. The width of the stripes varies and they do not have
to be continuous.
Instead of using a single column from each input image to compose the virtual views,
depicted in Figure 3(a), we can use stripes of columns to approximate the missing light rays,
see Figure 3(b). As it was noted before, this approximation holds only at certain distance from
the virtual camera. We can think of a so called clean surface, surrounding the virtual camera
V , which will be approximated well. Composition from slices was proposed in [3] for clean
surface at infinity.
Some special cases of this surface are illustrated in Figure 4. The surface can have constant
depth (a). Then, we can use constant stripes from all input images to compose a novel view.
α
C
b)
a)
C
c)
1
1
B
A
0
d)
2
2
2
2
1
0
0
0
e)
Figure 4: Clean surface, see text for a description.
Note, that this approximation allows deviations from constant depth (b). Changing the way
how input images are sampled changes the shape of the surface, for example to a plane (c). (d)
shows that for a planar surface, the approximation of missing image positions holds only for
one single plane B, while some points on plane A, closer to the camera, may not be imaged at
all and some points on plane C, more distant than B, can be imaged twice. This observation
is general for all surfaces (e). There will always be zones, where all points will not be imaged
(0), some surface, where all points will be seen once (1), and zones where some points will
be imaged twice or even more times. Figure 5(a) depicts a situation, when some points are
not imaged at all. Note the silver cylinders in front. The steps on the contours are caused
by missing points, while more distant objects are imaged correctly. Figure 5(b) illustrates the
second case, where some points are imaged twice. Note the double contours of the lines on
the floor.
In practice, real scenes have hardly some constant depth. Therefore, approximation valid
only on some surface seems impractical. Fortunately, images are discrete and single pixels do
not backproject to a single light ray, but instead to a cone of light rays. The same applies to
generated images from a virtual camera at V . As a result, we do not intersect light rays in
one point but we have an intersection of cones in some volume, see Figure 6. The size of this
volume depends also on how close are the approximated and approximating light rays. If they
are close enough, like α and β in Figure 6, the resulting volume is big. However, in case of α0
and β 0 we get a smaller volume. But this situation never happens when we have sufficiently
high number of images. How much exactly is “sufficiently high” can be actually computed.
Let ∆pix be half of the angle spanned by the pixel cone, r the radius of the circular path of
the camera, βγ the angle of the approximating ray with respect to the optical axis of respective
image I, and ∆γ the angle between that optical axis and the approximated ray α.
(a)
(b)
Figure 5: (a) Some points closer to the camera than the clean surface do not have to be imaged,
note the silver cylinders. (b) Some points more distant than the clean surface will be imaged
twice, such as the lines on the floor.
ω1 = βγ + ∆pix , ω2 = βγ − ∆pix ,
(3)
where βγ depends on the depth of the clean surface d. Using basic trigonometric identities, we
can write
ϕ1 = π − (ψ + ω1 ), ϕ2 = π − (ψ + ω2 ) ,
(4)
where ψ = π − ∆γ . And finally, we can compute the depth range
d1 = l2 tan(ϕ1 ) ,
d2 = l2 tan(ϕ2 ) ,
(5)
(6)
where l2 = r sin (∆γ ). The remaining distance to V , denoted by l1 , can be computed as
l1 = r cos (∆γ ). The above equations give us values of valid depth estimates for a given
number of input images (influences ∆γ ) and given clean surface (influences βγ ). Due to lack
of space, we do not investigate, how to acquire the scene depth estimate. We just note, that
dense stereo reconstruction can be used since even non-central images have stereo geometry.
4
Scaling the images
In IBR, we want to represent a real scene and allow motion of a viewer in this virtual representation of the scene. In case of X-slits rendering, this motion is simulated by changing a
position of the virtual camera V . Novel views corresponding to a given position of V are then
generated by pasting together columns from input images. But since we move forward to some
part of a scene, the objects in the image should get bigger and conversely, when moving away
they should get smaller. Since pasting together columns from input images does not reflect
this size change, the images should be rescaled. Rescaling function for X-slits rendering is
described in [1]. It is shown that output image rescaling depends on the scene depth. The topic
of this section is an investigation of influence of depth estimation precision on rescaled images
as well as how to deal with scenes containing objects at multiple depths.
I
r
∆ pix
βγ
∆γ
V
l1
d1
α
β
d2
Never happens
α’
β’
Figure 6: Due to discrete images, pixels backproject into cones (dashed lines) instead of light
rays only (solid lines). When light ray β is used to approximate ray α, their respective cones
do intersect in the shaded area, resp. volume in 3D.
If the scene depth estimate is not correct in some part of the output image, how severe
distortion we get? Since output images from X-slits rendering are non central [8], we get always images “distorted” in some sense [7]. Due to non-central nature of the generated images,
an approach to distortion correction similar to [9] cannot be employed. Instead, we employ
subjective methods used in presence research [2] to evaluate realness of images.
The change blindness effect [5] suggests that image perception depends on unconscious
interpretation of the image. Interestingly, even large changes (measured by image area) may
pass unnoticed if they do not change image interpretation. On the other hand, relatively local
changes of the image may turn a very possible image into a really impossible one.
We suggest that visual realness of images is related to the interpretation of the perceptually
dominant image feature. In particular, our experience and observations support that
1. image features appear to be “real” if they are not falsified [4] by human experience (i.e.
“real is that what is possible”), and
2. geometrically distorted images are acceptable if the perceptually dominant feature appears to be “real”.
According to the above observations, people might tolerate distortions of image geometry
provided that the distortions do not significantly contradict image interpretation. Such “tolerable” distortions should not therefore significantly alter the feeling of realness of the image.
The notion of perceptually dominant object gives us also a rule for rescaling images containing
objects at different depths, the dominant object should have a correct aspect ratio.
5 Experimental results
800
700
d1
d2
600
Depth range [cm]
Depth range [cm]
1000
d1
d2
500
600
400
300
400
200
200
0
0
100
50
100
d [cm]
(a)
150
0
0
20
40
d [cm]
60
80
(b)
Figure 7: Depth range where points will be imaged onto exactly one pixel for (a) 360 images
and (b) 720 images. d denotes the normalization depth. Note different scales of the axes.
We present experiments evaluating both the reduction of the number of input images and
also the rescaling of the images with identification of perceptually dominant objects. We have
chosen a scene filled with objects which are familiar to most people, an office with a computer
screen, books, a cup, and a picture. We have captured a complete set of input images (5000
images), so we were able to create novel views by pasting together single image columns. This
provided ground truth for evaluation of the reduction of number of input images.
We tested the reduction of number of input images to 360 and to 720. The computed depth
ranges, where points in the scene will be imaged into one pixel, are shown in Figure 7(a) and
(b) respectively. Figure 8 summarizes visual comparison of images generated using a complete
input set, in the first column, and images created using a reduced number of input images with
different scene depth assumption. The scene depth is marked above each column. The first
three images were created from 360 input images, the bottom three from 720. Clear improvement of visual quality of the latter images as well as bigger depth range without artifacts can
be observed. The ground truth images are details from Figure 1.
We created six views of familiar objects in the scene. Some of the views contained only
parts of the objects, in others, the objects were completely visible. For each view, we generated
three images with a different geometrical normalization. One of the images was created so that
it matched visually best to a real perspective view obtained by a digital camera. All images are
shown in Figure 9. The first column contains the perspective images, each of the other columns
consists of images normalized at the same depth. The images most similar to the perspective
view are in the second column, except for the third row, where it is the third image. We had
six test subjects, three of them had some previous experience with virtual reality and IBR
(experts), three were shown such images for the first time (non-experts).
Ground truth
d = inf
d = 2m
d = 80cm
# of images
360
720
360
720
360
720
Figure 8: Comparison of novel views rendered using reduced number of input images. The
first column contains the ground truth, views rendered by composing single columns from
input images. The second column was normalized for clean surface at infinity, the third column
at 2 meters and the fourth column at 80 centimetres. For each image, the first row uses 360
input images, the second row 720.
We made an experiment, where the subjects were shown all the test images in random
order, one at a time, and they were asked to decide whether the image looks “real”. The results
are summarized in tables on the right in Figure 9. Despite a small number of test subject, we
can draw the following conclusions from the results, since the data are consistent. If the image
was falsified by human experience, that is if some object such as the computer screen was too
wide, too narrow, or too distorted by some other means, the subjects noticed that immediately
and did not classify the image as “real”, regardless of their expertise. The experts tend to look
more carefully at the images and to be more critical, while non-experts were captivated by the
real look of the images and did judge more quickly.
On the other hand, the cup is a perceptually dominant feature, but it appears “real” in all
three images. The subjects commented on that that they do not know that particular cup and all
three look possible. The same applies to the picture on the wall in images in the 3-rd row where
the subjects judged according to the objects in the picture, not the image of the picture itself.
And it also applies to the laptop in the 4-th row, which is also a familiar, perceptually dominant
object, but due to the fact that it is not imaged by a front view, some subjects classified even
distorted images as “real”.
6 Conclusions
We have shown that number of input images can be greatly reduced (from 5000 to 720) while
preserving high image fidelity. The reduction introduces need for a correct depth estimate or, at
least, a reasonable assumption. We investigate, how big can be a scene depth range for which
we get images without apparent artifacts. We also pursued another aspect of visual fidelity, or
realness, of generated images, the correct aspect ratio of perceptually dominant objects. We
have shown that this issue is also related to the scene depth estimate.
References
[1] H. Bakstein, T. Pajdla, and D. Večerka. Rendering almost perspective views from a sparse
set of omnidirectional images. In R. Harwey and J. Bangham, editors, Proceedings of the
British Machine Vision Conference 2003, pages 241–250, September 2003.
[2] M. Lombard and T. B. Ditton. At the heart of it all: The concept of presence. Journal of
Computer-Mediated Communication, 3(2), 1997.
[3] P. Peer and F. Solina. Towards a real time panoramic depth sensor. In N. Petkov and M. A.
Westenberg, editors, Proceedings of CAIP’03, pages 107–115, 2003.
[4] K. R. Popper. Logic of Scientific Discovery. Hutchinson, London, 1968.
[5] R. A. Rensink. Change detection. Annual Review of Psychology, 53:245–277, 2002.
[6] H.-Y. Shum and S. B. Kang. A review of image-based rendering techniques. In IEEE/SPIE
Visual Communications and Image Processing (VCIP) 2000, pages 2–13, June 2000.
True persp.
d = 150cm
d = 125cm
d = 100cm
T
E
N
4
4
4
1
1
1
3
3
3
1
0
0
1
0
0
0
0
0
5
5
5
2
2
2
3
3
3
6
3
1
3
1
1
3
2
0
3
0
0
1
0
0
2
0
0
4
2
0
2
1
0
2
1
0
Figure 9: Test images together with perspective views. The first column contains the perspective views, the second to the fourth columns contain non-perspective test images. Each of the
columns corresponds to different a normalization depth. The second column is normalized at
the computer screen depth, the third at the picture on the wall, the last at object closer than
the computer screen. The tables on the right summarize the experimental results of tests performed on 6 subjects (T) divided into two groups (experienced (E) and non-experienced (N)
VR users). The numbers in the tables indicate how many subjects did mark the image as “real”.
Each row of numbers corresponds to one image.
[7] R. Swaminathan, M.D. Grossberg, and S.K. Nayar. A perspective on distortions. In
CVPR03, volume 2, pages 594–601, June 2003.
[8] A. Zomet, D. Feldman, S. Peleg, and D. Weinshall. Mosaicing new views: The crossedslits projection. IEEE PAMI, 25(6):741–754, June 2003.
[9] D. Zorin and A. H. Barr. Correction of geometric perceptual distortions in pictures. Computer Graphics, 29:257–264, 1995.
© Copyright 2026 Paperzz