3.2 Motion vector and construction of initial depth map

2D to 3D conversion based on object
extraction using motion and color
segmentation
Iman Kohyarnejadfard * and Mahmoud fathy
[email protected] , Iran University of Science and technology
[email protected] , Associate professor at Iran University of Science and
Technology
Abstract The human eyes work at the base of this fact that by two images
with a little special deference, they can three-dimensionally see objects of the
world. Stereoscopic idea is also like this and uses two cameras with the similar
distance of two eyes. One of the typical methods of three-dimensional video
coding is “video plus depth”. These coded images are again converted into
three-dimensional images at the side of receiver, by depth image based
algorithms. In this paper, a method is proposed to estimate the depth
information which is on the basis of segmentation with a combinational method
of color and movement, to creating different layers of scene. Each depth layer
is located at a certain distance from the camera, so the initial depth map is
created by determining order of the layers in the scene, and finally, via using
pixels movement, the initial depth maps improved and final depth map is
created. So, therefore a virtual view can be created by using any frame and its
corresponding depth map, and pictures can be seen in two stereoscopic views.
Examination on results of the proposed method and comparison with other
existing methods indicates improvement of accuracy and quality compared with
other pixel-based methods and segment-based methods.
Key words Conversion of 2D to 3D, depth extraction, Stereoscopic video,
objects order, objects motion
1
1 Introduction
Most technologies of three-dimensional monitors use stereoscopic display.
Stereoscopic images consist of two image sequences that are simultaneously
displayed for user and also there are various methods for encoding threedimensional images. One of these methods is “video plus depth” method. The
depth map sequence in the side of transmitter is calculated by left image and
right image sequence and sent to user side with one of the image sequence
(right image or left image sequence). The codec in user side makes another
color image sequence using the sequence of color images and its
corresponding depth data, by one of the numerous DIBR1 algorithms and then
displays two images, simultaneously. So, if we estimate a depth map for a
typical image, then other image with a little space difference can be obtained
and stereoscopic image can be displayed [20,21].
Each depth data sample is considered as a gray scale video signal. The
depth is limited between two boundaries which display the minimum and
maximum distance of three-dimensional point from camera. The depth range is
linearly quantized with 8 bits. For example, the nearest and farthest point is
shown by 255 and zero, respectively. So, the depth map is defined and the
result is a gray scale image [24]. Several methods have been proposed to
create depth map, such as creating depth map with the limited inputs. At the
beginning of this procedure, which is not fully automatic, segmentation of input
image to smaller parts is done. In the following, picture marking is manually
carried out, and thereafter, the edges and T-junctions are searched. At the end,
depth is estimated for each segment and post-processing is done [7].
Fast 2D to 3D conversion using a clustering-based hierarchical search in
a machine learning framework method, uses a machine learning framework to
infer the 3D structure of a query color image from a training database with color
and depth images [1]. In edge-based depth gradient refinement for 2D to 3D
learned prior conversion method, automatic 2D to 3D image conversion
approach based on machine learning principles is presented by using the
hypothesis that images with a similar structure have likely a similar 3D structure
[2]. One of another methods is foreground-based depth map generation for 2D
to 3D conversion. This method, for a given input image, determines if the image
is an object view scene or a non-object view scene, depending on the existence
of foreground objects which are clearly distinguishable from the background [3].
1 Depth Image Based Rendering
2
Automatic Real-time 2D-to-3D Conversion for scenic views is also another
method which is use three depth cues, haze, vertical edges, and sharpness to
estimate a sparse depth map. Then it obtains the full depth map from the sparse
map using an edge aware interpolation method [4,15].
One of the other available methods, object-based converting of 2D to 3D
to produce effective stereoscopic content, can be mentioned. This method tries
to estimate the object’s depth. The depth arrangement is an important guideline
to understand the relationship between the observed objects depth. In this
method, the presence of depth’s layers according to the depth discontinuities,
is used for a two-dimensional video sequence. These discontinuities occur
usually by incompatibility movement at the edges of moving objects (occlusion).
So, finding the depth arrangement and determination of occlusion at the
sequence of video is new problem [5].
Converting two-dimensional to three-dimensional at the base of
composition of motion and color is another method. In this method, both
movement and color have been used to create stereoscopic images from
monoscopic ones. Also, optical flow has been used to estimate two-dimensional
motion at the level of pixel which yields more partial results. Then, the minimum
difference information is applied for combining color information with the results
of optical flow to provide proper segmentation. Thereafter, automatic process
of depth estimation is done. Then, in order to achieve segmentation and
decision about depth of each area, some conditions imposed for “flood-fill”
procedure [13].
The easiest and most basic way in 2D to 3D conversion is 2D to 3D video
conversion based on interframe pixel matching. In this method, by using motion
vector, depth map is directly achieved from estimation of macroblocks motion.
This method has good speed but has not high quality, and of course, to improve
the quality of motion estimation, different methods have been used [16].
Another methods with different basis such as analysis of blur exist. In this
way, the depth of scene is determined with understanding the impact of variable
focal parameter of image. By measuring the amount of blur in the picture, focal
length is considered as a depth in the reverse filter. This method has several
drawbacks, for example, the amount of blur is not always available and also
blur can be influenced by focal length and other factors such as atmosphere
are impressive .There is another way to obtain the depth from geographical
constraints that the basis of this method is to exploit geographical constraints
3
in two moments with prior knowledge of camera settings, including focal length
and speed of camera [10-12].
In this paper, it has been tried to present a method that use the
advantages of previous methods and also resolve their deficiencies. By
examination on real depth data, it can be concluded that more information of
depth causing to understand three-dimensional images by human brain are in
the objects of scene and not in pixels. This means that data of different objects
in the scene have many differences, but the pixels within an object have similar
depth data due to the similar distance from camera. So the depth map similar
to real depth map can be computed by extracting the objects in the image and
then estimating their distance from the camera. The information such as color
is available from a single frame and motion data can be used from sequence of
frames. In the method presented by this article, information is segmented with
respect to color information. Then by applying motion information, image
objects are extracted, and finally, after determining the object’s orders in the
image, the final depth map is obtained.
This paper consists of five sections. In the section 2, conventional
methods of stereo video coding are reviewed. In the third section, the steps of
proposed algorithm will be presented. Evaluating the results of the proposed
method and other available methods will be shown in the fourth part and fifth
part will be expressed the results of this study.
2 Conventional methods of stereo video coding
Stereo display is one of the most important case of multi-view (with N=2 view).
Stereo compression is studied for a long time and many related standards are
available. The stereo consists of two pairs of image which are taken from two
points with short distance from each other. This distance corresponds to the
distance between two human eyes. In general, the similarities of these photos
make them very suitable to combine so that one of them can predict another
one. For example, one of them is compressed without reference to the other
and then, the second image is predicted according to the first one [14,22].
4
Fig. 1 Prediction in stereoscopic coding
Video signal transmission and corresponding depth map is an alternative
method to classic stereo videos. Using video and depth information is
interesting in compression of views performance, because each sample of
depth data is considered as a low-sized gray video signal [17,18].
The general problem of video plus depth method is to make content. For
example, cameras which automatically calculate the depth of each pixel are
available, but the quality of recorded depth is limited. Depth estimation
algorithms have been studied in computer vision and good results have been
obtained. But this problem will always remain that error exists in estimation.
Estimation error affects the quality of translated view. Even with great depth,
the effects may be happened due to non-pairing in view translated. This effect
may be increased by moving away of virtual view from the original position of
camera [21].
Fig. 2 Coding 3D video with 2D plus depth method [8]
5
3 Proposed Method
Computing depth of an image consists of two parts which have different nature:
The first part is to achieve the depth of different objects and the second part is
to compute the depth of inside parts of objects. In this paper, to improve
estimation of depth, firstly, the depth is calculated through one frame and then
improved through consecutive frames.
According to the method provided in this paper, at first an appropriate
method to separate the objects that is expressed as segmentation should be
selected. Thereafter, the frame is divided into small blocks containing two
objects and the arrangement of two objects in the blocks are determined, and
after calculating total order, the final depth map is calculated with the assistance
of initial depth map obtained by motion vector. Fig. 3 shows the manner of this
method.
Fig. 3 Conversion of 2D to 3D in the proposed method
6
3.1 Image segmentation
In computer vision, segmentation is the process of dividing a digital image into
multiple parts (set of pixels). The purpose of segmentation is simplification or
changing an image to something that be more meaningful and easier to
analyze. Image segmentation is normally used to determine the objects and
boundary (lines, curves, etc.). More specifically, image segmentation is the
process of assigning a label to each pixel in an image, so that pixels with one
label have similar features. Each pixel is located in a given area according to
its features such as color, intensity or tissue which are significantly different with
adjacent area.
3.1.1 K-Means segmentation
K-means is one of the algorithms of image clustering which is considered as
an unsupervised algorithm. Despite its simplicity, this method is a basic method
for many clustering methods (such as fuzzy clustering). Fig. 4 summarizes the
progress of algorithm in several rounds.
Fig. 4 Progress of K-Means algorithm
Performing image segmentation via method K-means is very simple for black
and white images. These images has only one variable (gray scale value). This
algorithm identifies the parts of image being very similar in terms of gray level
and puts them in the same category. Color images are considered here and
so, if RGB space is used, then three color values should be used as attributes
of the objects or one quantity should be extracted from these three values. Then
the algorithm is done on them. The method used here is that at the first, the
frame of each color space is carried into RGB color space. This space has three
matrixes of red, green and blue. One matrix should be obtained from these
7
three matrices. A simple way is to sum their corresponding values in the three
matrixes. It is not suitable for segmentation, because segmentation algorithm
may wrongly put the elements with different colors in the same category. To
solve this problem, three different coefficients can be used for each of three
values. But at the end, because the numbers are put in the common range of
image (for instance, 0 to 255), the obtained values should be normalized. In
Fig. 5, one frame and segmented image by K-Means have been shown.
Fig. 5 One frame and its segmented image by K-Means
3.1.2 Choosing the center of segments
K-Means Clustering algorithm output depends on the initial selection of clusters
and this causes that the results of clustering is different in various iterations of
algorithm. On the other hand, one of the problems in use of K-means algorithm
is that in this method, the number of sections should be determined before
segmentation. According to the automated proposed algorithm and differences
in the number of objects in different videos and frames, the number of sections
is not specified before.
As mentioned above, the number of sections are different in different
sequence of images, but are similar in frames of same sequence. So, the
number of sections and their centers can be calculated for one frame and used
to other next frames. The operation is time-consuming to the first frame, but by
choosing the number and suitable center segmentation of next frames can be
done with less time and as a result, the entire speed of segmenting can be
decreased.
Due to the use of color and possibility of plurality of color in an objects, the
number of parts used in K-means algorithm is more than the real number of
8
existing objects in the scene. By selecting the desired value being greater than
a certain threshold, even with a little less than the number of objects in the
frame, the objects in the frame will be separated but there may be parts of
different objects that take same labels and can be differentiated due to the
distance between them (In Fig. 6, there are parts with same labels but in
different areas). It is noteworthy that in the first frame, clusters centers will be
randomly selected.
Fig. 6 A segmented image using K-Means. Left image shows parts with same labels but
in different areas and right image shows small areas
So, segmenting of the first frame can be done as follows:
 The algorithm K-means is applied to α number of arbitrary part (that in the tests
done, the best value used for the dataset has been α=20) and with random
Centers μ.
 Assignment of different label to segments that have one label, but apart from
each other. An example of this process can be seen in Fig. 6.
With the above process, primary segmentation is done for the first frame and
the number of segments obtained is used as the number of segments for
subsequent frames, and also the midpoint in sequential arrangement of each
section’s pixels is selected as its center (in every area, the points are read from
top row and thereafter the completion of the area in that row becomes the next
row, and among the read points, the midpoint is selected as a center.)
9
3.1.3 Thresholding
In the primary segmentation, small parts which are parts of a larger objects, can
be observed. To remove these items, thresholding operation is done. In
thresholding, segments which their size (here, the size is the number of pixels
of that part) are smaller than β, they are integrated with the largest adjacent
part of that part. In the tests done, the best value for the amount β is obtained
in the range [9, 16].
Fig. 7 An example of segmentation by proposed schema
3.1.4 Integration of segments by using the motion vector to obtain
final objects
The segmentation obtained in previous part has not desired features to use in
conversion of 2D to 3D yet. As can be seen in Fig. 6, there are areas that belong
to the same object but have different labels (Such as areas 1 and 2). Another
feature of video that can be used to segmentation is motion. Different parts of
one object have usually similar motion. In part 3.2, the manner of obtaining
motion vectors from frame I to frame P will be expressed. The motion map
obtained from part 3.2 is used here. The integration process of parts using the
motion is as follows:
1. This method moves from the top left corner of the image (as shown in Fig.
7) and selects the first section, after that determines all its neighbors.
2. Among the neighbors of that section, if there is a neighbor that has a
motion vector by less than ε difference, then it takes the same label. If all
the neighbors has a different motion vector, the next section is checked.
10
3. The first and second steps are repeated and in this repetition, the updated
segmented image is used and this work will continue so that the change
is not achieved in the segmented image.
At the end of this section, the desired objects obtained. Fig. 7 shows an example
of segmentation with the proposed algorithm which objects are ready to use in
2D to 3D conversion.
3.2 Motion vector and construction of initial depth map
The basis of process in the motion estimation is dividing the frame into a matrix
of blocks and adapting blocks. For block of frame I (current frame), it should be
searched in frame P (next frame) around the neighborhood of that block, and
among possible blocks, the corresponding block is selected due to the most
similarity. P pixels around four sides of the corresponding block in the previous
frame is search area for a good block matching. P is known as a search
parameter. Greater motions need to larger value of P, and also larger parameter
provides more computational complexity. Finally, the block is selected that have
the lowest costs. There are several cost functions such as Minimum Absolute
Difference (MAD) and Minimum Square Error (MSE) that are shown in equation
1 and 2.
𝑁−1 𝑁−1
1
𝑀𝐴𝐷 = 2 ∑ ∑|𝐶𝑖𝑗 − 𝑅𝑖𝑗 |
𝑁
(1)
𝑁−1 𝑁−1
(2)
𝑖=0 𝑗=0
1
2
𝑀𝑆𝐸 = 2 ∑ ∑(𝐶𝑖𝑗 − 𝑅𝑖𝑗 )
𝑁
𝑖=0 𝑗=0
Where N is the number of block points, Cij and Rij are the number of pixels of
the current frame and reference frame respectively, that must be compared.
There are several search methods to find the corresponding block. The
easiest method is complete searching which peruses all block neighbors. Also
there are other models to speed up the search. In this paper, a four-step method
(4SS) is used.
11
Fig.8 Patterns of points that should be checked at each steps of 4SS, (a) the first step
pattern, (b) the second step pattern, (c) the third step pattern, (d) the fourth step pattern
4SS uses the search based on center and halfway stop conditions. In this
algorithm, for the first step, regardless of the value of search parameter, the
size of the pattern is placed S=2. It means that two pixels are searched around
the desired location, and so, 9 points are seen in the 5×5 window. If the
minimum weight is in the center of the search window, then it jumps to the fourth
step. If the minimum weight is in one of the other 8 points, then that point is set
as the search center, and in the second step, the search window is still 5×5.
According to the place of minimum weight, examination may be done in 3 or 5
locations. Like the model presented in the figure above, if the minimum weight
is at the corners, the search is done in 3 locations, otherwise search will be
conducted in 5 locations. Again if the minimum weight is in the center of the
window 5×5, we go to step 4, otherwise the third step is done. The third step is
exactly similar to the second step with the same search patterns. In the fourth
search step, the window 3×3 is used meaning S=1. The location of the minimum
weight is the best match to the block (an example is shown in fig. 9). In the
beginning 9 points indicated by big circle are examined. Because the minimum
weight is not in the center of window, the second step is done. In the second
step, three points indicated by the square are examined according to the pattern
presented in the previous section. This process continues until the end of the
fourth stage and finally the best match is found. In the figure, this algorithm
searches 17 points in the best case and 27 points in the worst case.
12
Fig. 9 Four-Step Search of a block
The size of the block with the center of (i,j) which moves to (m, n), is calculated
as follows:
)3(
𝐷 = √(𝑚 − 𝑖)2 + (𝑛 − 𝑗)2
The method for calculating the initial depth map uses the motion of
blocks. The assumption used here is that objects being closer to the camera
have more motion and thus larger motion vector. At the end of this section, the
initial depth map is created with the same size of original frame size. The values
of this map are between 0 to 255. The points being closer to the camera (larger
motion vector) have larger gray scale values.
𝑑(𝑖, 𝑗) = 𝛼𝐷𝑏
(4)
To map d (i,j) to the range [0, 255], α can be considered 255/max(D).
13
3.3 Selection of the blocks containing two objects
In this section, we offer a simple way to obtain the arrangement of objects in
the scene according to their distance from the camera. The proposed method
is trying to divide the frame as much as possible into small blocks containing
two objects. This will significantly reduce the computational and programming
complexity.
The first step is dividing the segmented image obtained in the previous
section into blocks that each block consists of only two objects. To divide the
frame into blocks containing two objects, different methods can be used. The
method used here is inspired by the divide and conquer method. Divide and
conquer algorithm works by recursive partitioning a problem into two or more
sub-problem. Partitioning continues until the resulting sub-problems are simple
enough to solve directly. The answer of main question is obtained from the
answers of sub-problems. This technique is the basis of efficient algorithms for
a variety of problems, for example, combined sorting and analyzing.
In any recursive algorithm, there is a considerable freedom in choosing
base state (small sub-problems that are directly solved in order to end problem).
Choosing the smallest or simplest base state is good and leads usually to easier
program. On the other hand, the performance will improve when recursion ends
at the big state that each of these states is non-recursively solved. In overall,
this strategy avoids recursive callings that do not any work or work a little. Since
a divide and conquer algorithms finally reduces each instance of the problem
or sub-problem to a large number of samples, the total cost of algorithm is
estimated by them, especially when the total cost of division and combination
are low.
Here dividing and solving the algorithm is used in such a way that is costeffective in terms of time and memory, the appropriate blocks containing the
two objects are being selected. Choosing primary states area in search of any
algorithm is so effective on the final result and complexity of subject. Due to
video frames used in this field are likely to contain more than one or two objects
and as a result the possibility of block selection has one answer and that
answer is the main frame, is very low. So at first, main frames are divided into
4 blocks.
One way of solving divide and conquer methods, is using the stack. Here,
the stack is used to keep sub problems. At the beginning, 4 parts of the divided
image are held into the stack. Next, a read operation from the stack is performed
14
following by reading from the stack, that’s mean a block from the stack is read
and the number of objects in the block are appointed. Three cases arise: The
first case is that the block is just have one object. In this case the block is not
qualified to be analyzed for determining the order of two objects so this block is
removed from the stack. The second case is that the block contains two objects.
Here block is one of the sub answers of the problem. This block can be selected
with the same size, but large size of the blocks in the future may bring
computational complexity. Thus, the sub-block of a size close to the threshold
£×ᴦ must be found as block target. If the block size is larger than the threshold,
to find the sub-block, the block is divided into four blocks. The first sub block is
selected and checked to determine whether or not greater than the threshold.
In case of being larger, the process will be redone as long as the size fits the
threshold and can’t shrink again. The remaining sub-blocks are discarded. At
the end selected block will be kept as one of the responses. The third occurs
when the block contains three objects or more. Here it will be divided into 4
parts and thus each 4 sub blocks instead of the original block is pushed in the
stack. The above process will continued as long as no block finds in the stack.
Fig. 10 Array of available object pairs and sub pictures stack in proposed method
15
To avoid selecting blocks that contain objects that have been previously
selected in the other blocks, In addition to stack blocks, the array consists of
cells corresponding to elements of the stack is built, that each cell contains two
objects in corresponding block in the stack. So when a block is popped from the
stack, the array is checked if block is available with objects corresponding to
that block objects or not. If available, the operation on block will be cancelled
and the block will be discarded. In Fig. 10 the stack and the mentioned array
are shown. Finally, with draining the stack a group of blocks with two objects
will be created. In the next stage each of these blocks will be used and
arrangement of objects in each block will be obtained. Procedure of determining
the depth of each block will be discussed later.
3.4 Determine the order of two objects in block
This must be determined which of two objects in each block is ahead, and
finally, after examining all blocks, the overall order of the objects will be
estimated. One of the possibilities is that the block-like fig. 11.a contains two
green and yellow objects. To determine the ordering of two objects, a series of
proposed rules should be introduced and by selecting the appropriate rule for
each block, order of two objects is decided. But before the creation of these
rules, a few definitions will be expressed.
The corresponding object in the next frame: each block contains two
objects. Subsequent Frames in videos are not so much different with each
other. Hence, if the object in the next frame is not out of frame, the
corresponding object can be found. Obtaining the corresponding object has
been previously described. It is possible that all errors are greater than
threshold so in this case, this result is achieved which object in the next frame
of the picture disappeared or faded out. Thus the following equation is defined:
e(O) = {
1,
0
If the object with less than the threshold error exists
Otherwise
(5)
A value of 1 in above equation indicates that the object o is inside the next
frame, and zero indicates that the object o fades out.
Motion vector: Here comes the slightly different definition of motion vector.
The goal is to find motion vector of each objects within the block. Because the
16
blocks often contain edges and small parts of objects which may partially cover
in the next frame, obtaining motion vector is difficult and with a lot of error for
them. The geometric center of segmented objects less likely to be covered in a
frame or leave the frame, so it is easy to calculate motion vector for them and
then the main object’s motion vector is used instead of the motion vector of
each objects in the block. Therefore, once for each frame, the motion vector of
each object and block 4 × 4 will be calculated. If b is a block contained of two
object o1 and o2 and also objects O1 and O2 with clusters (c11,c12) and (c22,c21)
includes o1 and o2, motion vector is defined as follow:
𝑀𝑉(𝑜1 ) = 𝑀𝑉(𝑂1 ) = √(𝑐11 + 𝛼)2 + (𝑐12 − 𝛽)2
(6)
c11 +2
∑c12+2 |farameI(i, j) − frameP(i + α , j + β)|
Subject to : min error = ∑i=c
11 −2 j=c12 −2
−8 ≤ α ≤ 8 , −8 ≤ β ≤ 8
Mobility and immobility: Objects in successive frames in a video sequence is
found in different places. Even if objects are fixed in this time period, they still
have a bit movement in two consecutive frames and this is due to various
factors such as geographical factors, camera movement and more. Here these
little movements are considered immobile. The movement achieved by a
method previously discussed. If the size of movement be greater than Ɛ the
object consider as mobile otherwise consider as immobile:
𝑠(𝑂) = {
1,
0,
𝑖𝑓 𝑀𝑉(𝑜) > 𝜀
otherwise
(7)
For each objects in the block the value of s is calculated. So different cases will
happen:
1) One of them is mobile and the other is stationary, 2) Both are stationary, 3)
Both are mobile. Each of these modes will be used in the creation of rules.
Object area: is the total number of object pixels in the block, that this value is
exhibited with S is for I frame and S’ for P frame.
17
After the above definitions, rules are defined in this section, which help to
determine order of two objects in block. In Fig. 11.a, two objects (green and
yellow) are in a block. In the first case one of them is mobile and the other is
stationary (suppose yellow is stationary and green is mobile). This is the easiest
possible case. In this case if the size of the stationary object remains
unchanged, means that the mobile object was moved beneath it and the
stationary object is in front of moving one. Otherwise, with the changing in size
of the stationary object the result uttered that mobile object is in front of the fixed
object. This behavior can be written as follows:
Rule1 (MV(A)=0 , MV(B)>0 ) : {
order(A) > order(B)
order(A) < order(B)
if S(A) = S′(B)
otherwise
(8)
Fig. 11 Depiction of two objects in a block, (a) initial state, (b) yellow object is stationary
and its area has no change in next frame, (c) yellow object is mobile and its area has
change in next frame
18
The other possibility is that both objects are mobile. The method recommended
for this is forecasting area in next frame according to object’s motion vector. So
should initially be offered a formula for predicting area of object A in P frame. A
predicted area for the object in the P frame is named SPA. Since Mv(o) = Mv(O)
that previously mentioned, now prediction of area of object o can be calculated
as follow:
h
h
w
w
B IA = frameI (b1 − − |αA |: b1 + + |αA |, b2 − − |βA |: b2 + + |βA |)
2
2
2
2
(9)
h
h
w
w
bPA (i, j) = { frameI(i − α𝐴 , j − β𝐴 ) | iє[b1 − : b1 + ] , jє[b2 − : b2 + }
2
2
2
2
(10)
f(i, j) = {
1,
0,
g(i, j) = {
1,
0,
if bPA (i, j) = A
otherwise
if bIA (i, j) ≠ A and bIA (i, j) ≠ b
otherwise
𝑏1 +ℎ/2
𝑆𝑃𝐴 =
∑
(11)
𝑏2 +𝑤/2
(12)
(13)
𝑓(𝑖, 𝑗) − ∑ 𝑔(𝑖, 𝑗)
𝑖=𝑏1 −ℎ/2
𝑏2 −𝑤/2
Suppose [αA , βA] is motion vector of object A. In this case, equation (9) shows
the super block with greater size than considered block bI whose center is
(b1,b2) to the extent of motion vector of A, for this reason BIA is used to show
that(I index indicates that the block reference is I frame). The equation (10)
obtains bPI by transferring pixels of BIA in amount of [Aβ , αA]. In equation (11) f
determines whether (i , j) pixel (in bPI block) is member of A object or not and
in equation (12) if (i,j) pixel is not member of A or B, g takes 1 as a value. Finally
(13) calculates size of predicted area for A in P frame.
h(i, j) = {
𝑆′𝐴 =
1,
0,
if bP (i, j) = A
otherwise
𝑏1 +ℎ/2
𝑏2 +𝑤/2
∑
∑
(14)
(15)
ℎ(𝑖, 𝑗)
𝑖=𝑏1 −ℎ/2 𝑗=𝑏2 −ℎ/2
error(A) = |S PA – S ′𝐴 |
(16)
19
Equation (15) calculates area of the object A in the block B according to the
number of pixels and in Equation (16) the difference between predicted area
and real area of A in block b from P frame is obtained. Finally, can be concluded
that if amount error is low, the object A is in front of the B, otherwise a part of A
is covered with B and so A is located behind B. thus rule 2 is defined as follow:
Rule 2: If MV(A)>0 and MV(B)>0 :
{
order(A) > order(B) ,
order(A) < order(B) ,
(17)
if error(A) < ɛ
otherwise
Fig.12 One block of segmented Image, (a) a part of segmented image, (b) a block from I
frame, (c) a block with same position in P frame, (d) expanded block b to the extent of
motion vector of A
20
Fig.13 Procedure of predicting area A, (a) shows block b in the middle and block B which
contains block b and surrounding pixels, (b) block b after shifting to the extent of motion
vector of object depicted by •, (c) new objects are entered into block b, (d) colored c
When both objects are stationary, order of objects can’t be obtained from
movement, so in this case one of the two objects is randomly considered as the
front.
3.5 overall order of objects
Finally, a table can be created from merging all pairs of objects. For example,
in table that is shown below, for each column, top row objects are ahead of
bottom row objects:
1 1 2 2 3 4
2 4 5 3 5 2
Pairs in table above must somehow be combined and arranged to an
ordered array. There are many ways to do this. One way is to use tree. The tree
21
which the parent is member of top row in above table and child is member of
bottom row, is formed. Respectively from left to right, each column of the above
table is read. If nodes don’t exist, they are added in appropriate place but if exit,
parent or child may be changed. Fig. 14 is shown changes in creating the tree
for explained example.
Fig.14 Example of obtaining overall order using order pairs
Finally nodes from leaves to the roots are read and pushed in array. The
obtained array presents overall order of objects in the scene.
3.6 Estimation of final depth map
In this section depth map obtained in section 3.2 will be improved to the final
depth map. The initial depth map contains the movement information. These
information may include deficiencies in depth estimation. One of them occurs
when Object stands in front of the image without moving and thus its short
motion vector, near zero value assigns to its depth wrongly. Also if the object
placed another one with faster speed (so it has a greater motion vector), it will
consider closer to the camera than other one. For solving this problem, array
gained from previous session is used.
22
3 5 1 4 6 8 2 10 9 7
Assume that the array of object’s order has 10 cell like the above. The
content of cells are the object’s tag which are obtained from image
segmentation. The order of objects from left to right, represents the order
calculated in previous session. The closer objects to the camera gain greater
value in range 0 to 255. Now we need to define ak for k-th object as follow where
n is the number of layers (objects):
α=
255
× (k − 1)
n
(18)
So if the initial depth of pixels (i,j) of the object k are displayed with dk(i,j),
the final depth Dk(i,j) can be calculated with equation 19.
Dk (i, j) =
dk (i, j)
255 − (k − 1) + dk (i, j)
+ αi =
n
n
(19)
3.7 Depth Image Based Rendering (DIBR)
Depth image based rendering is providing a virtual view synthesis process
based on the depth of still or moving scene from images and information related
to the depth of each pixel. Conceptually these new views can be understood
from a process with two steps as follows: First, the original image points, using
the corresponding depth data, are transferred to the three dimensional world
again. After that, this three dimensional space is displayed on the pages of the
image on predicted virtual camera which is located in position of reconstructed
scene. Attachment of reconstructed two dimensional to three dimensional
display and following display is usually called Image Packaging [6,9].
3.7.1 Packaging three-dimensional image
Consider a system of two arbitrary space points M, two cameras and m' as the
first construction of the second view, is formed. Based on this assumption that
the coordinate system of word is equal to the first camera’s coordinate system,
two equation is created as follows [19]:
̃
m
̃ =
̃ A𝑃𝑛 𝑀
(20)
23
̃
m
̃′ =
̃ A′Pn DM
(21)
̃ shows two-dimensional image points and =
̃ shows equality of nonM
zero scale factor. The 4×4 D matrix contains rotation of R and translation of t,
which transfers the three-dimensional points of the world coordinate system to
the camera coordinate system of the second view and the 4×4 A matrix
accompanying A’ respect to the intrinsic parameters of first camera specifies
the second camera. M can be used as follow:
M = ZA−1 m
̃
(22)
The replacement (20) with (21) leads to the classic certain equation
differences that defines depth dependent relationship between the
corresponding points in two images of the same three-dimensional scene:
Z′𝑚
̃′ = ZA′ RA−1 m
̃ + A′t
(23)
This difference equation can also be considered as a 3D video package
which can be used to generate new custom view from a known reference
image. This only needs to define the position and orientation of the virtual
camera than reference cameras and also to determine the main parameters of
the virtual camera. After that, if the depth of the three-dimensional space of
each pixel of the original image is known, the virtual image can be synthesized
by using replacement of the formula (22) to all parts of the original picture.
3.7.2 Construction of stereoscopic image
On stereoscopic screen, two view with a little difference are reconstructed from
a three-dimensional scene and their joint image will be shown simultaneously
on the screen. The difference of the left eye and the right eye image data that
is named disparity of the scenes, will be interpreted by the human brain and two
images is perceived as a three-dimensional image.
24
Fig. 15 Reconstruction of depth on stereoscopic display [19]
3.7.3 Shift sensor algorithm
In the real high-quality stereoscopic cameras, usually one of two different ways
to make adjustments of disparity scene (convergence distance Zc in a threedimensional scene). In "toed-in" approach, adjustments of disparity scene is
selected from camera of left and right eyes which are jointed and turned toward
the inside. In the shift sensor approach, convergence screen with a small shift
(h), are obtained from the CCD sensors placed in parallel cameras (fig.16).
Fig.16 Arrangement of shift sensors in stereoscopic camera [19]
25
In arrangement of stereoscopic camera’s shift sensor Zc is convergence
distance and it’s obtained from shift (h) in camera’s CCD sensors. All that needs
to be defined is two virtual cameras, one for left eye and another one for the
right eye. According to the main view, these cameras are moved symmetrically
and their CCD sensors are shifted with respect to the lens position.
Mathematically, shift sensors can be formulated as a main movement
of principal point of the camera.
0 0 h
A = A + [0 0 0]
0 0 0
(24)
∗
Where the symbol * as a superscript in here and in the following is used
and should be replaced with quotation (') or double quotation ("). A* refers A'
and A" and uttered that the equation shows special parameters of both left and
right virtual cameras. By using equation (24), and assuming that the movement
of both virtual camera is limited only by translation according to the reference
camera, It means R=I where I is an identity 3×3 matrix and can be used in
simple following form(25):
0 0 h
A∗ RA−1 = A∗ A−1 = I + [0 0 0]
0 0 0
(25)
h
Z∗m
̃ ∗ = Z (m
̃ + [0]) + A∗ t
0
(26)
This equation can be simplified by assuming that only non-zero
components needed to create adjustment of shifts sensor’s virtual camera is
horizontal transition movement (tx) in main camera’s focal length. With
considering tz=0 we can conclude that, the depth value of the three-dimensional
space and word coordinate system which is selected equal to the camera
coordinate system of the main view and the virtual camera coordinate system,
have same quantity. It means Z*=Z, so equation (25) is reduced to the following
equation:
26
h
𝐴∗ 𝑡
m
̃ =m
̃+
+ [0]
𝑍
0
∗
𝑡𝑥
with t = [ 0 ]
0
(27)
In this case, position of a certain pixel (u,v) in each point of packed image can
be calculated easily:
u∗ = u +
αu t x
+ h , resp. v∗ = v.
Z
(28)
Horizontal translation tx of camera is defined equal to half the distance of
selected axis (tc) and given movement direction:
tc
,
2
tx = { t
c
,
2
−
(29)
left − eye view
right − eye view
As previously described, the value of sensor shift (h) depends on selected
convergence distance (Zc) and by knowing that when Z=Zc, horizontal
component of simplified 28 formula (u*) must be equal in the right eye and left
one (it means u*=u), equation 30 is obtained [19].
h = −t x
αu
Zc
(30)
4 Experimental results
To evaluate the performance of proposed algorithm, this section expresses the
results of implementation and comparison of them with the results of other
similar methods. The data set used in this article are monoscopic images
sequences that are listed in the table. All the used videos are in YUV 4:2:2
format.
27
Table 1 Data sets that have been used
Video sequence
size
Akko&kayo
480×680
Hall
288×352
Miss America
144×176
Clair
144×176
Ballet
768×1024
Breakdancing
768×1024
The fig. 17 shows depth map obtained for a frame of various sequences:
Fig. 17 Depth map obtained from proposed method for a frame of various sequences
An example of the depth map obtained from the proposed algorithm
presented above. Now the results of this method should be compared with other
28
methods statistical data. One of conversion methods for 2D to 3D is inter frame
pixel matching. Automatic methods in 2D to 3D conversion are divided to two
type, pixel-based and object-based. This method is only used inter frame
block’s movement, so this method is one of pixel-based methods. The speed
complexity of this method in compare with methods which do object extraction
is more because the only time consuming operation in this method is block
matching and finding motion vector. But in spite of its good speed, the quality
is not good. Some mistakes take palace in finding movement. Finding motion
vector of edges and jagged surfaces is done simply but the center parts of
object, smooth parts and homochromatic parts of surfaces usually consider
stationary, so motion vector is calculated difficultly.
2
d(i,j) = λ√MV(i, j)x + MV(i, j)y
(31)
2
The radical part of equation shows motion vector size of pixel (i,j).
According to the explained description, paved areas and central parts of objects
may get value of zero, causing large errors in depth map. Another problem is
that this method directly and without different coefficients, calculate depth from
the motion vector, therefore the similar depth value assigns to the objects with
different distances from camera but have same motion vector size. Thus the
ability of depth estimation for areas with zero value motion vector and also for
areas with similar motion vector, can be considered as two evaluation factor.
Fig. 18 The achieved depth map using inter frame pixel matching for akko&koyo
29
Fig. 19 The achieved depth map for akko&koyo using 2D-to-3D Conversion Based on
Motion and Color Mergence method and using proposed method
Fig. 20 The obtained depth map for Hall and Miss America using object based 2D to 3D
video conversion for effective stereoscopic content generation method(c and d) and using
proposed method(a and b)
30
An important feature depth maps generated must have is that the depth
value attributed to the pixels of an object must have similar value. Also, the
pixels of an object in the image should not be smaller than the pixels of an object
behind of it. This feature is very effective in creating three dimensional
Perception. The accuracy of this property is more important at the edge of the
objects. Therefore, the accuracy of the segmentation method can be
determined. For this, 100 points is randomly selected from each picture and
investigates if assigned depth value of the object is greater than the object in
behind or not. Table 2 shows the results.
Table 2 Accuracy of some 2D to 3D conversion methods
Accuracy of
inter frame
pixel
matching
Accuracy of
object based
method
Accuracy of
proposed
method
Video
sequence
Number of
objects
Miss America
2
73%
90%
91%
Hall
2-3
66%
81%
85%
Fig. 21 shows the results in the form of a graph:
100%
90%
80%
70%
Accuracy of object's depth in Miss
america
Accuracy edges depth in Miss america
60%
50%
40%
Accuracy of object's depth in Hall
30%
Accuracy edges depth in Hall
20%
10%
0%
Inter frame pixel matching
method
Object based mathod
proposed method
Fig. 21 Accuracy comparison of different 2D to 3D conversion methods
31
One of the most important criteria in evaluating different methods is the
computational complexity. The total cost of the computational complexity is sum
of all steps in conversion. In the following tables the computational complexity
for every method mentioned will be analyzed.
Table 3 Computational complexity of inter frame method
Step
Computational complexity
Movement estimation
O(n /16×4S )=O(n ×S )
2
2
2
2
Table 4 Computational complexity of Motion and Color Mergence method
Step
Computational complexity
Movement estimation
O(n /16×4S )=O(n ×S )
Assigning weight to blocks
O(n )
Segmentation
O(n )
Median filter
O(n × f )
Color Mergence
O(n )
Using rules
O(k)
Assigning depth
O(n )
2
2
2
2
2
2
2
32
2
2
2
Table 5 Computational complexity of object based method
Step
Computational complexity
Movement estimation
O(n /16×4S )=O(n ×S )
Segmentation
O(n )
Determining In
O(n )
Using rules
O(k)
Ordering
O(k )
Depth assignment
O(n )
2
2
2
2
2
2
2
2
Table 6 Computational complexity of proposed method
Step
Computational complexity
Segmentation
O(n )
thresholding
O(n )
Movement estimation
O(n × 28/16)
Block selection
O(log(n ))
Ordering
O(h)
Sorting orders
O(h )
Assigning depth
O(n )
2
2
2
2
2
2
33
5 Conclusion
A new method in conversion of 2D to 3D is presented in this article.
Stereoscopic is one of the ways to display 3D which uses 2 video sequences
and also, use of video plus depth instead of 2 video sequences is one of the
most popular 3D coding method. Views are usually have some features, such
as pixels inside an object have similar distance from camera and objects with
bigger motion often is closer to camera. By using these concepts, the proposed
method, firstly presents a way to extract objects of the scene which uses color
and motion segmentation, and then object ordering and depth assignment is
done. The results show good performance of this method in compare with other
methods. There are similar methods in conversion of 2D to 3D which try to
estimate depth map. Pixel based methods are usually faster and well suited in
online using but haven’t good accuracy. On the other side, object based
methods have better accuracy with less speed. Results of these methods
analyzed. Each of these methods try to present better depth map but refinement
of depth map has direct relation with computational complexity.
References
1. Herrera, J. L., et al. (2014). Fast 2D to 3D conversion using a clustering-based
hierarchical search in a machine learning framework. 2014 3DTV-Conference: The
True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON)
2. Herrera, J. L., et al. (2015). Edge-based depth gradient refinement for 2D to 3D
learned prior conversion. 2015 3DTV-Conference: The True Vision - Capture,
Transmission and Display of 3D Video (3DTV-CON)
3. Lee, H. S., et al. (2015). Foreground-based depth map generation for 2D-to-3D
conversion. 2015 IEEE International Symposium on Circuits and Systems (ISCAS)
4. Wafa, A., et al. (2015). Automatic real-time 2D-to-3D conversion for scenic views.
Quality of Multimedia Experience (QoMEX), 2015 Seventh International Workshop
on
5. Feng, Y., et al. (2011). "Object-Based 2D-to-3D Video Conversion for Effective
Stereoscopic Content Generation in 3D-TV Applications." IEEE TRANSACTIONS
ON BROADCASTING 57(2): 500-509
6. Yeong-Kang, L., et al. (2012). An effective hybrid depth-perception algorithm for
2D-to-3D conversion in 3D display systems. 2012 IEEE International Conference
on Consumer Electronics (ICCE)
7. Xi, Y., et al. (2011). Depth map generation for 2D-to-3D conversion by limited user
inputs and depth propagation. 3DTV Conference: The True Vision - Capture,
Transmission and Display of 3D Video (3DTV-CON), 2011
8. Fengli, Y., et al. (2011). Depth generation method for 2D to 3D conversion. 3DTV
Conference: The True Vision - Capture, Transmission and Display of 3D Video
(3DTV-CON), 2011
34
9. Caviedes, J. and J. Villegas (2011). Real time 2D to 3D conversion: Technical and
visual quality requirements. Consumer Electronics (ICCE), 2011 IEEE International
Conference on
10. Cao, X., et al. (2011). "Semi-Automatic 2D-to-3D Conversion Using Disparity
Propagation." IEEE TRANSACTIONS ON BROADCASTING 57(2): 491-499.
11. Han, K. and K. Hong (2011). Geometric and texture cue based depth-map
estimation for 2D to 3D image conversion. Consumer Electronics (ICCE), 2011
IEEE International Conference on
12. Jiahong, Z., et al. (2011). A novel 2D-to-3D scheme by visual attention and
occlusion analysis. 3DTV Conference: The True Vision - Capture, Transmission
and Display of 3D Video (3DTV-CON), 2011
13. Xu, F., et al. (2008). 2D-to-3D Conversion Based on Motion and Color Mergence.
2008 3DTV Conference: The True Vision - Capture, Transmission and Display of
3D Video
14. Cao, X., et al. (2011). "Converting 2D Video to 3D: An Efficient Path to a 3D
Experience." IEEE MultiMedia 18(4)
15. ChaoChung Cheng, ChungTe Li, Liang-Gee Chen. (2010). A novel 2D-to-3D
conversion system using edge information. Consumer Electronics, 2010 IEEE
Transactions on
16. Zhijie Zhao. Ming Chen. Long Yang. Zhipeng Fan. Li Ma. (2010). 2D to 3D video
conversion based on interframe pixel matching. Information Science and
Engineering (ICISE), 2010 2nd International Conference on , vol., no., pp.33803383
17. Chao-Chung Cheng. Chung-Te Li. Po-Sen Huang. (2009). A block-based 2d-to-3d
conversion system with bilateral filter. Proc. IEEE Int. Conf. Consumer Electronics,
vol. 0, pp. 1–2
18. Zheng, L., et al. (2009). An efficient 2D to 3D video conversion method based on
skeleton line tracking. 2009 3DTV Conference: The True Vision - Capture,
Transmission and Display of 3D Video
19. W. J. f. Speranza, L. Zhang, R. Renaud, J. Chan, C. Vazquez. (2005). Depth
Image Based Rendering for Multiview Stereoscopic Displays:Role of Information at
Object Boundaries. Three-Dimensional TV,Video, and Display IV, vol. 6016, pp.
75-85
20. Lee, P. J. and X. X. Huang (2011). 3D motion estimation algorithm in 3D video
coding. Proceedings 2011 International Conference on System Science and
Engineering
21. Jin Young Lee, Hochen Wey, and Du-Sik Park. (2011). A Fast and Efficient MultiView Depth Image Coding Method Based on Temporal and Inter-View Correlations
of Texture Images. 2011 IEEE
22. Smolic, A., et al. (2007). Coding Algorithms for 3DTV. IEEE Transactions on
Circuits and Systems for Video Technology 17(11): 1606-1621
23. Saxena, A., et al. (2009). "Make3D: Learning 3D Scene Structure from a Single
Still Image." IEEE Transactions on Pattern Analysis and Machine Intelligence
31(5): 824-840
35
24. Ishibashi, T. Yendo, T. Tehrani, M.P. Fujii, T. Tanimoto, M. (2011). Global view
and depth format for FTV. Digital Signal Processing (DSP), 2011 17th International
Conference on , vol., no., pp.1-6
36