2D to 3D conversion based on object extraction using motion and color segmentation Iman Kohyarnejadfard * and Mahmoud fathy [email protected] , Iran University of Science and technology [email protected] , Associate professor at Iran University of Science and Technology Abstract The human eyes work at the base of this fact that by two images with a little special deference, they can three-dimensionally see objects of the world. Stereoscopic idea is also like this and uses two cameras with the similar distance of two eyes. One of the typical methods of three-dimensional video coding is “video plus depth”. These coded images are again converted into three-dimensional images at the side of receiver, by depth image based algorithms. In this paper, a method is proposed to estimate the depth information which is on the basis of segmentation with a combinational method of color and movement, to creating different layers of scene. Each depth layer is located at a certain distance from the camera, so the initial depth map is created by determining order of the layers in the scene, and finally, via using pixels movement, the initial depth maps improved and final depth map is created. So, therefore a virtual view can be created by using any frame and its corresponding depth map, and pictures can be seen in two stereoscopic views. Examination on results of the proposed method and comparison with other existing methods indicates improvement of accuracy and quality compared with other pixel-based methods and segment-based methods. Key words Conversion of 2D to 3D, depth extraction, Stereoscopic video, objects order, objects motion 1 1 Introduction Most technologies of three-dimensional monitors use stereoscopic display. Stereoscopic images consist of two image sequences that are simultaneously displayed for user and also there are various methods for encoding threedimensional images. One of these methods is “video plus depth” method. The depth map sequence in the side of transmitter is calculated by left image and right image sequence and sent to user side with one of the image sequence (right image or left image sequence). The codec in user side makes another color image sequence using the sequence of color images and its corresponding depth data, by one of the numerous DIBR1 algorithms and then displays two images, simultaneously. So, if we estimate a depth map for a typical image, then other image with a little space difference can be obtained and stereoscopic image can be displayed [20,21]. Each depth data sample is considered as a gray scale video signal. The depth is limited between two boundaries which display the minimum and maximum distance of three-dimensional point from camera. The depth range is linearly quantized with 8 bits. For example, the nearest and farthest point is shown by 255 and zero, respectively. So, the depth map is defined and the result is a gray scale image [24]. Several methods have been proposed to create depth map, such as creating depth map with the limited inputs. At the beginning of this procedure, which is not fully automatic, segmentation of input image to smaller parts is done. In the following, picture marking is manually carried out, and thereafter, the edges and T-junctions are searched. At the end, depth is estimated for each segment and post-processing is done [7]. Fast 2D to 3D conversion using a clustering-based hierarchical search in a machine learning framework method, uses a machine learning framework to infer the 3D structure of a query color image from a training database with color and depth images [1]. In edge-based depth gradient refinement for 2D to 3D learned prior conversion method, automatic 2D to 3D image conversion approach based on machine learning principles is presented by using the hypothesis that images with a similar structure have likely a similar 3D structure [2]. One of another methods is foreground-based depth map generation for 2D to 3D conversion. This method, for a given input image, determines if the image is an object view scene or a non-object view scene, depending on the existence of foreground objects which are clearly distinguishable from the background [3]. 1 Depth Image Based Rendering 2 Automatic Real-time 2D-to-3D Conversion for scenic views is also another method which is use three depth cues, haze, vertical edges, and sharpness to estimate a sparse depth map. Then it obtains the full depth map from the sparse map using an edge aware interpolation method [4,15]. One of the other available methods, object-based converting of 2D to 3D to produce effective stereoscopic content, can be mentioned. This method tries to estimate the object’s depth. The depth arrangement is an important guideline to understand the relationship between the observed objects depth. In this method, the presence of depth’s layers according to the depth discontinuities, is used for a two-dimensional video sequence. These discontinuities occur usually by incompatibility movement at the edges of moving objects (occlusion). So, finding the depth arrangement and determination of occlusion at the sequence of video is new problem [5]. Converting two-dimensional to three-dimensional at the base of composition of motion and color is another method. In this method, both movement and color have been used to create stereoscopic images from monoscopic ones. Also, optical flow has been used to estimate two-dimensional motion at the level of pixel which yields more partial results. Then, the minimum difference information is applied for combining color information with the results of optical flow to provide proper segmentation. Thereafter, automatic process of depth estimation is done. Then, in order to achieve segmentation and decision about depth of each area, some conditions imposed for “flood-fill” procedure [13]. The easiest and most basic way in 2D to 3D conversion is 2D to 3D video conversion based on interframe pixel matching. In this method, by using motion vector, depth map is directly achieved from estimation of macroblocks motion. This method has good speed but has not high quality, and of course, to improve the quality of motion estimation, different methods have been used [16]. Another methods with different basis such as analysis of blur exist. In this way, the depth of scene is determined with understanding the impact of variable focal parameter of image. By measuring the amount of blur in the picture, focal length is considered as a depth in the reverse filter. This method has several drawbacks, for example, the amount of blur is not always available and also blur can be influenced by focal length and other factors such as atmosphere are impressive .There is another way to obtain the depth from geographical constraints that the basis of this method is to exploit geographical constraints 3 in two moments with prior knowledge of camera settings, including focal length and speed of camera [10-12]. In this paper, it has been tried to present a method that use the advantages of previous methods and also resolve their deficiencies. By examination on real depth data, it can be concluded that more information of depth causing to understand three-dimensional images by human brain are in the objects of scene and not in pixels. This means that data of different objects in the scene have many differences, but the pixels within an object have similar depth data due to the similar distance from camera. So the depth map similar to real depth map can be computed by extracting the objects in the image and then estimating their distance from the camera. The information such as color is available from a single frame and motion data can be used from sequence of frames. In the method presented by this article, information is segmented with respect to color information. Then by applying motion information, image objects are extracted, and finally, after determining the object’s orders in the image, the final depth map is obtained. This paper consists of five sections. In the section 2, conventional methods of stereo video coding are reviewed. In the third section, the steps of proposed algorithm will be presented. Evaluating the results of the proposed method and other available methods will be shown in the fourth part and fifth part will be expressed the results of this study. 2 Conventional methods of stereo video coding Stereo display is one of the most important case of multi-view (with N=2 view). Stereo compression is studied for a long time and many related standards are available. The stereo consists of two pairs of image which are taken from two points with short distance from each other. This distance corresponds to the distance between two human eyes. In general, the similarities of these photos make them very suitable to combine so that one of them can predict another one. For example, one of them is compressed without reference to the other and then, the second image is predicted according to the first one [14,22]. 4 Fig. 1 Prediction in stereoscopic coding Video signal transmission and corresponding depth map is an alternative method to classic stereo videos. Using video and depth information is interesting in compression of views performance, because each sample of depth data is considered as a low-sized gray video signal [17,18]. The general problem of video plus depth method is to make content. For example, cameras which automatically calculate the depth of each pixel are available, but the quality of recorded depth is limited. Depth estimation algorithms have been studied in computer vision and good results have been obtained. But this problem will always remain that error exists in estimation. Estimation error affects the quality of translated view. Even with great depth, the effects may be happened due to non-pairing in view translated. This effect may be increased by moving away of virtual view from the original position of camera [21]. Fig. 2 Coding 3D video with 2D plus depth method [8] 5 3 Proposed Method Computing depth of an image consists of two parts which have different nature: The first part is to achieve the depth of different objects and the second part is to compute the depth of inside parts of objects. In this paper, to improve estimation of depth, firstly, the depth is calculated through one frame and then improved through consecutive frames. According to the method provided in this paper, at first an appropriate method to separate the objects that is expressed as segmentation should be selected. Thereafter, the frame is divided into small blocks containing two objects and the arrangement of two objects in the blocks are determined, and after calculating total order, the final depth map is calculated with the assistance of initial depth map obtained by motion vector. Fig. 3 shows the manner of this method. Fig. 3 Conversion of 2D to 3D in the proposed method 6 3.1 Image segmentation In computer vision, segmentation is the process of dividing a digital image into multiple parts (set of pixels). The purpose of segmentation is simplification or changing an image to something that be more meaningful and easier to analyze. Image segmentation is normally used to determine the objects and boundary (lines, curves, etc.). More specifically, image segmentation is the process of assigning a label to each pixel in an image, so that pixels with one label have similar features. Each pixel is located in a given area according to its features such as color, intensity or tissue which are significantly different with adjacent area. 3.1.1 K-Means segmentation K-means is one of the algorithms of image clustering which is considered as an unsupervised algorithm. Despite its simplicity, this method is a basic method for many clustering methods (such as fuzzy clustering). Fig. 4 summarizes the progress of algorithm in several rounds. Fig. 4 Progress of K-Means algorithm Performing image segmentation via method K-means is very simple for black and white images. These images has only one variable (gray scale value). This algorithm identifies the parts of image being very similar in terms of gray level and puts them in the same category. Color images are considered here and so, if RGB space is used, then three color values should be used as attributes of the objects or one quantity should be extracted from these three values. Then the algorithm is done on them. The method used here is that at the first, the frame of each color space is carried into RGB color space. This space has three matrixes of red, green and blue. One matrix should be obtained from these 7 three matrices. A simple way is to sum their corresponding values in the three matrixes. It is not suitable for segmentation, because segmentation algorithm may wrongly put the elements with different colors in the same category. To solve this problem, three different coefficients can be used for each of three values. But at the end, because the numbers are put in the common range of image (for instance, 0 to 255), the obtained values should be normalized. In Fig. 5, one frame and segmented image by K-Means have been shown. Fig. 5 One frame and its segmented image by K-Means 3.1.2 Choosing the center of segments K-Means Clustering algorithm output depends on the initial selection of clusters and this causes that the results of clustering is different in various iterations of algorithm. On the other hand, one of the problems in use of K-means algorithm is that in this method, the number of sections should be determined before segmentation. According to the automated proposed algorithm and differences in the number of objects in different videos and frames, the number of sections is not specified before. As mentioned above, the number of sections are different in different sequence of images, but are similar in frames of same sequence. So, the number of sections and their centers can be calculated for one frame and used to other next frames. The operation is time-consuming to the first frame, but by choosing the number and suitable center segmentation of next frames can be done with less time and as a result, the entire speed of segmenting can be decreased. Due to the use of color and possibility of plurality of color in an objects, the number of parts used in K-means algorithm is more than the real number of 8 existing objects in the scene. By selecting the desired value being greater than a certain threshold, even with a little less than the number of objects in the frame, the objects in the frame will be separated but there may be parts of different objects that take same labels and can be differentiated due to the distance between them (In Fig. 6, there are parts with same labels but in different areas). It is noteworthy that in the first frame, clusters centers will be randomly selected. Fig. 6 A segmented image using K-Means. Left image shows parts with same labels but in different areas and right image shows small areas So, segmenting of the first frame can be done as follows: The algorithm K-means is applied to α number of arbitrary part (that in the tests done, the best value used for the dataset has been α=20) and with random Centers μ. Assignment of different label to segments that have one label, but apart from each other. An example of this process can be seen in Fig. 6. With the above process, primary segmentation is done for the first frame and the number of segments obtained is used as the number of segments for subsequent frames, and also the midpoint in sequential arrangement of each section’s pixels is selected as its center (in every area, the points are read from top row and thereafter the completion of the area in that row becomes the next row, and among the read points, the midpoint is selected as a center.) 9 3.1.3 Thresholding In the primary segmentation, small parts which are parts of a larger objects, can be observed. To remove these items, thresholding operation is done. In thresholding, segments which their size (here, the size is the number of pixels of that part) are smaller than β, they are integrated with the largest adjacent part of that part. In the tests done, the best value for the amount β is obtained in the range [9, 16]. Fig. 7 An example of segmentation by proposed schema 3.1.4 Integration of segments by using the motion vector to obtain final objects The segmentation obtained in previous part has not desired features to use in conversion of 2D to 3D yet. As can be seen in Fig. 6, there are areas that belong to the same object but have different labels (Such as areas 1 and 2). Another feature of video that can be used to segmentation is motion. Different parts of one object have usually similar motion. In part 3.2, the manner of obtaining motion vectors from frame I to frame P will be expressed. The motion map obtained from part 3.2 is used here. The integration process of parts using the motion is as follows: 1. This method moves from the top left corner of the image (as shown in Fig. 7) and selects the first section, after that determines all its neighbors. 2. Among the neighbors of that section, if there is a neighbor that has a motion vector by less than ε difference, then it takes the same label. If all the neighbors has a different motion vector, the next section is checked. 10 3. The first and second steps are repeated and in this repetition, the updated segmented image is used and this work will continue so that the change is not achieved in the segmented image. At the end of this section, the desired objects obtained. Fig. 7 shows an example of segmentation with the proposed algorithm which objects are ready to use in 2D to 3D conversion. 3.2 Motion vector and construction of initial depth map The basis of process in the motion estimation is dividing the frame into a matrix of blocks and adapting blocks. For block of frame I (current frame), it should be searched in frame P (next frame) around the neighborhood of that block, and among possible blocks, the corresponding block is selected due to the most similarity. P pixels around four sides of the corresponding block in the previous frame is search area for a good block matching. P is known as a search parameter. Greater motions need to larger value of P, and also larger parameter provides more computational complexity. Finally, the block is selected that have the lowest costs. There are several cost functions such as Minimum Absolute Difference (MAD) and Minimum Square Error (MSE) that are shown in equation 1 and 2. 𝑁−1 𝑁−1 1 𝑀𝐴𝐷 = 2 ∑ ∑|𝐶𝑖𝑗 − 𝑅𝑖𝑗 | 𝑁 (1) 𝑁−1 𝑁−1 (2) 𝑖=0 𝑗=0 1 2 𝑀𝑆𝐸 = 2 ∑ ∑(𝐶𝑖𝑗 − 𝑅𝑖𝑗 ) 𝑁 𝑖=0 𝑗=0 Where N is the number of block points, Cij and Rij are the number of pixels of the current frame and reference frame respectively, that must be compared. There are several search methods to find the corresponding block. The easiest method is complete searching which peruses all block neighbors. Also there are other models to speed up the search. In this paper, a four-step method (4SS) is used. 11 Fig.8 Patterns of points that should be checked at each steps of 4SS, (a) the first step pattern, (b) the second step pattern, (c) the third step pattern, (d) the fourth step pattern 4SS uses the search based on center and halfway stop conditions. In this algorithm, for the first step, regardless of the value of search parameter, the size of the pattern is placed S=2. It means that two pixels are searched around the desired location, and so, 9 points are seen in the 5×5 window. If the minimum weight is in the center of the search window, then it jumps to the fourth step. If the minimum weight is in one of the other 8 points, then that point is set as the search center, and in the second step, the search window is still 5×5. According to the place of minimum weight, examination may be done in 3 or 5 locations. Like the model presented in the figure above, if the minimum weight is at the corners, the search is done in 3 locations, otherwise search will be conducted in 5 locations. Again if the minimum weight is in the center of the window 5×5, we go to step 4, otherwise the third step is done. The third step is exactly similar to the second step with the same search patterns. In the fourth search step, the window 3×3 is used meaning S=1. The location of the minimum weight is the best match to the block (an example is shown in fig. 9). In the beginning 9 points indicated by big circle are examined. Because the minimum weight is not in the center of window, the second step is done. In the second step, three points indicated by the square are examined according to the pattern presented in the previous section. This process continues until the end of the fourth stage and finally the best match is found. In the figure, this algorithm searches 17 points in the best case and 27 points in the worst case. 12 Fig. 9 Four-Step Search of a block The size of the block with the center of (i,j) which moves to (m, n), is calculated as follows: )3( 𝐷 = √(𝑚 − 𝑖)2 + (𝑛 − 𝑗)2 The method for calculating the initial depth map uses the motion of blocks. The assumption used here is that objects being closer to the camera have more motion and thus larger motion vector. At the end of this section, the initial depth map is created with the same size of original frame size. The values of this map are between 0 to 255. The points being closer to the camera (larger motion vector) have larger gray scale values. 𝑑(𝑖, 𝑗) = 𝛼𝐷𝑏 (4) To map d (i,j) to the range [0, 255], α can be considered 255/max(D). 13 3.3 Selection of the blocks containing two objects In this section, we offer a simple way to obtain the arrangement of objects in the scene according to their distance from the camera. The proposed method is trying to divide the frame as much as possible into small blocks containing two objects. This will significantly reduce the computational and programming complexity. The first step is dividing the segmented image obtained in the previous section into blocks that each block consists of only two objects. To divide the frame into blocks containing two objects, different methods can be used. The method used here is inspired by the divide and conquer method. Divide and conquer algorithm works by recursive partitioning a problem into two or more sub-problem. Partitioning continues until the resulting sub-problems are simple enough to solve directly. The answer of main question is obtained from the answers of sub-problems. This technique is the basis of efficient algorithms for a variety of problems, for example, combined sorting and analyzing. In any recursive algorithm, there is a considerable freedom in choosing base state (small sub-problems that are directly solved in order to end problem). Choosing the smallest or simplest base state is good and leads usually to easier program. On the other hand, the performance will improve when recursion ends at the big state that each of these states is non-recursively solved. In overall, this strategy avoids recursive callings that do not any work or work a little. Since a divide and conquer algorithms finally reduces each instance of the problem or sub-problem to a large number of samples, the total cost of algorithm is estimated by them, especially when the total cost of division and combination are low. Here dividing and solving the algorithm is used in such a way that is costeffective in terms of time and memory, the appropriate blocks containing the two objects are being selected. Choosing primary states area in search of any algorithm is so effective on the final result and complexity of subject. Due to video frames used in this field are likely to contain more than one or two objects and as a result the possibility of block selection has one answer and that answer is the main frame, is very low. So at first, main frames are divided into 4 blocks. One way of solving divide and conquer methods, is using the stack. Here, the stack is used to keep sub problems. At the beginning, 4 parts of the divided image are held into the stack. Next, a read operation from the stack is performed 14 following by reading from the stack, that’s mean a block from the stack is read and the number of objects in the block are appointed. Three cases arise: The first case is that the block is just have one object. In this case the block is not qualified to be analyzed for determining the order of two objects so this block is removed from the stack. The second case is that the block contains two objects. Here block is one of the sub answers of the problem. This block can be selected with the same size, but large size of the blocks in the future may bring computational complexity. Thus, the sub-block of a size close to the threshold £×ᴦ must be found as block target. If the block size is larger than the threshold, to find the sub-block, the block is divided into four blocks. The first sub block is selected and checked to determine whether or not greater than the threshold. In case of being larger, the process will be redone as long as the size fits the threshold and can’t shrink again. The remaining sub-blocks are discarded. At the end selected block will be kept as one of the responses. The third occurs when the block contains three objects or more. Here it will be divided into 4 parts and thus each 4 sub blocks instead of the original block is pushed in the stack. The above process will continued as long as no block finds in the stack. Fig. 10 Array of available object pairs and sub pictures stack in proposed method 15 To avoid selecting blocks that contain objects that have been previously selected in the other blocks, In addition to stack blocks, the array consists of cells corresponding to elements of the stack is built, that each cell contains two objects in corresponding block in the stack. So when a block is popped from the stack, the array is checked if block is available with objects corresponding to that block objects or not. If available, the operation on block will be cancelled and the block will be discarded. In Fig. 10 the stack and the mentioned array are shown. Finally, with draining the stack a group of blocks with two objects will be created. In the next stage each of these blocks will be used and arrangement of objects in each block will be obtained. Procedure of determining the depth of each block will be discussed later. 3.4 Determine the order of two objects in block This must be determined which of two objects in each block is ahead, and finally, after examining all blocks, the overall order of the objects will be estimated. One of the possibilities is that the block-like fig. 11.a contains two green and yellow objects. To determine the ordering of two objects, a series of proposed rules should be introduced and by selecting the appropriate rule for each block, order of two objects is decided. But before the creation of these rules, a few definitions will be expressed. The corresponding object in the next frame: each block contains two objects. Subsequent Frames in videos are not so much different with each other. Hence, if the object in the next frame is not out of frame, the corresponding object can be found. Obtaining the corresponding object has been previously described. It is possible that all errors are greater than threshold so in this case, this result is achieved which object in the next frame of the picture disappeared or faded out. Thus the following equation is defined: e(O) = { 1, 0 If the object with less than the threshold error exists Otherwise (5) A value of 1 in above equation indicates that the object o is inside the next frame, and zero indicates that the object o fades out. Motion vector: Here comes the slightly different definition of motion vector. The goal is to find motion vector of each objects within the block. Because the 16 blocks often contain edges and small parts of objects which may partially cover in the next frame, obtaining motion vector is difficult and with a lot of error for them. The geometric center of segmented objects less likely to be covered in a frame or leave the frame, so it is easy to calculate motion vector for them and then the main object’s motion vector is used instead of the motion vector of each objects in the block. Therefore, once for each frame, the motion vector of each object and block 4 × 4 will be calculated. If b is a block contained of two object o1 and o2 and also objects O1 and O2 with clusters (c11,c12) and (c22,c21) includes o1 and o2, motion vector is defined as follow: 𝑀𝑉(𝑜1 ) = 𝑀𝑉(𝑂1 ) = √(𝑐11 + 𝛼)2 + (𝑐12 − 𝛽)2 (6) c11 +2 ∑c12+2 |farameI(i, j) − frameP(i + α , j + β)| Subject to : min error = ∑i=c 11 −2 j=c12 −2 −8 ≤ α ≤ 8 , −8 ≤ β ≤ 8 Mobility and immobility: Objects in successive frames in a video sequence is found in different places. Even if objects are fixed in this time period, they still have a bit movement in two consecutive frames and this is due to various factors such as geographical factors, camera movement and more. Here these little movements are considered immobile. The movement achieved by a method previously discussed. If the size of movement be greater than Ɛ the object consider as mobile otherwise consider as immobile: 𝑠(𝑂) = { 1, 0, 𝑖𝑓 𝑀𝑉(𝑜) > 𝜀 otherwise (7) For each objects in the block the value of s is calculated. So different cases will happen: 1) One of them is mobile and the other is stationary, 2) Both are stationary, 3) Both are mobile. Each of these modes will be used in the creation of rules. Object area: is the total number of object pixels in the block, that this value is exhibited with S is for I frame and S’ for P frame. 17 After the above definitions, rules are defined in this section, which help to determine order of two objects in block. In Fig. 11.a, two objects (green and yellow) are in a block. In the first case one of them is mobile and the other is stationary (suppose yellow is stationary and green is mobile). This is the easiest possible case. In this case if the size of the stationary object remains unchanged, means that the mobile object was moved beneath it and the stationary object is in front of moving one. Otherwise, with the changing in size of the stationary object the result uttered that mobile object is in front of the fixed object. This behavior can be written as follows: Rule1 (MV(A)=0 , MV(B)>0 ) : { order(A) > order(B) order(A) < order(B) if S(A) = S′(B) otherwise (8) Fig. 11 Depiction of two objects in a block, (a) initial state, (b) yellow object is stationary and its area has no change in next frame, (c) yellow object is mobile and its area has change in next frame 18 The other possibility is that both objects are mobile. The method recommended for this is forecasting area in next frame according to object’s motion vector. So should initially be offered a formula for predicting area of object A in P frame. A predicted area for the object in the P frame is named SPA. Since Mv(o) = Mv(O) that previously mentioned, now prediction of area of object o can be calculated as follow: h h w w B IA = frameI (b1 − − |αA |: b1 + + |αA |, b2 − − |βA |: b2 + + |βA |) 2 2 2 2 (9) h h w w bPA (i, j) = { frameI(i − α𝐴 , j − β𝐴 ) | iє[b1 − : b1 + ] , jє[b2 − : b2 + } 2 2 2 2 (10) f(i, j) = { 1, 0, g(i, j) = { 1, 0, if bPA (i, j) = A otherwise if bIA (i, j) ≠ A and bIA (i, j) ≠ b otherwise 𝑏1 +ℎ/2 𝑆𝑃𝐴 = ∑ (11) 𝑏2 +𝑤/2 (12) (13) 𝑓(𝑖, 𝑗) − ∑ 𝑔(𝑖, 𝑗) 𝑖=𝑏1 −ℎ/2 𝑏2 −𝑤/2 Suppose [αA , βA] is motion vector of object A. In this case, equation (9) shows the super block with greater size than considered block bI whose center is (b1,b2) to the extent of motion vector of A, for this reason BIA is used to show that(I index indicates that the block reference is I frame). The equation (10) obtains bPI by transferring pixels of BIA in amount of [Aβ , αA]. In equation (11) f determines whether (i , j) pixel (in bPI block) is member of A object or not and in equation (12) if (i,j) pixel is not member of A or B, g takes 1 as a value. Finally (13) calculates size of predicted area for A in P frame. h(i, j) = { 𝑆′𝐴 = 1, 0, if bP (i, j) = A otherwise 𝑏1 +ℎ/2 𝑏2 +𝑤/2 ∑ ∑ (14) (15) ℎ(𝑖, 𝑗) 𝑖=𝑏1 −ℎ/2 𝑗=𝑏2 −ℎ/2 error(A) = |S PA – S ′𝐴 | (16) 19 Equation (15) calculates area of the object A in the block B according to the number of pixels and in Equation (16) the difference between predicted area and real area of A in block b from P frame is obtained. Finally, can be concluded that if amount error is low, the object A is in front of the B, otherwise a part of A is covered with B and so A is located behind B. thus rule 2 is defined as follow: Rule 2: If MV(A)>0 and MV(B)>0 : { order(A) > order(B) , order(A) < order(B) , (17) if error(A) < ɛ otherwise Fig.12 One block of segmented Image, (a) a part of segmented image, (b) a block from I frame, (c) a block with same position in P frame, (d) expanded block b to the extent of motion vector of A 20 Fig.13 Procedure of predicting area A, (a) shows block b in the middle and block B which contains block b and surrounding pixels, (b) block b after shifting to the extent of motion vector of object depicted by •, (c) new objects are entered into block b, (d) colored c When both objects are stationary, order of objects can’t be obtained from movement, so in this case one of the two objects is randomly considered as the front. 3.5 overall order of objects Finally, a table can be created from merging all pairs of objects. For example, in table that is shown below, for each column, top row objects are ahead of bottom row objects: 1 1 2 2 3 4 2 4 5 3 5 2 Pairs in table above must somehow be combined and arranged to an ordered array. There are many ways to do this. One way is to use tree. The tree 21 which the parent is member of top row in above table and child is member of bottom row, is formed. Respectively from left to right, each column of the above table is read. If nodes don’t exist, they are added in appropriate place but if exit, parent or child may be changed. Fig. 14 is shown changes in creating the tree for explained example. Fig.14 Example of obtaining overall order using order pairs Finally nodes from leaves to the roots are read and pushed in array. The obtained array presents overall order of objects in the scene. 3.6 Estimation of final depth map In this section depth map obtained in section 3.2 will be improved to the final depth map. The initial depth map contains the movement information. These information may include deficiencies in depth estimation. One of them occurs when Object stands in front of the image without moving and thus its short motion vector, near zero value assigns to its depth wrongly. Also if the object placed another one with faster speed (so it has a greater motion vector), it will consider closer to the camera than other one. For solving this problem, array gained from previous session is used. 22 3 5 1 4 6 8 2 10 9 7 Assume that the array of object’s order has 10 cell like the above. The content of cells are the object’s tag which are obtained from image segmentation. The order of objects from left to right, represents the order calculated in previous session. The closer objects to the camera gain greater value in range 0 to 255. Now we need to define ak for k-th object as follow where n is the number of layers (objects): α= 255 × (k − 1) n (18) So if the initial depth of pixels (i,j) of the object k are displayed with dk(i,j), the final depth Dk(i,j) can be calculated with equation 19. Dk (i, j) = dk (i, j) 255 − (k − 1) + dk (i, j) + αi = n n (19) 3.7 Depth Image Based Rendering (DIBR) Depth image based rendering is providing a virtual view synthesis process based on the depth of still or moving scene from images and information related to the depth of each pixel. Conceptually these new views can be understood from a process with two steps as follows: First, the original image points, using the corresponding depth data, are transferred to the three dimensional world again. After that, this three dimensional space is displayed on the pages of the image on predicted virtual camera which is located in position of reconstructed scene. Attachment of reconstructed two dimensional to three dimensional display and following display is usually called Image Packaging [6,9]. 3.7.1 Packaging three-dimensional image Consider a system of two arbitrary space points M, two cameras and m' as the first construction of the second view, is formed. Based on this assumption that the coordinate system of word is equal to the first camera’s coordinate system, two equation is created as follows [19]: ̃ m ̃ = ̃ A𝑃𝑛 𝑀 (20) 23 ̃ m ̃′ = ̃ A′Pn DM (21) ̃ shows two-dimensional image points and = ̃ shows equality of nonM zero scale factor. The 4×4 D matrix contains rotation of R and translation of t, which transfers the three-dimensional points of the world coordinate system to the camera coordinate system of the second view and the 4×4 A matrix accompanying A’ respect to the intrinsic parameters of first camera specifies the second camera. M can be used as follow: M = ZA−1 m ̃ (22) The replacement (20) with (21) leads to the classic certain equation differences that defines depth dependent relationship between the corresponding points in two images of the same three-dimensional scene: Z′𝑚 ̃′ = ZA′ RA−1 m ̃ + A′t (23) This difference equation can also be considered as a 3D video package which can be used to generate new custom view from a known reference image. This only needs to define the position and orientation of the virtual camera than reference cameras and also to determine the main parameters of the virtual camera. After that, if the depth of the three-dimensional space of each pixel of the original image is known, the virtual image can be synthesized by using replacement of the formula (22) to all parts of the original picture. 3.7.2 Construction of stereoscopic image On stereoscopic screen, two view with a little difference are reconstructed from a three-dimensional scene and their joint image will be shown simultaneously on the screen. The difference of the left eye and the right eye image data that is named disparity of the scenes, will be interpreted by the human brain and two images is perceived as a three-dimensional image. 24 Fig. 15 Reconstruction of depth on stereoscopic display [19] 3.7.3 Shift sensor algorithm In the real high-quality stereoscopic cameras, usually one of two different ways to make adjustments of disparity scene (convergence distance Zc in a threedimensional scene). In "toed-in" approach, adjustments of disparity scene is selected from camera of left and right eyes which are jointed and turned toward the inside. In the shift sensor approach, convergence screen with a small shift (h), are obtained from the CCD sensors placed in parallel cameras (fig.16). Fig.16 Arrangement of shift sensors in stereoscopic camera [19] 25 In arrangement of stereoscopic camera’s shift sensor Zc is convergence distance and it’s obtained from shift (h) in camera’s CCD sensors. All that needs to be defined is two virtual cameras, one for left eye and another one for the right eye. According to the main view, these cameras are moved symmetrically and their CCD sensors are shifted with respect to the lens position. Mathematically, shift sensors can be formulated as a main movement of principal point of the camera. 0 0 h A = A + [0 0 0] 0 0 0 (24) ∗ Where the symbol * as a superscript in here and in the following is used and should be replaced with quotation (') or double quotation ("). A* refers A' and A" and uttered that the equation shows special parameters of both left and right virtual cameras. By using equation (24), and assuming that the movement of both virtual camera is limited only by translation according to the reference camera, It means R=I where I is an identity 3×3 matrix and can be used in simple following form(25): 0 0 h A∗ RA−1 = A∗ A−1 = I + [0 0 0] 0 0 0 (25) h Z∗m ̃ ∗ = Z (m ̃ + [0]) + A∗ t 0 (26) This equation can be simplified by assuming that only non-zero components needed to create adjustment of shifts sensor’s virtual camera is horizontal transition movement (tx) in main camera’s focal length. With considering tz=0 we can conclude that, the depth value of the three-dimensional space and word coordinate system which is selected equal to the camera coordinate system of the main view and the virtual camera coordinate system, have same quantity. It means Z*=Z, so equation (25) is reduced to the following equation: 26 h 𝐴∗ 𝑡 m ̃ =m ̃+ + [0] 𝑍 0 ∗ 𝑡𝑥 with t = [ 0 ] 0 (27) In this case, position of a certain pixel (u,v) in each point of packed image can be calculated easily: u∗ = u + αu t x + h , resp. v∗ = v. Z (28) Horizontal translation tx of camera is defined equal to half the distance of selected axis (tc) and given movement direction: tc , 2 tx = { t c , 2 − (29) left − eye view right − eye view As previously described, the value of sensor shift (h) depends on selected convergence distance (Zc) and by knowing that when Z=Zc, horizontal component of simplified 28 formula (u*) must be equal in the right eye and left one (it means u*=u), equation 30 is obtained [19]. h = −t x αu Zc (30) 4 Experimental results To evaluate the performance of proposed algorithm, this section expresses the results of implementation and comparison of them with the results of other similar methods. The data set used in this article are monoscopic images sequences that are listed in the table. All the used videos are in YUV 4:2:2 format. 27 Table 1 Data sets that have been used Video sequence size Akko&kayo 480×680 Hall 288×352 Miss America 144×176 Clair 144×176 Ballet 768×1024 Breakdancing 768×1024 The fig. 17 shows depth map obtained for a frame of various sequences: Fig. 17 Depth map obtained from proposed method for a frame of various sequences An example of the depth map obtained from the proposed algorithm presented above. Now the results of this method should be compared with other 28 methods statistical data. One of conversion methods for 2D to 3D is inter frame pixel matching. Automatic methods in 2D to 3D conversion are divided to two type, pixel-based and object-based. This method is only used inter frame block’s movement, so this method is one of pixel-based methods. The speed complexity of this method in compare with methods which do object extraction is more because the only time consuming operation in this method is block matching and finding motion vector. But in spite of its good speed, the quality is not good. Some mistakes take palace in finding movement. Finding motion vector of edges and jagged surfaces is done simply but the center parts of object, smooth parts and homochromatic parts of surfaces usually consider stationary, so motion vector is calculated difficultly. 2 d(i,j) = λ√MV(i, j)x + MV(i, j)y (31) 2 The radical part of equation shows motion vector size of pixel (i,j). According to the explained description, paved areas and central parts of objects may get value of zero, causing large errors in depth map. Another problem is that this method directly and without different coefficients, calculate depth from the motion vector, therefore the similar depth value assigns to the objects with different distances from camera but have same motion vector size. Thus the ability of depth estimation for areas with zero value motion vector and also for areas with similar motion vector, can be considered as two evaluation factor. Fig. 18 The achieved depth map using inter frame pixel matching for akko&koyo 29 Fig. 19 The achieved depth map for akko&koyo using 2D-to-3D Conversion Based on Motion and Color Mergence method and using proposed method Fig. 20 The obtained depth map for Hall and Miss America using object based 2D to 3D video conversion for effective stereoscopic content generation method(c and d) and using proposed method(a and b) 30 An important feature depth maps generated must have is that the depth value attributed to the pixels of an object must have similar value. Also, the pixels of an object in the image should not be smaller than the pixels of an object behind of it. This feature is very effective in creating three dimensional Perception. The accuracy of this property is more important at the edge of the objects. Therefore, the accuracy of the segmentation method can be determined. For this, 100 points is randomly selected from each picture and investigates if assigned depth value of the object is greater than the object in behind or not. Table 2 shows the results. Table 2 Accuracy of some 2D to 3D conversion methods Accuracy of inter frame pixel matching Accuracy of object based method Accuracy of proposed method Video sequence Number of objects Miss America 2 73% 90% 91% Hall 2-3 66% 81% 85% Fig. 21 shows the results in the form of a graph: 100% 90% 80% 70% Accuracy of object's depth in Miss america Accuracy edges depth in Miss america 60% 50% 40% Accuracy of object's depth in Hall 30% Accuracy edges depth in Hall 20% 10% 0% Inter frame pixel matching method Object based mathod proposed method Fig. 21 Accuracy comparison of different 2D to 3D conversion methods 31 One of the most important criteria in evaluating different methods is the computational complexity. The total cost of the computational complexity is sum of all steps in conversion. In the following tables the computational complexity for every method mentioned will be analyzed. Table 3 Computational complexity of inter frame method Step Computational complexity Movement estimation O(n /16×4S )=O(n ×S ) 2 2 2 2 Table 4 Computational complexity of Motion and Color Mergence method Step Computational complexity Movement estimation O(n /16×4S )=O(n ×S ) Assigning weight to blocks O(n ) Segmentation O(n ) Median filter O(n × f ) Color Mergence O(n ) Using rules O(k) Assigning depth O(n ) 2 2 2 2 2 2 2 32 2 2 2 Table 5 Computational complexity of object based method Step Computational complexity Movement estimation O(n /16×4S )=O(n ×S ) Segmentation O(n ) Determining In O(n ) Using rules O(k) Ordering O(k ) Depth assignment O(n ) 2 2 2 2 2 2 2 2 Table 6 Computational complexity of proposed method Step Computational complexity Segmentation O(n ) thresholding O(n ) Movement estimation O(n × 28/16) Block selection O(log(n )) Ordering O(h) Sorting orders O(h ) Assigning depth O(n ) 2 2 2 2 2 2 33 5 Conclusion A new method in conversion of 2D to 3D is presented in this article. Stereoscopic is one of the ways to display 3D which uses 2 video sequences and also, use of video plus depth instead of 2 video sequences is one of the most popular 3D coding method. Views are usually have some features, such as pixels inside an object have similar distance from camera and objects with bigger motion often is closer to camera. By using these concepts, the proposed method, firstly presents a way to extract objects of the scene which uses color and motion segmentation, and then object ordering and depth assignment is done. The results show good performance of this method in compare with other methods. There are similar methods in conversion of 2D to 3D which try to estimate depth map. Pixel based methods are usually faster and well suited in online using but haven’t good accuracy. On the other side, object based methods have better accuracy with less speed. Results of these methods analyzed. Each of these methods try to present better depth map but refinement of depth map has direct relation with computational complexity. References 1. Herrera, J. L., et al. (2014). Fast 2D to 3D conversion using a clustering-based hierarchical search in a machine learning framework. 2014 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON) 2. Herrera, J. L., et al. (2015). Edge-based depth gradient refinement for 2D to 3D learned prior conversion. 2015 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON) 3. Lee, H. S., et al. (2015). Foreground-based depth map generation for 2D-to-3D conversion. 2015 IEEE International Symposium on Circuits and Systems (ISCAS) 4. Wafa, A., et al. (2015). Automatic real-time 2D-to-3D conversion for scenic views. Quality of Multimedia Experience (QoMEX), 2015 Seventh International Workshop on 5. Feng, Y., et al. (2011). "Object-Based 2D-to-3D Video Conversion for Effective Stereoscopic Content Generation in 3D-TV Applications." IEEE TRANSACTIONS ON BROADCASTING 57(2): 500-509 6. Yeong-Kang, L., et al. (2012). An effective hybrid depth-perception algorithm for 2D-to-3D conversion in 3D display systems. 2012 IEEE International Conference on Consumer Electronics (ICCE) 7. Xi, Y., et al. (2011). Depth map generation for 2D-to-3D conversion by limited user inputs and depth propagation. 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2011 8. Fengli, Y., et al. (2011). Depth generation method for 2D to 3D conversion. 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2011 34 9. Caviedes, J. and J. Villegas (2011). Real time 2D to 3D conversion: Technical and visual quality requirements. Consumer Electronics (ICCE), 2011 IEEE International Conference on 10. Cao, X., et al. (2011). "Semi-Automatic 2D-to-3D Conversion Using Disparity Propagation." IEEE TRANSACTIONS ON BROADCASTING 57(2): 491-499. 11. Han, K. and K. Hong (2011). Geometric and texture cue based depth-map estimation for 2D to 3D image conversion. Consumer Electronics (ICCE), 2011 IEEE International Conference on 12. Jiahong, Z., et al. (2011). A novel 2D-to-3D scheme by visual attention and occlusion analysis. 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2011 13. Xu, F., et al. (2008). 2D-to-3D Conversion Based on Motion and Color Mergence. 2008 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video 14. Cao, X., et al. (2011). "Converting 2D Video to 3D: An Efficient Path to a 3D Experience." IEEE MultiMedia 18(4) 15. ChaoChung Cheng, ChungTe Li, Liang-Gee Chen. (2010). A novel 2D-to-3D conversion system using edge information. Consumer Electronics, 2010 IEEE Transactions on 16. Zhijie Zhao. Ming Chen. Long Yang. Zhipeng Fan. Li Ma. (2010). 2D to 3D video conversion based on interframe pixel matching. Information Science and Engineering (ICISE), 2010 2nd International Conference on , vol., no., pp.33803383 17. Chao-Chung Cheng. Chung-Te Li. Po-Sen Huang. (2009). A block-based 2d-to-3d conversion system with bilateral filter. Proc. IEEE Int. Conf. Consumer Electronics, vol. 0, pp. 1–2 18. Zheng, L., et al. (2009). An efficient 2D to 3D video conversion method based on skeleton line tracking. 2009 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video 19. W. J. f. Speranza, L. Zhang, R. Renaud, J. Chan, C. Vazquez. (2005). Depth Image Based Rendering for Multiview Stereoscopic Displays:Role of Information at Object Boundaries. Three-Dimensional TV,Video, and Display IV, vol. 6016, pp. 75-85 20. Lee, P. J. and X. X. Huang (2011). 3D motion estimation algorithm in 3D video coding. Proceedings 2011 International Conference on System Science and Engineering 21. Jin Young Lee, Hochen Wey, and Du-Sik Park. (2011). A Fast and Efficient MultiView Depth Image Coding Method Based on Temporal and Inter-View Correlations of Texture Images. 2011 IEEE 22. Smolic, A., et al. (2007). Coding Algorithms for 3DTV. IEEE Transactions on Circuits and Systems for Video Technology 17(11): 1606-1621 23. Saxena, A., et al. (2009). "Make3D: Learning 3D Scene Structure from a Single Still Image." IEEE Transactions on Pattern Analysis and Machine Intelligence 31(5): 824-840 35 24. Ishibashi, T. Yendo, T. Tehrani, M.P. Fujii, T. Tanimoto, M. (2011). Global view and depth format for FTV. Digital Signal Processing (DSP), 2011 17th International Conference on , vol., no., pp.1-6 36
© Copyright 2026 Paperzz