Compressed Video Indexing Based on Object’s Motion Nevine H. AbouGhazaleh , Yousry El Gamal∗ Computer Engineering Department Arab Academy for Science & Technology (AAST) ABSTRACT Compressed video processing for the sake of content based retrieval saves time of the expensive decoding. In this paper , we process the compressed MPEG video data for the motion analysis of its contents. Two motion components are differentiated from each other. Firstly, the object's motion; the change in object’s co-ordinate throughout consecutive frames. Secondly, the camera motion resulting from the camera effects such as zooming in and out, and panning right and left,…etc. A trajectory is constructed for each object and represented by a spatio-temprol representation . Video objects are indexed by the actual motion of the objects, independent from the moving camera motion. Keywords: Video retrieval, object tracking, video analysis. 1. INTRODUCTION An increasing demand on the applications that employ video databases has emerged. This requires efficient techniques for the purpose of browsing and retrieval of this video data. Video data is characterized by containing a huge amount of information. Thus there is a critical need for compression and accordingly processing this compressed data to save the time of the computationally expensive decompression. Querying video by content showed its effectiveness in satisfying the user needs involving the content description of data in terms of a set of content based features. What really characterizes a video data from image data for example is the presence of motion. So, it is considered as a key feature in retrieving video for the sake of searching or browsing. 2. BACKGROUND The Moving Pictures Experts Group (MPEG) standard by the International Standard Organization (ISO) is intended for full motion video compression. The MPEG video compression algorithm relies on two basic techniques: block based motion compensation for the reduction of temporal redundancy and transform domain based compression - Discrete Cosine Transform ( DCT ) - for the reduction of spatial redundancy. The DCT transforms the 2-dim image(frame) into the frequency domain. Reduction acts upon the features with high frequency components, in which the human eyes are less sensitive to its observing. Motion compensation divides the frame into small blocks and induces the translation of each block among consecutive frames. These vectors and the DCT coefficients are major components of an MPEG file. Frames of a video stream are classified into three types; Intracoded frames (I frames), Predicted frames (P frames ), and Bi-directional frames (B frames); I frame is coded independent of any other frame. A sequence should start with an I frame. It provide access points for the sake of random access, but with moderate compression. P frame is coded using motion compensation from a previous I or P frame. Such type of predication is called forward prediction. A B frame is predicted from a previous frame and a future reference frame, called the forward and backward prediction respectively. B frames accomplish the highest rate of compression. Blocks with both forward and backward Motion Vectors (MVs) are called bidirectional predicted blocks as shown in figure 1. The order of storing frames in an MPEG coded file differs from the displaying order. This is due to the fact that processing of a B frame needs a prior knowledge of the previous and a later reference I/P frame, thus both reference frames of a B frame should precede the frame in the storage sequence but they should be displayed in the correct logical sequence. ∗ E-mail : [email protected] , [email protected] Forward Prediction Backward Prediction Figure(1) Forward and Bidirectional prediction in MPEG 3. PREVIOUS WORK In related work, Nevenka1 extracted MVs from the MPEG coded video stream, constructed a set of trajectories for each scene, and clustered those trajectories to represent objects. Kobla2 extracted the motion feature from key-frames avoiding restrictions imposed by the MPEG format. It is further indexed after a dimensionality reduction algorithm. Emile3 segmented objects from the Discrete Cosine Transform (DCT) provided by the MPEG, then constructed its trajectory. In the frequency domain camera motion was detected for the sake of video segmentation into shots as in Milanense4. Each shot was characterized by its camera effect. They segmented frames into superblocks and suppressing irregular motion after construction of new MVs to establish a continuos motion over time. But this method is not adequate for object tracking. All the previous work used to index the motion of objects relative to the moving camera motion do not give accurate results. However, in the query time it is mostly posed about the absolute object motion, not the relative one. In this paper, we propose a method for indexing absolute motion of the objects and compensating the change in apparent motion induced by the movement of the camera. 4. SEGMENTATION After succeeding in extracting motion estimation as data, video processing takes over to interpret this data as shown in figure 2. Video segmentation takes place in two forms; Temporal and Spatial segmentation. Video scenes are temporally segmented into shots, where they constitute a semantically related piece of data. Thus, a single shot mainly consists of the same objects, the same background and the same camera effect. Spatial segmentation is the detection of objects contained in this shot. Our work is based on ready temporal segmented scenes. video clip Camera shooting motion vectors motion estimation features objects scene segmentation feature extraction file indexing Figure (2) Steps of Video indexing based on motion features 4.1 Object Detection In this section, main objects in the shot would be detected for further construction of their trajectories. For each I frame, the DCT DC coefficients can be employed to identify regions with similar colors 3, and each region is considered as an object. This method is adequate in detecting objects having only one single color, i.e. one object can only be painted by a single color, which is not a common situation. Even single colored objects do have a shadow which would be considered as another object other than the original one. The algorithm as shown in figure (3), detects the objects from the P frames; specifically the motion vectors of the forward predicted blocks. Each block has a motion vector, Adjacent blocks with the same motion vector, most probably constitute a single object. Similarity between blocks is measured by the vector magnitude and angle6. |angle (V(x,y)) - angle ( Vk) | < angle threshold | || V(x,y)|| - ||Vk|| | < magnitude threshold where: V(x,y) ≡ the vector to be examined, ≡ the mean vector of neighbor region k. Vk Magnitude is used as an indicator of speed and angle specifies the direction. Adjacent regions are merged by a region growing algorithm to ensure the integrity of the object’s region. An average motion vector is calculated for each object . For every key-frame Given MV of each block R = {} r1 = {b00} R = R union {r1} k=1 for each block in frame if MVij of neighbor block is similar to average MV of rk increment size of rk compute Avg_MVk else k++ rk = {b ij} R = R union {rk} if bij has no MV then skip. where R= set of all regions bij = block at position i , j MVij = motion vector of block i, ,j Avg_MVk = average motion vector of region k Figure (3) Object segmentation algorithm Only objects of interest are considered for further processing ,i.e. too small objects are eliminated as they may result from any noise in the decoding process or undesired illumination effect. However, it is most probably that too small objects are of no interest. Each Object is represented by a block that resides at the centroid of the object; called the centroid block. 4.2 Background Detection After accomplishing this step of object detection we need to identify the background. The background can be considered as an object but it is assumed to be the largest object in the shot. This is a valid assumption for shots containing multiple objects. Background object, beside being the dominant object in the shot may be occluded by other objects in the shot. Thus we should consider this property in the merging of objects. So, a background object would be the largest object in the shot irrespective of the spatial locality of the blocks constituting the object. 5. MOTION DETECTION AND TRAJECTORY ESTIMATION Detected objects from the segmentation process need to undergo more processing for the sake of the construction of a complete path along the shot. The motion of these objects are assumed to be continuous. This path is called a trajectory. A trajectory is constructed by the aggregation of the motion vectors throughout the frames in the suitable order. A forward predicted MV of each centroid block in P and B frames is extracted and considered as the displacement of the object from frame i to frame i+1. MVs are extracted for the new block’s positions in consecutive frames1. There are no accompanied motion vectors for the I frames, so the displacement is induced from the backward predicted MV of the preceding B frame. MVs are aggregated to construct the trajectory of each centroid block respecting the logical sequence of the video sequence. 5.1 Camera Effect Detection The movement of the camera during recording video shots alters the semantics of the extracted motion feature. For example a trajectory of a car moving from left to right with a stand still camera will be very much similar to a trajectory of a stopping car with a pan left camera. The trajectory of the car in both shots will appear to be the same trajectory, but actually they do have different semantics. This what we are trying to differentiate. The background object although being detected, is not a subject for the trajectory construction. However, the motion of the camera is deduced from the background’s motion. To reduce the processing time, the camera motion shouldn’t necessarily be computed for every frame. This is due to the slow motion of the camera with respect to the frame rates The average motion vector of the background is calculated every h frames. The value of h is adjusted according to the sensitivity of the algorithm and the nature of the moving camera This detected background motion should represent the inverse motion of the camera, i.e the background vector is rotated around the X and Y axis to obtain the camera motion for that frame as shown in figure (4). It is assumed to be constant till the next background detection. In our algorithm, we detect the camera motion every P frame, that is, if we are processing a video file with frame rate 30 fps , then the camera motion will be detected every 113ms. which is capable of detecting the motion accurately. This approximation reduced a lot the computation time of the algorithm,. The detected camera motion is eliminated from all the trajectory points that occurred at that frame. While adding the camera component to the trajectory segments till the detection of the next camera motion. The resultant vector represents the absolute motion of the object independent from the camera motion as shown in figure (4) This method helps in the cases of objects translated on the X-Y axes. Any motion in the Z- axes is not detected. The rotation of an object around itself is considered as being a still object. Background Motion AM = MV - BM where AM ≡ Absolute motion vector MV ≡ Extracted motion vector BM ≡ Background motion vector Extracted Motion Vector Absolute block’s motion Figure (4) Absolute motion calculation Inversed background motion For each trajectory j for each trajectory segment tsi if fi = P // detect camera motion cm for this time interval k // extract average bmk cmk= - bmk // to get absolute object motion // atsi = tsi + cmk where fi ≡ frame type bm ≡ background motion of frame i , which is equal to the MV of the largest object in frame i cm ≡ camera motion tsi ≡ trajectory segment for frame i atsi ≡ absolute trajectory segment for frame i Figure (5) Absolute trajectory calculation There are other camera effects that need a special way of detection like the zoom in or out. In these cases, the frame is divided into four quadrants , and the background motion is detected in each quad separately. A decision for classifying the motion is according to the angle of background motion as shown in table (1). Camera effect Quad. 1 Quad. 2 Quad. 3 Quad. 4 Zoom in 45° 135° -135° -45° Zoom out -135° -45° 45° 135° Table 1: Camera effects correspond the average motion angle in each quadrant. 5.2 Indexed Trajectory Each detected object is indexed by its centroid block s trajectory. The trajectory is represented as a set of points. Each point is the x-y coordinate of the centroid block in a specific frame. Each trajectory is normalized to start at the point (0,0). The object s stops appear as a constant location throughout a set of consecutive frames. The length of the trajectory varies according to the time it lasts in the shot.. In the retrieval process , The user provides a query trajectory , and a set of shots containing similar trajectories are obtained. The trajectories may undergo some translations from the query one. The Euclidean distance is employed to measure the similarity between trajectories according to the following equation . √ ∑ ( Si - Qi )2 where Si ≡ Stored trajectory segment i Qi ≡ Query trajectory segment i To enhance the retrieval results, other features are extracted to be posed about, such as the size of the objects (represented in blocks), the objects life time (deduced from the number of frames it lasts in), the average speed of the object(calculated from the resultant distance moved by the objects and its duration), ,etc. 6. EXPERIMENTS AND RESULTS The proposed method, developed in Object Oriented C under DOS , has been tested on several recorded clips with different combinations of camera and object motion. The resulting objects trajectory is stored in a sequential file. Video clips are recorded and coded to MPEG-1 format. The Motion is estimated using the MPEG video software decoder Ver. 2.2 implemented at the Computer science division, University of California, Berkeley. Recorded clips lasts on average of 2 sec., with a rate of 30 frame per sec. These 2 sec. clips should represent a shot. The experimental sample clip "Police_car.m1v has been recorded at the AAST computer Labs. The clip is of resolution 320x240 pixel which constitutes 20x15 Mblocks. The clip contains a police car moving from right to left recorded with a moving camera from down to up. Some frames are demonstrated in figure (6.a). Figure (6.b) illustrates the segmentation process and plots the absolute trajectory of the car object after the removal of the camera motion component. Another sample clip, “ballerina.m1v” that contains a dancing ballerina girl and on the background there are other ballerinas standing behind her. The ballerina crosses the stage from right to left while the recording camera moves from left to right as shown in figure (7.a). The clip is of resolution 320x240 pixels. The segmented object and its constructed trajectory are shown in figure (7.b). The algorithm doesn t show efficiency in detecting objects of smooth texture due to the lack of motion information on these areas from the MPEG file. Accordingly clips with a smooth textured backgrounds such as the sky , the sea, and a blank wall, are hard to detect their movement. Thus if the shooting camera is moving, background is not detected. 7. CONCLUSIONS In this paper we described a sub system for the feature extraction of video clips, considering the motion as the main feature to be indexed for later retrieval in the query time. We first proposed a segmentation method based on the extracted MVs in the P frames. The Background object is detected assuming its dominance on the frame. The trajectory of the moving objects are constructed and the camera effect of the clip are also determined. Finally we deduced the absolute trajectory of the objects contained in the scene and indexed in a file for later access. All the above algorithms are designed for processing MPEG-1 streams, without decompression, which were proven to be faster than processing on decompressed video clips, thus providing a faster searching and retrieval. The advantage of the proposed system is that it indexes the object using only the object s motion not affected by the camera motion. The system is useful to use when shooting with a moving camera like in the monitoring and the surveillance applications . REFERENCES 1. Nevenka Dimitrova, “Content classification and retrieval of digital video based on motion recovery”, Ph.D. thesis, 1995. 2. Kobla V., Extraction of features for indexing MPEG-Compressed video, Technical Report, 1997 3. Emile Saouria, “Video Indexing Based on Object Motion”, Technical Report, May 1997. 4. Milanese R., Deguillaume F., and Jacot-Descombes.A, “ Video segmentation and camera motion characterization using compressed data” Multimedia storage & Archiving systems II, November 1997. 5. William I. Grosky , Multimedia Information Systems IEEE Multimedia, Spring 1994. 6. Bartolini,F, Cappellinin,V. and C. Giani, Motion estimation and tracking for urban traffic monitoring Proceedings 6th International Workshop on time varying image processing and moving object recognition , 1996 Figure (6.a) Figure (6.b) a. Selected frames from the “Police_Car.m1v” b. The detected object with its absolute trajectory Figure (7.a) Figure (7.b) a. Selected frames from the “barelina.m1v” b. The detected object with its absolute trajectory
© Copyright 2026 Paperzz