Compressed Video Indexing Based on Object`s Motion

Compressed Video Indexing Based on
Object’s Motion
Nevine H. AbouGhazaleh , Yousry El Gamal∗
Computer Engineering Department
Arab Academy for Science & Technology (AAST)
ABSTRACT
Compressed video processing for the sake of content based retrieval saves time of the expensive decoding. In
this paper , we process the compressed MPEG video data for the motion analysis of its contents. Two motion
components are differentiated from each other. Firstly, the object's motion; the change in object’s co-ordinate
throughout consecutive frames. Secondly, the camera motion resulting from the camera effects such as zooming in and
out, and panning right and left,…etc. A trajectory is constructed for each object and represented by a spatio-temprol
representation . Video objects are indexed by the actual motion of the objects, independent from the moving camera
motion.
Keywords: Video retrieval, object tracking, video analysis.
1. INTRODUCTION
An increasing demand on the applications that employ video databases has emerged. This requires efficient
techniques for the purpose of browsing and retrieval of this video data. Video data is characterized by containing a
huge amount of information. Thus there is a critical need for compression and accordingly processing this
compressed data to save the time of the computationally expensive decompression. Querying video by content
showed its effectiveness in satisfying the user needs involving the content description of data in terms of a set of
content based features.
What really characterizes a video data from image data for example is the presence of motion. So, it is
considered as a key feature in retrieving video for the sake of searching or browsing.
2. BACKGROUND
The Moving Pictures Experts Group (MPEG) standard by the International Standard Organization (ISO) is
intended for full motion video compression. The MPEG video compression algorithm relies on two basic techniques:
block based motion compensation for the reduction of temporal redundancy and transform domain based compression
- Discrete Cosine Transform ( DCT ) - for the reduction of spatial redundancy.
The DCT transforms the 2-dim image(frame) into the frequency domain. Reduction acts upon the features
with high frequency components, in which the human eyes are less sensitive to its observing. Motion compensation
divides the frame into small blocks and induces the translation of each block among consecutive frames. These vectors
and the DCT coefficients are major components of an MPEG file.
Frames of a video stream are classified into three types; Intracoded frames (I frames), Predicted frames (P
frames ), and Bi-directional frames (B frames); I frame is coded independent of any other frame. A sequence should
start with an I frame. It provide access points for the sake of random access, but with moderate compression. P frame
is coded using motion compensation from a previous I or P frame. Such type of predication is called forward
prediction. A B frame is predicted from a previous frame and a future reference frame, called the forward and
backward prediction respectively. B frames accomplish the highest rate of compression. Blocks with both forward and
backward Motion Vectors (MVs) are called bidirectional predicted blocks as shown in figure 1.
The order of storing frames in an MPEG coded file differs from the displaying order. This is due to the fact
that processing of a B frame needs a prior knowledge of the previous and a later reference I/P frame, thus both
reference frames of a B frame should precede the frame in the storage sequence but they should be displayed in the
correct logical sequence.
∗
E-mail : [email protected] , [email protected]
Forward Prediction
Backward Prediction
Figure(1) Forward and Bidirectional prediction in MPEG
3. PREVIOUS WORK
In related work, Nevenka1 extracted MVs from the MPEG coded video stream, constructed a set of
trajectories for each scene, and clustered those trajectories to represent objects. Kobla2 extracted the motion feature
from key-frames avoiding restrictions imposed by the MPEG format. It is further indexed after a dimensionality
reduction algorithm. Emile3 segmented objects from the Discrete Cosine Transform (DCT) provided by the MPEG,
then constructed its trajectory.
In the frequency domain camera motion was detected for the sake of video segmentation into shots as in
Milanense4. Each shot was characterized by its camera effect. They segmented frames into superblocks and
suppressing irregular motion after construction of new MVs to establish a continuos motion over time. But this method
is not adequate for object tracking.
All the previous work used to index the motion of objects relative to the moving camera motion do not give
accurate results. However, in the query time it is mostly posed about the absolute object motion, not the relative one.
In this paper, we propose a method for indexing absolute motion of the objects and compensating the change in
apparent motion induced by the movement of the camera.
4. SEGMENTATION
After succeeding in extracting motion estimation as data, video processing takes over to interpret this data
as shown in figure 2. Video segmentation takes place in two forms; Temporal and Spatial segmentation. Video scenes
are temporally segmented into shots, where they constitute a semantically related piece of data. Thus, a single shot
mainly consists of the same objects, the same background and the same camera effect. Spatial segmentation is the
detection of objects contained in this shot. Our work is based on ready temporal segmented scenes.
video
clip
Camera
shooting
motion
vectors
motion
estimation
features
objects
scene
segmentation
feature
extraction
file
indexing
Figure (2) Steps of Video indexing based on motion features
4.1 Object Detection
In this section, main objects in the shot would be detected for further construction of their trajectories.
For each I frame, the DCT DC coefficients can be employed to identify regions with similar colors 3, and each region
is considered as an object. This method is adequate in detecting objects having only one single color, i.e. one object
can only be painted by a single color, which is not a common situation. Even single colored objects do have a shadow
which would be considered as another object other than the original one.
The algorithm as shown in figure (3), detects the objects from the P frames; specifically the motion vectors of
the forward predicted blocks. Each block has a motion vector, Adjacent blocks with the same motion vector, most
probably constitute a single object. Similarity between blocks is measured by the vector magnitude and angle6.
|angle (V(x,y)) - angle ( Vk) | < angle threshold
| || V(x,y)|| - ||Vk|| | < magnitude threshold
where: V(x,y) ≡ the vector to be examined,
≡ the mean vector of neighbor region k.
Vk
Magnitude is used as an indicator of speed and angle specifies the direction. Adjacent regions are merged by
a region growing algorithm to ensure the integrity of the object’s region. An average motion vector is calculated for
each object .
For every key-frame
Given MV of each block
R = {}
r1 = {b00}
R = R union {r1}
k=1
for each block in frame
if MVij of neighbor block is similar to average MV of rk
increment size of rk
compute Avg_MVk
else
k++
rk = {b ij}
R = R union {rk}
if bij has no MV then skip.
where
R= set of all regions
bij = block at position i , j
MVij = motion vector of block i, ,j
Avg_MVk = average motion vector of region k
Figure (3) Object segmentation algorithm
Only objects of interest are considered for further processing ,i.e. too small objects are eliminated as they
may result from any noise in the decoding process or undesired illumination effect. However, it is most probably that
too small objects are of no interest.
Each Object is represented by a block that resides at the centroid of the object; called the centroid block.
4.2 Background Detection
After accomplishing this step of object detection we need to identify the background. The background can be
considered as an object but it is assumed to be the largest object in the shot. This is a valid assumption for shots
containing multiple objects. Background object, beside being the dominant object in the shot may be occluded by other
objects in the shot. Thus we should consider this property in the merging of objects. So, a background object would be
the largest object in the shot irrespective of the spatial locality of the blocks constituting the object.
5. MOTION DETECTION AND TRAJECTORY ESTIMATION
Detected objects from the segmentation process need to undergo more processing for the sake of the
construction of a complete path along the shot. The motion of these objects are assumed to be continuous. This path is
called a trajectory. A trajectory is constructed by the aggregation of the motion vectors throughout the frames in the
suitable order.
A forward predicted MV of each centroid block in P and B frames is extracted and considered as the
displacement of the object from frame i to frame i+1. MVs are extracted for the new block’s positions in consecutive
frames1. There are no accompanied motion vectors for the I frames, so the displacement is induced from the backward
predicted MV of the preceding B frame. MVs are aggregated to construct the trajectory of each centroid block
respecting the logical sequence of the video sequence.
5.1 Camera Effect Detection
The movement of the camera during recording video shots alters the semantics of the extracted motion
feature. For example a trajectory of a car moving from left to right with a stand still camera will be very much similar
to a trajectory of a stopping car with a pan left camera. The trajectory of the car in both shots will appear to be the
same trajectory, but actually they do have different semantics. This what we are trying to differentiate.
The background object although being detected, is not a subject for the trajectory construction. However, the
motion of the camera is deduced from the background’s motion. To reduce the processing time, the camera motion
shouldn’t necessarily be computed for every frame. This is due to the slow motion of the camera with respect to the
frame rates The average motion vector of the background is calculated every h frames. The value of h is adjusted
according to the sensitivity of the algorithm and the nature of the moving camera
This detected background motion should represent the inverse motion of the camera, i.e the background
vector is rotated around the X and Y axis to obtain the camera motion for that frame as shown in figure (4). It is
assumed to be constant till the next background detection.
In our algorithm, we detect the camera motion every P frame, that is, if we are processing a video file with
frame rate 30 fps , then the camera motion will be detected every 113ms. which is capable of detecting the motion
accurately. This approximation reduced a lot the computation time of the algorithm,.
The detected camera motion is eliminated from all the trajectory points that occurred at that frame. While
adding the camera component to the trajectory segments till the detection of the next camera motion. The resultant
vector represents the absolute motion of the object independent from the camera motion as shown in figure (4)
This method helps in the cases of objects translated on the X-Y axes. Any motion in the Z- axes is not
detected. The rotation of an object around itself is considered as being a still object.
Background
Motion
AM = MV - BM
where
AM ≡ Absolute motion vector
MV ≡ Extracted motion vector
BM ≡ Background motion vector
Extracted
Motion
Vector
Absolute
block’s
motion
Figure (4) Absolute motion calculation
Inversed
background
motion
For each trajectory j
for each trajectory segment tsi
if fi = P
// detect camera motion cm for this time interval k //
extract average bmk
cmk= - bmk
// to get absolute object motion //
atsi = tsi + cmk
where
fi ≡ frame type
bm ≡ background motion of frame i , which is equal to the MV of the largest object in frame i
cm ≡ camera motion
tsi ≡ trajectory segment for frame i
atsi ≡ absolute trajectory segment for frame i
Figure (5) Absolute trajectory calculation
There are other camera effects that need a special way of detection like the zoom in or out. In these cases, the
frame is divided into four quadrants , and the background motion is detected in each quad separately. A decision for
classifying the motion is according to the angle of background motion as shown in table (1).
Camera effect
Quad. 1
Quad. 2
Quad. 3
Quad. 4
Zoom in
45°
135°
-135°
-45°
Zoom out
-135°
-45°
45°
135°
Table 1: Camera effects correspond the average motion angle in each quadrant.
5.2 Indexed Trajectory
Each detected object is indexed by its centroid block s trajectory. The trajectory is represented as a set of
points. Each point is the x-y coordinate of the centroid block in a specific frame. Each trajectory is normalized to start
at the point (0,0). The object s stops appear as a constant location throughout a set of consecutive frames. The length
of the trajectory varies according to the time it lasts in the shot..
In the retrieval process , The user provides a query trajectory , and a set of shots containing similar trajectories
are obtained. The trajectories may undergo some translations from the query one. The Euclidean distance is employed
to measure the similarity between trajectories according to the following equation .
√ ∑ ( Si - Qi )2
where
Si ≡ Stored trajectory segment i
Qi ≡ Query trajectory segment i
To enhance the retrieval results, other features are extracted to be posed about, such as the size of the objects
(represented in blocks), the objects life time (deduced from the number of frames it lasts in), the average speed of the
object(calculated from the resultant distance moved by the objects and its duration), ,etc.
6. EXPERIMENTS AND RESULTS
The proposed method, developed in Object Oriented C under DOS , has been tested on several recorded clips
with different combinations of camera and object motion. The resulting objects trajectory is stored in a sequential file.
Video clips are recorded and coded to MPEG-1 format. The Motion is estimated using the MPEG video software
decoder Ver. 2.2 implemented at the Computer science division, University of California, Berkeley. Recorded clips
lasts on average of 2 sec., with a rate of 30 frame per sec. These 2 sec. clips should represent a shot.
The experimental sample clip "Police_car.m1v has been recorded at the AAST computer Labs. The clip is of
resolution 320x240 pixel which constitutes 20x15 Mblocks. The clip contains a police car moving from right to left
recorded with a moving camera from down to up. Some frames are demonstrated in figure (6.a). Figure (6.b) illustrates
the segmentation process and plots the absolute trajectory of the car object after the removal of the camera motion
component.
Another sample clip, “ballerina.m1v” that contains a dancing ballerina girl and on the background there are
other ballerinas standing behind her. The ballerina crosses the stage from right to left while the recording camera
moves from left to right as shown in figure (7.a). The clip is of resolution 320x240 pixels. The segmented object and
its constructed trajectory are shown in figure (7.b).
The algorithm doesn t show efficiency in detecting objects of smooth texture due to the lack of motion
information on these areas from the MPEG file. Accordingly clips with a smooth textured backgrounds such as the sky
, the sea, and a blank wall, are hard to detect their movement. Thus if the shooting camera is moving, background is
not detected.
7. CONCLUSIONS
In this paper we described a sub system for the feature extraction of video clips, considering the motion as the
main feature to be indexed for later retrieval in the query time. We first proposed a segmentation method based on the
extracted MVs in the P frames. The Background object is detected assuming its dominance on the frame. The
trajectory of the moving objects are constructed and the camera effect of the clip are also determined. Finally we
deduced the absolute trajectory of the objects contained in the scene and indexed in a file for later access.
All the above algorithms are designed for processing MPEG-1 streams, without decompression, which were
proven to be faster than processing on decompressed video clips, thus providing a faster searching and retrieval.
The advantage of the proposed system is that it indexes the object using only the object s motion not
affected by the camera motion. The system is useful to use when shooting with a moving camera like in the
monitoring and the surveillance applications .
REFERENCES
1. Nevenka Dimitrova, “Content classification and retrieval of digital video based on motion recovery”, Ph.D. thesis,
1995.
2. Kobla V., Extraction of features for indexing MPEG-Compressed video, Technical Report, 1997
3. Emile Saouria, “Video Indexing Based on Object Motion”, Technical Report, May 1997.
4. Milanese R., Deguillaume F., and Jacot-Descombes.A, “ Video segmentation and camera motion characterization
using compressed data” Multimedia storage & Archiving systems II, November 1997.
5. William I. Grosky , Multimedia Information Systems IEEE Multimedia, Spring 1994.
6. Bartolini,F, Cappellinin,V. and C. Giani,
Motion estimation and tracking for urban traffic monitoring
Proceedings 6th International Workshop on time varying image processing and moving object recognition , 1996
Figure (6.a)
Figure (6.b)
a. Selected frames from the “Police_Car.m1v”
b. The detected object with its absolute trajectory
Figure (7.a)
Figure (7.b)
a. Selected frames from the “barelina.m1v”
b. The detected object with its absolute trajectory