8. Video databases

7. Video databases
Video data representations
 Video = time-ordered sequence of correlated images (frames)
 Video signal representations originate from TV technology;
different standards in USA (NTSC) and Europe (PAL, SECAM)
 25-30 frames/sec
 Interlaced presentation of even/odd rows to avoid flickering.
 Frame size levels: 352 x 240, 768 x 576 (PAL), 720 x 576 (CCIR
601), 720 x 480 (NTSC), 1440 x 1152, 1920 x 1080 (HDTV)
 Aspect ratios: 4:3, 16:9 (widescreen)
 Color videos: Decomposition into luminance and chrominance.
 Typical sampling rates for SD video:
720 samples per line for luminance,
360 samples per line for chrominance signals.
MMDB-7
J. Teuhola 2012
168
Video compression







Not just coding of a sequence of images (Motion-JPEG), because
the subsequent images are correlated (temporal redundancy).
Motion compensation: blocks (e.g. 8 x 8 pixels) in a frame are
predicted by blocks in a previously reconstructed frame.
Compression artifacts disturbing the human eye may be different from
those in still images.
Different techniques for different application areas (tv, dvd/bd,
internet, videoconferencing)
Important issues:
 Speed of compression/decompression
 Robustness (error sensitivity)
Most of the standards are based on DCT (Discrete Cosine Transform)
Typical compression ratios from 50:1 to 100:1;
the decompressed video is almost indistinguishable from the original.
MMDB-7
J. Teuhola 2012
169
Standardization of video compression
ISO/IEC MPEG (Moving Pictures Experts Group)
 Standard includes both video and audio compression.
 Started 1988; steps:
 MPEG-1: Rates up to 1.5 Mbits / sec (VHS quality)
 MPEG-2: Rates up to 10 Mbits / sec (Digi-TV, DVD, HDTV)
 MPEG-3: Planned but dropped (found to be unnecessary)
 MPEG-4: Object-based (separation from scene, animation,
3D, face modelling, interactivity, etc.)
ITU-T (International Telecommunication Union):
 H.261: Low bit-rates (e.g. videoconferencing)
 H.262 = MPEG-2
 H.263: Low bit-rates (improved)
 H.264 = MPEG 4 / Part 10, high compression power
MMDB-7
J. Teuhola 2012
170
Random access from compressed video



Broadcasting or accessing video from storage:
It should be possible to start from (almost) any frame.
MPEG solution: Three kinds of frames:
 I-frame: Coded without temporal correlation (prediction);
 gives lowest compression gain.
 P-frame: Motion-compensated prediction from the last
(closest) I- or P-frame.
 B-frame: Bidirectional prediction from the previous and/or
the next I- or P-frame;
 highest compression gain
 gets over sudden changes
 errors do not propagate.
GOP = Group Of Pictures = smallest random-access unit, must
be decodable independently (starts usually with an I-frame).
MMDB-7
J. Teuhola 2012
171
Example of frame order in MPEG
Bidirectional prediction
I
B
B
B
P
B
B
B
P
B
B
B
I
Forward prediction



Two orders of frames:
 Display order
 Bitstream order
Buffering is needed to convert from bitstream order into display
order; a small delay is involved.
The predictor and predicted frame need not be adjacent.
MMDB-7
J. Teuhola 2012
172
Organizing and querying content of a video database
Questions to be answered:
 Which aspects of videos are likely to be of interest?
 How should these aspects be represented and stored?
 What kind of query languages are suitable?
 Is the content extraction process manual or automatic?
Possible aspects of interest:
 Animate objects (people, etc.)
 Inanimate objects (houses, cars, etc.)
 Activities and events (walking, driving, etc.)
Properties of objects:
 Frame-dependent: valid in a subset of frames.
 Frame-independent: valid for the video as a whole.
MMDB-7
J. Teuhola 2012
173
Query types from a video database
(a) Retrieve a complete video by name
(b) Find frame sequences (‘clips’; ’shots’) containing certain objects or
activities.
(c) Find all videos/sequences containing objects/activities with certain
properties.
(d) Given a frame sequence, find all objects (of a certain type) occurring
in some or all of the frames of the segment.
(e) Given a frame sequence, find all activities (of a certain type)
occurring in it.
NOTE: Video is a multimedia tool: images + audio + possible text.
Audio channel can be extremely important in detecting events.
Textual components (e.g. subtitles are invaluable keyword sources)
MMDB-7
J. Teuhola 2012
174
Indexing of video content



Content descriptions are not usually built on a frame-by-frame
basis, due to the high number of frames.
Compact representations are needed.
Concepts:
 Frame sequence:
A contiguous subset of frames (e.g. a ‘shot’)
 Well-ordered set of frame sequences:
Temporal order, no overlaps
 Solid set of frame sequences:
Well-ordered, non-empty gaps between sequences (‘scene’)
 Frame sequence association map:
For each object and activity, a solid set of frame sequences
is attached, showing frames in which they appear.
MMDB-7
J. Teuhola 2012
175
Frame segment tree




Binary tree
Special (1-dimensional) case of the spatial clipping approach.
Leaves represent basic intervals of the frame sequence:
 Leaves are well ordered, and they cover the whole video.
 Their endpoints include all endpoints of the sequences.
 An internal node represents the concatenation of its children
 The root represents the whole video.
Example of objects and activities:
obj. 1
obj. 2
act. 1
1000
2000
MMDB-7
3000
J. Teuhola 2012
4000
5000
frame no
176
Frame segment tree: example
1
03000 2
04 o2
2000
8
9
0500
5002000
20005
3000
o1
a1
10
o2
a1
20002500
05000
30003 5000
30006 o1
4000
11
25003000
Indexing:
 Obj. 1  6, 9, 15
 Obj. 2  4, 10, 13, 14
 Act. 1  7, 9, 10, 12
40007 a1
5000
12 a1 13 o2 14 o2
30003500
35004000
40004500
15 o1
45005000
Note: Actually the intervals are
half-open, e.g. [0, 500) = 0..499
MMDB-7
J. Teuhola 2012
177
Indexing in the frame segment tree



For each object and activity record, there is a list of pointers to
the nodes of the frame segment tree.
Objects and activities themselves may be indexed in traditional
ways.
Each node of the frame segment tree points to a linked list of
pointers to the objects and activities that appear throughout the
whole segment that this node represents (but only partially in
the parent segment). In the previous example:
node 4  obj. 2,
node 7  act. 1
node 10  obj. 2, act. 1
node 13  obj. 2
node 15  obj. 1

node 6  obj. 1
node 9  obj.1, act. 1
node 12  act. 1
node 14  act.2
This can be generalized to a set of videos (common frame
segment tree, combined object/activity set, extended pointers).
MMDB-7
J. Teuhola 2012
178
Queries using a frame segment tree
(a) Find segments where a given object/activity occurs
(trivial; just follow the pointers.)
(b) Find objects occurring between frames s and e:
Walk the tree in preorder, denote the current node interval by I.
 If I  [s, e) = , then this subtree can be skipped.
 If I  [s, e), then walk through the whole subtree (including
the current node) and report all its objects.
 Otherwise report the objects and activities of the current
node, and continue the search to both subtrees.
(c) Find objects/activities occurring together with object x:
Scan the segments where x occurs, and report the
objects/activities occurring in these segments and their
ancestors.
MMDB-7
J. Teuhola 2012
179
R-segment tree (RS-tree)


Special case of R-tree
Two possible implementations:
(a) 1-dimensional space (dimension = time)
(b) 2-dimensional space, where the other dimension is just
enumeration of objects/activities (not a true spatial
dimension):
R2
R1
obj. 1
obj. 2
act. 1
R3
1000
2000
MMDB-7
3000
J. Teuhola 2012
4000
5000
180
Computer-assisted video analysis
Video segmentation:
 Division of videos into homogeneous sequences.
 Typical segments are often so called shots, filmed without interrupts
 Segmentation = detection of shot boundaries
 Sharp cuts are easier than gradual transitions (e.g. crossfade)
 Features for automatic segmentation:
 Similarity of color histograms of subsequent frames:
simple and effective, but sensitive to varying illumination.
 Edge features: similarity of shapes
 Motion vectors: restricted vector lengths within a shot.
 Corner points: similarity of landmark points in frames
 The actual segmentation can be based on thresholds for similarity,
but also machine learning techniques have been used widely.
 Higher-level segmentation into scenes, called also story units.
MMDB-7
J. Teuhola 2012
181
Computer-assisted video analysis (cont.)
Keyframes:
 Representative frames within shots, containing the essential
elements for retrieval
 Scene-level segmentation often uses keyframe features, and
operates e.g. in top-down or bottom-up manner.
Choosing keyframes:
 Fuzzy task – no definite optimum
 Can be based on the same features as segmentation
 Various algoritmic approaches:
 Sequential comparison
 Clustering
 Trajectory-based
 Decision in the context of object/event detection
MMDB-7
J. Teuhola 2012
182
Computer-assisted video analysis (cont.)
Object recognition:
 Keyframe-based recognition extracts the same features as for still
images: color, texture, shape, but also objects and motion.
 Motion compensation techniques can be used to find out the frame
interval of the occurrence of the object.
Annotations:
 Allocation of semantic concepts to video segments
 Means roughly the same as segment classification
 Machine-learning tools have been attampted
 Human assistance is usually needed in the final recognition, naming
and classification of segments and detected objects within them.
Ref: W. Hu, N. Xie, L. Li, X. Zend, and S. Maybank: ”A Survey of Visual Content-Based
Video Indexing and Retrieval”, IEEE Trans. on Systems, Man, and Cybernetics
41(6), Nov. 2011.
MMDB-7
J. Teuhola 2012
183