A Comparison of Approaches in Shot Detection and

Course Design Report for Digital Image Processing
Aug 2008
A Survey of Approaches in Shot Detection and Condensed
Representation
Liang Shi [email protected]
Beijing University of Posts and Telecommunications
Abstract
Shot detection has been a hot topic in video processing e.g. shot boundary detection,
segmentation and content based video retrieval(CBVR). Along with the boosting of
video resources, condensed representation for indexing and retrieval attracted much
more attention. Since diversified algorithms are adopted in certain sphere of
applications, this paper explores the characteristics of four approaches, including
Ordinal measurement of Greyscale, Color Histogram, Image Template(Exemplar) and
Image Epitome to analyze low level features of color, greyscale, texture and
structure. Then implementation and experiment of each are made in shot boundary
detection and CBIR on five types of data set to demonstrate respective merits and
limitations.
Introduction
Digital Video Processing has been an extremely active area over the past decades. The
importance of this area arises from the explosion of video resources from the Internet, e.g.
Google and Yahoo Video, YouTube, Altavista, AOL video etc, along with its numerous
applications such as Video Retrieval, Analysis, Indexing and Categorization. Due to the
nature of video data, which is unsuitable for traditional forms of data access[1], searching
are either text based or based on the query by example paradigm. Therefore, techniques
have been sought that organize video data into more compact forms or extract semantically
meaningful information. The granularity of Digital Videos can be listed in a hierarchy of
Video, Scene, Shot, and Frame, in which a scene are most commonly regarded as a unit to
be interpreted. Since scene transition always correspond with a shot change, shot
boundary detection is indispensable as a first step for scene segmentation. The results of
both shot boundary detection and condensed video representation do not need to be
immediately directed to semantic level applications; they may instead be stored as
metadata and used when they are needed. This can be achieved by the use of standards,
such as MPEG-7 [2], which contain appropriate specifications for the description of shots,
scenes, and various types of condensed representation.[3]
In this paper, instead of focus on the semantic gap, I explore the common techniques to
extract basic features as color, greyscale, structure and texture with. Ordinal Greyscale
measurement is a modified version of greyscale measurement, which considered the
structure layout as well as illumination histogram; Color Histogram is the most common
method to extract content clues in images but left out the spatial messages. Template is
regularly used in tracking to match patches pixel by pixel, but it’s too strict to adapt varied
1/12
Course Design Report for Digital Image Processing
Aug 2008
resolution and scaling. A relatively new methods Epitome is a novel approach to integrate
both the overall features as color, greyscale and the local marks as structure and texture,
even content in a visible way. There is no generalized solution to every condition in shot
detection and image retrieval, each is employed in specific applications that depend on the
content it process and the requirement on the trade-off between precision and complexity.
System Description
[Fig1] System Layout
In Shot Boundary Detection, I first implemented Ordinal measure of Greyscale and RGB
color histogram and made a comparison between the results. Then color histogram and
Image Template are used in image retrieval. The last experiment is done on Epitome in
CBIR and Video Condensed Representation.
Experiment and Results
Compare Ordinal Greyscale Measure and Color Histogram in shot
Boundary Detection
I first implement Ordinal Measures for Image Correspondence by Dinkar N. Bhat and
Shree K. Nayar[4] to a set of 5 different types of video content representing a wide variety to
show both the strongpoint and the drawback or this method.
Each contained varied type of transition methods between shots, using adaptive threshold
with ground truth data, the results are as follows:
Category
Story film: Crash
Sport: Football World Cup 2002
Animation: Backkom Series
Newsreel: Topics in Focus
Surrealists film: Love Me If You Dare
Length
Shot Correct Wrong
(min) transition
detect detect
3'46"
126
126
5
3'16"
273
134
45
4'10"
189
189
38
13'
255
245
24
2'23"
59
50
61
[Table1] Shot Detection result by Ordinal Measure for Image Correspondence
2/12
Course Design Report for Digital Image Processing
Aug 2008
3/12
Course Design Report for Digital Image Processing
[Fig2-5]
Experiment
Ordinal
Results.
Greyscale
(5
Measure
separately);
Aug 2008
[Fig6] From top-down: (a) a dissolve between
Each
shots with similar color content; (b) a dissolve
picture show the detection result by ordinal
between shots with dissimilar color content;
greyscale. One frame are supposed to stand
(c)fade; (d) a wipe of the “door’’ variety; (e) a wipe
for a shot. Some results show the detection
of the “grill’’ variety; and (f) a computer-generated
error with causes on the right.
special effect transition between shots.[3]
Conclusions on Greyscale measurement:
1. For the abrupt shot transition, greyscale performs well for it is not sensitive to local
movement with a adaptable grid to extract spatial features.
2. The major weak point is the over dependence on illuminations and object position, as
shown in [Fig2-5] the Backkom and the football. When there is a moving light source, or
a jumping object, this method will fail.
3. Another problem is the transition between medium and close-up shot, as shown in
[Fig2-5] the new results. This will cause problems for scene segmentation
4. Some animation effect, gradient transition or surrealists camera movement remain
difficult for shot detection, as shown in the film and the samples from TRECVID 2003.
4/12
Course Design Report for Digital Image Processing
Aug 2008
Color histogram is a representation of the distribution of colors in an image, derived by
counting the number of pixels of each of given set of color ranges in a typically
two-dimensional (2D) or three-dimensional (3D) color space. commonly used as
appearance-based signature to classify images for content-based image retrieval systems
(CBIR). The main advantage is lost of structure information. Therefore, a blue T-shirt may
have similar histogram with the ocean, as shown in [Fig7].
[Fig7] RGB color Histogram Similarity between pictures with irrelated content
To demonstrate this I proceed as earlier to test the same set of data by Color Histogram in
shot boundary detection. Tab2 shows the results.
Category
Story film: Crash
Sport: Football World Cup 2002
Animation: Backkom Series
Newsreel: Topics in Focus
Surrealists film: Love Me If You Dare
Length
Shot Correct Wrong
(min) transition
detect detect
3'46"
126
39
0
3'16"
273
61
2
4'10"
189
189
3
13'
255
77
0
2'23"
59
10
8
[Table2] Shot Detection result by RGB color Histogram
It is obvious that the correctness severely declined. Probably because the frames in the
same scene usually share a similar RGB histogram no matter how many shot transition it
consist. While it remains high in the trial on Backkom Series for the color contrast is much
higher than other type of videos. [Fig 8] On the other hand, the false detection also decline
noticeably to nearly zero for the same reason. The disadvantage of neglect spatial
information become an advantage to some extent.
Despite the error rate of shot transition detection, color histogram are always applied in
scene segmentation. An average histogram will effectively show the basic features in a
specific scene. It could also be used to identify relationship between frames as well as
locate a segment of the video in a long reference video by computing the overall color
distribution.
To take a look at the correct detection number again, it is interesting to find they almost
stands for the scene transitions, except the sport video(does not have typical scene
definition) and the newsreel which has similar background throughout the video [Fig 9]
5/12
Course Design Report for Digital Image Processing
Aug 2008
700
Euclidean distance
600
X: 151
Y: 444.7
500
400
300
200
100
0
0
50
100
150
200
Number of Frame
250
300
350
[Fig8] Normalized Euclidean distance of RGB color Histogram in Animation Clip
500
450
400
Euclidean distance
350
300
250
200
X: 3338
Y: 117.5
150
100
50
0
0
2000
4000
6000
8000
10000
Number of Frame
12000
14000
16000
18000
[Fig9] Normalized Euclidean distance of RGB color Histogram in News Video Clip
To prove the availability of color histogram in scene detection and segmentation,
experiment are made on 5 type of videos again, which first segment the scene and compute
average RGB feature of each scene, then compare the Euclidean distance between them
and histogram of a shot fragment to locate the clip. The result is quite positive, with some
modification of the length of clips. But similar problems appeal to be difficult when video are
not color based or the layout of objects are indispensable. [Tab3] is a illustration on the
experiment results.
6/12
Course Design Report for Digital Image Processing
Aug 2008
Video
Green histogram
Blue histogram
Red histogram
[Table3] Average RGB color Histogram of 5 types of video clip
Conclusions on Color Histogram:
1. Color Histogram fits scene detection rather than shot detection for its innate feature are
shared between consecutive shots in the same scene.
2. The results of the correctness on shot detection decline significantly in news, film,
especially sports, despite the animation with obvious coloration. At the same time, the
wrong detection also decrease for the overall histogram ignore the structure information
and the color correspondence between shots easier to identify.
3. Advantages

invariant with translation and rotation, varies slowly with the angle of view

suited for recognizing an object of unknown position within a scene

translation of an RGB image into the illumination invariant allows the histogram to
operate well in varying light levels

color information is faster to compute
4. Drawbacks

dependent on the object color while ignoring its shape and texture

histogram-based algorithms have no generic concept

high sensitivity to noisy interference

High dimensionality
7/12
Course Design Report for Digital Image Processing
Aug 2008
Compare Template and Color Histogram in CBIR
Both as simple appearance models, color histogram and template are widely used in all
kinds of CBIR applications. Here is a comparison:
Method
Description
Applications
Template
a compact summarization of the distribution of Tracking
data in an image.
Color
histogram
a standard statistical distribution of occurrence Video shot detection,
frequencies of regions in color space
clustering algorithm,
Image retrieval
[Table4] Comparison between color Histogram and Template
Since Template fits to do image tracking, Experiment are made as follows:
[Fig10] Experiment Results of CBIR by Template (Tracking)
8/12
Course Design Report for Digital Image Processing
Aug 2008
Conclusion on Template:
1. Advantages

high noise tolerance

insensitive to color changes
2. Drawbacks

not robust when color reversed on the entire image

Resolution dependent, too rigid match
Image Epitome in CBIR and Video Condensed Representation
Proposed by Nebojsa Jojic and Brendan Frey at ICCV 2003, Epitome has become another
popularity in compact image representation for it combined features in color, texture, and
structure in an elegant succinct patch. The epitome of an image is its miniature, condensed
version containing the essence of the textural and shape properties of the image. [Fig 11]
Because a collection of images often shares an epitome, e.g., when images are a few
consecutive frames from a video sequence, or when they are photographs of similar objects,
Epitome could be used in image segmentation, motion estimation, object removal and
super-resolution.[5] It could also be adopted in CBIR and CBVR.
As a new way of representation, it performs well in image denoising and super resolution
after learning with a simple graphical model. In shot detection, it avoid the problem caused
by resolution and rigid structure restriction. Furthermore, it is more content-based for it is
derived from the images, and is defined on an image substantially smaller then the modeled
images, but significantly larger then the targeted image patches. An amazing example of
CBIR is shown in [Fig12]. However, Epitome is not as good in reconstruction for when the
traditional Gaussian Filter can reduce image noise; it may also blur image edges, while in
image retrieval its over-dependence on spatial information and manual label make Epitome
limited in practical use.
[Fig11] The idea of Epitome: a compact representation of images
9/12
Course Design Report for Digital Image Processing
[Fig12] Examples of CBIR by Epitome;
Aug 2008
Left: face detection by eyes in Epitome;
Right: Retrieval
smiling faces by the highest total posterior at the “smiling point” in Epitome of 295 face images
Conclusion on Epitome:
Although as a novel idea, Epitome appears to be unstable in image retrieval, and its over
dependence on manual separation also impairs the availability in pragmatic applications.
However, it has some advantages over Template and Histogram (by Jojic):
“statistical generative models to mimic the structure of the real world ”
“ The main requirement is complete adaptively, and so I am trying to avoid
application-specific initialization parameters. ”
Low level features in Condensed Video Representation
An important functionality when retrieving information in general is the availability of a
condensed representation of a larger information unit. This can be the summary of a book,
the theme or ground of a musical piece, the thumbnail of an image, or the abstract of a
paper. Condensed representation allows us to assess the relevance or value of the
information before committing time, effort, or computational and communication resources
to process the entire information unit. It can also allow us to extract high-level information
when we are not interested in the whole unit, especially during manual organization,
classification, and annotation tasks. While for most types of information there is one or more
standard forms of condensed representation, this is not the case for video. In the literature,
the term highlighting or storyboarding is used for video representation resulting in distinct
images, while skimming is sometimes used for a representation resulting in a shorter video.
[3] Video skimming is more difficult to perform effectively than video highlighting, because it
is more difficult to extract appropriate features from its atomic elements (video segments
instead of images), since they contain a much greater amount of information.
Low level feature build the basis to scene segmentation and key frames extraction. Color
and greyscale are more adapted to abrupt scene change while structure and texture are
applied in object recognition for Video Indexing. An advanced, object-based method has
been developed by Kim and Hwang [18]. They use the method described in [19] to extract
objects from each frame, and then use these objects to extract video key frames.
10/12
Course Design Report for Digital Image Processing
Aug 2008
Conclusion
In the task of CBVR, two fundamental steps are Shot boundary Detection and Content
Based Image Retrieval(CBIR). I implement two algorithms for each separately. Shot
boundary change could be extract by the altering of color and greyscale, but color
histogram ignore the spatial information while greyscale ignore color, so they are proved to
have poor robustness in situations less relevant to color and illumination. Thus I applied
template and Epitome as supplementary to enhance the detection performance. Besides,
template are proved to be quite powerful in image tracking when the image resolution and
scaling is identical. Epitome is more inspiring in scene detection CBIR and CBVR with a
condensed visible format.
Owing to diversified advantages above, there is no universal method for all situations. Each
application will combine several features for better performance.
REFERENCES
[1] N. Dimitrova, H.-J. Zhang, B. Shahraray, I. Sezan, T. Huang, and A. Zakhor,
“Applications of video-content analysis and retrieval,” IEEE Multimedia, vol. 9, no. 3, pp.
42–55, July 2002.
[2] P. Salembier and J. Smith, “MPEG-7 multimedia description schemes,” IEEE Trans.
Circuits Syst. Video Technol., vol. 11, no. 6, pp. 748–759, June 2001.
[3] Costas Cotsaces, Nikos Nikolaidis, and Ioannis Pitas ,”Video Shot Detection and
Condensed Representation [A review]” IEEE SIGNAL PROCESSING MAGAZINE [28]
MARCH 2006
[4] Dinkar N. Bhat and Shree K. Nayar "Ordinal Measures for Image Correspondence"
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.
20, NO. 4, APRIL 1998
[5] Nebojsa Jojic Brendan J. Frey Anitha Kannan "Epitomic analysis of appearance and
shape" In Proc. of ICCV 2003.
[6] X. Zhu, J. Fan, A.K. Elmagarmid, and X. Wu, “Hierarchical video summarization and
content description joint semantic and visual similarity,” ACM Multimedia Syst., vol. 9, no.
1, 2003.
[7] D. Lelescu and D. Schonfeld, “Statistical sequential analysis for real-time video scene
change detection on compressed multimedia bitstream,” IEEE Trans. Multimedia, vol. 5,
no. 1, pp. 106–117, Mar. 2003.
[8] A. Hanjalic, “Shot-boundary detection: Unraveled and resolved?” IEEE Trans. Circuits
Syst. Video Technol., vol. 12, no. 2, pp. 90–105, Feb. 2002. [9] J. Nam and A. Tewfik,
“Detection of gradual transitions in video sequences using Bspline interpolation,” IEEE
Trans. Multimedia, vol. 7, no. 4, pp. 667–679, Aug. 2005.
[10] H. Zhang and S.S.A. Kankanhalli, “Automatic partitioning of full-motion video,” ACM
Multimedia Syst., vol. 1, no. 1, pp. 10–28, Jan. 1993.
[11] Z. Cernekova, C. Kotropoulos, and I. Pitas, “Video shot segmentation using singular
value decomposition,” in Proc. 2003 IEEE Int. Conf. Multimedia and Expo, Baltimore,
11/12
Course Design Report for Digital Image Processing
Aug 2008
Maryland, July 2003, vol. 2, pp. 301–302.
[12] R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm for detecting and
classification production effects,” ACM Multimedia Syst., vol. 7, no. 1, pp. 119–128, Jan.
1999.
[13] J. Yu and M.D. Srinath, “An efficient method for scene cut detection,” Pattern
Recognit. Lett., vol. 22, no. 13, pp. 1379–1391, Nov. 2001.
[14] R. Lienhart, “Reliable dissolve detection,” in Proc. SPIE, vol. 4315, pp. 219–230, Jan.
2001.
[15] Z.-N. Li, X. Zhong, and M.S. Drew, “Spatialtemporal joint probability images for video
segmentation,” Pattern Recognit., vol. 35, no. 9, pp. 1847–1867, Sep. 2002.
[16] G. Boccignone, A. Chianese, V. Moscato, and A. Picariello, “Foveated shot detection
for video segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 3, pp.
365–377, Mar. 2005.
[17] “TREC video retrieval evaluation” [Online]. Available: http://wwwnlpir.
nist.gov/projects/trecvid/
[18] C. Kim and J.-N. Hwang, “Object-based video abstraction for video surveillance
systems,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 12, pp. 1128–1138, Dec.
2002.
[19] C. Kim and J.-N. Hwang, “Fast and automatic video object segmentation and tracking
for content-based applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 2,
pp. 122–129, Feb. 2002.
[20] Z. Li, G.M. Schuster, and A.K. Katsaggelos, “Minmax optimal video summarization,”
IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 10, pp. 1245–1256, Oct. 2005.
[21] A. Ferman and A. Tekalp, “Two-stage hierarchical video summary extraction to match
low-level user browsing preferences,” IEEE Trans. Multimedia, vol. 5, no. 3, pp. 244–256,
June 2003.
[22] Y. Gong, “Summarizing audio-visual contents of a video program,” EURASIP J. Appl.
Signal Processing, vol. 2003, pp. 160–169, Feb. 2003.
[23] F. Shipman, A. Girgensohn, and L. Wilcox, “Creating navigable multi-level video
summaries,” in Proc. 2003 IEEE Int. Conf. Multimedia and Expo, Baltimore, Maryland,
July 2003, vol. 2, pp. 753–756.
[24] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Video summarization and scene detection by
graph modeling,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 2, pp. 296–305,
Feb. 2005.
[25] Z. Xiong, X. Zhou, Q.Tian, Y. Rui, and T.S. Huang, “Semantic Retrieval in Video,”
IEEE Signal Processing Mag., vol. 23, no. 2, pp. 18–27, 2006.
[26] W.K. Li and S.H. Lai, “Storage and retrieval for media databases,” Proc. SPIE, vol.
5021, pp. 264-271, Jan. 2003.
12/12