Course Design Report for Digital Image Processing Aug 2008 A Survey of Approaches in Shot Detection and Condensed Representation Liang Shi [email protected] Beijing University of Posts and Telecommunications Abstract Shot detection has been a hot topic in video processing e.g. shot boundary detection, segmentation and content based video retrieval(CBVR). Along with the boosting of video resources, condensed representation for indexing and retrieval attracted much more attention. Since diversified algorithms are adopted in certain sphere of applications, this paper explores the characteristics of four approaches, including Ordinal measurement of Greyscale, Color Histogram, Image Template(Exemplar) and Image Epitome to analyze low level features of color, greyscale, texture and structure. Then implementation and experiment of each are made in shot boundary detection and CBIR on five types of data set to demonstrate respective merits and limitations. Introduction Digital Video Processing has been an extremely active area over the past decades. The importance of this area arises from the explosion of video resources from the Internet, e.g. Google and Yahoo Video, YouTube, Altavista, AOL video etc, along with its numerous applications such as Video Retrieval, Analysis, Indexing and Categorization. Due to the nature of video data, which is unsuitable for traditional forms of data access[1], searching are either text based or based on the query by example paradigm. Therefore, techniques have been sought that organize video data into more compact forms or extract semantically meaningful information. The granularity of Digital Videos can be listed in a hierarchy of Video, Scene, Shot, and Frame, in which a scene are most commonly regarded as a unit to be interpreted. Since scene transition always correspond with a shot change, shot boundary detection is indispensable as a first step for scene segmentation. The results of both shot boundary detection and condensed video representation do not need to be immediately directed to semantic level applications; they may instead be stored as metadata and used when they are needed. This can be achieved by the use of standards, such as MPEG-7 [2], which contain appropriate specifications for the description of shots, scenes, and various types of condensed representation.[3] In this paper, instead of focus on the semantic gap, I explore the common techniques to extract basic features as color, greyscale, structure and texture with. Ordinal Greyscale measurement is a modified version of greyscale measurement, which considered the structure layout as well as illumination histogram; Color Histogram is the most common method to extract content clues in images but left out the spatial messages. Template is regularly used in tracking to match patches pixel by pixel, but it’s too strict to adapt varied 1/12 Course Design Report for Digital Image Processing Aug 2008 resolution and scaling. A relatively new methods Epitome is a novel approach to integrate both the overall features as color, greyscale and the local marks as structure and texture, even content in a visible way. There is no generalized solution to every condition in shot detection and image retrieval, each is employed in specific applications that depend on the content it process and the requirement on the trade-off between precision and complexity. System Description [Fig1] System Layout In Shot Boundary Detection, I first implemented Ordinal measure of Greyscale and RGB color histogram and made a comparison between the results. Then color histogram and Image Template are used in image retrieval. The last experiment is done on Epitome in CBIR and Video Condensed Representation. Experiment and Results Compare Ordinal Greyscale Measure and Color Histogram in shot Boundary Detection I first implement Ordinal Measures for Image Correspondence by Dinkar N. Bhat and Shree K. Nayar[4] to a set of 5 different types of video content representing a wide variety to show both the strongpoint and the drawback or this method. Each contained varied type of transition methods between shots, using adaptive threshold with ground truth data, the results are as follows: Category Story film: Crash Sport: Football World Cup 2002 Animation: Backkom Series Newsreel: Topics in Focus Surrealists film: Love Me If You Dare Length Shot Correct Wrong (min) transition detect detect 3'46" 126 126 5 3'16" 273 134 45 4'10" 189 189 38 13' 255 245 24 2'23" 59 50 61 [Table1] Shot Detection result by Ordinal Measure for Image Correspondence 2/12 Course Design Report for Digital Image Processing Aug 2008 3/12 Course Design Report for Digital Image Processing [Fig2-5] Experiment Ordinal Results. Greyscale (5 Measure separately); Aug 2008 [Fig6] From top-down: (a) a dissolve between Each shots with similar color content; (b) a dissolve picture show the detection result by ordinal between shots with dissimilar color content; greyscale. One frame are supposed to stand (c)fade; (d) a wipe of the “door’’ variety; (e) a wipe for a shot. Some results show the detection of the “grill’’ variety; and (f) a computer-generated error with causes on the right. special effect transition between shots.[3] Conclusions on Greyscale measurement: 1. For the abrupt shot transition, greyscale performs well for it is not sensitive to local movement with a adaptable grid to extract spatial features. 2. The major weak point is the over dependence on illuminations and object position, as shown in [Fig2-5] the Backkom and the football. When there is a moving light source, or a jumping object, this method will fail. 3. Another problem is the transition between medium and close-up shot, as shown in [Fig2-5] the new results. This will cause problems for scene segmentation 4. Some animation effect, gradient transition or surrealists camera movement remain difficult for shot detection, as shown in the film and the samples from TRECVID 2003. 4/12 Course Design Report for Digital Image Processing Aug 2008 Color histogram is a representation of the distribution of colors in an image, derived by counting the number of pixels of each of given set of color ranges in a typically two-dimensional (2D) or three-dimensional (3D) color space. commonly used as appearance-based signature to classify images for content-based image retrieval systems (CBIR). The main advantage is lost of structure information. Therefore, a blue T-shirt may have similar histogram with the ocean, as shown in [Fig7]. [Fig7] RGB color Histogram Similarity between pictures with irrelated content To demonstrate this I proceed as earlier to test the same set of data by Color Histogram in shot boundary detection. Tab2 shows the results. Category Story film: Crash Sport: Football World Cup 2002 Animation: Backkom Series Newsreel: Topics in Focus Surrealists film: Love Me If You Dare Length Shot Correct Wrong (min) transition detect detect 3'46" 126 39 0 3'16" 273 61 2 4'10" 189 189 3 13' 255 77 0 2'23" 59 10 8 [Table2] Shot Detection result by RGB color Histogram It is obvious that the correctness severely declined. Probably because the frames in the same scene usually share a similar RGB histogram no matter how many shot transition it consist. While it remains high in the trial on Backkom Series for the color contrast is much higher than other type of videos. [Fig 8] On the other hand, the false detection also decline noticeably to nearly zero for the same reason. The disadvantage of neglect spatial information become an advantage to some extent. Despite the error rate of shot transition detection, color histogram are always applied in scene segmentation. An average histogram will effectively show the basic features in a specific scene. It could also be used to identify relationship between frames as well as locate a segment of the video in a long reference video by computing the overall color distribution. To take a look at the correct detection number again, it is interesting to find they almost stands for the scene transitions, except the sport video(does not have typical scene definition) and the newsreel which has similar background throughout the video [Fig 9] 5/12 Course Design Report for Digital Image Processing Aug 2008 700 Euclidean distance 600 X: 151 Y: 444.7 500 400 300 200 100 0 0 50 100 150 200 Number of Frame 250 300 350 [Fig8] Normalized Euclidean distance of RGB color Histogram in Animation Clip 500 450 400 Euclidean distance 350 300 250 200 X: 3338 Y: 117.5 150 100 50 0 0 2000 4000 6000 8000 10000 Number of Frame 12000 14000 16000 18000 [Fig9] Normalized Euclidean distance of RGB color Histogram in News Video Clip To prove the availability of color histogram in scene detection and segmentation, experiment are made on 5 type of videos again, which first segment the scene and compute average RGB feature of each scene, then compare the Euclidean distance between them and histogram of a shot fragment to locate the clip. The result is quite positive, with some modification of the length of clips. But similar problems appeal to be difficult when video are not color based or the layout of objects are indispensable. [Tab3] is a illustration on the experiment results. 6/12 Course Design Report for Digital Image Processing Aug 2008 Video Green histogram Blue histogram Red histogram [Table3] Average RGB color Histogram of 5 types of video clip Conclusions on Color Histogram: 1. Color Histogram fits scene detection rather than shot detection for its innate feature are shared between consecutive shots in the same scene. 2. The results of the correctness on shot detection decline significantly in news, film, especially sports, despite the animation with obvious coloration. At the same time, the wrong detection also decrease for the overall histogram ignore the structure information and the color correspondence between shots easier to identify. 3. Advantages invariant with translation and rotation, varies slowly with the angle of view suited for recognizing an object of unknown position within a scene translation of an RGB image into the illumination invariant allows the histogram to operate well in varying light levels color information is faster to compute 4. Drawbacks dependent on the object color while ignoring its shape and texture histogram-based algorithms have no generic concept high sensitivity to noisy interference High dimensionality 7/12 Course Design Report for Digital Image Processing Aug 2008 Compare Template and Color Histogram in CBIR Both as simple appearance models, color histogram and template are widely used in all kinds of CBIR applications. Here is a comparison: Method Description Applications Template a compact summarization of the distribution of Tracking data in an image. Color histogram a standard statistical distribution of occurrence Video shot detection, frequencies of regions in color space clustering algorithm, Image retrieval [Table4] Comparison between color Histogram and Template Since Template fits to do image tracking, Experiment are made as follows: [Fig10] Experiment Results of CBIR by Template (Tracking) 8/12 Course Design Report for Digital Image Processing Aug 2008 Conclusion on Template: 1. Advantages high noise tolerance insensitive to color changes 2. Drawbacks not robust when color reversed on the entire image Resolution dependent, too rigid match Image Epitome in CBIR and Video Condensed Representation Proposed by Nebojsa Jojic and Brendan Frey at ICCV 2003, Epitome has become another popularity in compact image representation for it combined features in color, texture, and structure in an elegant succinct patch. The epitome of an image is its miniature, condensed version containing the essence of the textural and shape properties of the image. [Fig 11] Because a collection of images often shares an epitome, e.g., when images are a few consecutive frames from a video sequence, or when they are photographs of similar objects, Epitome could be used in image segmentation, motion estimation, object removal and super-resolution.[5] It could also be adopted in CBIR and CBVR. As a new way of representation, it performs well in image denoising and super resolution after learning with a simple graphical model. In shot detection, it avoid the problem caused by resolution and rigid structure restriction. Furthermore, it is more content-based for it is derived from the images, and is defined on an image substantially smaller then the modeled images, but significantly larger then the targeted image patches. An amazing example of CBIR is shown in [Fig12]. However, Epitome is not as good in reconstruction for when the traditional Gaussian Filter can reduce image noise; it may also blur image edges, while in image retrieval its over-dependence on spatial information and manual label make Epitome limited in practical use. [Fig11] The idea of Epitome: a compact representation of images 9/12 Course Design Report for Digital Image Processing [Fig12] Examples of CBIR by Epitome; Aug 2008 Left: face detection by eyes in Epitome; Right: Retrieval smiling faces by the highest total posterior at the “smiling point” in Epitome of 295 face images Conclusion on Epitome: Although as a novel idea, Epitome appears to be unstable in image retrieval, and its over dependence on manual separation also impairs the availability in pragmatic applications. However, it has some advantages over Template and Histogram (by Jojic): “statistical generative models to mimic the structure of the real world ” “ The main requirement is complete adaptively, and so I am trying to avoid application-specific initialization parameters. ” Low level features in Condensed Video Representation An important functionality when retrieving information in general is the availability of a condensed representation of a larger information unit. This can be the summary of a book, the theme or ground of a musical piece, the thumbnail of an image, or the abstract of a paper. Condensed representation allows us to assess the relevance or value of the information before committing time, effort, or computational and communication resources to process the entire information unit. It can also allow us to extract high-level information when we are not interested in the whole unit, especially during manual organization, classification, and annotation tasks. While for most types of information there is one or more standard forms of condensed representation, this is not the case for video. In the literature, the term highlighting or storyboarding is used for video representation resulting in distinct images, while skimming is sometimes used for a representation resulting in a shorter video. [3] Video skimming is more difficult to perform effectively than video highlighting, because it is more difficult to extract appropriate features from its atomic elements (video segments instead of images), since they contain a much greater amount of information. Low level feature build the basis to scene segmentation and key frames extraction. Color and greyscale are more adapted to abrupt scene change while structure and texture are applied in object recognition for Video Indexing. An advanced, object-based method has been developed by Kim and Hwang [18]. They use the method described in [19] to extract objects from each frame, and then use these objects to extract video key frames. 10/12 Course Design Report for Digital Image Processing Aug 2008 Conclusion In the task of CBVR, two fundamental steps are Shot boundary Detection and Content Based Image Retrieval(CBIR). I implement two algorithms for each separately. Shot boundary change could be extract by the altering of color and greyscale, but color histogram ignore the spatial information while greyscale ignore color, so they are proved to have poor robustness in situations less relevant to color and illumination. Thus I applied template and Epitome as supplementary to enhance the detection performance. Besides, template are proved to be quite powerful in image tracking when the image resolution and scaling is identical. Epitome is more inspiring in scene detection CBIR and CBVR with a condensed visible format. Owing to diversified advantages above, there is no universal method for all situations. Each application will combine several features for better performance. REFERENCES [1] N. Dimitrova, H.-J. Zhang, B. Shahraray, I. Sezan, T. Huang, and A. Zakhor, “Applications of video-content analysis and retrieval,” IEEE Multimedia, vol. 9, no. 3, pp. 42–55, July 2002. [2] P. Salembier and J. Smith, “MPEG-7 multimedia description schemes,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp. 748–759, June 2001. [3] Costas Cotsaces, Nikos Nikolaidis, and Ioannis Pitas ,”Video Shot Detection and Condensed Representation [A review]” IEEE SIGNAL PROCESSING MAGAZINE [28] MARCH 2006 [4] Dinkar N. Bhat and Shree K. Nayar "Ordinal Measures for Image Correspondence" IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 4, APRIL 1998 [5] Nebojsa Jojic Brendan J. Frey Anitha Kannan "Epitomic analysis of appearance and shape" In Proc. of ICCV 2003. [6] X. Zhu, J. Fan, A.K. Elmagarmid, and X. Wu, “Hierarchical video summarization and content description joint semantic and visual similarity,” ACM Multimedia Syst., vol. 9, no. 1, 2003. [7] D. Lelescu and D. Schonfeld, “Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream,” IEEE Trans. Multimedia, vol. 5, no. 1, pp. 106–117, Mar. 2003. [8] A. Hanjalic, “Shot-boundary detection: Unraveled and resolved?” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 2, pp. 90–105, Feb. 2002. [9] J. Nam and A. Tewfik, “Detection of gradual transitions in video sequences using Bspline interpolation,” IEEE Trans. Multimedia, vol. 7, no. 4, pp. 667–679, Aug. 2005. [10] H. Zhang and S.S.A. Kankanhalli, “Automatic partitioning of full-motion video,” ACM Multimedia Syst., vol. 1, no. 1, pp. 10–28, Jan. 1993. [11] Z. Cernekova, C. Kotropoulos, and I. Pitas, “Video shot segmentation using singular value decomposition,” in Proc. 2003 IEEE Int. Conf. Multimedia and Expo, Baltimore, 11/12 Course Design Report for Digital Image Processing Aug 2008 Maryland, July 2003, vol. 2, pp. 301–302. [12] R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm for detecting and classification production effects,” ACM Multimedia Syst., vol. 7, no. 1, pp. 119–128, Jan. 1999. [13] J. Yu and M.D. Srinath, “An efficient method for scene cut detection,” Pattern Recognit. Lett., vol. 22, no. 13, pp. 1379–1391, Nov. 2001. [14] R. Lienhart, “Reliable dissolve detection,” in Proc. SPIE, vol. 4315, pp. 219–230, Jan. 2001. [15] Z.-N. Li, X. Zhong, and M.S. Drew, “Spatialtemporal joint probability images for video segmentation,” Pattern Recognit., vol. 35, no. 9, pp. 1847–1867, Sep. 2002. [16] G. Boccignone, A. Chianese, V. Moscato, and A. Picariello, “Foveated shot detection for video segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 3, pp. 365–377, Mar. 2005. [17] “TREC video retrieval evaluation” [Online]. Available: http://wwwnlpir. nist.gov/projects/trecvid/ [18] C. Kim and J.-N. Hwang, “Object-based video abstraction for video surveillance systems,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 12, pp. 1128–1138, Dec. 2002. [19] C. Kim and J.-N. Hwang, “Fast and automatic video object segmentation and tracking for content-based applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 2, pp. 122–129, Feb. 2002. [20] Z. Li, G.M. Schuster, and A.K. Katsaggelos, “Minmax optimal video summarization,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 10, pp. 1245–1256, Oct. 2005. [21] A. Ferman and A. Tekalp, “Two-stage hierarchical video summary extraction to match low-level user browsing preferences,” IEEE Trans. Multimedia, vol. 5, no. 3, pp. 244–256, June 2003. [22] Y. Gong, “Summarizing audio-visual contents of a video program,” EURASIP J. Appl. Signal Processing, vol. 2003, pp. 160–169, Feb. 2003. [23] F. Shipman, A. Girgensohn, and L. Wilcox, “Creating navigable multi-level video summaries,” in Proc. 2003 IEEE Int. Conf. Multimedia and Expo, Baltimore, Maryland, July 2003, vol. 2, pp. 753–756. [24] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Video summarization and scene detection by graph modeling,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 2, pp. 296–305, Feb. 2005. [25] Z. Xiong, X. Zhou, Q.Tian, Y. Rui, and T.S. Huang, “Semantic Retrieval in Video,” IEEE Signal Processing Mag., vol. 23, no. 2, pp. 18–27, 2006. [26] W.K. Li and S.H. Lai, “Storage and retrieval for media databases,” Proc. SPIE, vol. 5021, pp. 264-271, Jan. 2003. 12/12
© Copyright 2026 Paperzz