Correlation-based retrieval for heavily changed

Correlation-Based Retrieval for Heavily Changed Near-Duplicate
Videos
JIAJUN LIU, ZI HUANG, and HENG TAO SHEN, The University of Queensland
BIN CUI, Peking University
The unprecedented and ever-growing number of Web videos nowadays leads to the massive existence of
near-duplicate videos. Very often, some near-duplicate videos exhibit great content changes, while the user
perceives little information change, for example, color features change significantly when transforming a
color video with a blue filter. These feature changes contribute to low-level video similarity computations,
making conventional similarity-based near-duplicate video retrieval techniques incapable of accurately capturing the implicit relationship between two near-duplicate videos with fairly large content modifications.
In this paper, we introduce a new dimension for near-duplicate video retrieval. Different from existing nearduplicate video retrieval approaches which are based on video-content similarity, we explore the correlation
between two videos. The intuition is that near-duplicate videos should preserve strong information correlation in spite of intensive content changes. More effective retrieval with stronger tolerance is achieved by
replacing video-content similarity measures with information correlation analysis. Theoretical justification
and experimental results prove the effectiveness of correlation-based near-duplicate retrieval.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and
Retrieval—Retrieval models; search process
General Terms: Design, Algorithms, Experimentation
Additional Key Words and Phrases: Near-duplicate video, correlation-based retrieval, similarity-based
retrieval
ACM Reference Format:
Liu, J., Huang, Z., Shen, H. T., and Cui, B. 2011. Correlation-based retrieval for heavily changed nearduplicate videos. ACM Trans. Inf. Syst. 29, 4, Article 21 (November 2011), 25 pages.
DOI = 10.1145/2037661.2037666 http://doi.acm.org/10.1145/2037661.2037666
1. INTRODUCTION
Emerging video-related online services such as video sharing, video broadcasting, and
video searching increasingly have been bringing user interest and participation to
video-related activities like editing, sharing, watching, and searching. According to
a recent report by comScore,1 76.8% of the total U.S. Internet audience view online
videos, viewing 14.8 billion online videos in January 2009, alone, with an average view
count of 101 videos per user. It also shows that an ever-rising demand of online videos
is evident.
1 www.comscore.com.
Authors’ addresses: J. Liu, H. T. Shen, and Z. Huang, School of Information Technology and Electrical
Engineering, The University of Queensland, Australia; emails: {jiajun, shenht, huang}@itee.uq.edu.au;
B. Cui, Department of Computer Science, Peking University, China; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2011 ACM 1046-8188/2011/11-ART21 $10.00
DOI 10.1145/2037661.2037666 http://doi.acm.org/10.1145/2037661.2037666
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21
21:2
J. Liu et al.
The emerging of video-related applications and services subsequently leads to the
continuous exponential growth of online video content [Shen et al. 2007, 2011; Huang
et al. 2010; Tan et al. 2009; Wu et al. 2007]. This is also supported by comScore statistics,
which shows the number of total Internet videos climbed from about 12.677 billion in
November 2008, to about 14.831 billion in January 2009, representing a 17% growth
within three months. Increasing video-related activities contribute to an unprecedented
number as well as a substantial percentage of near-duplicates in today’s online videos,
which are referred as Near-duplicate Videos (NVs) [Tan et al. 2009; Wu et al. 2007].
The percentage of NVs in web videos is surprisingly as high as 93% for some of the
experimental queries [Wu et al. 2007]. The massive existence of NV data imposes
heavy demands on Near-duplicate Video Retrieval (NVR) as it is crucial to many novel
applications such as copyright-violation detection, video monitoring and tracking, video
database cleansing, video recommendation, etc. To list a few: a content provider who
publishes a copyrighted video in its own YouTube channel wants to enforce its copyright
protection by detecting and removing illegal editions of its original copy in the same
site. A company who has invested in a TV commercial wants to monitor that the
commercial is being broadcasted for the right counts during the correct time period. A
video service provider wants to precisely recommend videos of the same topic to users
according to the content of the videos they are watching or searching.
Motivated by the above needs, intensive study has been conducted in the NVR domain
in recent years. A variety of approaches has been proposed, most of which, (if not
all) exploit content similarity between video features to determine the near-duplicate
relationship with or without temporal information consideration [Law-To et al. 2006,
2007, 2009; Shen et al. 2007; Wu et al. 2007, 2008; Zhao and Ngo 2009; Zhu et al.
2008]. These approaches have been reportedly effective to NVs with moderate content
misalignments. However, a recent study has further shown that users judge if two
videos are near-duplicates mainly based on the degree of information change they
perceive [Cherubini et al. 2009], even when their features look very different. This
brings further challenge to the tolerance of content changes of those conventional
similarity-based NVR methods. For example, Figure 1 illustrates when an original
video is edited with heavy color changes and some additional black frame insertions,
its features receive significant changes while the information a user can perceive from
it actually remains more or less the same. Remarkable distortions on the feature
values can be easily observed in their feature visualizations, making them difficult
to be detected as near-duplicates. We call this sort of NVs near-duplicates with heavy
content changes.
Heavy content changes, from a video’s perspective, represent great changes to its
visual content and, subsequently, its features, introduced by certain modifications. The
type of modification and the extent of this modification determine whether or not heavy
content changes have been made. For instance, changes brought by video format/quality
changes or mirror/flip operations are usually not considered heavy content changes,
since the video features remain largely close to those of the original after modification.
These modifications naturally have less impact on the color distribution/pattern. On
the other hand, a video’s color distribution/pattern is naturally more sensitive to direct
changes on brightness, color (filtering, warmth), etc. At the same time, some other
types of modification can make heavy changes when the modifications are performed
in a greater extent, like logo/banner insertion, frame deletion, and insertion, etc.
Figure 2 demonstrates how the type and extent of a modification affects the feature
of 48-dimensional color moment. It shows an original video and nine variants. For
each variant, we visualize the features of the variant and the original to illustrate
the changes on color moment features. Modifying encoding quality and video format
results in very little change to the color distribution, even when the encoding quality
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:3
1
0.9
Dimensional Value
0.8
0.7
0.6
0.5
0.4
0.3
0.2
30
25
20
20
15
15
10
10
5
5
0
Sigature Dimension
0
Frame
(a) An original video
1
Dimensional Value
0.8
0.6
0.4
0.2
0
30
25
40
20
30
15
20
10
10
5
Sigature Dimension
0
0
Frame
(b) A near-duplicate with intensive color changes
Fig. 1. Two NVs with different color features.
in Figure 2(c) is only half of the original. Mirror does not change color distribution at
all. However, since color moment captures color values from each regions, and the geometric positions are changed, color moment features are slightly changed after mirror
transformation. Clearly, none of the NVs represented by Figure 2(b), 2(c), and 2(d) can
be considered to have heavy content changes. This is determined by the characteristics
of these modification types. On the other hand, in Figure 2(e) and 2(f), when the brightness is increased by 30% and the red channel increased by 100, their features are very
distorted from the original ones, because these changes directly affect the color values
of the features. For banner and logo insertions in Figure 2(g), 2(h), 2(i), and 2(j), it is
evident that as the extent of modification increases, the changes on features are more
and more intense. In the case of modifications in Figure 2, Figure 2(h), 2(j), 2(e), 2(f)
include heavy content changes, because their types of modifications and the extent of
these modifications have introduced significant changes to their features.
The significance of CNVR is obvious from observations on the composition of Web
video datasets. According to a statistical study in Wu et al. [2007], there are a large number of near-duplicates in real Web videos that are difficult to retrieve with similaritybased approaches using global color features, especially on photometric variations,
scene modification, video length changes, and so on. This present a challenge to existing approaches about the effectiveness of retrieval under such circumstances.
In order to obtain stronger tolerance to content changes in NDV, inspired by
the correlation analysis in statistics, we propose a new direction which exploits
video information correlation instead of content similarity. The intuition is that two
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:4
J. Liu et al.
original
format
(a) Original
(b) Format
original
quality
(c) Quality
original
mirror
(d) Mirror
original
brightness
(e) Brightness
original
red filter
(f) Red filter
original
banner(small)
(g) Banner (small)
original
banner(big)
(h) Banner (large)
original
logo(small)
(i) Logo (small)
original
logo(big)
(j) Logo (large)
Fig. 2. Impact of video modifications on color moment features.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:5
near-duplicate videos should always preserve strong information correlation even if
their content has great changes. To differentiate our approach from previous ones,
we use Similarity-based Near-duplicate Video Retrieval (SNVR) to represent previous
similarity-based retrieval approaches, and use Correlation-based Near-duplicate
Video Retrieval (CNVR) for ours. In this paper, we employ Canonical Correlation
Analysis (CCA) for video information correlation discovery. Our main contributions
are summarized as follow:
—We propose Correlation-based Near-duplicate Video Retrieval (CNVR) based on
information correlation analysis. It is designed to have strong tolerance to intensive
content changes. Unlike SNVR, the nature of correlation analysis determines that
in CNVR, non-trivial relationship between two near-duplicate videos has great opportunity to be found despite the fact that the explicit content similarity is not high.
—We explain the rationale of using CCA for the task of CNVR, together with its
formulation and solution. Furthermore, the relation between CCA and Mutual
Information in information theory is also discussed to justify the suitability of
utilizing CCA in CNVR for computing video information correlation.
—We present an overall CNVR framework based on CCA, followed by a near-duplicate
case study. Additionally, we propose a single-valued measure for the extent of
being (partial) NVs by assessing the correlation analysis outputs, that is, Degree of
Correlation (DoC).
—We discuss how and why CNVR can deal with partial NVs without explicit framelevel pairwise comparison. The near-duplicate frames in the videos can be detected
and located, and the NV relationship is assessed in the same measure as for a
normal NV.
—We conduct an extensive performance study on two datasets from existing works to
demonstrate the performance of CNVR for both NVs and partial NVs.
The rest of the paper is organized as follows. First, we survey existing work in
Section 2. In Section 3, we explain why CCA is used for the CNVR problem, with its
formal formulation and solution. Its relation to MI is also discussed. The proposed
CNVR framework is described in Section 4, where we illustrate components of the
framework and define a mechanism to measure the degree of video information
correlation through CCA outputs. A case study is shown in Section 5, where detailed
intermediate results and their interpretation is given. Experimental results are shown
in Section 6. Section 7 concludes the paper.
2. RELATED WORK
A lot of work has been done to address the problem in the past decade in both domains
of NVR, including copy detection [Assent and Kremer 2009; Law-To et al. 2007, 2009;
Liu et al. 2007; Shen et al. 2007; Wu et al. 2007, 2008, 2009; Yeh and Cheng 2009;
Yuan et al. 2004; Zhao and Ngo 2009; Zhu et al. 2008], most of which adopts similarity
measures on global features or local keypoints to determine if two videos are nearduplicates. Global features are generally summarized from the whole videos. In Yuan
et al. [2004], a compact color histogram is used, as well as enhanced spatio-temporal
features which summarize the color pattern distribution along video time axis. Similar
accumulative color histogram is employed in Wu et al. [2007], but with the assistance
of local keypoints to evaluate unidentified candidates in a two-step filtering framework
in a rough-to-fine scheme. Bounded Coordinate System (BCS), a PCA-based summarization approach described in Shen et al. [2007] and Huang et al. [2009], extracts
global video feature distribution by applying PCA in the feature space. Derived principle components capture the distribution of video frames and the similarity is computed
in an extended Lp-norm. A reference-based histogram is also demonstrated to be used
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:6
J. Liu et al.
as a feature [Liu et al. 2007], where each video is compared with a set of reference
videos (seed videos) using 2-dimensional PCA features, and then the percentages of
video frames which are closest to the corresponding reference video are recorded and
combined as a histogram. Typically, if the similarity between two videos is greater
than a threshold, both are regarded as near-duplicates. Some methods also utilize time
information [Tan et al. 2009; Law-To et al. 2006; Yeh and Cheng 2009] or semantics
[seok Min et al. 2009] for NVR.
Those approaches that use global features have been observed to perform quite well
in many cases. However, there has been an observation that it may work less accurately
under circumstances where large photometric differences between NVs exist. This is
because global similarity measures are largely dependent on the coordinate system in
which the features are assessed. Sometimes even when two videos have very strong
correlation, their relationship may not be obvious because of their low similarity in
the specific coordinate system. Recall Figure 1, which illustrates two videos having
very different color features. Obviously they are near-duplicates. However, the intense
color transformation leads to remarkable feature changes that weaken the relevance
of candidates in feature space. The number of such types of NVs are observed to be
considerably large [Wu et al. 2007].
On the other hand, local interest points (or keypoints) from frames are widely
used to tackle the unstable object detection performance to viewpoint, scale, and
rotation changes. The invention of scale-rotation invariant keypoint detectors like the
Hessian-Affine detector the [Mikolajczyk and Schmid 2004] has resolved the issue of
detector instability under viewpoints or scale variations. These detectors therefore
have been widely utilized in recent NVR literature [Law-To et al. 2009; Zhao and
Ngo 2009; Wu et al. 2007; Zhu et al. 2008], with descriptors like SIFT and PCA-SIFT
[Lowe 2004; Ke and Sukthankar 2004]. More sophisticated approaches are developed
for matching keypoint-based features, such as bag-of-words representation [Wu et al.
2007], one-to-one matching [Zhao et al. 2007; Ngo et al. 2006], spatial entropy by
measuring angles [Zhao and Ngo 2009], and nonrigid matching [Zhu et al. 2008], and
temporal-aware summarization[Satoh et al. 2007; Wu et al. 2008; Tan et al. 2008].
Existing keypoint-based NVR approaches are known for their capability of detecting
viewpoints, scale, and rotation changes. Nevertheless, they are still quite sensitive to
great changes in illumination, contrast, shadows, and shading [Brown and Lowe 2002].
Additionally, the enormous number of keypoints and their high feature dimensionality
often cause the efficiency issue in large-scale video databases. Even with dimension
reduction and indexing techniques [Poullot et al. 2008], keypoint-based methods
are still hardly practical for time-sensitive NVR applications on very large video
collections without reducing the number of keypoint significantly.
Recently, an interest seam-based method has been proposed for video retrieval
[Zhang et al. 2010]. In this method, a video is represented by an interest-seam image, which is constructed by considering both spatial and temporal energy along the
columns of the video. With dynamic programming, the seams with maximum visual
saliency energy are identified and concatenated into a representative image, which
is called an interest-seam image. It shows great effectiveness for finding NVs with
gradually changed scenes. However, when the videos are changed with frame insertion/deletion/disordering, as most likely exist in the partial near-duplicates, both the
temporal and spatial information it relies on is disrupted.
In this paper we propose a novel CNVR approach which achieves strong tolerance
to content and temporal changes. The approach exploits the information correlation
between two videos by applying Canonical Correlation Analysis (CCA). To the best of
our knowledge, no previous investigation has been conducted on the use of correlation
analysis for the NVR task.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:7
3. CNVR ESSENTIALS
So far, correlation analysis has not been discussed to tackle the NVR problem. In this
section we first introduce the essential background knowledge about correlation analysis and how the employment of correlation analysis will improve the NVR quality,
followed by the formulation of correlation analysis and its solution. Finally, the interpretation of CCA outputs on video features and its relation to Mutual Information will
be discussed.
Correlation analysis is a well-known family of statistical tools for analyzing associations between variables or sets of variables. Other than directly calculating similarity,
it reveals possible relation between data values. Canonical Correlation Analysis (CCA)
is the most widely used correlation technique in the family for its capability of measuring the correlation between two sets of variables, through constructing cross-covariance
matrices and maximizing the mutual information between two sets of variables [Clark
1975]. Here we will investigate CCA as a method to solve correlation-based nearduplicate video retrieval problem.
Intuitively, CCA can be used for the NVR problem because its formulation suits the
problem very well, and its solution sufficiently defines the information relationship
between two near-duplicate video sequences. Compared to SNVR, several advantages
can be observed for CNVR based on CCA.
(1) It is content change-friendly as strong information correlation can still be preserved
for two near-duplicate videos even when there is major content modification, or
dramatic photometric and geometric transformations. CCA recognizes these transformations and reflects them with correlation and canonical weights, by finding
optimal projections. Canonical weights contain the significance of each frame as a
variable, and thus the outputs of CCA are sufficient for the determination of the
overall correlation between two videos, as well as the correlation between two individual frames. Near-duplicates and partial duplicates can be detected using the
same unified framework.
(2) CCA is immune to temporal changes. This is extremely useful in cases where
insertions/deletions occur to the NVs, as well as where frame rate changes due
to different camera or encoding settings. Partial NV detection can be achieved by
a single video-to-video matching and explicit frame-level one-to-one matching is
avoided.
(3) CCA on the retrieval task is largely feature-independent, as long as transformations
on videos can be linearly reflected in the feature space. Multiple types of features
can be combined in generalized CCA [Hardoon et al. 2004] to increase tolerance for
some specific transformations.
(4) It is very simple to replace CCA with Kernel Canonical Correlation Analysis
(KCCA) to find non-linear correlation in the CNVR framework without major modification to the implementation.
In Sections 4, 5 and 6 we will discuss how (1) and (2) are achieved from both theoretical and experimental aspects. Technical analysis and experiment results will be
presented. We leave (3) and (4) for further analysis and experiments in our future work.
3.1. Formulation and Solution
Introduced by Hotelling [1936], CCA has recently drawn attention in the field of machine learning, where most works focus on recognition and classification problems
[Hardoon et al. 2004; Kim and Cipolla 2009; Kim et al. 2007]. In this subsection we
will formally formulate CCA to tackle the NVR problem.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:8
J. Liu et al.
CCA can be understood as finding pairs of basis vectors for two sets of variables, so
that the correlations between the projected variables on their respective basis vectors
are mutually maximized. Literally, it means to find the optimal linear combinations
of variables in each set to maximize their correlation after the combination. As for
illustration, here we examine the case with two sets of samples representing two
videos. Formally, given two sets of samples denoted as matrices X = {x1 , x2 , . . . , xD}T
and Y = {y1 , y2 , . . . , yD}T , with each xi and y j representing samples of instance values
of the observed variables respectively, and D representing the number of samples, the
goal of CCA is to find two vectors w X and wY , such that the projections of X and Y on
w X and wY
pX = (w X, x1 , . . . , w X, xD),
pY = (wY , y1 , . . . , wY , yD),
have mutually maximal correlation,
ρ = max corr( pX, pY )
w X ,wY
= max
w X ,wY
pX, pY ,
|| pX|| · || pY ||
(1)
where ρ is called the canonical correlation of matrices X and Y , and α, β is the inner
product that equals to α T β. This can be written in the empirical expectation form as
ρ = max w X ,wY
w TX Ê[XY T ]wY
w TX Ê[XXT ]w XwYT Ê[Y Y T ]wY
,
(2)
where the empirical expectation of function f is denoted as Ê[ f ]. Here, if we denote
(X, Y )’s within set covariance matrices as C XX, CY Y and between set covariance matrices as C XY , CY X, we have the complete covariance matrix of (X, Y ) as
T C XX C XY
X
X
=
.
C(X, Y ) = Ê
CY X CY Y
Y
Y
As we observe that C XY = CYT X, the above function defined in Equation (1) could be
revised to
w TX C XY wY
ρ = max .
w X ,wY
w TX C XXw XwYT CY Y wY
(3)
Here the canonical problem is transformed to an optimization problem to find w X and
wY with the purpose of maximizing ρ in the function. This is suitable to the problem
of finding maximal correlation between two matrix-like features of two videos. The
correlation we get through optimizing this function reflects the potential relationship
of two videos and can be used to determine if two videos are near-duplicates. Note that
constraints
w TX C XXw X = 1,
wYT CY Y wY = 1,
(4)
can be introduced to the optimization function as we observe that ρ is independent
from rescaling w X or wY . Hence it can be written with Lagrange multipliers as
λY T
λX T
w X C XXw X − 1 −
wY C Y wY − 1 .
L(λ, w X, wY ) = w TX C XY wY −
2
2
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:9
With the constraints in Equation (4), we now find that λ X actually equals λY by taking
derivatives on the above function.
∂ L(λ, w X, wY )
= C XY wY − λ XC XXw X = 0,
(5)
∂w X
∂ L(λ, w X, wY )
= CY Xw X − λY CY Y wY = 0,
(6)
∂wY
0 = λY wYT CY Y wY − λ Xw TX C XXw X
= λY − λ X .
(7)
Now if we let λ = λ X = λY and substitute λ X and λY with λ, with the assumption that
CY Y is invertible, we get
CY−1Y CY Xw X
.
λ
By substituting in Equation (5) with Equation (8), we have
wY =
C XY CY−1Y CY Xw X = λ2 C XXw X,
(8)
(9)
which represents a generalized eigen-problem. Note, w X can be obtained by solving
this problem, and then wY can be found with w X and Equation (8). To formulate it
as a standard eigen-problem defined as Ax = λx, we assume C XX is also invertible
(regularization to C XX and CY Y to ensure invertibility will be discussed in Subsection
3.3). Now we decompose C XX using complete Cholesky Decomposition as
T
,
C XX = RXX RXX
given that C XX is symmetric and positive as a covariance matrix. If we denote rX as
T
rX = RXX
w X, we can rewrite Equation (9) as follows:
2
C XY CY−1Y CY X R−1T
XX rX = λ RXXrX,
−1
−1T
2
R−1
(10)
XXC XY CY Y CY X RXX rX = λ rX.
Here we have formulated the problem as a standard eigen-problem in the form
AX = λX, where λ represents the correlations. By solving this eigen-problem we hereby
get the correlation information λ. By solving Equation (9) and (8), we can also obtain
the corresponding w X and wY .
3.2. Application and Interpretation
To apply CCA on the NVR problem, where each video is represented as a sequence
of frames and each frame is represented as a high-dimensional feature vector, we
can model each frame as a variable, and each dimension of the feature space as a
sample for the correlation analysis. To explain the process more clearly, we follow
the same notations in Section 3.1. Given two videos X and Y with N and M frames
in a D-dimensional feature space, they can be presented as D × N and as D × M
matrix respectively. They are denoted as X = {x1 , x2 , . . . , xD}T and Y = {y1 , y2 , . . . , yD}T
respectively. Each column in X and Y represents a frame as a variable, while each row,
namely xi and yi , represents all the values from the same dimension for all frames as a
sample. With such settings, there are M and N variables in the criteria and predictors
respectively with the same number of observations (samples), that is, D. The objective
now is to find two sets of weights of the frames, namely w X and wY , in the correlation
pattern such that the correlation between two videos is maximized (ρ in Equation 2).
By solving the problem we have three outputs, the correlation λ, weights w X for video
X and wY for video Y.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:10
J. Liu et al.
Fig. 3. CCA on two video feature metrices.
Table I. Interpretation of CCA Outputs
Field
λ
w X , wY
Mathematical Meaning
Correlations of two
sets of variables
Contribution of
each variable in X
and Y to the
corresponding
correlation value
Interpretation
The possible overall dependency of two video features. Larger
values imply a higher possibility that two videos are NVs.
The frame-level intercorrelation of X and Y . Larger values indicate a higher possibility of being near-duplicate frames. The
value and sign of a weight indicate the direction and degree
of the involvement of a frame in a specific correlation pattern.
According to our retrieval objective, wY is closely examine for
near-duplicate segment identification.
Without loss of generality, we assume D > N ≥ M. As there are multiple solutions
to this eigen-problem (Equation 10), we can get multiple sets of w X and wY as well
as their corresponding λ. Note that for a standard eigen-problem in the form AX = λX,
the number of solutions is less than or equal to rank(X). With the implicit constraint
that λ X = λY , and the notation of full-rank matrices X(D × N) and Y (D × M), it
is clear that the number of solutions of λ is min(rank(X), rank(Y )), which equals M
in our case because D > N ≥ M. Now for each λ from the solutions, there can be
solved the weights of frames (projection basis vector) w X and wY of N-dimensional and
M-dimensional respectively. Finally we obtain w X(N × M) and wY (M × M), in which
each column contains the weights of frames (projection basis vector) for a particular
correlation pattern.
Figure 3 depicts the features and output of the above scenario. By applying CCA, we
get an M-dimensional vector λ = (λ1 , . . . , λ M ) that contains their correlations, and two
matrices w X = {w X1 , . . . , w XM } and wY = {wY1 , . . . , wY M } that indicate the significance
of frames in X and Y contributing to the corresponding correlation. For example, w X1
and wY1 represent the contributions of the frames in X and Y respectively for the
correlation λ1 . This means X and Y are projected to w X1 and wY1 respectively, and
λ1 is the correlation value after the projection. The interpretation of relevant results
produced by CCA is listed in Table I.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:11
3.3. Regularization
Note that in several steps of the CCA solution, the algorithm depends on the assumption that corresponding covariance matrices, that is, within set covariance matrices
C XX and CY Y , are invertible. Nonetheless, as within set covariance matrices are not
always full rank, it may lead to the rejection of the invertibility assumption. Obviously,
regularization is necessary here to solve the ill-posed inverse problem. Many wellknown regularization techniques in statistics and machine learning can be used here
to approximate the ill-posed inverse problem, while a most common solution can be to
all a diagonal matrix with extremely small diagonals. In our CNVR case, for computation considerations, simple approaches are preferable. We simply rewrite Equation (8)
and (9) as
(CY Y + kI)−1 CY Xw X
= wY ,
λ
C XY (CY Y + kI)−1 CY Xw X = λ2 C XXw X,
(11)
subject to the constraints
w TX (C XX + kI)w X = 1,
wYT (CY Y + kI)wY = 1,
where k is a real number that is usually very small, for example, 10−8 , and I is the
identity matrix with the same dimensions as C XX and CY Y . Examination on the impact
of k on NVR effectiveness will be presented in Section 6. By doing this we solve the
approximation of an ill-posed inverse problem.
3.4. Relation to Mutual Information
In information theory, Mutual Information (MI) is a commonly recognized and used
quantity that measures the mutual dependence of the two variables. The relation
between CCA and MI is clear. As information is addictive for statistically independent
variable, and as canonical variables satisfy this condition, the MI between two onedimensional variables x and y can be written as
σx2 σ y2
1
1
1
MI(x, y) = log
= log
,
2
2
2
σx2 σ y2 − σxy
2
1 − ρxy
where σx2 and σ y2 are the variances of x and y respectively, ρxy is the covariance between
2
x and y, and ρxy
is the correlation between x and y. Observe that the equation here
yield the relationship that the Mutual Information is maximized when their correlation
is maximized. It means that when CCA finds the optimal projection with maximized
correlation, maximum information dependence of variables is obtained. The implication
suggests that the solution of CCA suits the purpose of finding near-duplicate videos
well, and that NVS are expected to have strong information correlation.
4. CNVR FRAMEWORK
In this section we present a general CNVR framework based on CCA to retrieve nearduplicate videos. Three major components are described in the framework.
4.1. Feature Extraction
Feature extraction is a fundamental component in this framework. Upon arrival of a
given video (as a sequence of frames), this component extracts frame feature vectors
for the video, namely visual patterns from frame contents. All later manipulations are
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:12
J. Liu et al.
based on these feature vectors, so the more distinctive these features are, the better
results we can get by applying correlation analysis on them. For the consideration
of compatibility with the correlation analysis component, each video is modeled as a
feature matrix where each row is a high-dimensional feature vector representing a
frame in the video. To name a few possible compatible types of features, it can be
in the form of color histogram, local keypoint descriptors, or other types of features.
Note that correlation analysis often depends on the number of samples, that is, the
dimensionality of the feature space.
4.2. Correlation Analysis
In this component, CCA is applied on the video feature matrices to generate the outputs, including λ, w X and wY . The CCA outputs carry information which can be used
to distinguish near-duplicates from non-near-duplicates. However, due to the mathematical nature of CCA, some highly random correlation values may be produced due to
the chance factors. Hence a mechanism to test the statistical significance of correlation
values in λ must be introduced to guarantee the quality of significant correlations. On
the other hand, as part of the retrieval task, a single value should be generated to identify the degree of correlation between two videos. We will discuss how the statistical
significance of correlation values can be tested and how (partial) near-duplicate videos
can be determined by analyzing λ, w X, and wY .
4.3. Statistical Significance Test
By solving the eigen-problem described in Equation (10), CCA outputs a vector of correlation values, that is, λ. The context of CCA optimization implies that the solution
provides maximally possible correlations rather than an absolute single measurement
like the Euclidean distance function or other similarity measures. True near-duplicates
tend to have high correlations. However, in CCA, insignificant but high correlations
can possibly be included in λ as a product of chance factors, which means that high correlations can also occur to false near-duplicates. This issue makes λ inaccurate for use
directly and individually. So we use Bartlett’s Chi-square Test [Snedecor and Cochran
1989] to test the null hypothesis, that is, if one of the correlations is the product of
chance factors, and the two sets of data are actually unrelated, to eliminate those statistically random values in λ. Instead of directly on the resulted λ, the Chi-square test is
iteratively done on the statistic Wilk’s Lambda [Mardia et al. 1980], which is defined as
=
M
1 − λ2j ,
j=i
where i means the i th iteration and denotes the product.
The null hypothesis is then tested against the i th value in λ, that is, λi , using the
function below [Clark 1975], which distributes approximately as Chi-square:
χ 2 = −[(D − 1) − 0.5(M + N + 1)] loge ,
s =
χ2
,
(M − i + 1)(N − i + 1)
with (M − i + 1)(N − i + 1) indicating the degree of freedom. χ 2 ’s value is used to accept
or reject the null hypothesis that two sets of variables are unrelated. Specifically, after
χ 2 is computed, s is used to present the level of significance, where a higher level of
significance denotes a higher possibility that the null hypothesis can be accepted. If the
null hypothesis is rejected for λi , two sets of variables are considered to be significantly
related under λi . We iteratively execute this test from λ1 to λ M . For each λi , if its s value
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:13
is greater than a predefined threshold , it is then considered as a random correlation
value. After the Chi-square test, those random correlation values in λ are wiped out.
We will test the effect of in Section 6.
4.4. (Partial) Near-duplicate Video Determination
In this subsection we establish a metric by analyzing the retained correlation values in
λ after the Chi-square test, to form a single-valued DoC that carries all the significant
correlations to determine if two videos are (partial) near-duplicates. For any two videos
X and Y , if the updated correlation vector λ̃ is M̃-dimensional after removing all
insignificant correlation values, where M̃ ≤ M, we have
M̃
DoC(X, Y ) =
i=1
N
λ̃i
,
where N is the number of frames in X, and λ̃i is the i th correlation value in λ̃. It
is directly used to determine the probability of being near-duplicates for two videos.
Videos in the database can be ranked based on their DoC values with respect to a query
video for retrieval purpose. The reason for using summation instead of averaging is
that the sum can better reflect how many frames are matched. For example, query
video X has six frames, while candidate video Y1 has two closely matched frames out
of three, and video Y2 has five closely matched frames out of eight. By averaging the
correlations, they might get very close ranks, but with summation Y2 , Y1 and Y2 will
be regarded as a better match. This strategy is flexible, and its current setting simply
shows our preference to those longer videos with more matched frames. The division
of N normalizes this score into the range of [0-1].
Note that before entering this procedure, a preprocessing step is needed to split the
uninformative frames out from X and Y . Here, uninformative frames refer to those
black or white screens. Uninformative frames can very often be falsely included in
certain combinations of transformed frames in CCA. As it brings noise to the matching
process, before the main procedure of (partial) near-duplicate determination is started,
these frames from both X and Y are first removed.
CNVR has the capability of detecting partial NVs in the same framework. The DoC
itself is sufficient for reflecting both NV and partial NV relationships. However, if the
exact near-duplicate frames need to be located, a designated subprocedure is invoked.
More specifically, the near-duplicate frames from the candidate video can be detected
and located using w˜Y (assuming Y is the candidate and X is the query). The approach
relies on the assumption that after the Chi-square test, each correlation pattern in λ̃
has a proper significance and represents a matching pattern between the two sets of
frames. Here w˜Y stands for the degree (value) and direction (sign) of involvement of the
frames in the candidate video in a correlation pattern. The magnitude of its values can
be regarded as the frames’ weights, or the importance in the matching. Great weights
in w˜Y mean that these frames are near-duplicate frames.
Given two videos, the query X and the candidate Y , to identify the near-duplicate
frames involved in a correlation pattern, the first step is to transform w˜Y into its
absolute values w˜Y , since the values are useful for determining the involvement of the
frames in a particular correlation. After that, for each correlation value in λ̃, we identify
the contributive frames by collecting those with the greatest weights. Multiple frames
may contribute greatly to a correlation value. Naturally, each contributive frame will
be considered as a near-duplicate frame to its original. All near-duplicate frames from
the candidate video are finally collected from all of its correlation values. Examples are
given next in Section 5.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:14
J. Liu et al.
5. A CASE STUDY
In this section, we present a case study to show the impact of CNVR on the retrieval
task, regarding significantly changed videos that are relatively difficult to detect. We
investigate an original web video and a set of its variations to gain an intuition on
CNVR’s novelty on retrieval. We will visually compare their contents and color features,
after which we will show the differences on their retrieval ranks, with SNVR and CNVR
respectively.
Figure 4 illustrates four near-duplicates with their color features for the original
video shown in Figure 1(a). In each of the video-feature pairs, we show the representative frames and their features, which are represented by high-dimensional color
histograms.
Partial near-duplicate 1 in Figure 4 is heavily changed by inserting descriptive text
frames and adding unrelated contents. In the 16 frames we visualize, only four frames
are from the original videos. Hence this video is a partial near-duplicate. Comparing
its feature visualization with that of the origin in Figure 1(a), clearly they are very
dissimilar, and the content similarity between it and the original is be expected to
be very low. However, in CNVR, this issue is eased as only the correct near-duplicate
frames are considered for similarity computation. These near-duplicate frames have
extremely strong correlations with those in the original video, and this can lead to a
very high DoC value.
Near-duplicate 2 has been edited with intensive color changes with blue, purple,
orange, grey, red, and other filters. Also, a couple of unrelated frames are inserted at
the beginning. Huge changes on color patterns and moderate changes in frame content
and order make the features quite different from its origin. Comparing the feature
visualization with that of the origin, along the frame axis, their values shift up and
down as a result of different types of color changes. Similarity-based approaches are
unable to reveal the close relevance under those color changes effectively. While in
CNVR, with the assistance of linear projections, the underlying correlations between
two videos can be discovered.
Near-duplicate 3 is a near-duplicate with repetitive frame insertion. The impact on
its features is obvious. Regular saggings are present whenever a frame is inserted. As
most similarity-based approaches depend on the order of frames explicitly or implicitly,
this confuses the similarity-based approaches by disrupting the normal frame order,
but with CNVR, the effect of disorder of frames is reduced to a minimal since the
temporal order of frames in CCA does not affect the correlation.
Finally, near-duplicate 4 is presented as a typical edited low-quality version with
information changes and loss. Additional content changes with inserted frames are
included in this video. The shape of feature values from inserted frames is totally
different, while the shape of values from those inherited frames get coarse in some
dimensions and more salient in others. The combination of frame insertion and quality
loss consequently leads to low similarity between video features. With CNVR, the
frame insertion is almost ignored, and the quality loss is greatly recovered with linear
transformations.
These examples suggest that after heavy changes to the video content, features can
be very dissimilar from their original features in their metric space. This makes it
considerably tough for conventional SNVR to detect. Nevertheless, as CNVR computes
their correlations by linear projections, it is robust to linear content transformations.
As a result, stronger tolerance to content changes can be obtained. We show the intermediate results of two cases to demonstrate this rationale. The first one illustrates
how the near-duplicate frames are identified, despite temporal disorder and content
misalignment. The second one explains how strong video-level correlation is obtained
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:15
0.9
0.8
Dimensional Value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
30
25
20
20
15
15
10
10
5
5
0
Sigature Dimension
0
Frame
(a) Partial near-duplicate 1: content edition
1
Dimensional Value
0.8
0.6
0.4
0.2
0
30
25
40
20
30
15
20
10
10
5
0
Sigature Dimension
0
Frame
(b) Near-duplicate 2: color transformation
1
Dimensional Value
0.8
0.6
0.4
0.2
0
30
25
35
20
30
25
15
20
10
15
10
5
0
Sigature Dimension
5
0
Frame
(c) Near-duplicate 3: repetitive frame insertion
1
Dimensional Value
0.8
0.6
0.4
0.2
0
30
25
25
20
20
15
15
10
10
5
Sigature Dimension
5
0
0
Frame
(d) Near-duplicate 4: quality loss and order change
Fig. 4. Near-duplicates with intensive changes.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:16
J. Liu et al.
Table II. Intermediate Results for Near-Duplicate 1
Correlation λ
Chi-square s
Frame Weight wY 1
Frame Weight wY 2
Frame Weight wY 3
Frame Weight wY 4
Frame Weight wY 5
Frame Weight wY 6
1.0
0
12.11
0.36
−4.82
−1.11
1.0
0.99
0
0.0014
23.22
−15.83
Ignored
Ignored
10.56
8.03
−27.82
17.82
−9.1
−2.9
0.96
0.084
−50.34
10.04
49
−29.65
Table III. Intermediate Results for Near-Duplicate 2
Correlation λ
Chi-square s
Frame Weight wY 1
Frame Weight wY 2
Frame Weight wY 3
Frame Weight wY 4
Frame Weight wY 5
Frame Weight wY 6
1.0
0
−0.02
1.21
−1.53
0.58
−0.09
6.98
0.99
0
2.58
22.76
−33.11
5.99
4.82
2.65
0.98
0
−6.18
34.70
−40.16
−3.25
28.45
−19.73
0.91
0.02
1.26
9.35
−3.91
−13.39
−6.97
13.30
0.88
0.14
0.30
−3.98
40.43
−76.54
43.33
−1.24
0.63
0.60
6.89
−5.47
−6.82
−18.11
11.76
12.95
on an NV with heavily transformed colors. For simplicity, we only show the six most
representative frames of the original video and the NV.
5.1. Intermediate Results: Near-Duplicate 1
In this case we use the 9th to 14th frames in near-duplicate 1 (Figure 4(a)) and the
original video (Figure 1(a)) as the candidate and the query respectively. The subsequent
frames in near-duplicate 1 contains two near-black screens and two unrelated frames,
making it a typical partial near-duplicate. We denote the query video as X and the
candidate as Y . Clearly, the second and third frames in Y are considered uninformative
ones (black screens) and thus ignored.
After the CCA process, only four correlation values are produced. Table II lists the
outputs of CCA. It contains the correlation values, the Chi-square values, and the
weights of frames from Y . The underlined weights represent the dominating frames,
which contribute the most to the corresponding correlation value. Based on Chi-square
test, correlation values 1.0, 1.0, and 0.99 are tested to be significant by comparing their
Chi-square values to the threshold whose default value is 0.05. The correlation value
0.96 happens more likely by chance and is discarded. Interesting observations can be
obtained by looking into the weights for each significant correlation value. For the first
correlation value 1.0, wY 1 stands alone on its magnitude, meaning that the first frame
in Y is a near-duplicate to a frame in X. For the second and third correlation values 1.0
and 0.99, both indicate that the first and fifth frames in Y have near-duplicates in X.
By collecting all results for all significant correlation values, it is easy to identify that
the first and fifth frames in Y are near-duplicate frames of X. This exactly matches the
ground truth.
5.2. Intermediate Results for Near-Duplicate 2
In this case we use the 5th to 10th frames in near-duplicate 2 (Figure 4(b)) and the
original video (Figure 1(a)) as the candidate Y and the query X respectively. In nearduplicate 2, these frames represent the most distorted segment.
Table III lists the outputs of CCA. Based on the Chi-square test, the correlation
values 0.88 and 0.63 are very likely to have arisen by chance, since their s values are
at much greater levels than the default value. Therefore, we abandon the tailing two
correlations. For the first correlation value 1.0, wY 6 obviously indicates that the sixth
frame in Y is a near-duplicate to a frame in X. For the correlation value 0.99, wY 2 and
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:17
Table IV. Comparison on Near-Duplicate Similarity, Correlation, and Retrieval Results
Near-duplicate
1 (Figure 4(a))
2 (Figure 4(b))
3 (Figure 4(c))
4 (Figure 4(d))
No. Frames
16
40
33
21
DoC in CNVR
0.66
1
1
0.83
Similarity
in SNVR
0.14
0.23
0.23
0.38
Successfully
Retrieved in CNVR
Y
Y
Y
Y
Successfully
Retrieved in SNVR
N
N
N
N
wY 3 are greatest, showing that both the second and third frames have close frames in
X. Similarly, from the correlation values 0.91 and 0.88, we can also identify that the
fourth and fifth frame frames in Y have near-duplicates in X. We may note that the
first frame is not identified as a near-duplicate frame. This is because the modification
is extremely heavy. The distortion is so great that even with the assistance of other
frames, strong correlation between X and Y can not be detected when the first frame
is taken into consideration. Nonetheless, CCA sees Y as a close match to X, which can
be further reflected in Table IV.
5.3. Effect on Retrieval
In terms of the ranks of the examples in Figure 4, our experimental results show that
CNVR greatly outperforms SNVR. In the SNVR ranking, they receive low ranks with
their low similarities to the original video. This results in low retrieval precision. With
CNVR, their strong implicit dependency is discovered via correlations, consequently
leading to high ranks. Hence these near-duplicates can be successfully identified.
Table IV gives the comparison of DoC, Euclidean distance-based similarity, and retrieval result from CNVR and SNVR respectively, to justify this. In all 4 cases, CNVR
greatly outperforms SNVR, showing superior tolerance to those heavy content changes.
Note that in this example, the Euclidean distance-based similarity in SNVR is computed as the normalized sum of frame similarity in the process of sliding-window
matching.
Particularly, near-duplicate 1 is a partial NV with a very low similarity. But its DoC
is still very high. This is achieved by successfully identifying the frames that have the
best matches with those of the original video. The DoC correctly reflects its degree of
being a partial NV of the original video. For the near-duplicate frames, the correlation is
strong, while for the unrelated frames, the correlation is weak. After normalization, the
DoC is finalized to 0.66. For near-duplicate 2 and 3, their DoC values reach 1 because
most of the modifications are recognized by CCA. For near-duplicate 4, its length is
close to the original video, and there are some unrelated frames. The final DoC comes
to 0.83. Comparing CNVR with SNVR, high DoC values enable near-duplicates to be
ranked higher in CNVR, while low similarity values fail to reflect the near-duplicate
relationship in the traditional SNVR.
6. EXPERIMENTS
To evaluate our CNVR approach, we conduct extensive experiments for the NVR task
on two datasets which have been previously used in the related literature.
6.1. Dataset
—TV ADS. This dataset is generated from 2,500 diverse TV commercials [Huang et al.
2009] where the categories include leisure, shopping, sports, travel, movie, and game,
with varying lengths from five seconds to one minute in the resolution of 320 × 240.
For the purpose of the experiment, each video has been modified to 20 NVs with very
strong content changes of various types, making for a total of 50,000 videos in the
final dataset. The categories of modifications are listed in Table V. Most of the NVs
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:18
J. Liu et al.
Table V. ND Categories on TV ADS
Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Modification
Brightness-50%
Brightness+50%
Contrast-50%
Contrast+50%
Red channel+150
Blue channel+150
Green channel+150
Cropping 50%
Embedded logo: 50% of frame size
Embedded banner: 50% of frame size
Embedded sub-scene: 50% of frame size
Brightness-25% & contrast-25%
Brightness+25% & contrast+25%
Red, green, blue channel+50 respectively
Cropping 25% & red channel+75
Cropping 25% & brightness-25%
Cropping 25% & contrast-25%
Embedded logo: 25% of frame size & red, green, blue channel+20 respectively
Embedded banner: 25% of frame size & red, green, blue channel+20 respectively
Embedded sub-scene: 25% of frame size & red, green, blue channel+20 respectively
in this dataset are video-level, which means only very few partial NVs are included.
We use 100 original videos as our queries to test the accuracy of NVR.
—CC WEB VIDEO. This dataset [Wu et al. 2007] consists of 12,970 web videos downloaded from YouTube, Google Video, and Yahoo!Video in response to 24 most popular
text queries, where long videos (longer than 10 minutes) are removed. Since most
social videos are user-generated, videos in this dataset usually has unpredictable
types of NVs, from moderate transformations to heavy transformations, from light
content edition to intensive edition, and from single type of content change to randomly combined multiple changes. Another characteristic of this dataset is that it
contains a remarkable level of partial NVs: massive frame insertion/deletion and
frame disordering appear in most of the user-edited NVs. In the dataset, for each
text query, five representative videos from the dominant cluster of NVs are selected
as our video query, which makes 120 query videos in total.
For feature extraction in our experiments, each video in the datasets is presented
as a sequence of keyframes, where each keyframe is represented by a feature vector
containing 27-dimensional YUV color coment. Given a video keyframe, the keyframe is
equally divided into nine regions where regions are geometrically equal to one another.
For each region, YUV color space is used for summarization. Uninformative frames
are detected and labeled so that when a candidate video is assessed in future, its
uninformative frames can be matched efficiently as a special case instead of being
processed in the CCA.
6.2. Evaluation Criteria
The NVR effectiveness is measured by the precision-recall curve, as used in most NVR
approaches. Since we experiment with multiple queries, we use the average precision
and recall over all queries to assess the overall effectiveness of our approach and
reference approaches. The essential rule is that if a video has any near-duplicate frame
of the original video, it is treated as an NV and should appear in the range of positives
in search ranks. On the other hand, to evaluate the NVR efficiency, average response
time over all queries is used to show the actual time efficiency of the approaches. All
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:19
the experiments are made on a PC with Windows XP with a 2.0GB RAM and 2.6GHz
Duo CPU.
6.3. Reference Approaches
To comparatively study the performance of our approach, we implemented a global
feature-based NVR method and a local keypoint-based NVR method. We applied these
two methods and our CNVR on the same datasets. In the following subsections, experimental results and analysis will demonstrate how our approach differs from existing
SNVR methods.
The first reference approach we compare with is the Bounded Coordinate System
(BCS) [Shen et al. 2007; Huang et al. 2009] as a global feature-based SNVR approach.
Particularly, to summarize a video, it computes a coordinate system, where each of its
coordinate axes is identified by Principal Component Analysis (PCA) and bounded by
the range of data projections along the axis. It basically summarizes feature distribution information. Then, given two coordinate systems generated by two videos, to
measure video similarity, two coordinate systems can be matched by performing two
operations: translation and rotation. A translation allows one to move its origin to
wherever desired. Using translation, we can move one system’s origin to the position
of the other. A rotation defines an angle which specifies the amount to rotate for an
axis. Using rotation, we can rotate an axis in one system to match its correspondent in
the other. By fuzzing the two factors, namely translation and rotation, the similarity
of two videos is therefore obtained. In our experiments, BCS features are processed
offline and stored prior to the NVR stage. Given a query video, its BCS feature is first
generated. Then linear search is performed in stored BCS features, where, for each
candidate video, its signature similarity to the query video is computed. Candidate
videos are finally ranked by their computed similarities.
The second approach to compare with is the recently proposed Scale-Rotation invariant Pattern Entropy (SR-PE) [Zhao and Ngo 2009], as a representative local keypointbased NVR approach to compare with CNVR. For this approach, keyframes are first processed with the Hessian-Affine detector [Mikolajczyk and Schmid 2004, 2002], where
36-dimensional PCA-SIFT descriptors are used as keypoints. Given two frames, to
measure their similarity, a two-stage rough-to-fine scheme is employed, where the first
step filters out irrelevant keypoints in the Bag-of-Word style, and the second applies
SR-PE measure in the remaining keypoints. Specifically, exhaustive keypoint matching
is performed in the keypoint pool that holds remaining keypoints, and then the horizontal and vertical motion angles of each matched pair of keypoints are computed. For
two frames, these angles are then clustered, based on which the SR-PE later is calculated. Finally, the SR-PE is used to determine the frame similarity. In the experiment,
keypoints are detected and described offline, and the two-stage retrieval is performed
for our evaluation observations. Note that this approach is basically a frame-matchingbased method since each frame in one video compares with frames in the other video
to find the best match.
6.4. Parameter Tuning
A critical aspect of the experiments is to make sure appropriate parameters are used
in comparisons. There are two parameters that affect the CNVR performance, namely
k and , in the regularization step of CCA algorithm and the DoC component of CNVR
framework respectively. For the two reference approaches, parameters are carefully
selected. The BCS parameter c is set to one, as the best setting [Huang et al. 2009].
Also, based on the experiments in Zhao and Ngo [2009], the SR-PE parameter γ is also
set to five, where it obtains best effectiveness.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
J. Liu et al.
Precision
21:20
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
k=10-10
k=10-8
k=10-5
k=10-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
(a) TV ADS
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
k=10-10
k=10-8
-5
k=10
k=10-1
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5 0.6
Recall
0.7
0.8
0.9
1
(b) CC WEB VIDEO
Fig. 5. Effect of k.
6.4.1. Effect of k. As previously described in Equation (11), the regularization factor
k, which appears in all the matrix reversibility problems, is one parameter in the
approach that affects the NVR effectiveness. k is usually taken as an extremely small
value. In Figure 5 we note the effectiveness is satisfactory when k < 10−8 for both
datasets, while it reaches peak performance at 10−8 . Curves for k = 10−5 and k = 10−1
show that very large regularization factors add more noise to the solution of the eigenproblem, and hence reduce the NVR accuracy. On the other hand, a too small k value
will not fill in much data and tend to be neglected. Based on the observation, we use
k = 10−8 in the following experiments.
6.4.2. Effect of . In the Chi-square significance test, is a threshold on s values, which
affects the number of correlation values in λ to be considered for DoC computation. A
larger suggests that more correlation values could be rejected. Figure 6 demonstrates
’s impact on NVR accuracy on both datasets. Figure 6(a) reveals that effectiveness
increases as grows when <0.05, since more correlations are considered in computing
DoC. It achieves best at 0.05 and dramatically drops when > 0.05. This shows that
for CNVR on TV ADS, most true positives have fairly small s values, most likely
between 0.03 and 0.05. When becomes larger than 0.05, effectiveness decreases as
more false positives are included, since the chance factor increases in DoC computation.
In the case of CC WEB VIDEO (Figure 6(b)), the trends are quite similar with slight
differences. Most true positives also have s values around 0.05. The curves of = 0.03
and = 0.1 are close to each other, indicating that many true positives are excluded,
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:21
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
ε=0.01
ε=0.03
ε=0.05
ε=0.1
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
Recall
(a) TV ADS
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
ε=0.01
ε=0.03
ε=0.05
ε=0.1
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
(b) CC WEB VIDEO
Fig. 6. Effect of .
both by enlarging the threshold to 0.1 or shrinking to 0.03. So we choose = 0.05 as
our default setting.
6.5. Effectiveness
In this subsection we compare the effectiveness of the proposed CNVR with two SNVR
approaches—BCS and SR-PE described in Subsection 6.3. Figure 7 shows the precisionrecall curves for different methods on both datasets.
First, we investigate the effectiveness on the TV ADS dataset, as shown in
Figure 7(a). The challenge of performing NVR on TV ADS is clear: all true NVs are
intensively transformed, causing heavy changes to the original video features. Visual
impression of the impact on features is given in the examples in Figure 4. From the
definitions of BCS and SR-PE, we note that they are not very robust to most of the
transformations in Table V, and are especially sensitive to changes that create great
illumination, shading, and other similar misalignments. Specifically, transformations
that cause heavy brightness or contrast changes will lead to very distinct features, for
both BCS and SR-PE. SR-PE is affected by the possibility of losing keypoint detection
accuracy. While even when the corresponding keypoints are discovered, their descriptors often vary a lot. On the other hand, BCS takes the distance between major principle
components into account as one of the two factors that determine features similarity. When features changes dramatically, the BCS signatures change correspondingly.
Hence SR-PE and BCS respond to some of the changes (e.g., transformation 1 to 7 in
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:22
J. Liu et al.
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
CNVR
SR-PE
BCS
0.1
0
0
0.1
0.2
0.3
0.4
0.5 0.6
Recall
0.7
0.8
0.9
1
0.7
0.8
0.9
1
(a) TV ADS
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
CNVR
SR-PE
BCS
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall
(b) CC WEB VIDEO
Fig. 7. CNVR vs SNVR on effectiveness.
Table V) less effectively. Figure 7(a) reflects this observation. Their precisions are close
to CNVR at low recalls, because they perform well for the transformations they are less
sensitive to (e.g., transformation 8 in Table V). However when recall increases, their
precisions drop rapidly as the ranks of those candidates with sensitive transformations
are relatively low. On the contrary, with the assistance of CCA, CNVR is invariant to
feature transformations to a considerable extent, and hence maintains very good invariancy to most of the transformations, except for a few like transformation 8 and 11
in Table V. The curves in Figure 7(a) reveal that CNVR has much stronger tolerance
to many intensive content changes than BCS and SR-PE.
Next we examine the effectiveness on the CC WEB VIDEO dataset. As shown in
Figure 7(b), CNVR outperforms both BCS and SR-PE mostly, especially BCS. The main
reason is that this dataset contains a lot of partial NVs, in which unrelated frames are
massively inserted, and the frame order is changed arbitrarily. This makes video-level
summarization (BCS) inadequate for finding accurate frame-level correspondence. For
SR-PE , the improvement margin is much smaller than that on the TV ADS dataset.
This can be explained as follows: In CC WEB VIDEO, near-duplicate variants are much
more arbitrary. It still contains lot of heavy content changes. Since the videos are randomly modified and uploaded by users, the average intensity of modifications is not as
heavy as that in TV ADS. Nonetheless, CNVR’s precision-recall curve is very smooth,
showing that it responds to various content changes effectively. Note that BCS and
SR-PE’s curves sink more speedily at higher recalls, indicating that both are less capable than CNVR on retrieving near-duplicate videos with much heavier content changes.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:23
1,724
Time(Second)
1500
1000
500
2.7
28
BCS
CNVR
0
SR-PE
(a) TV ADS
237
250
Time(Second)
200
150
100
50
0.6
8.3
0
BCS
CNVR
SR-PE
(b) CC WEB VIDEO
Fig. 8. CNVR vs SNVR on efficiency.
For partial near-duplicates, SR-PE does frame-level pairwise matching, which also
shows good effectiveness, but the pairwise matching is very computationally expensive. We will examine that in the following subsection.
6.6. Efficiency
In the last experiment, we test the efficiency of different methods. One of the major
advantage of BCS is its competitive time efficiency, as a video-level summarization.
The similarity computation is simple and fast. On the contrary, long response time
has been the major challenge for most keypoint-based approaches, including SR-PE,
as its retrieval time increases dramatically as the number of keyframes increases. The
pairwise comparison in the match process greatly increases the time complexity, as
well. Figure 8 shows the average response time for three methods when no indexing
structure is utilized for retrieval. As can be seen, CNVR obtains comparable efficiency
to BCS, but improves upon SR-PE significantly. For large video datasets, SR-PE’s
response time is extremely long, while CNVR’s response time remains reasonable.
In short, CNVR achieves high accuracy in finding near-duplicates of great content
changes, with competitive time cost.
7. CONCLUSION
Nowadays NVs show more flexibility in their changes. They often include multiple
categories of changes, or more intensive changes, some of them, especially uploaded by
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
21:24
J. Liu et al.
users, may be partial NVs as well. These facts have made it much more difficult for existing SNVR approaches to achieve high accuracy while maintaining reasonable efficiency.
Instead unlike conventional SNVR approaches, we proposed a novel CNVR framework
that facilitates NVR by assessing video (partial) near-duplicate relationships more accurately with correlation analysis. We defined the formulation and solutions of CCA
in an NVR perspective and described major components in the CNVR framework. We
also explained the CCA’s capability in detecting (partial) near-duplicates of intensive
content changes from information theory perspective. In the experiment section, we
verified our CNVR framework on real video datasets with comprehensive and comparative experiments, where the results showed its great effectiveness and reasonable
efficiency.
In the future, we plan to extend the work in two ways. First, non-linear correlation analysis methods will be studied to discover implicit correlations among video
sequences for more accurate detection. Second, to scale to very large datasets, novel
indexing structures will be investigated to index videos based on the proposed correlation measure. According to the preprocessed correlation analysis among database
videos, reference-based indexing methods, such as iDistance [Jagadish et al. 2005], or
reference-based histogram [Liu et al. 2007], can be utilized to avoid the redundant
access to the non-near-duplicates.
REFERENCES
ASSENT, I. AND KREMER, H. 2009. Robust adaptable video copy detection. In Proceedings of SSTD. 380–385.
BROWN, M. AND LOWE, D. G. 2002. Invariant features from interest point groups. In Proceedings of BMVC.
CHERUBINI, M., DE OLIVEIRA, R., AND OLIVER, N. 2009. Understanding near-duplicate videos: a user-centric
approach. In Proceedings of ACM Multimedia. 35–44.
CLARK, D. 1975. Understanding Canonical Correlation Analysis. Geo Abstracts Ltd.
HARDOON, D. R., SZEDMÁK, S., AND SHAWE-TAYLOR, J. 2004. Canonical correlation analysis: An overview with
application to learning methods. Neural Computat. 16, 12, 2639–2664.
HOTELLING, H. 1936. Relations between two sets of variates. Biometrika 28, 3/4, 321–377.
HUANG, Z., HU, B., CHENG, H., SHEN, H. T., LIU, H., AND ZHOU, X. 2010. Mining near-duplicate graph for
cluster-based reranking of web video search results. ACM Trans. Info. Syst. 28, 4.
HUANG, Z., SHEN, H. T., SHAO, J., ZHOU, X., AND CUI, B. 2009. Bounded coordinate system indexing for real-time
video clip search. ACM Trans. Info. Syst. 27, 3.
JAGADISH, H. V., OOI, B. C., TAN, K.-L., YU, C., AND ZHANG, R. 2005. idistance: An adaptive b+ -tree based
indexing method for nearest neighbor search. ACM Trans. Data. Syst. 30, 2, 364–397.
KE, Y. AND SUKTHANKAR, R. 2004. Pca-sift: A more distinctive representation for local image descriptors. In
Proceedings of CVPR. 506–513.
KIM, T.-K. AND CIPOLLA, R. 2009. Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans. Patt. Anal. Machine Intell. 31, 8, 1415–1428.
KIM, T.-K., KITTLER, J., AND CIPOLLA, R. 2007. Discriminative learning and recognition of image set classes
using canonical correlations. IEEE Trans. Patt. Anal. Machine Intell. 29, 6, 1005–1018.
LAW-TO, J., BUISSON, O., GOUET-BRUNET, V., AND BOUJEMAA, N. 2006. Robust voting algorithm based on labels of
behavior for video copy detection. In Proceedings of ACM Multimedia. 835–844.
LAW-TO, J., BUISSON, O., GOUET-BRUNET, V., AND BOUJEMAA, N. 2009. Vicopt: a robust system for content-based
video copy detection in large databases. Multimedia Syst. 15, 6, 337–353.
LAW-TO, J., CHEN, L., JOLY, A., LAPTEV, I., BUISSON, O., GOUET-BRUNET, V., BOUJEMAA, N., AND STENTIFORD, F. 2007.
Video copy detection: a comparative study. In Proceedings of CIVR. 371–378.
LIU, L., LAI, W., HUA, X.-S., AND YANG, S.-Q. 2007. Video histogram: A novel video signature for efficient web
video duplicate detection. In Proceedings of MMM. 94–103.
LOWE, D. G. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Computer Vision 60, 2,
91–110.
MARDIA, K. V., KENT, J. T., AND BIBBY, J. M. 1980. Multivariate Analysis. Academic Press.
MIKOLAJCZYK, K. AND SCHMID, C. 2002. An affine invariant interest point detector. In Proceedings of ECCV.
128–142.
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.
Correlation-Based Retrieval for Heavily Changed Near-Duplicate Videos
21:25
MIKOLAJCZYK, K. AND SCHMID, C. 2004. Scale & affine invariant interest point detectors. Int. J. Computer
Vision 60, 1, 63–86.
NGO, C.-W., ZHAO, W., AND JIANG, Y.-G. 2006. Fast tracking of near-duplicate keyframes in broadcast domain
with transitivity propagation. In Proceedings of ACM Multimedia. 845–854.
POULLOT, S., CRUCIANU, M., AND BUISSON, O. 2008. Scalable mining of large video databases using copy detection.
In Proceedings of ACM Multimedia. 61–70.
SATOH, S., TAKIMOTO, M., AND ADACHI, J. 2007. Scene duplicate detection from videos based on trajectories of
feature points. In Proceedings of MIR. 237–244.
SEOK MIN, H., CHOI, J., NEVE, W. D., AND RO, Y. M. 2009. Near-duplicate video detection using temporal patterns
of semantic concepts. In Proceedings of ISM. 65–71.
SHEN, H. T., LIU, J., HUANG, Z., NGO, C.-W., AND WANG, W. 2011. Near-duplicate video retrieval: Current research
and future trends. IEEE Trans. Multimedia.
SHEN, H. T., ZHOU, X., HUANG, Z., AND SHAO, J. 2007. Statistical summarization of content features for fast
near-duplicate video detection. In Proceedings of ACM Multimedia. 164–165.
SHEN, H. T., ZHOU, X., HUANG, Z., SHAO, J., AND ZHOU, E. 2007. Uqlips: A real-time near-duplicate video clip
detection system. In Proceedings of VLDB. 1374–1377.
SNEDECOR, G. W. AND COCHRAN, W. G. 1989. Statistical Methods 8th Ed. Iowa State University Press.
TAN, H.-K., NGO, C.-W., HONG, R., AND CHUA, T.-S. 2009. Scalable detection of partial near-duplicate videos by
visual-temporal consistency. In Proceedings of ACM Multimedia. 145–154.
TAN, H.-K., WU, X., NGO, C.-W., AND ZHAO, W. 2008. Accelerating near-duplicate video matching by combining
visual similarity and alignment distortion. In Proceedings of ACM Multimedia. 861–864.
WU, X., HAUPTMANN, A. G., AND NGO, C.-W. 2007. Practical elimination of near-duplicates from web video
search. In Proceedings of ACM Multimedia. 218–227.
WU, X., TAKIMOTO, M., SATOH, S., AND ADACHI, J. 2008. Scene duplicate detection based on the pattern of
discontinuities in feature point trajectories. In Proceedings of ACM Multimedia. 51–60.
WU, X., ZHAO, W., AND NGO, C.-W. 2007. Near-duplicate keyframe retrieval with visual keywords and semantic
context. In Proceedings of CIVR. 162–169.
WU, Z., JIANG, S., AND HUANG, Q. 2009. Near-duplicate video matching with transformation recognition. In
Proceedings of ACM Multimedia. 549–552.
YEH, M.-C. AND CHENG, K.-T. 2009. Video copy detection by fast sequence matching. In Proceedings of CIVR.
YUAN, J., DUAN, L.-Y., TIAN, Q., AND XU, C. 2004. Fast and robust short video clip search using an index
structure. In Proceedings of MIR. 61–68.
ZHANG, X., HUA, G., 0001, L. Z., AND SHUM, H.-Y. 2010. Interest seam image. In Proceedings of CVPR. 3296–
3303.
ZHAO, W. AND NGO, C.-W. 2009. Scale-rotation invariant pattern entropy for keypoint-based near-duplicate
detection. IEEE Trans. Image Process. 18, 2, 412–423.
ZHAO, W., NGO, C.-W., TAN, H.-K., AND WU, X. 2007. Near-duplicate keyframe identification with interest point
matching and pattern learning. IEEE Trans. Multimedia 9, 5, 1037–1048.
ZHU, J., HOI, S. C. H., LYU, M. R., AND YAN, S. 2008. Near-duplicate keyframe retrieval by nonrigid image
matching. In Proceedings of ACM Multimedia. 41–50.
Received April 2010; revised October 2010; April 2011; accepted September 2011
ACM Transactions on Information Systems, Vol. 29, No. 4, Article 21, Publication date: November 2011.

Download Report

Correlation-based retrieval for heavily changed

Paperzz.com

Your Paperzz