Tracking Concept Drifting with an Online

Tracking Concept Drifting
with an Online-Optimized Incremental Learning Framework
Jun Wu*, Dayong Ding
Xian-Sheng Hua
Bo Zhang
AI Lab, Tsinghua University
Beijing 100084, P. R. China
Microsoft Research Asia
Beijing 100080, P. R. China
AI Lab, Tsinghua University
Beijing 100084, P. R. China
{wujun01, ddy01} @mails.tsinghua.edu.cn
[email protected]
[email protected]
ABSTRACT
Concept drifting is an important and challenging research issue in
the field of machine learning. This paper mainly addresses the
issue of semantic concept drifting in time series such as video
streams over a relatively long period of time. An OnlineOptimized Incremental Learning framework is proposed as an
example learning system for tracking the drifting concepts.
Furthermore, a set of measures are defined to track the process of
concept drifting in the learning system. These tracking measures
are also applied to determine the corresponding parameters used
for model updating in order to obtain the optimal up-to-date
classifiers. Experiments on the data set of TREC Video Retrieval
Evaluation 2004 not only demonstrate the inside concept drifting
process of the learning system, but also prove that the proposed
learning framework is promising for tackling the issue of concept
drifting.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing-indexing methods; I.2.10 [Artificial Intelligence]:
Vision and Scene Understanding-video analysis.
General Terms
Algorithms, Experimentation.
Keywords
Incremental Learning, Gaussian Mixture Model, Video Content
Analysis, Concept Drifting, TREC Video Retrieval Evaluation
1. INTRODUCTION
In time series, the underlying data distribution, or the concept that
we are trying to learn from the data sequences, typically is
constantly evolving over time. Often these changes make the
models built on old data inconsistent with the new data, thus
instant updating of the models is required [1]. This problem,
known as concept drifting [2], complicates the task of learning
concepts from data. An effective learner should be able to track
such changes and quickly adapt to them [1]. To model concept
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
MIR’05, November 10–11, 2005, Singapore.
Copyright 2005 ACM 1-59593-244-5/05/0011…$5.00.
drifting in time sequences has become an important and
challenging task.
Klinkenberg et al [3][4] propose a new method to recognize and
handle concept changes with support vector machines, which
maintains an automatically adjusted window on training data so
that the estimated generation error is minimized. Fan [5] points
out that the additional old data do not always help produce a more
accurate hypothesis than using the most recent data only. It
increases the accuracy only in some “lucky” situations. In [6] Fan
also demonstrates a random decision-tree ensemble based engine,
named as StreamMiner, to mine concept drifts in data streams. In
StreamMiner systematic data selection of old data and new data
chunks is utilized to compute the optimal model that best fits on
the changing data streams. Wang et al [7] propose a general
framework for mining drifting concept in data streams using
weighted ensemble classifiers based on their expected
classification accuracy on the test data under the time-evolving
environment.
Though many methods have been proposed to deal with concept
drifting, few researchers have covered the issue of how to track
concept drifting in a systematic viewpoint. This paper will
address this issue based on a novel online learning framework,
termed OOIL (Online-Optimized Incremental Learning) [11]. The
evolving processes of the drifting concepts, as well as a couple of
tracking measures related to concept drifting, will be investigated.
The remainder of this paper is organized as follows. Section 2
briefly introduces the issue of concept drifting. The onlineoptimized incremental learning framework is presented in Section
3. Section 4 discusses how to track concept drifting in detail.
Experiments are introduced in Section 5, followed by conclusions
and future works in Section 6.
2. CONCEPT DRIFTING
For an incoming data streams, there are two important issues: data
sufficiency and concept drifting [5]. Traditional machine learning
schemes typically do not consider the problems of concept
drifting. Actually if there is no concept drifting, and the training
data set is sufficient, there is no need to update the models. But,
when concept drifting occurs, besides the new data, the old data
should also be considered to enhance the performance of the
system. However, how much old data should be used and how to
use these data are not trivial issues [5].
* Supported by National NSF of China (No.60135010), National NSF of
China (No.60321002) and the Chinese National Key Foundation
Research Development Plan (2004CB318108).
Most of existing research works related to concept drifting in
machine learning are mainly concerning the final classification
results. A more sophisticated way is to make clear whether there
is concept drifting in the system and how much the concepts are
drifting in a quantitative way before classification. This paper
will follow this idea and define a couple of measures to
investigate the intrinsic properties of the learning systems.
3. OOIL FRAMEWORK - AN EXAMPLE
LEARNING SYSTEM
3.1 Framework Overview
As mentioned above, the scenario we are investigating is a batch
learning problem. That is, data are supposed to arrive over time
in batches. Different from traditional batch learning [16] [17], we
suppose all of the upcoming batches are unlabeled. Only a small
preliminary pre-labeled training data set is required during the
whole learning process.
3.2 Symbol Definitions
To describe our system more clearly, some symbols which will be
used in this paper are listed in Table 1 and Table 2. It should be
noted that yt and xt denote the t-th batch and the first portion of
the t-th batch. For simplicity, if there is no confusion, we denote
y as any element batch in D, x as the part (a small portion) of y
(i.e., the superscript t is omitted).
Table 1. Main symbols definition (1)
meanings
feature vector
an outcome of Y
pre-labeled training sets
t-th batch
all batch set
part of t-th batch
symbols
Y = [Y(1), Y(2), …, Y(d)]T
y = [y(1), y(2), …, y(d)]T
y*c = {y1,c, y2,c, …, yn(c),c}
yt = {yt1, yt2, …, ytN(t) }
D = {y1 ,…, yt-2, yt-1, yt ,…}
xt = {xt1, xt2, …, xtT} (1<T<N(t))
Table 2. Main symbols definition (2)
symbols
c
C
d
n(c)
N(t)
y
x
meanings
semantic concept c (1 c C)
total number of considered concepts
the dimension of the feature vector
the number of labeled samples for c
the number of samples in t-th batch
any element batch in D
the first portion of y
3.3 OOIL Framework
As aforementioned, our proposed OOIL framework consists of
three main modules, which will be presented in detail as follows.
Figure 1. Online-Optimized Incremental Learning Framework
Figure 1 shows the flowchart of our proposed online-optimized
incremental learning (OOIL) framework. As illustrated, Video 1,
2, …, K (for the box at top-left) are the pre-labeled training data
set, and batch 1, 2, ..., N (for the white boxes below) are unlabeled
upcoming data batches. The OOIL framework consists of three
primary modules: Global Model Pre-Training (GMPT), Local
Adaptation (LA) and Global Model Incremental Updating
(GMIU). When the concept drifts along the timeline, the
underlying data distribution, which is related to the corresponding
model, will change among different batches. Therefore, the key
issue of our proposed framework is how to update the models
instantly in order to adapt to the distribution changing of the new
upcoming data batch.
In more detail, in this framework, firstly C (the number of all
target concepts) original GMMs (termed Global GMMs) are
generated by training on the pre-labeled training samples using a
so-called Agglomerative EM (AEM) algorithm [8]. Then for the
under-test t-th batch, its first portion (xt) is also modeled by a
GMM using AEM algorithm. Thereafter, this “reference GMM”
is compared with all the Global GMMs, to generate C
corresponding local optimized GMMs which are then applied to
classify the samples in the under-test batch. At last, all the Global
GMMs are updated by combining the local optimized models and
the original global models. And, these updated global models
will be used in the next process for the newly-upcoming batch.
3.3.1 Global Model Pre-Training
Firstly, pre-labeled training samples are used to train a set of
Global GMMs by AEM algorithm [8].
Suppose a certain
semantic concept c has a finite mixture distribution in feature
space Y, represented by
(
f Y y *c |
k ,c
)=
k
α m,c N (
m =1
m,c ,
m,c
)
(1)
where k is the number of components, N ( m,c, m,c) is an Gaussian
component, m,c = ( m,c, m,c) is its mean and covariance matrix,
and m is the mixing probabilities ( km=1 m,c=1). Let k,c= { 1,c,
2,c, … , k,c, 1,c, 2,c, … , k-1,c} be the parameter set defining a
given mixture. Typical EM algorithm will iteratively maximize
maximum likelihood (ML) to estimate k,c based on the training
samples yc as
ˆ
k ,c
= arg max L
(
k ,c , y c
)
(2)
k ,c
In this paper, a modified EM algorithm, AEM, based on Mixture
Minimum Description Length (MMDL) criterion [8], is adopted
to estimate the best k,c (the hat “ ˆ ” is omitted for simplicity)
and the best k = k(c) for GMMs from the labeled samples.
3.3.2 Local Adaptation (Optimization)
As mentioned above, semantic concepts have a so-called “Time
Local Consistency” property, which enlightens us to improve the
classification accuracy by exploring the characteristics of certain
amount of the unlabeled samples in current under-test batch.
Recall x is the first portion of unlabeled samples in the under-test
batch. Similar to pre-training process in above sub-section, we
estimate a GMM for the sample set x. Suppose the estimated
GMM parameters are denoted by
t
kt
=
{
t
t
1, 2,
t
,
kt
, α1t , α 2t ,
,α t t
k −1
}.
(3)
Local adaptation is to find a set of local GMMs which is the
combination of the original global GMMs, represented by
{ k ( c ),c , 1
c
C }, and t t , aiming at optimizing the
k
classification performance on current under-test batch, i.e., y.
As aforementioned, the semantic concepts may drift gradually
over time. GMM’s local adaptation is designed to reduce the
affection caused by concept drifting by locally adapting GMMs on
a small portion of the under-test batch, as following steps.
Step 1: Compute the symmetric Kullback-Leibler (KL) divergence
(distance) [9] (Ds) of every pair of Gaussian components in the
GMMs represented by k ( c ),c and t t , as below form
3.3.3 Global Model Incremental Updating
Since drifting concept accords with the change of parameters for
generative models, in this sub-section, we will investigate how to
update the parameters of GMMs.
As we all know, a GMM is completely determined by a series of
Gaussian components and their corresponding weights. It is
observed that in theory there are three basic operations for
changing GMM components. The first operation is adding, i.e.,
one or more new Gaussian components will be added into the
model. The second is deleting, i.e., some existing components
will be removed in the new model. The third one is drifting, in
which some components trend to drift to a new position in the
parametric space. As a result, we decompose the updating
procedure of the Gaussian components into these three cases.
Based on above observations, we introduce the scheme for
updating global GMMs by combining the original global models
with the localized models. The updated models will be applied as
up-to-date global models for new upcoming batches. For
convenience, the localized GMM for concept c is represented by
l
k l ( c ),c
fX x |
=
k l (c )
j =1
k
(
D s (c, i, j) = D s N (y c |
=
1
tr
2
+
1
2
(
(
i,c
i,c
−
−
i,c ,
t
j
i,c
), N (x(t )|
)( )
t −1
j
)
t T
j
−1
i,c
+
t
j,
t
j
))
(4)
−1
i,c
−
( ) (
t −1
j
i,c
t
j
−
)
k ( c ),c
(
Step 3: Gaussian components N x
t
j,
t
j
)
Gaussian components in the localized GMM. It should be noted
that some concepts possibly are not able to obtain a local model.
For this case, the global model will applied be as the local model
and the global model incremental learning process in the next
subsection will be skipped. Therefore, for a sample yi in the
under-test batch, the classification result is determined by
1≤ s ≤C
{ (
l
j,c ,
l
j,c
) in localized
, find the most “close” component N (
i,c ,
i,c
) in
( (
i = arg min DKL N
1≤m≤ k(c)
l
j,c ,
l
j,c
), N (
m,c ,
m,c
)) .
(10)
Step 2: (Adding New Components) Gaussian component
N ( i,c , i,c ) in global GMM k ( c ),c is replaced by N *i,c , i,*c ,
t
α j , kl (c) = | J t(c) | is the number of
c(y i ) = arg max f X y i |
(
) , j ∈ J (c), are taken as
(
j∈J t ( c )
(9)
(6)
}
a new GMM estimation of semantic concept c for the under-test tth batch. That is, the local GMM for the certain semantic concept
c is
α tj
(7)
fX x | l l
=
N tj , tj ,
k ( c ) ,c
α l (c)
j∈J t (c)
where α l (c) =
).
by comparing Kullback-Leibler divergences (DKL) [9] as
(5)
J t (c ) = j : c = arg min D (s, j ) .
1≤ s ≤C
l
j,c
If the minimum DKL in (10) is larger than a predefined threshold
(denoted as KLDThresh), go to Step 2’ (adding new components).
Otherwise, go to Step 2 (drifting existing components).
Let J t(c) be a subset of {1, …, kt}, defined by
{
l
k l (c),c
GMM
k
1≤ i ≤ k ( c )
l
j,c ,
To update the global GMMs, we combine the components in
original GMMs with the most “close” ones in the localized
GMMs, delete those which no longer appear in the local GMMs,
or add new components to the global models, as follows.
Step 1: For each Gaussian component N
Step 2: Compute the distance between each semantic concept c
and each Gaussian component in t t , defined by
D (c, j ) = min D (c, i , j )
(
α lj,c N
l
k l ( s ),s
)}.
That is, the sample yi is classified to semantic concept c(yi).
(8)
(
)
which is defined by
(
*
i,c ,
*
i,c
)=
(
arg min DKL (1 − α ) N (
( , )
i,c ,
i,c
)+ α N (
l
j,c ,
l
j,c
), N (
*
i,c ,
*
i,c
)) ,
(11)
where is a parameter standing for the “updating speed”, which
will be discussed in detail in Section 4.1.8. According to
reference [10], *i,c , i,*c has a close form as
(
)
*
i,c
*
i,c
(
= (1 − α )
i,c
+
If a component N (
= (1 − α )
i,c
+α
)+ α
l
j,c
+
i,c
i,c ,
T
i,c
i,c
)
in
k ( c ),c
l
j,c
l
j,c
,
( )
l T
j,c
(12)
−
*
i,c
( )
* T
i,c
. (13)
drifts to a new position,
add label i into a label set, JG(c), which is initialized as an empty
set (at the same time, set kG(c) = 0. Note that kG(c) is the number
of elements in label set JG(c)), and kG(c) = kG(c) + 1.
Step 2’: (Drifting Existing Components) N
into
k ( c ),c
(
l
j,c ,
l
j,c
) is added
as a new component in the global model, and updated
′ +1,c
k(c)
global Gaussian model
(
fY yc |
'
k (c )+1,c
)=
k
4.1.1 Component Lifecycle
has the form of
(1 − β )α m,c N (
m =1
N add
+β
N
j =1
(
l
j,c ,
l
j,c
m,c ,
m,c
)
,
)
(14)
where Nadd is the total number of added components from the
*
local model,
= *•Wadd (0
<1.0) is also a parameter
controlling the “updating speed”, which will be discussed in detail
Wjl is the total weight of newlyin Section 4.1.8. Wadd =
added components from local model N
(
l
j,c ,
l
j,c
)
, and kG(c) =
kG(c) + Nadd . Finally, add a label: {kG(c)+j} into the label set
JG(c). The current Global GMM has N1 = k+Nadd components.
Step 3: (Deleting Outdated Components) For 1
i ∉ J G (c) , then delete the i-th component N ( i,c ,
i
i,c
kG(c), if
) in
k ( c ),c
by setting the corresponding weight i,c = 0 and normalizing the
weights of remaining components so the sum equals to 1.0.
Supposing total number of deleted components is Ndelete. And
now the current Global GMM has N2 = k + Nadd Ndelete Gaussian
components.
Finally, the updated GMM
(
f yc |
=
kG ( c ), c
)
kG ( c )
m =1
=
new
α old N (
m,c
m,c
kG ( c )
m =1
α old N (
m,c
m,c
,
,
m,c
m,c
)
)
new
k G ( c ),c
+
N add
j =1
has the form of
α new
N(
j
+
l
j,c
,
α new N (
p
p∈J G ( c ), p ≥ k ( c )
l
j,c
l
j,c
,
)
, (15)
l
j,c
)
new
is the weight of added new ones from
k ( c ),c , and α p
As aforementioned, for certain component in a GMM, it may
disappear at a certain stage. As a result, how long it survives in
the concept model is an important property. For component i
within a GMM, Component Lifecycle (CL) is defined as the
period from its first “birth” to its “death”, i.e.
LC i = T delete − T added .
(16)
The “Maximum Lifecycle Component” (MLC) is the one that has
the longest Component Lifecycle among the whole GMM, which
demonstrates the importance of a component in the current GMMbased learning system. CL and MLC are mainly used to describe
the importance of a certain component. That is to say, the longer
the Component Lifecycle is, the more important the corresponding
component is.
Figure 2 gives an example of a GMM’s evolving process. It
illustrates a GMM with four components (labeled as C-1,…, C-4
in the figure) in time t = 1. When t = 4, C-4 is deleted and C-5 is
added. In time t = 7, C-3 is also deleted and at this time there are
only three components in this GMM. Each column represents a
snapshot of the GMM at a certain moment, and each block means
one component. Furthermore, the height of each block equals its
weight in GMM.
GMM Component Lifecycle and Weights Distribution
t=1
t=2
t=3
t=4
t=5
t=6
t=7
C-1
C-1
C-1
C-1
C-1
C-1
C-1
C-2
C-2
C-2
C-2
C-2
C-3
C-3
C-3
C-4
C-4
C-4
C-3
C-3
C-5
C-5
C-2
C-2
t=8
C-1
C-2
C-3
C-5
C-5
C-5
Figure 2. An Toy Example of a GMM’s Evolving Process.
old
in which, α m,c
is the weight of all remaindered components in
l
k l (c),c
.
4. TRACKING CONCEPT DRIFTING
In this section, we will present how to track the drifting concepts
in the leaning framework we have presented in Section 3. In
addition, we will also discuss in a learning system how to set the
in Equation (14))
two parameters ( in Equation (13), and
which have not been detailed in previous section. Actually they
are related to the “tracking measures” to be introduced in this
section.
4.1 Measure Definitions
semantic concepts in the evolving process to describe how
concepts drift in a learning system in detail.
Practically, when one concept is represented by a generative
model, the concept drifting can be mapped to the changes of
parameters in the parametric space. For easier analysis of concept
drifting, based on the three typical operations in the updating
procedures for GMM components mentioned in Section 3.3.3, in
this section we define several measures in terms of tracking the
4.1.2 Dominant Component
As shown in Figure 2, for a particular component such as C-1, its
weight also changes alone timeline.
Under the formulation of GMM, we can define the significance of
a certain component according to its weight. The Dominant
Component (DC) is the one that has the largest weight among all
currently existing Gaussian components at a certain moment, i.e.,
l DC = arg max(α p ) .
(17)
1≤ p ≤ k (c )
As illustrated in Figure 2, when t = 1, component 1 (labeled as C1 in the figure) is the DC, and when t = 7, C-5 is the DC.
4.1.3 Component Saliency
As described in Section 4.1.2, in a GMM, for a particular
component, its weight may also keep changing along the timeline.
The average weight for one component is defined as Component
Saliency (CS), which reflects the importance of a component over
time. Though C-1 and C-2 are both Maximum Lifecycle
Components, the Component Saliency of C-1 is greater than C-2.
As a result, C-1 is more important than C-2 according to their
Component Saliencies.
4.1.4 Component Saliency Variance and Component
Saliency Drifting Curve
The variance of the weight of one component over time is defined
as Component Saliency Variance (CSV), which is able to describe
the stability of this component, which can be visualized by
Component Saliency Drifting Curve (CSDC).
Component
Saliency Variance is a component property over time in a GMM.
4.1.5 Dominant Component over Time
As aforementioned, the measure Component Lifecycle only
considers the factor of time, while Dominant Component only
regards the factor of weight. If we combine both time and weight,
Dominant Component over Time (DCT) can be defined as:
l DCT = arg max[ AWm ⋅ CLm ] ,
(18)
1≤i ≤ I
where I is the maximum number of possible components in the
whole learning system.
4.1.6 Component Drifting Distance
As has mentioned, a GMM component may gradually drifts along
the timeline. Supposing the component N ( i,c , i,c ) in k ( c ),c
drifts to N
(
l
j,c ,
l
j,c
) in
l
k l (c),c
, Component Drifting Distance
(CDD) is defined as the KL divergence [9] between the original
component and the drifted one,
CD S t,ci = D KL [ N
(
i,c ,
i,c
), N (
l
i,c ,
l
i,c
)] .
(19)
CDD reflects not only individual drifting magnitude but also the
overall drifting “speed” of the whole learning system. Figure 3
illustrates the CDD values of the toy example given in Figure 2.
Component Drifting Distance Distribution
CDD
Timeline
Figure 3. A Toy Example - Component Drifting Distance.
4.1.7 System Stability
The numbers of adding and deleting operations, as well as the
magnitudes of CDDs reflect the stability of the concept tracking
system. Suppose at certain moment, totally the GMM has been
updated for H times. We define System Stability (SS) as
SS =
H
u =1
Su ,
(20)
k (c )
j =1
α j CDSuc, j δ (u, j) + γ [ N u (delete) + N u (add)] , (21)
δ (u , j ) =
1, component j is drifting at time u
0 , otherwise.
4.1.8 Updating Speed
Till now we have defined several useful measures for describing
the concept drifting system. Our main goal is to judge whether
there is concept drifting and how much the concepts are drifting
in the learning system based on these defined measures, especially
SS values. According to these measures, we are able to set the
corresponding parameters in our proposed OOIL framework
introduced in Section 3. As mentioned above, the parameter ,
appearing in Equation (13), is used to control the updating speed
for the components’ drifting, while parameter in Equation (14)
is utilized to control the weights of newly-added components,
which reflects the “adding speed” for the newly-added
components.
The parameter , ranging from 0 to 1.0, represents the Updating
Intensity (UI) of the Global Model Incremental Updating (GMIU)
procedure. Based on the values of System Stability, when the
dynamic system drifts very quickly, more information in the
newly-trained local model should be considered. On the contrary,
when the system drifts slowly, the global model is considered as
the more reliable one. As a result, System Stability helps to guide
us how to control the updating speed for the original global
models. That is, for the former case, the parameter should be set
larger, and for the later case, it will be set smaller.
In addition, the parameter
in Equation (14) controls the
component’s “adding speed”, which ranges from 0 to Wadd (recall
the total weight of newly-added components in the local
optimized model). Therefore, we named it as Newly-Added
Component Ratio (NACR). When NACR is larger, the reallocated
weights of the newly-added components are also larger, which
means the newly-added components in the updated global model
have relatively higher importance. On the contrary, when NACR
is smaller, the newly-added components will play a less important
role in the updated global model.
At the initial stage, these two parameters are pre-set according to
the prior knowledge obtained from analyzing the defined tracking
measures on the preliminary training data. And with the newupcoming batch available, existing tracking information such as
System Stability on all out-dated bathes will be computed as the
reference information for tuning these two parameters. That is, if
System Stability is getting larger, the two parameters will be
increased. Otherwise, they will be decreased.
5. EXPERIMENTS
where,
Su =
N u (delete) (or N u (add ) ) is the total number of deleted (or newlyadded) components in the current under-investigating GMM, and
γ is a predefined parameter to balance these two sum items. We
are also able to graphically observe the GMM’s evolving process
by plotting the Su curve (Su~u). SS value and Su curve reflect the
overall property of the whole learning system, and it is a very
important and helpful measure for judge whether the concept is
drifting in the system. A real-world example will be provided in
Section 5.2.2.
,
(22)
We execute a series of experiments on the development data set of
TREC Video Retrieval Evaluation (TRECVID) 2004 [12]. Our
goal is to investigate the tracking measures of concept drifting in
this data set, and at the same time to evaluate OOIL framework by
comparing it with several other similar schemes.
5.1 Data Set and Feature
TECVID 2004 development data [13] used in our experiments
includes 114 news videos (spanning four months in 1998) from
ABC and CNN, about 60 hours in total. Each news video is about
half an hour. We deal with ABC and CNN collections separately,
for the properties of videos are quite different from two different
stations or sources. For ABC/CNN collection, 57 ABC/CNN
news videos are divided into 11 data batches (groups) along the
timeline, and 5 videos (about 2.5-hour videos, recorded every
other day or so) for each batch.
We select two semantic concepts as samples to test our schemes:
basketball and Studio_Setting [12]. All pre-labeled shots for the
concepts are from the annotations provided by TECVID 2004.
The training data set in each batch is derived by extracting
features from all frames in the positive-labeled shots. Totally,
there are 520 positive basketball shots (139 for ABC, 381 for
CNN, and 2% in the whole collections) and 6225 positive
Studio_Setting shots (3223 for ABC, 3032 for CNN, and 19% in
the whole collections). And the test data set is all the key-frames
(maybe multiple key-frames in common shot boundary reference,
which are also provided by TRECVID 2004) in the whole data
batch. One video (about half an hour) in each batch will be
considered as reference video especially for our OOIL framework.
In addition, the feature used in our experiments is the correlogram
in HSV color space, totally a 144-dimension feature vector
(denoted as HSVCorrelogram144 hereinafter), as presented in
[19]. For the lacking of sufficient training samples in current data
batch when training such a high-dimension GMM, the standard
Principle Component Analysis (PCA) method is applied to reduce
the feature dimension to 6. In addition, the mean of each
component has not been normalized as zero in our experiments.
5.2 Insight into Data with Tracking Measures
is defined as Component Drifting Distance (CDD). In Section 4.1,
Figure 3 has given a toy example for CDD. In this subsection,
we will introduce how to obtain a CDD figure from real-world
data set such as the TRECVID 2004 development data set.
After training all the GMMs on all the data batches, we compare
the differences between adjacent model pairs. For simplicity, we
do not consider the case of adding or deleting components here,
and all the GMMs have 5 components. Our goal here is to track
the drifting information between these GMMs and search for the
nearest pair of components between adjacent GMMs. After
comparing pair-wise KL-divergences, we regard the matched pair
of components that belongs to the same group. Thereafter, we
label all these GMM components (totally, 5 11 = 55) into five
groups denoted by C-1, C-2, …, C-5, similar to the case
illustrated in Figure 2. All the components in a group are
considered to keep drifting along the timeline. CDD is then
computed between the adjacent two components which belong to
the same group.
Except for CDD, the Component Saliency Drifting Curve can also
be illustrated explicitly, which reflects the stability of one
particular component (figure is omitted due to space limitation).
5.2.2 System Stability Comparison
As aforementioned, SS value reflects the system stability in a
global viewpoint. To compare the System Stability between
different data collections, the SS values for concept Studio_Setting
on both ABC and CNN collections are illustrated in Figure 4.
Since we do not consider the adding and deleting cases, the SS
value defined in Equation (21) is simplified as follow:
Su =
k (c )
j =1
α j CDSuc, j δ (u, j ) .
(23)
In this subsection, we examine the tracking measures (define in
Section 4.1) for the concept tracking system based on the
TRECVID 2004 development data set. According to the
statistical properties directly from the data, we will survey on how
to utilize the information of the data itself in order to track
concept drifting in more sophisticated way.
To avoid the unpredictable differences between the performances
of different learning systems, we directly train the models from
the pre-labeled training data on each batch. For an underinvestigating concept, totally 11 GMMs are trained from prelabeled samples in 11 data batches. If there is concept drifting
among these 11 data batches, these models will vary remarkably.
Our goal of this subsection is to track the differences among these
GMMs. For convenience, in the experiments of this section, the
number of the component of these GMMs are all set as 5.
Therefore, it should be noted that the measures which will be
illustrated in this subsection are not exactly the same as (but
estimations of) the ones defined in Section 4.1.
5.2.1 Component Drifting Distance and Component
Saliency Drifting Curve
As mentioned in Section 4.1, the corresponding components
between two adjacent GMMs (trained from the adjacent batches)
maybe drift along the timeline. The KL divergence between them
Figure 4. System Stability Comparison between ABC and
CNN - Studio_Setting.
It can be found that the Su values are different for different data
batches within ABC/CNN collection, and they are also different
between ABC and CNN data collections. For Studio_Setting
concept, SSABC+S = 560.4, and SSCNN+S = 18132.0. For basketball
concept, SSCNN+B = 652.9. As SSABC+S is much smaller than
SSCNN+S, it can be concluded that concept Studio_Setting drifts
more evidently in CNN collection than that in ABC collection.
Moreover, in CNN collection, concept basketball is much stabber
than concept Studio_Setting, but drifts a little more evidently than
Studio_Setting in ABC collection. As a result, we will update the
models quicker when tracking concept Studio_Setting on CNN
collection than that on ABC collection. And the updating speed
for concept basketball is a little quicker than concept
Studio_Setting in ABC. In addition, these results can explain why
the schemes’ performances on ABC collection are a little better
than that on CNN collection, as to be detailed in Section 5.3.3.
This section has illustrated some defined tracking measures on
ABC and CNN collections, which helps to find a more
sophisticated way to update the models. Based on these
illustrated figures of defined tracking measures, we are able to
obtain more reliable knowledge on how to set the initial
corresponding parameters evolved in learning systems.
5.3 OOIL Framework and Other Schemes
To evaluate the proposed OOIL framework in tracking drifting
concept, we compare it with a related scheme or the same scheme
but under different settings.
5.3.1 Evaluation on TRECVID Data
We use the evaluation method applied in TRECVID 2004 task on
“high-level feature extraction”. Due to multiple key-frames in
common shots, the maximum value among the GMM’s outputs of
all these key-frames within one shot will be considered as the final
score of this shot. And the final classification results will return a
ranked shot list for each concept. An evaluation tool named
trec_eval [15] is used to compute the AP (Average Precision)
value [19] as the ultimate evaluation measure. This evaluation
method is mainly focused on returning a highly-ranked list, and
there is no need to determine whether it is positive sample or not.
To achieve a higher AP value, the system must rank the mostlikely sample as ahead as possible (see the appendix of TREC-10
Proceedings on common evaluation measures for detailed
information [20]).
5.3.2 Schemes for Comparison
As we have presented, the OOIL framework has three major
features, i.e., effectively utilizing unlabeled samples in under-test
batches, GMM local adaptation (LA), and global Model
incremental updating (GMIU). Experiments are designed to
evaluate the effectiveness of each feature. Accordingly, three
different schemes are designed as follows.
S1 - Our proposed OOIL framework.
S2 - OOIL but using updated global models for classification:
Compared with S1, the online updated global GMMs are
employed for online classification.
S3 - Allow checking more “labeled” data: Similar to S1, except
the local models are obtained directly from the training set of
current under-test batch which is manually labeled, instead of
using unlabeled samples by LA which is presented in Section
3.3.2. Obviously, this scheme uses more training data than S1
and S2, for which only a preliminary pre-labeled training data set
is need and all the up-coming batch has no any labeled data. We
will show that the performance of our proposed scheme (S1) is
close to that of this scheme, but without using any labeled
samples from the current under-test batch.
5.3.3 Results
We apply these different schemes on ABC and CNN data
collections separately. For Studio_Setting, scheme comparison is
executed both on ABC and CNN collection. While for basketball,
scheme comparison is only executed on CNN collection, for there
are very few basketball shots in ABC collection.
Table 3. Overall AP Comparisons on ABC and CNN
collections
Schemes
S1
S2
S3
ABC
Studio_Setting
0.139
0.098
0.203
basketball
0.039
0.028
0.052
CNN
Studio_Setting
0.124
0.098
0.194
Table 3 shows the overall performance comparisons of different
schemes for different concepts on ABC and CNN collections. For
each scheme, there is an overall Average Precision (AP) value on
all batches (that is, the average AP value) along the timeline.
From this table, it can be concluded that our proposed OOIL
framework achieves better overall performances than S2 (though
for few batches, its performance is not the best one), and is very
close to the ideal case (scheme S3) on both ABC and CNN data
collections. That is to say, our OOIL framework is an effective
learning system in tracking concept drifting.
Figure 5. Comparisons on CNN Collection - basketball.
For concept basketball on CNN collection, a series of Average
Precision values are represented by a corresponding curve, as
illustrated in Figure 5. The x-coordinate is the index of the data
batches and the y-coordinate are the Average Precision values for
all data batches.
In addition, it can be seen that for concept Studio_Setting, the
overall performances in ABC collection (0.139) is better than that
in CNN collection (0.124). The reasons for these results can be
explained by the comparison of System Stability illustrated in
Section 5.2.2.
5.3.4 Parameters Setting
Parameters for GMMs: Based on our experiments, the number of
components of the global GMM is pre-set to 5, and the number of
components of the reference GMM (recall it is trained from the
reference video for OOIL framework) is 30. In the GMM training
procedure, the maximum iteration number is set as 100, and the
alternative termination condition for basic EM algorithm is that
the difference between two iteration processes is smaller than a
pre-defined threshold, which is preset to 0.01.
Parameter of UI: As aforementioned, the parameter (UI) stands
for the Updating Intensity for the global model updating
procedure. The dynamic system may drift very quickly, or
relatively slowly. For the former case, the System Stability will be
larger than the later case. As a result, if large value of System
Stability is calculated from the pre-labeled data set, the parameter
should be set larger at initial stage of the OOIL system, even
near to 1.0. On the contrary, it should be set smaller.
From the analysis for the System Stability on ABC and CNN data
collections in Section 5.2.2, it can be found that for
Studio_Setting concept, the ABC collection is stabber than CNN
collection. That is, the SS value on CNN is smaller than that on
ABC. Furthermore, concept Studio_Setting on ABC collection is
stabber than concept basketball on CNN.
Parameter of NACR: The parameter (NACR) is applied when
certain component in the local optimized model is added to the
global model. In our experiments, the range of NACR in our
experiments is set to [0, 0.5].
Threshold for GMIU Process: In Global Model Incremental
Updating (GMIU) procedure of OOIL framework, there is a predefined threshold named as KLDThresh which control the
alternative condition whether to drift the old component in global
model or to add a new component from the local model to the
global model. If the KL-divergence defined in Equation (10) is
larger than KLDThresh, add the corresponding new component
into the global model. Otherwise, drift the existing component
according to the nearest component in the local model.
As a summary, for OOIL framework, the initial pre-set parameters
for concept basketball and Studio_Setting on both ABC and CNN
collections are listed in Table 4.
Table 4. OOIL Parameter Pre-setting
ABC
based learning systems currently. However, similar methods may
be easily applied to other generative model based learning
algorithms. Future work will be focused on applying our
framework on more semantic concepts and a larger data collection,
as well as finding a better or optimal way to determine the
systematic parameters. Furthermore, to study concept drifting for
discriminative model based learning systems (such as SVM and
boosting) may be another future work.
7. REFERENCES
[1] A. Tsymbal, The problem of concept drift: definitions and related
work, Available at http://www.cs.tcd.ie.
[2] Widmer G., et al, Learning in the presence of concept drifting and
hidden contexts, Machine Learning, 23 (1) 1996.
[3] R. Klinkenberg, et al, Detecting Concept Drift with Support Vector
Machines. ICML 2000, pp 487-494.
[4] R. Klinkenberg. Using Labeled and Unlabeled Data to Learn
[5]
[6]
[7]
[8]
[9]
[10]
[11] Jun Wu, et al, An Online-Optimized Incremental Learning
Framework for Video Semantic Classification, ACM MM’04.
[12] TREC Video Retrieval Evaluation (NIST, USA) Homepage.
basketball
Studio_Setting
basketball
Studio_Setting
UI
NACR
KLDThresh
-
0.1
0.05
5.0
0.15
0.05
5.0
0.3
0.2
5.0
Adaptive Threshold Determination: The parameters for OOIL
framework (UI, NACR, and KLDThresh) firstly are preset as
aforementioned at the initial stage. With more and more new
batches available, the corresponding parameters will be adaptive
adjusted according to the existing tracking information, such as SS
values, obtained from the Global Model Incremental Updating
(GMIU) process. And then these adjusted parameters will be
applied in the next learning process for the new upcoming batch.
6. CONCLUSIONS AND FUTURE WORKS
In this paper, we have proposed a set of systematic measures to
tracking concept drifting based on a novel Online-Optimized
Incremental Learning framework. A limitation of the proposed
concept drifting tracking scheme is that it only works for GMM
Available at: http://www-nlpir.nist.gov/projects/trecvid/
Available at:
http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html
C.-Y. Lin, B. Tseng, J.R. Smith, IBM T.J. Watson Research Center,
Video Collaborative Annotation Forum: Establishing Ground-Truth
Labels on Large Multimedia Datasets, http://www-nlpir.nist.gov
/projects/tvpubs/tvpapers03/ibm.final2.paper.pdf.
TREC Video Retrieval Evaluation Tools. Available at: http://wwwnlpir.nist.gov/projects/trecvid/trecvid.tools/.
Scott H. Clearwater, Tze-Pin Cheng, Haym Hirsh, Bruce G.
Buchanan, Incremental batch learning, Proceedings of the sixth
international workshop on Machine learning, Ithaca, New York,
USA, P: 366 - 370 ,1989
D. A. Pomerleau, Efficient Training of artificial neural networks for
autonomous navigation, Neural Computation, vol. 3, 1991.
Lei Zhang, F.Z Lin, Bo Zhang, A CBIR method based on colorspatial feature, IEEE Region 10 annual International Conference
1999 (TENCON’99), Cheju, Korea, 1999, pp 166-169.
J. Huang, S.R Kumar, M. Mitra, et al, Image Indexing Using Color
Correlograms, IEEE Conference on CVPR 1997, pp762-768.
TREC-10 Proceedings appendix on common evaluation measures.
Available at: http://trec.nist.gov/pubs/trec10/appendices/measures.pdf.
[13] TREC Video Retrieval Evaluation Past Data.
[14]
CNN
Parameters
in OOIL
Drifting Concepts. IJCAI-2001 Workshop on Learning from
Temporal and Spatial Data, pp 16-24.
Wei Fan, Systematic Data Selection to Mine Concept-Drifting Data
Streams, ACM SIGKDD 2004.
Wei Fan, StreamMiner: A Classifier Ensemble-based Engine to
Mine Concept Drifting, VLDB 2004
H.X. Wang, et al, Mining Concept-Drifting Data Streams using
Ensemble Classifiers, ACM SIGKDD 2003.
M. Figueiredo, et al, On Fitting Mixture Models, Energy
Minimization, Computer Vision and Pattern Recognition, E.
Hancock and M. Pellilo (Eds.), Springer-Verlag, 1999.
S. Kullback, Information Theory and Statistics, J. Wiley & Sons,
New York, 1959.
M. West, J. Harrison, Bayesian Forecasting and Dynamic Models,
Springer Verlag, New York, 1989.
[15]
[16]
[17]
[18]
[19]
[20]