When Is the Right Time to Refresh
Knowledge Discovered from Data?
Xiao (“Shaw”) Fang
University of Toledo
Olivia R. Liu Sheng
University of Utah
Knowledge
Introduction
Patterns
Pattern Postprocessing
Processed
Data
Data Mining
Data Source
Data Preprocessing
The KDD (Knowledge Discovery
in Databases) Process
2
Introduction
Prior KDD research focused on
efficiency and effectiveness improvement of
the KDD process;
innovative applications of KDD.
Prior KDD research overlooked
the problem of maintaining currency of
knowledge over an evolving data source
3
Introduction
Data
source
New data
KDD
?
Knowledge
t1
t2
Time
The problem of maintaining currency of
knowledge over an evolving data source is
nontrivial.
4
Introduction
The problem is a fundamental problem,
which impacts every KDD application.
The problem is critical.
Cooper and Giuffrida (2000)
King et al. (2002)
5
Introduction
Knowledge refreshing is an nontrivial
process of keeping knowledge discovered
using KDD up-to-date with its dynamic
data source.
6
Related Work
Incremental data mining
Data stream mining
Related analytical research on databases
7
Related Work
Time
Manku and
Motiwani
(2002)
Hulten et al.
(2001)
Guha et al.
(2003)
2000
Cheung et
al. (1996)
Utgoff (1989)
Can (1993)
Data Mining
Model
Association
Classification
Clustering
8
Related Work
Knowledge
Patterns
Pattern Postprocessing
Processed
Data
Data Mining
Data
Source
Data Preprocessing
9
Related Work
Research gaps:
how to refresh patterns vs. when to refresh
knowledge
one step in the KDD process vs. the
complete KDD process
one particular data mining model vs. all
three major data mining models
computational efficiency vs. effective
decision making
10
Related Work
This research studies when to refresh
knowledge so as to optimize the trade off
between the loss of knowledge and the
cost incurred by running KDD.
11
Related Work
Previous analytical research on design
and implementation of databases
Chandy et al. (1975)
Park et al. (1990)
Segev and Fang (1991)
12
Related Work
This research is a first study on knowledge
refreshing for the KDD process.
The new problem context requires introduction
new concepts and new model components,
hence, a consequently new model structure.
13
Research Questions
Knowledge loss: what is it? how to measure
and estimate it?
The knowledge refreshing problem:
definition and model?
Solution and implementation?
How robust and effective is the solution?
14
Knowledge Loss: Definition
Knowledge loss refers to the phenomenon
that knowledge discovered by a previous run
of KDD becomes obsolete gradually as new
data are continuously added in.
Type I: part or all of the earlier discovered
knowledge become invalid due to incoming new
data;
Type II: new knowledge brought in by incoming
new data.
15
Knowledge Loss: Measurement
t-s
KDD
Type I
knowledge
loss
lt 1
Time
KDD
Kt s
Type II
knowledge
loss
Valid
knowledge
t
Kt
Kt s Kt
Kt s Kt
(1)
lt [0,1]
lt
: knowledge loss at time t
16
Knowledge Loss: Measurement
Knowledge discovered by a KDD process
employing any of the three major data mining
models can be represented as a set.
Association: a set of itemsets or a set of
association rules. (Agrawal and Srikant 1994)
Classification: a set of classification rules.
(Quinlan 1993)
Clustering: a set of cluster centers. (Jain et al.
1999)
17
Knowledge Loss : Estimation
dt
t-s
KDD
t
Time
Kt s
dt
: amount of new data accumulated by time t
lt f ( d t ) ?
18
Knowledge Loss : Estimation
A two-parameter Weibull function for
estimating knowledge loss
lt 1 e
dt
t
(2)
, : parameters
t : random error term
2
E[ t ] 0,Var ( t )
19
Knowledge Loss : Estimation
Motivation
Weibull function
product degradation in
reliability engineering
(Murthy et al. 2003)
knowledge degradation
in KDD
Validation: empirical experiments using
real world data
20
Knowledge Loss: Estimation
Experimental design
Incremental
data
Base
data
KDD
Base
knowledge
New
knowledge
KDD
Base
data
Incremental
data
Applying (1)
knowledge loss
amount of new data
21
Knowledge Loss: Estimation
Experiment set 1
Panel data: 63,999 online transactions
Knowledge discovered: co-purchasing
knowledge
Data mining model: Association
22
Knowledge Loss: Estimation
Weibull estimation
Experiment 1-3
0.8
Increment: 100 transactions
0.7
Support: 0.1%
0.6
Number of observations: 515
Range of knowledge loss: 0-0.7
knowledge loss
Base data: 12,500 transactions
0.5
0.4
0.3
0.2
observed knowledge loss
Regression results
0.1
37911.0
0.60
R 2 0.99
0.0009
2
0
0
10000
20000
30000
40000
50000
60000
amount of new data (# of transactions)
lt 1 e
(
dt
)
t
23
Knowledge Loss: Estimation
Experiment Base
data
Increment Support
Num. of
Range of
observations Knowledge
loss
1-1
10,000
100
0.1%
540
0-0.77
1-2
7,500
100
0.1%
565
0.08-0.77
1-3
12,500
100
0.1%
515
0-0.70
1-4
10,000
50
0.1%
1080
0-0.77
1-5
10,000
200
0.1%
270
0-0.77
1-6
10,000
100
0.075%
540
0-0.69
1-7
10,000
100
0.125%
540
0-0.73
24
Knowledge Loss: Estimation
Experiment
1-1
1-2
27102.9
30710.9
1-3
1-4
1-5
1-6
37911.0
27123.2
27109.3
36676.3
1-7
37272.0
0.56
0.51
0.60
0.56
0.55
0.49
0.67
2
0.001
0.001
0.0009
0.001
0.001
0.0007
0.002
25
Knowledge Loss: Estimation
Experiment set 2
Census data (publicly available at the UCI
repository for machine learning research):
45,222 records
Knowledge discovered: rules for predicting
income level based on attributes such as age,
and education level etc.
Data mining model: Classification
26
Knowledge Loss: Estimation
Weibull estimation
Experiment 2-2
Base data: 30,000 transactions
Increment: 50 transactions
1
0.9
Number of observations: 305
Regression results
knowledge loss
Range of knowledge loss: 0.020.93
0.8
0.7
0.6
0.5
0.4
observed knowledge loss
0.3
0.2
1961.5
0.58
0.1
0
0
2000
4000
6000
10000
12000
14000
16000
amount of new data (# of records)
2 0.001
R 2 0.99
8000
lt 1 e
(
dt
)
t
27
The Knowledge Refreshing Problem
Data
Source
......
Data
Source
......
......
KDD
(cost of running
KDD)
KDD
......
knowledge
base
0
Data
Source
......
knowledge base
(knowledge loss)
......
......
knowledge
base
......
request
(cost of knowledge
loss)
......
Time
request
28
The Knowledge Refreshing Problem
System constraint: when a request is submitted to
the knowledge base, a KDD run is required if
current knowledge loss exceeds lc .
The knowledge refreshing problem
Determine when to run KDD over a time horizon.
Objective: minimize the system cost incurred over
the time horizon, subject to the system constraint.
29
Model
Assumption
arrival of new data: following a Poisson process
with intensity µ.
n different types of requests: ri ,i=1,2,…,n, n≥1.
arrival of type i request: following a Poisson
process with intensity λi, i=1,2,…,n.
30
Model
Decision points: moments when a request arrives.
System time: 0,1,2,…,m,…,M
M: total number of decision points
m: system time of the mth decision point, 1≤m≤M
Action: am {0,1}
0: not running KDD
1: running KDD
31
Model
System state: sm=(qm, dm)
qm: type of request arrived at time m
qm {ri } i 1,2,, n
dm: amount of new data accumulated by time m
d m , where denotes nonnegative integers
32
Model
System constraint
if
lm ≥ lc
then am = 1,
where lm denotes knowledge loss at time m.
Applying (2),
if 1 e
(
dm
if d m d c
)
lc
then am 1.
then am 1,
1
1
)) .
where d c (ln(
1 lc
33
Model
Transition probability:
am
Psm sm 1
am
sm= (qm , dm)
sm+1= (qm+1 , dm+1)
34
Model
Transition probability
dm
dm+1
0
0
1
1
…
…
dc-1
dc-1
dc
dc
dc+1 Dc
dc+1 Dc
…
…
35
Model
Transition probability
Lemma 1. The joint probability mass function of
qm1 and y m , m 1 is,
P{qm 1 ri , ym ,m 1 k}
k
n
i
( j )
k 1
j 1
i 1,2, , n and k .
36
Model
Transition probability
Lemma 2
P{qm 1 ri , d m 1 k 2 | qm rj , d m k1 , am 0}
k2 k1 i
if k 2 k1
n
k 2 k1 1
(
)
j
j 1
0
if k 2 k1
where i, j 1,2, , n and k1 , k 2 {0,1,2, , d c - 1} ;
37
Model
Transition probability
Lemma 2 (continued)
P{qm 1 ri , d m 1 Dc | qm r j , d m k1 , am 0}
n
i
(
n
j j
j 1
)
dc k1
j 1
where i, j 1,2, , n and k1 {0,1,2, , d c - 1} .
38
Model
Transition probability
Lemma 3
P{qm 1 ri , d m 1 k 2 | qm rj , d m k1 , am 1}
k
2
i
n
( j ) k2 1
j 1
where i, j 1,2, , n and k1 , k 2 {0,1,2, , d c - 1} ;
39
Model
Transition probability
Lemma 3 (continued)
P{qm 1 ri , d m 1 Dc | qm rj , d m k1 , am 1}
n
i
(
n
j j
j 1
)
dc
j 1
where i, j 1,2,, n and k1 .
40
Model
Cost function: c(sm,am)
l
c
m
c( sm , am )
k
c
m
l
cm
k
cm
if am 0
if am 1
: cost of knowledge loss at time m
: cost of running KDD at time m
k
cm
c
k
41
Model
Cost function
c
l
m
sm=(qm, dm)
lm
c ri :cost incurred if a type ri request is answered from a fully
out-of-dated knowledge base (i.e., knowledge loss is 1)
cml cqm (1 e
500
1000
(
dm
0.5
)
) 0
1
42
Model
Knowledge refreshing policy: π
( 1, 2 , , m , , M )
a m m ( sm )
Expected system cost under π
M
k
l
EC E [ i ( si )ci (1 i ( si ))ci ]
i 1
The optimal knowledge refreshing policy: π*
EC * min EC
43
Solution
…
0
vsm (m)
…
m
1
M
vsm (m) : optimal system cost from time m, with state
sm , to the end of a time horizon.
By the principle of dynamic optimality (Bellman and
Dreyfus 1962),
vsm (m) min {c( sm , am )
amA
sm 1S
a
Ps ms
m m 1
vsm 1 (m 1)}
(4)
44
Solution
The value iteration method
Input v sM 1 ( M 1) 0
Solve (4) from m=M to m=1.
Output v s1 (1) and the optimal knowledge
refreshing policy π*
45
Implementation: the adaptive
heuristic
Exponential smoothing method for estimating
future data arrival rate
new a (1 ) old
(5)
new : new estimate for
old : old estimate for
a : actually observed
: smoothing constant, 0 1
46
Input: Duration of a forward looking period, T
Current system state, s
Smoothing constant, ω
Other system parameters
Output: Action a A
Estimate current knowledge loss l using (2)
If l ≥ lc
a=1
Return a
Else
Estimate new using (5)
new
Estimate i using (6) for i 1,2, , n
n
Approximate M as M T inew
i 1
Derive a1 by solving (4) from m=M to m=1, given s1=s
a = a1
Return a
End if
47
Numerical Analysis
Simulation design
α
β
27102.9
0.56
µ
348
r1
λ1 = 1
cr1 = 1500
r2
λ2 = 1/7
cr2 = 4500
r3
λ3 = 1/14
cr3 = 18000
ck = 700 (= 20*35)
48
Numerical Analysis
Robustness analysis
Generating the optimal knowledge refreshing policy
under the Poisson arrival assumption;
Implementing the policy in a simulation experiment
with exponentially distributed interarrival times of data
and requests;
Implementing the policy in a simulation experiment
with non-exponentially distributed interarrival times of
data and requests.
49
Numerical Analysis
Robustness analysis
non-exponential distributions selected:
Erlang-4 (coefficients of variation (CV) = 0.5)
Uniform (CV = 0.58)
Erlang-2 (CV = 0.71)
Hyperexponential (CV = 1.41)
50
Numerical Analysis
Robustness analysis
T = 180
lc = 0.2
=1
Exponential (cost):
62652.88
Erlang-4 (cost):
63528.29
1.4%
Uniform (cost):
63168.46
0.8%
Erlang-2 (cost):
63144.24
0.8%
Hyperexponential (cost): 63247.58
1.0%
51
Numerical Analysis
Effectiveness analysis: the optimal policy
Benchmark policy: fixed-interval knowledge refreshing
policy
T = 180
lc = 0.2
=1
Optimal policy (cost):
cost saving
62652.88
Fixed-interval policy (worst case): 153185.53
59.1%
Fixed-interval policy (best case):
80840.20
22.5%
Fixed-interval policy (average):
95362.07
34.3%
52
Numerical Analysis
Effectiveness analysis: the optimal policy
T
lc
180
90
0.2
0.2
1
1
Cost saving
(%)
22.5
19.9
360
180
180
0.2
0.15
0.25
1
1
1
24.0
15.0
23.2
180
180
0.2
0.2
0.5
2
24.3
2.1
53
Numerical Analysis
Effectiveness analysis: the optimal policy
lc
Cost saving (%)
Percentage of KDD runs
enforced by the system
constraint
(%)
0.05
0.1
98.5
0.1
5.3
88.1
0.15
15.0
64.4
0.2
22.5
25.6
0.25
23.2
0.1
0.3
24.0
0
0.35
24.0
0
54
Numerical Analysis
Effectiveness analysis: the optimal policy
0.3
0.25
cost saving
0.2
0.15
0.1
boundary 2
0.05
0
boundary 1
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2.25
2.5
55
Numerical Analysis
Effectiveness analysis: the adaptive heuristic
360
240
240
180 time units
changing data arrival rate
Average data arrival rate = 280
56
Numerical Analysis
Effectiveness analysis: the adaptive heuristic
lc
1
1
1
0.5
2
0.2
0.15
0.25
0.2
0.2
ECh
Cost saving
EC f
(%)
70578.11
15.3
59773.85
60623.07 67012.67
59451.29 71671.23
114563.70 140216.50
24166.00 24485.87
9.5
17.1
18.3
1.3
57
Contributions
This research is a first study on knowledge refreshing,
an ubiquitous problem faced by every KDD
application.
We introduce the concept of knowledge loss. We also
propose how to measure and estimate knowledge loss.
We propose a general model for knowledge refreshing
and derive from the model an optimal policy and an
adaptive heuristic for knowledge refreshing.
58
Contributions
The
measurement and estimating functions
of knowledge loss can be employed to
assess the quality of a knowledge base.
The
adaptive heuristic is readily applicable
for solving real word knowledge refreshing
problems.
59
Future Research
Theoretical foundations for the estimating
function of knowledge loss.
Multiple system constraints.
Boundary conditions under which a bestperformed fixed-interval policy is as good as
an optimal policy.
60
Comments and Suggestions
61
© Copyright 2026 Paperzz