Slides

When Is the Right Time to Refresh
Knowledge Discovered from Data?
Xiao (“Shaw”) Fang
University of Toledo
Olivia R. Liu Sheng
University of Utah
Knowledge
Introduction
Patterns
Pattern Postprocessing
Processed
Data
Data Mining
Data Source
Data Preprocessing
The KDD (Knowledge Discovery
in Databases) Process
2
Introduction

Prior KDD research focused on
 efficiency and effectiveness improvement of
the KDD process;
 innovative applications of KDD.

Prior KDD research overlooked
 the problem of maintaining currency of
knowledge over an evolving data source
3
Introduction
Data
source
New data
KDD
?
Knowledge
t1

t2
Time
The problem of maintaining currency of
knowledge over an evolving data source is
nontrivial.
4
Introduction

The problem is a fundamental problem,
which impacts every KDD application.

The problem is critical.
 Cooper and Giuffrida (2000)
 King et al. (2002)
5
Introduction

Knowledge refreshing is an nontrivial
process of keeping knowledge discovered
using KDD up-to-date with its dynamic
data source.
6
Related Work

Incremental data mining

Data stream mining

Related analytical research on databases
7
Related Work
Time
Manku and
Motiwani
(2002)
Hulten et al.
(2001)
Guha et al.
(2003)
2000
Cheung et
al. (1996)
Utgoff (1989)
Can (1993)
Data Mining
Model
Association
Classification
Clustering
8
Related Work
Knowledge
Patterns
Pattern Postprocessing
Processed
Data
Data Mining
Data
Source
Data Preprocessing
9
Related Work

Research gaps:
 how to refresh patterns vs. when to refresh
knowledge
 one step in the KDD process vs. the
complete KDD process
 one particular data mining model vs. all
three major data mining models
 computational efficiency vs. effective
decision making
10
Related Work

This research studies when to refresh
knowledge so as to optimize the trade off
between the loss of knowledge and the
cost incurred by running KDD.
11
Related Work

Previous analytical research on design
and implementation of databases
 Chandy et al. (1975)
 Park et al. (1990)
 Segev and Fang (1991)
12
Related Work

This research is a first study on knowledge
refreshing for the KDD process.

The new problem context requires introduction
new concepts and new model components,
hence, a consequently new model structure.
13
Research Questions

Knowledge loss: what is it? how to measure
and estimate it?

The knowledge refreshing problem:
definition and model?

Solution and implementation?

How robust and effective is the solution?
14
Knowledge Loss: Definition

Knowledge loss refers to the phenomenon
that knowledge discovered by a previous run
of KDD becomes obsolete gradually as new
data are continuously added in.
 Type I: part or all of the earlier discovered
knowledge become invalid due to incoming new
data;
 Type II: new knowledge brought in by incoming
new data.
15
Knowledge Loss: Measurement
t-s
KDD
Type I
knowledge
loss
lt  1 
Time
KDD
Kt s
Type II
knowledge
loss
Valid
knowledge
t
Kt
Kt s  Kt
Kt s  Kt
(1)
lt  [0,1]
lt
: knowledge loss at time t
16
Knowledge Loss: Measurement

Knowledge discovered by a KDD process
employing any of the three major data mining
models can be represented as a set.
 Association: a set of itemsets or a set of
association rules. (Agrawal and Srikant 1994)
 Classification: a set of classification rules.
(Quinlan 1993)
 Clustering: a set of cluster centers. (Jain et al.
1999)
17
Knowledge Loss : Estimation


dt
t-s
KDD
t
Time
Kt s
dt
: amount of new data accumulated by time t
lt  f ( d t ) ?
18
Knowledge Loss : Estimation

A two-parameter Weibull function for
estimating knowledge loss
lt  1  e
 dt 
 
 

 t
(2)
 ,  : parameters
 t : random error term
2
E[ t ]  0,Var ( t )  
19
Knowledge Loss : Estimation

Motivation
Weibull function
product degradation in
reliability engineering
(Murthy et al. 2003)
knowledge degradation
in KDD

Validation: empirical experiments using
real world data
20
Knowledge Loss: Estimation

Experimental design
Incremental
data
Base
data
KDD
Base
knowledge
New
knowledge
KDD
Base
data
Incremental
data
Applying (1)
knowledge loss
amount of new data
21
Knowledge Loss: Estimation

Experiment set 1
 Panel data: 63,999 online transactions
 Knowledge discovered: co-purchasing
knowledge
 Data mining model: Association
22
Knowledge Loss: Estimation
Weibull estimation
Experiment 1-3
0.8
Increment: 100 transactions
0.7
Support: 0.1%
0.6
Number of observations: 515
Range of knowledge loss: 0-0.7
knowledge loss
Base data: 12,500 transactions
0.5
0.4
0.3
0.2
observed knowledge loss
Regression results
0.1
  37911.0
  0.60
R 2  0.99
  0.0009
2
0
0
10000
20000
30000
40000
50000
60000
amount of new data (# of transactions)
lt  1  e
(
dt

)
 t
23
Knowledge Loss: Estimation
Experiment Base
data
Increment Support
Num. of
Range of
observations Knowledge
loss
1-1
10,000
100
0.1%
540
0-0.77
1-2
7,500
100
0.1%
565
0.08-0.77
1-3
12,500
100
0.1%
515
0-0.70
1-4
10,000
50
0.1%
1080
0-0.77
1-5
10,000
200
0.1%
270
0-0.77
1-6
10,000
100
0.075%
540
0-0.69
1-7
10,000
100
0.125%
540
0-0.73
24
Knowledge Loss: Estimation

Experiment
1-1
1-2
27102.9
30710.9
1-3
1-4
1-5
1-6
37911.0
27123.2
27109.3
36676.3
1-7
37272.0

0.56
0.51
0.60
0.56
0.55
0.49
0.67
2
0.001
0.001
0.0009
0.001
0.001
0.0007
0.002
25
Knowledge Loss: Estimation

Experiment set 2
 Census data (publicly available at the UCI
repository for machine learning research):
45,222 records
 Knowledge discovered: rules for predicting
income level based on attributes such as age,
and education level etc.
 Data mining model: Classification
26
Knowledge Loss: Estimation
Weibull estimation
Experiment 2-2
Base data: 30,000 transactions
Increment: 50 transactions
1
0.9
Number of observations: 305
Regression results
knowledge loss
Range of knowledge loss: 0.020.93
0.8
0.7
0.6
0.5
0.4
observed knowledge loss
0.3
0.2
  1961.5
  0.58
0.1
0
0
2000
4000
6000
10000
12000
14000
16000
amount of new data (# of records)
 2  0.001
R 2  0.99
8000
lt  1  e
(
dt

)
 t
27
The Knowledge Refreshing Problem
Data
Source
......
Data
Source
......
......
KDD
(cost of running
KDD)
KDD
......
knowledge
base
0
Data
Source
......
knowledge base
(knowledge loss)
......
......
knowledge
base
......
request
(cost of knowledge
loss)
......
Time
request
28
The Knowledge Refreshing Problem

System constraint: when a request is submitted to
the knowledge base, a KDD run is required if
current knowledge loss exceeds lc .

The knowledge refreshing problem
Determine when to run KDD over a time horizon.
Objective: minimize the system cost incurred over
the time horizon, subject to the system constraint.
29
Model

Assumption
 arrival of new data: following a Poisson process
with intensity µ.
 n different types of requests: ri ,i=1,2,…,n, n≥1.
 arrival of type i request: following a Poisson
process with intensity λi, i=1,2,…,n.
30
Model

Decision points: moments when a request arrives.

System time: 0,1,2,…,m,…,M
M: total number of decision points
m: system time of the mth decision point, 1≤m≤M

Action: am {0,1}
0: not running KDD
1: running KDD
31
Model

System state: sm=(qm, dm)
qm: type of request arrived at time m
qm {ri } i  1,2,, n
dm: amount of new data accumulated by time m
d m    , where   denotes nonnegative integers
32
Model

System constraint
if
lm ≥ lc
then am = 1,
where lm denotes knowledge loss at time m.
Applying (2),
if 1  e
(
dm

if d m  d c
)
 lc
then am  1.
then am  1,
1


1

)) .
where d c   (ln(
1  lc


33
Model

Transition probability:
am
Psm sm 1
am
sm= (qm , dm)
sm+1= (qm+1 , dm+1)
34
Model

Transition probability
dm
dm+1
0
0
1
1
…
…
dc-1
dc-1
dc
dc
dc+1 Dc
dc+1 Dc
…
…
35
Model

Transition probability
Lemma 1. The joint probability mass function of
qm1 and y m , m 1 is,
P{qm 1  ri , ym ,m 1  k} 
 k
n
i
(    j )
k 1
j 1
i  1,2,  , n and k    .
36
Model

Transition probability
Lemma 2
P{qm 1  ri , d m 1  k 2 | qm  rj , d m  k1 , am  0}

 k2  k1  i
if k 2  k1

n
k 2  k1 1
(



)

 j
j 1

0
if k 2  k1
where i, j  1,2,  , n and k1 , k 2  {0,1,2,  , d c - 1} ;
37
Model

Transition probability
Lemma 2 (continued)
P{qm 1  ri , d m 1  Dc | qm  r j , d m  k1 , am  0}


n
i
(

n
j   j
j 1
)
dc k1
j 1
where i, j  1,2,  , n and k1  {0,1,2,  , d c - 1} .
38
Model
Transition probability

Lemma 3
P{qm 1  ri , d m 1  k 2 | qm  rj , d m  k1 , am  1}

k 
2
i
n
(     j ) k2 1
j 1
where i, j  1,2, , n and k1    , k 2  {0,1,2, , d c - 1} ;
39
Model

Transition probability
Lemma 3 (continued)
P{qm 1  ri , d m 1  Dc | qm  rj , d m  k1 , am  1}


n
i
(

n
j   j
j 1
)
dc
j 1
where i, j  1,2,, n and k1    .
40
Model

Cost function: c(sm,am)
l

c
 m
c( sm , am )  
k

c
 m
l
cm
k
cm

if am  0
if am  1
: cost of knowledge loss at time m
: cost of running KDD at time m
k
cm
c
k
41
Model

Cost function
c
l
m
sm=(qm, dm)
lm
c ri :cost incurred if a type ri request is answered from a fully
out-of-dated knowledge base (i.e., knowledge loss is 1)
cml  cqm (1  e
500
1000
(
dm
0.5

)
)   0
1
42
Model

Knowledge refreshing policy: π
  (  1,  2 ,  ,  m ,  ,  M )
a m   m ( sm )

Expected system cost under π
M
k
l 
EC  E [ i ( si )ci  (1   i ( si ))ci ]
 i 1


The optimal knowledge refreshing policy: π*
EC *  min EC
 
43
Solution
…
0
vsm (m)

…
m
1
M
vsm (m) : optimal system cost from time m, with state
sm , to the end of a time horizon.
By the principle of dynamic optimality (Bellman and
Dreyfus 1962),
vsm (m)  min {c( sm , am ) 
amA

sm 1S
a
Ps ms
m m 1
vsm 1 (m  1)}
(4)
44
Solution

The value iteration method
Input v sM 1 ( M  1)  0
Solve (4) from m=M to m=1.
Output v s1 (1) and the optimal knowledge
refreshing policy π*
45
Implementation: the adaptive
heuristic

Exponential smoothing method for estimating
future data arrival rate
 new   a  (1   )  old
(5)
 new : new estimate for 
 old : old estimate for 
 a : actually observed 
 : smoothing constant, 0    1
46
Input: Duration of a forward looking period, T
Current system state, s
Smoothing constant, ω
Other system parameters
Output: Action a  A
Estimate current knowledge loss l using (2)
If l ≥ lc
a=1
Return a
Else
Estimate  new using (5)
new
Estimate i using (6) for i  1,2,  , n
n
Approximate M as M  T  inew
i 1
Derive a1 by solving (4) from m=M to m=1, given s1=s
a = a1
Return a
End if
47
Numerical Analysis

Simulation design
α
β
27102.9
0.56
µ
348
r1
λ1 = 1
cr1 = 1500
r2
λ2 = 1/7
cr2 = 4500
r3
λ3 = 1/14
cr3 = 18000
ck = 700 (= 20*35)
48
Numerical Analysis

Robustness analysis
Generating the optimal knowledge refreshing policy
under the Poisson arrival assumption;
Implementing the policy in a simulation experiment
with exponentially distributed interarrival times of data
and requests;
Implementing the policy in a simulation experiment
with non-exponentially distributed interarrival times of
data and requests.
49
Numerical Analysis

Robustness analysis
non-exponential distributions selected:

Erlang-4 (coefficients of variation (CV) = 0.5)

Uniform (CV = 0.58)

Erlang-2 (CV = 0.71)

Hyperexponential (CV = 1.41)
50
Numerical Analysis

Robustness analysis
T = 180
lc = 0.2
 =1
Exponential (cost):
62652.88
Erlang-4 (cost):
63528.29
1.4%
Uniform (cost):
63168.46
0.8%
Erlang-2 (cost):
63144.24
0.8%
Hyperexponential (cost): 63247.58
1.0%
51
Numerical Analysis

Effectiveness analysis: the optimal policy
Benchmark policy: fixed-interval knowledge refreshing
policy
T = 180
lc = 0.2
 =1
Optimal policy (cost):
cost saving
62652.88
Fixed-interval policy (worst case): 153185.53
59.1%
Fixed-interval policy (best case):
80840.20
22.5%
Fixed-interval policy (average):
95362.07
34.3%
52
Numerical Analysis

Effectiveness analysis: the optimal policy

T
lc
180
90
0.2
0.2
1
1
Cost saving
(%)
22.5
19.9
360
180
180
0.2
0.15
0.25
1
1
1
24.0
15.0
23.2
180
180
0.2
0.2
0.5
2
24.3
2.1
53
Numerical Analysis

Effectiveness analysis: the optimal policy
lc
Cost saving (%)
Percentage of KDD runs
enforced by the system
constraint
(%)
0.05
0.1
98.5
0.1
5.3
88.1
0.15
15.0
64.4
0.2
22.5
25.6
0.25
23.2
0.1
0.3
24.0
0
0.35
24.0
0
54
Numerical Analysis

Effectiveness analysis: the optimal policy
0.3
0.25
cost saving
0.2
0.15
0.1
boundary 2
0.05
0
boundary 1
0
0.25
0.5
0.75
1
1.25

1.5
1.75
2
2.25
2.5
55
Numerical Analysis

Effectiveness analysis: the adaptive heuristic
360
240
240
180 time units
changing data arrival rate
Average data arrival rate = 280
56
Numerical Analysis

Effectiveness analysis: the adaptive heuristic

lc
1
1
1
0.5
2
0.2
0.15
0.25
0.2
0.2
ECh
Cost saving
EC f
(%)
70578.11
15.3
59773.85
60623.07 67012.67
59451.29 71671.23
114563.70 140216.50
24166.00 24485.87
9.5
17.1
18.3
1.3
57
Contributions

This research is a first study on knowledge refreshing,
an ubiquitous problem faced by every KDD
application.

We introduce the concept of knowledge loss. We also
propose how to measure and estimate knowledge loss.

We propose a general model for knowledge refreshing
and derive from the model an optimal policy and an
adaptive heuristic for knowledge refreshing.
58
Contributions
 The
measurement and estimating functions
of knowledge loss can be employed to
assess the quality of a knowledge base.
 The
adaptive heuristic is readily applicable
for solving real word knowledge refreshing
problems.
59
Future Research

Theoretical foundations for the estimating
function of knowledge loss.

Multiple system constraints.

Boundary conditions under which a bestperformed fixed-interval policy is as good as
an optimal policy.
60
Comments and Suggestions
61

Download Report

Slides

Paperzz.com

Your Paperzz