Optimization Methods
in Data Mining
Fall 2004
1
Overview
Optimization
Mathematical
Programming
Support
Vector
Machines
Classification,
Clustering,
etc
Fall 2004
Steepest
Descent
Search
Neural Nets,
Bayesian Networks
(optimize parameters)
Combinatorial
Optimization
Genetic
Algorithm
Feature selection
Classification
Clustering
2
What is Optimization?
Formulation
Decision variables
Objective function
Constraints
Solution
Fall 2004
Iterative algorithm
Improving search
Problem
Formulation
Model
Algorithm
Solution
3
Combinatorial Optimization
Finitely many solutions to choose from
Select the best rule from a finite set of rules
Select the best subset of attributes
Too many solutions to consider all
Solutions
Fall 2004
Branch-and-bound (better than Weka exhaustive
search)
Random search
4
Random Search
Select an initial solution x(0) and let k=0
Loop:
(k)) of x(k)
Consider the neighbors N(x
(0))
Select a candidate x’ from N(x
Check the acceptance criterion
(k+1) = x’ and
If accepted then let x
otherwise let x(k+1) = x(k)
Until stopping criterion is satisfied
Fall 2004
5
Common Algorithms
Simulated Annealing (SA)
that decreases as time goes on
Tabu Search (TS)
Idea: accept inferior solutions with a given probability
Idea: restrict the neighborhood with a list of solutions
that are tabu (that is, cannot be visited) because
they were visited recently
Genetic Algorithm (GA)
Idea: neighborhoods based on ‘genetic similarity’
Most used in data mining applications
Fall 2004
6
Genetic Algorithms
Maintain a population of solutions
rather than a single solution
Members of the population have certain
fitness (usually just the objective)
Survival of the fittest through
Fall 2004
selection
crossover
mutation
7
GA Formulation
Use binary strings (or bits) to encode
solutions:
011010010
Terminology
Fall 2004
Chromosomes = solution
Parent chromosome
Children or offspring
8
Problems Solved
Data Mining Problems that have been
addressed using Genetic Algorithms:
Fall 2004
Classification
Attribute selection
Clustering
9
Classification Example
Outlook
Sunny
100
Overcast
010
Rainy
001
Yes
10
No
01
Windy
Fall 2004
10
Representing a Rule
If windy=yes then play=yes
111
outlook
10 10
windy
play
If outlook=overcast and windy=yes then play=no
010
outlook
Fall 2004
10
windy
01
play
11
Single-Point Crossover
Parents
outlook
111
windy
010
windy
outlook
Offspring
play
10 10
outlook
111
windy
01 01
outlook
010
windy
play
01 01
play
10 10
play
Crossover point
Fall 2004
12
Two-Point
Crossover Offspring
Parents
outlook
111
windy
010
windy
outlook
play
10 10
outlook
111
windy
play
01 01
outlook
010
windy
10
play
play
01 10
01
Crossover points
Fall 2004
13
Uniform Crossover
Parents
Offspring
outlook
111
windy
10
010
windy
01
play
outlook
play
10
outlook
110
windy
01
outlook
011
windy
10
00
play
01 11
play
Problem?
Fall 2004
14
Mutation
Parent
010
outlook
Offspring
01 01
windy
play
010
outlook
11
windy
01
play
Mutated bit
Fall 2004
15
Selection
Which strings in the population should
be operated on?
Rank and select the n fittest ones
Assign probabilities according to fitness
and select probabilistically, say
Fitness ( xi )
P[select xi ]
Fitness ( x j )
j
Fall 2004
16
Creating a New Population
Create a population Pnew with p individuals
Survival
Crossover
Allow individuals from old population to survived intact
Rate: 1-r % of population
How to select the individuals that survive: Deterministic/random
Select fit individuals and create new once
Rate: r% of population. How to select?
Mutation
Slightly modify any on the above individuals
Mutation rate: m
Fixed number of mutations versus probabilistic mutations
Fall 2004
17
GA Algorithm
Randomly generate an initial population P
Evaluate the fitness f(xi) of each individual in P
Repeat:
Survival: Probabilistically select (1-r)p individuals from P and
add to Pnew, according to
f ( xi )
P[select xi ]
f (x j )
j
Crossover: Probabilistically select rp/2 pairs from P and apply
the crossover operator. Add to Pnew
Mutation: Uniformly choose m percent of member and invert
one randomly selected bit
Update: P Pnew
Evaluate: Compute the fitness f(xi) of each individual in P
Return the fittest individual from P
Fall 2004
18
Analysis of GA: Schemas
Does GA converge?
Does GA move towards a good solution?
Local optima?
Holland (1975): Analysis based on schemas
Schema: string combination of 0s, 1s, *s
Example: 0*10 represents {0010,0110}
Fall 2004
19
The Schema Theorem
(all the theory on one slide)
Average fitness of
individuals in
schema s at time t
Distance between
defined bits in s
Number of
defined bits
in schema s
uˆ ( s, t )
d ( s)
o( s )
E[m( s, t 1)]
m( s, t )1 pc
(
1
p
)
m
f (t )
l 1
Probability
of crossover
Probability
of mutation
Number of instance of
schema s at time t
Fall 2004
20
Interpretation
Fit schemas grow in influence
What is missing
Crossover?
Mutation?
How about time t+1 ?
Other approaches:
Fall 2004
Markov chains
Statistical mechanics
21
GA for Feature Selection
Feature selection:
Select a subset of attributes (features)
Reason: to many, redundant, irrelevant
Set of all subsets of attributes very
large
Little structure to search
Random search methods
Fall 2004
22
Encoding
Need a bit code representation
Have some n attributes
Each attribute is either in (1) or out (0)
of the selected set
1
0
temperature
Fall 2004
1
humidity
outlook
0
windy
23
Fitness
Wrapper approach
Filter approach
Apply learning algorithm, say a decision tree, to
the individual x ={outlook, humidity}
Let fitness equal error rate (minimize)
Let fitness equal the entropy (minimize)
Other diversity measures can also be used
Simplicity measure?
Fall 2004
24
Crossover
1
1
humidity
outlook
0
temperature
0
0
1
0
temperature
0
0
windy
1
windy
1
humidity
outlook
1
1
humidity
temperature
windy
humidity
outlook
1
outlook
0
temperature
0
windy
Crossover point
Fall 2004
25
In Weka
Fall 2004
26
Clustering Example
Create two clusters for:
ID
10
20
30
40
Outlook
Sunny
Overcast
Rainy
Rainy
{10,20}
{30,40}
{20,40}
{10,30}
Fall 2004
Temperature
Hot
Hot
Mild
Cool
Parents
1 1 0 0
0 1 0 1
Crossover
Humidity
High
High
High
Normal
Windy
True
False
False
False
Offspring
1 1 0 1
0 1 0 0
Play
No
Yes
Yes
Yes
{10,20,40}
{30}
{20}
{10,30,40}
27
Discussion
GA is a flexible and powerful random
search methodology
Efficiency depends on how well you can
encode the solutions in a way that will
work with the crossover operator
In data mining, attribute selection is the
most natural application
Fall 2004
28
Attribute Selection in
Unsupervised Learning
Attribute selection typically uses a
measure, such as accuracy, that is directly
related to the class attribute
How do we apply attribute selection to
unsupervised learning such as clustering?
Need a measure
compactness of cluster
separation among clusters
Fall 2004
Multiple measures
29
Quality Measures
Compactness
1 If xi belongs to cluster k
ik
Otherwise
0
Instances
Clusters
Fwithin 1
1
Z within
1 K n
2
ik xij kj
d k 1 i 1
Centroid
Normalization
constant to make
Fwithin [0,1]
Fall 2004
Number of
attributes
x
n
kj
i 1
n
i 1
ik ij
ik
30
More Quality Measures
Cluster Separation
Fbetween
Fall 2004
1 1 1 K n
2
1 ik xij kj
Z bet d k 1 k 1 i 1
jJ
31
Final Quality Measures
Adjustment for bias
K K min
Fclusters 1
K max K min
Compexity
d 1
Fclusters 1
D 1
Fall 2004
32
Wrapper Framework
Loop:
Obtain an attribute subset
Apply k-means algorithm
Evaluate cluster quality
Until stopping criterion satisfied
Fall 2004
33
Problem
What is the optimal attribute subset?
What is the optimal number of clusters?
Try to find simultaneously
Fall 2004
34
Example
Find an attribute subset and optimal number of clusters
(Kmin = 2, Kmin = 3) for
ID Sepal Length
10
5.0
20
5.1
30
4.8
40
5.1
50
4.6
60
6.5
70
5.7
80
6.3
90
4.9
100
6.6
Fall 2004
Sepal Width
3.5
3.8
3.0
3.8
3.2
2.8
2.8
3.3
2.4
2.9
Petal length
1.6
1.9
1.4
1.6
1.4
4.6
4.5
4.7
3.3
4.6
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
35
Formulation
Define an individual
Selected attributes
* * * *
*
number of
clusters
Initial Population
0 1 0 1 1
1 0 0 1 0
Fall 2004
36
Evaluate Fitness
Start with 0 1 0 1 1
Three clusters and {Sepal Width, Petal Width}
ID
10
20
30
40
50
60
70
80
90
100
Sepal Width
3.5
3.8
3.0
3.8
3.2
2.8
2.8
3.3
2.4
2.9
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
Apply k-means with k=3
Fall 2004
37
K-Means
Start with random centroids: 10, 70, 80
2
Petal Width
1.5
80
60
70 100
1
90
10
0.5
30
20
40
50
0
2
2.5
3
3.5
4
Sepal Width
Fall 2004
38
New Centroids
2
Petal Width
1.5
80
60
C170 100
1
90
10
0.5
30
50
C3
20
40
0
2
2.5
3
3.5
4
Sepal Width
No change in assignment so
terminate k-means algorithm
Fall 2004
39
Quality of Clusters
Centers
Center 1 at (3.46,0.34): {60,70,90,100}
Center 2 at (3.30,1.60): {80}
Center 3 at (2.73,1.28): {10,20,30,40,50}
Evaluation
Fwithin 0.55
Fbetween 6.60
Fclusters 0.00
Fcomplexity 0.67
Fall 2004
40
Next Individual
Now look at 1 0 0 1 0
Two clusters and {Sepal Length, Petal Width}
ID
10
20
30
40
50
60
70
80
90
100
Sepal Length
5.0
5.1
4.8
5.1
4.6
6.5
5.7
6.3
4.9
6.6
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
Apply k-means with k=3
Fall 2004
41
K-Means
Say we select 20 and 90 as initial centroids:
Petal Width
2
80
1.5
70
1
60
100
90
10
20
30
50
40
0.5
0
4
4.5
5
5.5
6
6.5
7
Sepal Width
Fall 2004
42
Recalculate Centroids
Petal Width
2
80
1.5
C2
70
1
60
100
90
10
20
30 C1
50
40
0.5
0
4
4.5
5
5.5
6
6.5
7
Sepal Width
Fall 2004
43
Recalculate Again
Petal Width
2
80
C2 60
100
1.5
70
1
90
10
C1 20
30
50
40
0.5
0
4
4.5
5
5.5
6
6.5
7
Sepal Width
No change in assignment so
terminate k-means algorithm
Fall 2004
44
Quality of Clusters
Centers
Center 1 at (4.92,0.45): {10,20,30,40,50,90}
Center 3 at (6.28,1.43): {60,70,90,100}
Evaluation
Fwithin 0.39
Fbetween 14.59
Fclusters 1.00
Fcomplexity 0.67
Fall 2004
45
Compare Individuals
0 1 0 1 1
1 0 0 1 0
Fwithin 0.55
Fwithin 0.39
Fbetween 6.60
Fbetween 14.59
Fclusters 0.00
Fclusters 1.00
Fcomplexity 0.67
Fcomplexity 0.67
Which is fitter?
Fall 2004
46
Evaluating Fitness
Can scale (if necessary)
Then weight them together, e.g.,
0.55 6.60
fitness(0 1 0 1 1)
0 0.67 0.53
0.55 14.59
0.39 14.59
fitness(1 0 0 1 0)
0 0.67 0.84
0.55 14.59
Alternatively, we can use Pareto optimization
Fall 2004
47
Fall 2004
48
Mathematical Programming
Continuous decision variables
Constrained versus non-constrained
Form of the objective function
Fall 2004
Linear Programming (LP)
Quadratic Programming (QP)
General Mathematical Programming (MP)
49
Linear Program
max
2 0.5 x
s.t
0 x 15
f ( x) 2 0.5 x
x 15
0 x
Optimal solution
x
10
Fall 2004
50
Two
Dimensional
Problem
4 x 2 x 4800
x
2
1
2000
2
x1 1500
x2 1500
1500
Feasible
Region
max 12 x1 9 x2
x1 1000
s.t.
x2 1500
x1 x2 1750
4 x1 2 x2 4800
Optimal Solution
1000
x1 , x2 0
x1 x2 1750
500
x1 0
x2 0
x1
500
1000
1500
2000
12 x1 9 x2 12000
12x1 9x2 6000
Fall 2004
Optimum is
always at an
extreme point
51
Simplex Method
x2
4 x1 2 x2 4800
2000
x1 1500
x2 1500
1500
1000
x1 x2 1750
500
x1 0
x2 0
Fall 2004
x1
500
1000
1500
2000
52
Quadratic Programming
1.4
1.2
f (x )=0.2+(x -1)
1
2
0.8
0.6
0.4
0.2
2
1.
8
1.
6
1.
4
1.
2
1
0.
8
0.
6
0.
4
0.
2
0
0
f ' ( x) 2( x 1)
2( x 1) 0
Fall 2004
x 1
53
General MP
f ' ( x) 0
f ' ( x) 0
f ' ( x) 0
f ' ( x) 0
Derivative being
zero is a necessary
but not sufficient
condition
f ' ( x) 0
Fall 2004
54
Constrained Problem?
f ' ( x) 0
x 10
Fall 2004
55
General MP
We write a general mathematical program in
matrix notation as:
min
s.t.
f ( x)
h ( x) 0
g(x) 0
x Vector of decision v ariables
h(x) (h 1 (x), h 2 (x),..., h m (x))
g(x) (g1 (x), g 2 (x),..., g m (x))
Fall 2004
56
Karush-Kuhn-Tucker (KKT)
Conditions
If x * is a relative minimum for
min
s.t.
f ( x)
h ( x) 0
g(x) 0
There exist λ , μ 0, such that
μ gx 0
f x* λ T h x* μT g x* 0
T
Fall 2004
*
57
Convex Sets
A set C is convex if any line connecting two points in the
set lies completely within the set, that is,
x1 , x2 C, (0,1) : x1 (1 )x2 C
Convex
Fall 2004
Not Convex
58
Convex Hull
The convex hull co(S) of a set S is the
intersection of all convex sets
containing S
A set V Rn is a linear variety if
x1 , x2 V : x1 (1 )x2 V , R
Fall 2004
59
Hyperplane
A hyperplane in Rn is a (n-1)-dimensional variety
Hyperplane in R2
Fall 2004
Hyperplane in R3
60
Convex Hull Example
Humidity
Closets points in
convex hulls
Play
No Play
d
c
Separating hyperplane
bisects closest points
Temperature
Fall 2004
61
Finding the Closest Points
Formulate as QP:
min
s.t.
1
2
cd
2
c i xi
i:Play Yes
d
x
i:Play No
1
1
i
i:Play Yes
i
i:Play Yes
i i
i 0
Fall 2004
62
Support Vector Machines
Support Vectors
Play
Humidity
No Play
Separating
Hyperplane
Temperature
Fall 2004
63
Example
ID
10
20
30
40
50
60
70
80
90
100
Fall 2004
Sepal Width
3.5
3.8
3.0
3.8
3.2
2.8
2.8
3.3
2.4
2.9
Petal Width
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1.0
1.3
64
Separating Hyperplane
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
2
Fall 2004
2.5
3
3.5
4
65
Assume Separating Planes
Constraints:
xi w i b 1, i : yi 1
xi w i b 1, i : yi 1.
Distance to each plane:
1
w
Fall 2004
66
Optimization Problem
2
max
w
subject to
x i w i b 1, i : yi 1
w ,b
x i w i b 1, i : yi 1.
Fall 2004
67
How Do We Solve MPs?
5
4
3
2
1
0
-1
-2
-3
-4
-5
-5
Fall 2004
-4
-3
-2
-1
0
1
2
3
4
5
68
Improving Search
Direction-step approach
x k 1 x k x k
Search direction
Current Solution
New Solution
Fall 2004
Step size
69
Steepest Descent
Search direction equal to negative gradient
x f (x k )
Finding is a one-dimensional optimization
problem of minimizing
( ) f (x k x k )
Fall 2004
70
Newton’s Method
Taylor series expansion
f (x) f (x k ) f (x k )( x x k )
1
(x x k )T F (x k )(x x k )
2 side is minimized at
The right hand
xk 1 xk F (xk ) f (xk )
1
Fall 2004
71
Discussion
Computing the inverse Hessian is difficult
Quasi-Newton
x k 1 x k k S k f (x k )
Conjugate gradient methods
Does not account for constraints
Fall 2004
Penalty methods
Lagrangian methods, etc.
72
Non-separable
Add an error term to the constraints:
x i w i b 1 i , i : yi 1
x i w i b 1 i , i : yi 1
i 0, i.
Fall 2004
73
Wolfe Dual
Only place
data appears
max
α
subject to
1
w i i j yi y j x i x j
2 i, j
i
0 i C
2
y
i
i
0.
i
Simple
constraints
Fall 2004
74
Extension to Non-Linear
Kernel functions
K (x, y ) (x) (y )
n
:
R
H
Mapping
Takes place of
dot product in
Wolfe dual
High dimensional
Hilbert space
Fall 2004
75
Some Possible Kernels
K (x, y ) (x y 1)
K (x, y ) e
p
x y / 2 2
2
K (x, y ) tanh( x y )
Fall 2004
76
In Weka
Weka.classifiers.smo
Support vector machine for nominal data only
Does both linear and non-linear models
Fall 2004
77
Optimization in DM
Optimization
Mathematical
Programming
Support
Vector
Machines
Classification,
Clustering,
etc
Fall 2004
Steepest
Descent
Search
Neural Nets,
Bayesian Networks
(optimize parameters)
Combinatorial
Optimization
Genetic
Algorithm
Feature selection
Classification
Clustering
78
Bayesian Classification
Naïve Bayes assumes independence between
attributes
Simple computations
Best classifier if assumption is true
Bayesian Belief Networks
Joint probability distributions
Directed acyclic graphs
Nodes are random variables (attributes)
Arcs represent the dependencies
Fall 2004
79
Example: Bayesian Network
Lung Cancer depends on Family History and Smoker
Family History
FH,S FH,~S ~FH,S ~FH,~S
Smoker
0.8
0.5
0.7
0.1
~LC 0.2
0.5
0.3
0.9
LC
Lung Cancer
Emphysema
Positive X-Ray
Dyspnea
Lung Cancer is conditionally independent of
emphysema. given Family History and Smoker
Fall 2004
80
Conditional Probabilities
Pr( LungCancer " yes" | FamilyHist ory " yes" , Smoker " yes" ) 0.8
Pr( LungCancer " no" | FamilyHist ory " no" , Smoker " no" ) 0.9
n
Pr( z1 ,..., z n ) Pr zi | Parents ( Z i )
i 1
Random
variable
Outcome of the
random variable
Fall 2004
The node representing
the class attribute is
called the output node
81
How Do we Learn?
Network structure
Given/known
Inferred or learned from the data
Variables
Fall 2004
Observable
Hidden (missing values / incomplete data)
82
Case 1: Known Structure and
Observable Variables
Straightforward
Similar to Naïve Bayes
Compute the entries of the conditional
probability table (CPT) of each variable
Fall 2004
83
Case 2: Known Structure and
Some Hidden Variables
Still need to learn the CPT entries
Let S be a set of s training instances
X 1 , X 2 ,..., X s .
Let wijk be the CPT entry for variable
Yi=yij having parents Ui=uik.
Fall 2004
84
CPT
Example
Y LungCancer
i
yij " yes"
wijk
U i {FamilyHist ory , Smoker}
uik {" yes" , " yes"}
FH,S FH,~S ~FH,S ~FH,~S
0.8
0.5
0.7
0.1
~LC 0.2
0.5
0.3
0.9
LC
w wijk i , j ,k
Fall 2004
85
Objective
Must find the value of
w wijk i , j ,k
The objective is to maximize the likelihood of
the data, that is,
s
Prw ( S ) Prw X d
How do we do this?
Fall 2004
d 1
86
Non-Linear MP
From training data
Compute gradients:
s Pr Y y , U u | X
Prw ( S )
i
ij
i
ik
d
wijk
wijk
d 1
Move in the direction of the gradient
Prw ( S )
wijk wijk l
wijk
Learning rate
Fall 2004
87
Case 3: Unknown Network
Structure
Need to find/learn the optimal network
structure for the data
What type of optimization problem is this?
Combinatorial optimization (GA etc.)
Fall 2004
88
© Copyright 2026 Paperzz