Presentation - CS

Weight Annealing
Data Perturbation for Escaping Local Maxima in Learning
Gal Elidan, Matan Ninio, Nir Friedman
Hebrew University
{galel,ninio,nir}@cs.huji.ac.il
Dale Schuurmans
University of Waterloo
[email protected]
The Learning Problem
weights
DATA
+
Hypothesis
Score
Score((hh,,D
D,)w )
Learning task: search for arg
argmax
maxh hScore
Score(h
(h,D
,D,w
))
w
log P (X [m] | h )  Pen (h )
Optimization is hard!
m
Typically resort to local
Classification:  w m P (h (X [optimization
m])  C [m] | X
[m])  Pen (h )
methods:
m
gradient ascent, greedy
Logistic
y [m]  h (x )EM
 w mlog1  exphill-climbing,
 Pen (h )
Density
estimation:
regression:
m
m
Escaping local maxima
 TABU search
 Random restarts
 Simulated annealing
Score
Local methods converge to (one of many) local optimum
Stuck here
h
These methods work by step perturbation during the
local search
Weight Perturbation
Our Idea: Perturbation of instance weights
W
DATA
W
perturb
DATA
Puts stronger emphasis on a subset of the instances
 Allows the learning procedure to escape
local maxima
Iterative Procedure
Score
W
DATA
h
REWEIGHT
LOCAL SEARCH
Benefits:
Hypothesis
Generality: a wide variety of learning scenarios
Modularity: Search is unchanged
Effectiveness: Allows global changes
Iterative Procedure
Two methods for reweighting
 Random: Sampling random weights
 Adversarial: Directed reweighting
To maximize on original goal
 slowly diminish magnitude of perturbations
Random Reweighting
Mean is original weight
Variance  temp
Wt
Wt+2
P(W)
W
 When hot, model can
“go” almost anywhere and
local maxima are bypassed
 When cold, search finetunes to find optimum with
respect to original data
W*
Wt+1
Distance from
original W
Adversarial Reweighting
Idea: Challenge model by increasing w of “bad”
(low scoring) instances
Challenge the model by
emphasizing bad
A min-maxWgame between re-weighting
andsamples
optimizer
t+1
(minimize the score using W)
W*
w
t 1

Wt
Converge towards original
distribution by constraining
distance from W*
 w * exp  η temp  Score wt

Kivinen & Warmuth
Learning Bayesian Networks
 A Bayesian network (BN) is a compact representation
of a joint distribution
 Learning a BN is a density estimation problem
MINVOLSET
PULMEMBOLUS
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
weights
DATA
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
PCWP
LVFAILURE
STROEVOLUME
FIO2
VENTALV
PVSAT
ARTCO2
The Alarm
network
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
BP
Learning task: find structure + parameters that maximize score
Structure Search Results
 Super-exponential combinatorial search space
 Search uses local ops: add/remove/reverse edge
 Optimize Bayesian Dirichlet score (BDe)
-15.15
Log-loss/instance on test
TRUE STRUCTURE
-15.2
-15.25
-15.3
BASELINE
-15.35
-15.4
Random annealing
Adversary
-15.45
-15.5
HOT
5
10
15
20
25
30
Iterations
35
40
COLD
Alarm network: 37 variables, 1000 samples
With similar running time:
 Random is superior
to random re-starts
 Single Adversary run
competes with random
Search with missing values
-15
-14.98
-14.96
GENERATING MODEL
RANDOM
-15.06
With similar running time:
Distance to true
 Over 90% of Random runs
generating
model
are
better then normal
SEM.
is halved!
 Adversary
run is best
BASELINE
10
20
40
50
60
70
80
90
90% of random better then baseline
30
-15.04
-15.02
ADVERSARY
-15.08
-15.1
Log-loss/instance on test data
 Missing values introduce many local maxima
 EM combines search & parameters estimation (SEM)
Percent at least this good
Alarm network: 37 variables, 1000 samples
Real-life datasets
 6 real-life examples with and without missing values
Log-loss / instance on test data
0.5
0.4
Adversary
20-80% Random
0.3
0.2
0.1
BASELINE
0
-0.1
-0.2
Variables 20
Samples 1512
36
446
30
300
70
200
36
546
13
100
With similar running time:
 Adversary is efficient
and preferable
 Random takes longer
for inferior results
Learning Sequence Motifs
DNA Promoter Sequences
ATCTAGCTGAGAATGCACACTGATCGAGCCC
CACCATATTCTTCGGACTGCGCTATATAGAC
TGCAACTAGTAGAGCTCTGCTAGAAACATTA
CTAAGCTCTATGACTGCCGATTGCGCCGTTT
GGGCGTCTGAGCTCTTTGCTCTTGACTTCCG
CTTATTGATATTATCTCTCTTGCTCGTGACT
GCTTTATTGTGGGGGGGACTGCTGATTATGC
TGCTCATAGGAGAGACTGCGAGAGTCGTCGT
AGGACTGCGTCGTCGTGATGATGCTGCTGAT
CGATCGGACTGCCTAGCTAGTAGATCGATGT
GACTGCAGAAGAGAGAGGGTTTTTTCGCGCC
GCCCCGCGCGACTGCTCGAGAGGAAGTATAT
ATGACTGCGCGCGCCGCGCGCCGGACTGCTT
TATCCAGCTGATGCATGCATGCTAGTAGACT
GCCTAGTCAGCTGCGATCGACTCGTAGCATG
CATCGACTGCAGTCGATCGATGCTAGTTATT
GGATGCGACTGAACTCGTAGCTGTAGTTATT








Represent using a motif
Position Specific Scoring Matrix:
A
0.97 0
C
0
0.01 0.99 0
0.2
G
0
0.99 0.1
0.8
T
0.03 0
Motif
0
0
0.02 0
0
0.98 0
Segal et al., RECOMB 2002

 1

Score (S , L)   wnlogistic log
K

n 1


N
Highly non-linear score




exp θi Sn ,i  j  


j
 i




optimization is hard!
Real-life Motifs Results
 Construct PSSM: find  that maximize the score
 Experiments on 9 transcription factors (motifs)
50
45
Log-loss on test data
40
Adversary
20-80% Random
With similar running time:
 Both methods are better
than standard ascent
 Adversary is efficient
and best 6/9 times
35
30
25
20
15
10
5
BASELINE
0
-5
ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6
Motif
PSSM: 4 letters x 20 positions, 550 sample
Simulated annealing
Score
Simulated annealing:
allow “bad” moves with some probability
P(move)  f(temp,)
h
 Wasteful propose, evaluate, reject cycle
 Needs a long time to escape local maxima
 WORSE then baseline on Bayesian networks!
Summary and Future Work
 General method applicable to a
variety of learning scenarios
decision trees, clustering, phylogenetic trees, TSP…
 Promising empirical results
approach “achievable” maximum
The BIG challenge:
THEORETICAL INSIGHTS
Adversary ≠ Boosting
Adversary
Output: Single hypothesis
Boosting
An ensemble
Weights: Converge to original Diverge from original
distribution
distribution
Learning: ht+1 depends on ht
ht+1 depends only on wt+1
same comparison is true of
Random Vs. Bagging/Bootstrap
Other annealing methods
Simulated annealing:
allow “bad” moves with
some probability
Deterministic annealing:
Change scenery by
changing family of h
Score
Score
simple hypothesis
P(move)  f(temp,)
h
Not good on Bayesian network!
complex hypothesis
h
Is not naturally applicable!