Model-free stochastic perturbative adaptation and optimization

Model-Free Stochastic Perturbative
Adaptation and Optimization
Gert Cauwenberghs
Johns Hopkins University
[email protected]
520.776 Learning on Silicon
http://bach.ece.jhu.edu/gert/courses/776
G. Cauwenberghs
520.776 Learning on Silicon
Model-Free Stochastic Perturbative
Adaptation and Optimization
OUTLINE
•
Model-Free Learning
– Model Complexity
– Compensation of Analog VLSI Mismatch
•
Stochastic Parallel Gradient Descent
– Algorithmic Properties
– Mixed-Signal Architecture
– VLSI Implementation
•
Extensions
– Learning of Continuous-Time Dynamics
– Reinforcement Learning
•
Model-Free Adaptive Optics
– AdOpt VLSI Controller
– Adaptive Optics “Quality” Metrics
– Applications to Laser Communication and Imaging
G. Cauwenberghs
520.776 Learning on Silicon
The Analog Computing Paradigm
• Local functions are efficiently implemented with minimal
circuitry, exploiting the physics of the devices.
• Excessive global interconnects are avoided:
– Currents or charges are accumulated along a single wire.
– Voltage is distributed along a single wire.
Pros:
–
–
–
–
Massive Parallellism
Low Power Dissipation
Real-Time, Real-World Interface
Continuous-Time Dynamics
Cons:
– Limited Dynamic Range
– Mismatches and Nonlinearities (WYDINWYG)
G. Cauwenberghs
520.776 Learning on Silicon
Effect of Implementation Mismatches
INPUTS
SYSTEM
{p i}
OUTPUTS
ε(p)
REFERENCE
Associative Element:
– Mismatches can be properly compensated by adjusting the
parameters pi accordingly, provided sufficient degrees of
freedom are available to do so.
Adaptive Element:
– Requires precise implementation
– The accuracy of implemented polarity (rather than amplitude) of
parameter update increments ∆pi is the performance limiting
factor.
G. Cauwenberghs
520.776 Learning on Silicon
Example: LMS Rule
A linear perceptron under supervised learning:
= Σ p ij x j(k )
(k )
yi
j
ε = 1 Σ Σ ( y itarget ( k ) - y i(k ) )2
2
k
j
with gradient descent:
∆p ij(k )
∂ε (k )
= -η x j(k )⋅ y itarget (
= -η
∂p ij
k)
- y i(k )
reduces to an incremental outer-product update rule, with scalable,
modular implementation in analog VLSI.
G. Cauwenberghs
520.776 Learning on Silicon
Incremental Outer-Product Learning in Neural Nets
xj
xi
pij
j
i
ei
ej
Multi-Layer Perceptron:
x i = f(Σ p ij x j)
j
Outer-Product Learning Update:
∆p ij = η x j⋅ e i
– Hebbian (Hebb, 1949):
e i = xi
– LMS Rule (Widrow-Hoff, 1960):
e i = f 'i⋅ x itarget - x i
– Backpropagation (Werbos, Rumelhart, LeCun):
e j = f 'j⋅ Σ p ij e i
i
G. Cauwenberghs
520.776 Learning on Silicon
Gradient Descent Learning
Minimize ε(p) by iterating:
(k )
∂
ε
p i(k + 1) = p i(k ) - η
∂p i
from calculation of the gradient:
∂ε
=Σ
∂p i
l
Σm
∂ ε ∂y l ∂ x m
⋅
⋅
∂y l ∂x m ∂p i
Implementation Problems:
– Requires an explicit model of the internal network dynamics.
– Sensitive to model mismatches and noise in the implemented
network and learning system.
– Amount of computation typically scales strongly with the
number of parameters.
G. Cauwenberghs
520.776 Learning on Silicon
Gradient-Free Approach to Error-Descent Learning
Avoid the model sensitivity of gradient descent, by observing the
parameter dependence of the performance error on the network
directly, rather than calculating gradient information from a preassumed model of the network.
Stochastic Approximation:
– Multi-dimensional Kiefer-Wolfowitz (Kushner & Clark 1978)
– Function Smoothing Global Optimization (Styblinski & Tang 1990)
– Simultaneous Perturbation Stochastic Approximation (Spall 1992)
Hardware-Related Variants:
–
–
–
–
–
Model-Free Distributed Learning (Dembo & Kailath 1990)
Noise Injection and Correlation (Anderson & Kerns; Kirk & al. 1992-93)
Stochastic Error Descent (Cauwenberghs 1993)
Constant Perturbation, Random Sign (Alspector & al. 1993)
Summed Weight Neuron Perturbation (Flower & Jabri 1993)
G. Cauwenberghs
520.776 Learning on Silicon
Stochastic Error-Descent Learning
Minimize ε(p) by iterating:
p (k +1) = p (k ) – µε
(k ) (k )
π
from observation of the gradient in the direction of π(k):
ε (k ) =
1
2
ε(p (k ) + π (k ) ) – ε(p (k ) –π (k ) )
with random uncorrelated binary components of the perturbation
vector π(k):
π i(k ) = ±σ ; E( π i(k ) π j(l ) ) ≈ σ 2 δij δkl
Advantages:
–
–
–
–
–
No explicit model knowledge is required.
Robust in the presence of noise and model mismatches.
Computational load is significantly reduced.
Allows simple, modular, and scalable implementation.
Convergence properties similar to exact gradient descent.
G. Cauwenberghs
520.776 Learning on Silicon
Stochastic Perturbative Learning
Cell Architecture
φ (t)
^
(t)
–η ε
φ(t)
πi(t)
Σ
z –1
pi(t)
NETWORK
Σ
pi(t) + φ(t) πi(t)
–η ε^ (t)
ε (p(t) + φ(t) π(t))
LOCAL
GLOBAL
p (k +1) = p (k ) – µε
G. Cauwenberghs
(k ) (k )
π
ε (k ) =
1
2
ε(p (k ) + π (k ) ) – ε(p (k ) –π (k ) )
520.776 Learning on Silicon
Stochastic Perturbative Learning
Circuit Cell
V σ+
πi
EN p
V σ–
πi
πi
V bp
Cperturb
sign( ε)
^
POL
pi(t) + φ(t) πi(t)
V bn
πi
G. Cauwenberghs
Cstore
EN n
520.776 Learning on Silicon
Charge Pump Characteristics
(b)
EN p
V bp
Iadapt
POL
V stored
∆Q adapt
C
100
V bn
EN n
stored (V)
∆t = 40 msec
1 msec
10-1
Voltage Increment ²V
Voltage Decrement ²V
stored (V)
(a)
10-2
23 µsec
10-3
10-4
∆t = 0
10-5
0
0.1
G. Cauwenberghs
0.2
0.3
0.4
0.5
0.6
100
∆t = 40 msec
10-1
1 msec
10-2
10-3
23 µsec
10-4
10-5
0
0.1
0.2
0.3
0.4
Gate Voltage V bn (V)
Gate Voltage V bp (V)
(a)
(b)
0.5
0.6
520.776 Learning on Silicon
Supervised Learning of Recurrent Neural Dynamics
BINARY QUANTIZATION
Q(.)
Q(.)
Q(.)
Q(.)
Q(.)
Q(.)
TEACHER
FORCING
π1H
W 11
UPDATE ACTIVATION AND PROBE MULTIPLEXING
π2H
π3H
π4H
π5H
W 21
W 12
W 13
W 14
W 16
x1
x2
W 31
W 56
W 61
θ2
W
off
π1V
G. Cauwenberghs
T
x2 (t)
x4
W 51
W
T
x1 (t)
x3
W 41
θ1
x1(t)
x2(t)
W 22
π6H
π0H
W 15
π2V
θ3
W
off
π3V
θ4
W
off
π4V
W 65
W 66
θ5
θ6
W
off
π5V
W
off
π6V
x5
x6
+
I ref
DYNAMICAL SYSTEM
–
off
I ref
y(t)
dx
= F(p , x, y )
dt
x(t)
z = G(x )
520.776 Learning on Silicon
z(t)
The Credit Assignment Problem
or How to Learn from Delayed Rewards
INPUTS
SYSTEM
{p i }
OUTPUTS
r*(t)
ADAPTIVE
CRITIC
r(t)
– External, discontinuous reinforcement signal r(t).
– Adaptive Critics:
•
•
•
•
•
G. Cauwenberghs
Discrimination Learning (Grossberg, 1975)
Heuristic Dynamic Programming (Werbos, 1977)
Reinforcement Learning (Sutton and Barto, 1983)
TD(λ) (Sutton, 1988)
Q-Learning (Watkins, 1989)
520.776 Learning on Silicon
Reinforcement Learning
(Barto and Sutton, 1983)
Locally tuned, address encoded neurons:
χ(t) ∈ {0,...2n –1} : n–bit address encoding of state space
y (t) = y χ(t) : classifier output
q (t) = q χ(t) : adaptive critic
Adaptation of classifier and adaptive critic:
y k (t+1) = y k (t) + α r(t) e k (t) y k (t)
q k (t+1) = q k (t) + β r(t) e k (t)
– eligibilities:
e k (t+1) = λ e k (t) + (1 – λ) δk χ(t)
– internal reinforcement:
r(t) = r(t) + γ q (t) – q (t – 1)
G. Cauwenberghs
520.776 Learning on Silicon
Reinforcement Learning Classifier for Binary Control
State Eligibility
SELhor
SELvert
Vbp
UPD
Vαp
ek
Vδ
^
r
qvert
qk
UPD
Vbn
Neuron Select
(State)
Vbn
Adaptive Critic SEL
hor
HYST
Vbp
UPD
Vbp
Vαn
64 Reinforcement Learning Neurons
UPD
Vαp
HYST
LOCK
LOCK
yk
Action Network
x2(t)
yvert
State
(Quantized)
x1(t)
Vbn
SELhor
Action y = –1
(Binary)
G. Cauwenberghs
y(t)
y=1
u(t)
520.776 Learning on Silicon
A Biological Adaptive Optics System
brain
iris
retina
lens
zonule
fibers
cornea
G. Cauwenberghs
optic nerve
520.776 Learning on Silicon
Wavefront Distortion and Adaptive Optics
• Imaging
- defocus
- motion
G. Cauwenberghs
• Laser beam
- beam wander/spread
- intensity fluctuations
520.776 Learning on Silicon
Adaptive Optics
Conventional Approach
– Performs phase conjugation
• assumes intensity is unaffected
– Complex
• requires accurate wavefront phase sensor (Shack-Hartman; Zernike
nonlinear filter; etc.)
• computationally intensive control system
G. Cauwenberghs
520.776 Learning on Silicon
Adaptive Optics
Model-Free Integrated Approach
Incoming
wavefront
Wavefront corrector with
N elements: u1 ,…,un ,…,uN
– Optimizes a direct measure J of optical performance (“quality metric”)
– No (explicit) model information is required
• any type of quality metric J, wavefront corrector (MEMS, LC, …)
• no need for wavefront phase sensor
– Tolerates imprecision in the implementation of the updates
• system level precision limited by accuracy of the measured J
G. Cauwenberghs
520.776 Learning on Silicon
Adaptive Optics Controller Chip
Optimization by Parallel Perturbative Stochastic Gradient Descent
Φ(u)
image
J(u)
wavefront
corrector
performance
metric sensor
J(u)
u
AdOpt VLSI wavefront controller
G. Cauwenberghs
520.776 Learning on Silicon
Adaptive Optics Controller Chip
Optimization by Parallel Perturbative Stochastic Gradient Descent
Φ(u)
Φ(u+δu)
image
J(u+δu)
J(u)
wavefront
corrector
performance
metric sensor
J(u+δu)
u+δu
AdOpt VLSI wavefront controller
G. Cauwenberghs
520.776 Learning on Silicon
Adaptive Optics Controller Chip
Optimization by Parallel Perturbative Stochastic Gradient Descent
Φ(u)
Φ(u+δu)
image
Φ(u-δu)
J(u+δu)
J(u)
wavefront
corrector
performance
metric sensor
J(u-δu)
J(u-δu)
u-δu
uj(k+1) = uj (k) + γ δJ(k) δuj (k)
AdOpt VLSI wavefront controller
G. Cauwenberghs
520.776 Learning on Silicon
δJ
Parallel Perturbative Stochastic Gradient Descent
Architecture
γ δJ
{-1,0,+1}
<δuiδuj > = σ2 δij
uncorrelated
perturbations
δuj
z –1
J(u)
uj
uj(k+1) = uj (k) + γ δJ(k) δuj (k)
(
( )
(
2 δ J ( k ) = J u1( k ) + δ u1 ,..., u (Nk ) + δ u N
G. Cauwenberghs
k
k)
) − J (u
(k)
1
− δ u1( ) , ..., u (Nk ) − δ u N (
k
k)
)
520.776 Learning on Silicon
Parallel Perturbative Stochastic Gradient Descent
Mixed-Signal Architecture
di
gi
ta
an
l
al
og
sgn(δJ)
{-1,0,+1}
Bernoulli
δuj
γ |δJ|
δuj = ±1
z –1
J(u)
uj
uj(k+1) = uj (k) + γ |δJ(k)| sgn(δJ(k) ) δuj (k)
Amplitude
G. Cauwenberghs
Polarity
520.776 Learning on Silicon
Wavefront Controller
VLSI Implementation
Edwards, Cohen, Cauwenberghs, Vorontsov & Carhart (1999)
{ )}
with δ u ( ) = σ and sgn (δ u ( ) ) = π ( ) = ±1
• Generate Bernoulli distributed δ u j(
k
k
k
j
j
k
j
• Decompose δ J ( k ) into sgn (δ J ( k ) ) ⋅ δ J ( k )
(
• Update u (jk +1) = u (jk ) − γ ′xor sgn (δ J ( k ) ) π j( k )
G. Cauwenberghs
)
AdOpt mixed-mode chip
2.2 mm2, 1.2µm CMOS
• Controls 19 channels
• Interfaces with LC SLM
or MEMS mirrors
520.776 Learning on Silicon
Wavefront Controller
Chip-level Characterization
G. Cauwenberghs
520.776 Learning on Silicon
Wavefront Controller
System-level Characterization (LC SLM)
HEX-127
LC SLM
G. Cauwenberghs
520.776 Learning on Silicon
Wavefront Controller
System-level Characterization (SLM)
Maximized and minimized J(u) 100 times
max
G. Cauwenberghs
min
520.776 Learning on Silicon
Wavefront Controller
System-Level Characterization (MEMS)
Weyrauch, Vorontsov, Bifano, Hammer, Cohen & Cauwenberghs
Response
function
MEMS
mirror
(OKO)
G. Cauwenberghs
520.776 Learning on Silicon
Wavefront Controller
System-level Characterization
(over 500 trials)
G. Cauwenberghs
520.776 Learning on Silicon
Quality Metrics
imaging: J
image
J
laser beam
focusing:
laser comm:
G. Cauwenberghs
J
= ∫ ∇I(r, t)
beam
comm
υ
d 2r
Horn(1968),
Delbruck (1999)
r
= F {I ( r, t )}
Muller et al.(1974)
Vorontsov et al.(1996)
= bit-error rate
520.776 Learning on Silicon
Image Quality Metric Chip
Cohen, Cauwenberghs, Vorontsov & Carhart (2001)
IQM =
N ,M
∑I
i, j
∗K
i, j
∑I
edge image Iij ∗ K
image Iij
N ,M
i, j
i, j
⎡ 0 −1 0 ⎤
K = ⎢ −1 +4 −1⎥
⎣ 0 −1 0 ⎦
2mm
G. Cauwenberghs
22 x 22
pixel array
120µm
pixel
120µm
520.776 Learning on Silicon
Architecture (3 x 3)
Edge and Corner Kernels
-1
-1
+4
-1
-1
G. Cauwenberghs
520.776 Learning on Silicon
Image Quality Metric Chip Characterization
Experimental Setup
G. Cauwenberghs
520.776 Learning on Silicon
Image Quality Metric Chip Characterization
Experimental Results
dynamic
range
~ 20
uniformly illuminated
G. Cauwenberghs
520.776 Learning on Silicon
Image Quality Metric Chip Characterization
Experimental Results
G. Cauwenberghs
520.776 Learning on Silicon
Beam Variance Metric Chip
Cohen, Cauwenberghs, Vorontsov & Carhart (2001)
N,M
BVM = N ⋅ M ⋅ ∑ I
2
i, j
i, j
+
+
+
⎛
⎞
⎜ ∑ Ii, j ⎟
⎝ i, j
⎠
N,M
2
+
perimeter of dummy pixels
22 20
x 22
x 20
22x22
array array
Pixel
Pixel array
G. Cauwenberghs
70µm
520.776 Learning on Silicon
MNI
I BVM ≈
G. Cauwenberghs
2
u
N,M
∑ I(i,j
1+κ )
I κu
i,j
⎛ N,M ⎞
⎜ ∑ I i,j ⎟
⎝ i,j ⎠
2
520.776 Learning on Silicon
Beam Variance Metric Chip Characterization
Experimental Setup
G. Cauwenberghs
520.776 Learning on Silicon
Beam Variance Metric Chip Characterization
Experimental Results
dynamic
range
~5
G. Cauwenberghs
520.776 Learning on Silicon
Beam Variance Metric Chip Characterization
Experimental Results
G. Cauwenberghs
520.776 Learning on Silicon
Beam Variance Metric Sensor in the Loop
Laser Receiver Setup
G. Cauwenberghs
520.776 Learning on Silicon
Beam Variance Metric Sensor in the Loop
G. Cauwenberghs
520.776 Learning on Silicon
Conclusions
• Computational primitives of adaptation and learning are naturally
implemented in analog VLSI, and allow to compensate for inaccuracies in
the physical implementation of the system under adaptation.
• Care should still be taken to avoid inaccuracies in the implementation of
the adaptive element. Nevertheless, this can easily be achieved by
ensuring the correct polarity, rather than amplitude, of the parameter
update increments.
• Adaptation algorithms based on physical observation of the
“performance” gradient in parameter space are better suited for analog
VLSI implementation than algorithms based on a calculated gradient.
• Among the most generally applicable learning architectures are those
that operate on reinforcement signals, and those that blindly extract and
classify signals.
• Model-free adaptive optics leads to efficient and robust analog
implementation of the control algorithm using a criterion that can be freely
chosen to accommodate different wavefront correctors, and different
imaging or laser communication applications.
G. Cauwenberghs
520.776 Learning on Silicon