NN as Universal Approximators

Neural Networks as
Universal Approximators
Yotam Amar and Heli Ben Hamu
Advanced Reading in Deep-Learning and Vision
Spring 2017
Motivation
Practical uses:
•
•
•
•
Replicating black box functions: decryption functions
More effective implementation
Visualization tools of neural networks
Optimization of functions using backpropagation of the
approximation
Theoretical uses:
• Understanding NN as hypothesis class
• What NN can be used for? (except vision)
Universal
Approximation
Theorem
Universal Approximation Theorem
𝐺
𝑋
𝐺
𝐺 𝑋 ≈ 𝑓(𝑋)
Theorem. A feedforward single hidden layer network with
finite width can approximate continuous functions on
compact subsets of ℝn under mild assumptions on the
activation function.
Single Hidden Layer NN
𝑥1
𝑥2
𝑋 = [𝑥1 , 𝑥2 , … 𝑥𝑛 ]
Output Neuron
𝑥𝑛
G X =
Input Layer
Hidden Layer
𝑗𝑡ℎ neuron: 𝜎(𝑤𝑗𝑇 𝑋 + 𝑏𝑗 )
𝑁
𝑇
𝑎
𝜎(𝑤
𝑗
𝑗=1
𝑗 𝑋
+ 𝑏𝑗 )
Quick Review: Measure Theory
(Ω, ℬ, 𝜇) – a measure space
Set of all outcomes
Measure
𝜎 − algebra
A measure 𝜇 takes a set from ℬ and returns “the measure of the set”.
Measure of A =
𝐴
𝑑𝜇(𝑥)
Measurable function – a function defined on a measurable set.
Borel algebra – the smallest 𝜎-algebra containing all open sets.
Borel measure –a measure defined over a Borel 𝜎-algebra.
Notations
𝐼𝑛 - 𝑛-dimensional unit cube
𝐶(𝐼𝑛 ) – space of continuous functions on 𝐼𝑛
𝑀(𝐼𝑛 ) – space of finite, signed regular Borel measures on 𝐼𝑛
Definitions
Discriminatoy Function
We say that 𝜎 is discriminatory if for a measure 𝜇 ∈ 𝑀 𝐼𝑛
𝜎 𝑤 𝑇 𝑋 + 𝑏 𝑑𝜇(𝑋) = 0
∀ 𝑤 ∈ ℝ𝑛 𝑎𝑛𝑑 𝑏 ∈ ℝ
𝐼𝑛
Then 𝜇 = 0
Sigmoidal Function
We say that 𝜎 is sigmoidal if
𝜎 𝑡 →
1 𝑎𝑠 𝑡 → +∞
0 𝑎𝑠 𝑡 → −∞
Theorem 1 [Cybenko 89’]
Let 𝜎 be any continuous discriminatory function. Then finite sums of
the form:
G X =
𝑁
𝑇
𝑎
𝜎(𝑤
𝑗=1 𝑗
𝑗 𝑋
+ 𝑏𝑗 )
are dense in 𝐶(𝐼𝑛 ).
In other words, given any 𝑓 ∈ 𝐶(𝐼𝑛 ) and 𝜀 > 𝑂, there is a sum, 𝐺(𝑋),
of the above form, for which
𝐺 𝑋 − 𝑓 𝑋
∞
<𝜀
Theorem 1 [Cybenko 89’]
σ
continuous
discriminatory
⇒
𝐺
𝑋
𝐺 𝑋 ≈ 𝑓(𝑋)
Proof of Theorem 1 [Cybenko 89’]
Let 𝑆 ⊂ 𝐶(𝐼𝑛 ) be the set of functions of the form 𝐺(𝑋). 𝑆 is linear.
Claim: The closure of 𝑆 is all 𝐶(𝐼𝑛 )
𝐶(𝐼𝑛 )
𝑆
Proof of Theorem 1 [Cybenko 89’]
Let 𝑆 ⊂ 𝐶(𝐼𝑛 ) be the set of functions of the form 𝐺(𝑋). 𝑆 is linear.
Claim: The closure of 𝑆 is all 𝐶(𝐼𝑛 )
Corollary of Hahn-Banach Theorem
Let 𝑉, ⋅
be a normed linear space, 𝑈 a subspace of V and 𝑎 ∈ 𝑉.
𝑎 ∈ 𝑐𝑙(𝑈)
⟺
∄ bounded linear functional 𝐿: 𝑉 → ℝ s.t.:
𝐿 𝑢 = 0 ∀𝑢 ∈ 𝑈 and 𝐿 𝑎 ≠ 0
Proof of Theorem 1 [Cybenko 89’]
In our case: V = 𝐶 𝐼𝑛 , U=S
For some ℎ ∈ 𝐶 𝐼𝑛 assume:
∃ bounded linear 𝐿: 𝐶 𝐼𝑛 → ℝ , 𝐿(ℎ) ≠ 0 s.t 𝐿 𝑆 = 0
By Hahn-Banach - ℎ ∉ 𝑐𝑙(𝑆)
ℎ
𝐶(𝐼𝑛 )
𝑐𝑙(𝑆)
𝑆 ⊂𝑆
𝑐𝑙(𝑆)
Proof of Theorem 1 [Cybenko 89’]
Riesz Representation Theorem (Relaxed Version)
Any bounded linear functional 𝐹 on 𝐶 𝐼𝑛 can be written as an
integration against a measure 𝜇 ∈ 𝑀 𝐼𝑛
𝐹 𝑔 =
𝑔 𝑋 𝑑𝜇(𝑋) , ∀𝑔 ∈ 𝐶(𝐼𝑛 )
𝐼𝑛
Therefore, 𝐿 can be represented as:
𝐿 𝑔 =
𝑔 𝑋 𝑑𝜇(𝑋) , ∀𝑔 ∈ 𝐶(𝐼𝑛 )
𝐼𝑛
Proof of Theorem 1 [Cybenko 89’]
In particular ∀𝑤 ∈ ℝ𝑛 , 𝑏 ∈ ℝ : 𝜎 𝑤 𝑇 𝑋 + 𝑏 ∈ 𝑆
Therefore ∀𝑤, b
𝐿 𝜎 𝑤𝑇𝑋 + 𝑏
=0=
𝐼𝑛
𝜎 𝑤 𝑇 𝑋 + 𝑏 𝑑𝜇(𝑋)
𝑅𝑖𝑒𝑠𝑧
However, by our assumption that 𝜎 is discriminatory ⇒ 𝜇 = 0
And hence 𝐿 = 0
Contradiction!
𝑆 is dense in 𝐶 𝐼𝑛 .
∎
Theorem 1 [Cybenko 89’]
σ
continuous
discriminatory
⇒
𝐺
𝑋
𝐺 𝑋 ≈ 𝑓(𝑋)
Lemma 1 [Cybenko 89’]
Any bounded, measurable sigmoidal function, 𝜎, is
discriminatory. In particular, any continuous sigmoidal
function is discriminatory.
−∞
∞
Theorem 2 [Cybenko 89’]
σ
continuous
sigmoidal
⇒
𝐺
𝑋
𝐺 𝑋 ≈ 𝑓(𝑋)
Good
Bad
Can We Classify?
The previous result shows that we can approximate
continuous functions up to a desired precision.
However, classification is a task which involves
approximating a discontinuous function - a decision
function.
YES!
What is the catch?
• No guaranty of layer size
• The curse of dimensionality: the number of nodes needed is
usually dependent on the dimension
• No optimization guarantees
Are there functions with better guarantees?
What about multilayer?
New results for Barron function:
• The approximation does not depend on the dimension
• 1 function can be approximated with 1 hidden layer
• A composition of n functions can be approximated by n hidden
layers
• Still no optimization guarantees
Barron Functions
𝐶𝑓,𝐵
Theorem
Barron functions
Let 𝐵 ⊆ ℝ𝑝 be a bounded set, function 𝑓 is called C-Barron on B if:
𝑓: 𝐵 → ℝ
∃𝑔: ℝ𝑛 → ℝ,
𝑔|𝐵 = 𝑓,
𝑔 ∗𝐵 ≤ 𝐶,
𝑔 ∈ ℱ𝐵
Fourier transform for 𝑓: ℝ𝑛 → ℝ:
1
𝑓 𝜔 ≔
𝑓 𝑥 𝑒 −𝑖 𝜔,𝑥 𝑑𝑥
𝑛
2𝜋
ℝ𝑛
For 𝑓: ℝ𝑛 → ℝ𝑚 the Fourier transform is component-wise.
Inverse Fourier transform:
ℱ −1 𝑔 𝑥 ≔
𝑔 𝜔 𝑒 𝑖 𝜔,𝑥 𝑑𝜔 = 2𝜋 𝑛 𝑔∨
ℝ𝑛
Barron functions
Let 𝐵 ⊆ ℝ𝑝 be a bounded set, function 𝑓 is called C-Barron on B if:
𝑓: 𝐵 → ℝ
∃𝑔: ℝ𝑛 → ℝ,
𝑔|𝐵 = 𝑓,
𝒈 ∗𝑩 ≤ 𝐶,
𝑔 ∈ 𝓕𝑩
For 𝐵 ⊆ ℝ𝑝 a bounded set and function 𝑔: ℝ𝑛 → ℝ , we will define:
𝜔 𝐵 = sup 𝜔, 𝑥
𝑥∈𝐵
𝒈
∗
𝑩
≔
𝜔
ℝ𝑛
𝐵
𝑔(𝜔) 𝑑𝜔
𝓕𝑩 = 𝑔: ℝ𝑛 → ℝ ∶ ∀𝑥 ∈ 𝐵, 𝑔 𝑥 = 𝑔 0 +
𝑒 𝑖 𝜔,𝑥 − 1 𝑔 𝜔 𝑑𝜔
Barron functions
Let 𝐵 ⊆ ℝ𝑝 be a bounded set, function 𝑓 is called C-Barron on B if:
𝑓: 𝐵 → ℝ
∃𝑔: ℝ𝑛 → ℝ,
𝑔|𝐵 = 𝑓,
𝑔 ∗𝐵 ≤ 𝐶,
𝑔 ∈ ℱ𝐵
We will mark by 𝜞𝑩 𝑪 the set of C-Barron functions on B.
For 𝑓: 𝐵 → ℝ we will define 𝐶𝑓,𝐵 to be its Barron constant:
𝐶𝑓,𝐵 ≔
inf
𝑔|𝐵 =𝑓,𝑔∈ℱ𝐵
𝑔
∗
𝐵
Barron functions
Some facts:
• If 𝐶𝑓,𝐵 < ∞ then 𝑓 is continuously differentiable
• 𝐶𝑓,𝐵 is larger the more 𝑓 is “spread out”
• Calculating 𝐶𝑓,𝐵 is an open question.
• It is easy to get upper bounds (by choosing some
extension 𝑔)
• Its hard to get lower bounds, but there are ways
Barron Theorem [Bar93]
Let 𝐵 ⊆ ℝ𝑝 be a bounded set
𝑓: 𝐵 → ℝ that is C-Barron on 𝐵
𝜇 a probability measure on 𝐵
𝜎: ℝ → ℝ a sigmoidal function
Then there exist 𝑤𝑖 ∈ ℝ𝑛 , 𝑏𝑖 ∈ ℝ, 𝑎𝑖 ∈ ℝ with
𝑘
𝑘
𝑇
𝑎
≤
2𝐶
we
will
mark
𝑓
(𝑋)
=
𝑎
𝜎
𝑤
𝑘
𝑖 𝑋 + 𝑏𝑖
𝑖=1 𝑖
𝑖=1 𝑖
Such that:
𝑓 − 𝑓𝑘
2
𝜇
≔
𝑓 𝑋 − 𝑓𝑘 𝑋
𝐵
2
2𝐶
𝑑𝜇(𝑋) ≤
𝑘
2
Common Activation Functions
Name
Binary Step
Plot
Equation
𝑓 𝑥 =
Range
0 ,𝑥 < 0
1 ,𝑥 ≥ 0
{0,1}
1
1 + 𝑒 −𝑥
(0,1)
Logistic
𝑓 𝑥 =
TanH
𝑓 𝑥 = tanh(𝑥)
(−1,1)
Arctan
𝑓 𝑥 = tan−1 (𝑥)
𝜋 𝜋
− ,
2 2
Relu
𝑓 𝑥 =
0 ,𝑥 < 0
𝑥 ,𝑥 ≥ 0
[0, ∞)
Multilayer Barron
𝐶𝑓,𝐵
Theorem
Multilayer Barron Theorem [2017]
0 ≤ 𝑖 ≤ 𝑙,
𝑚𝑖 ∈ ℕ,
𝑓𝑖 : ℝ𝑚𝑖−1 → ℝ𝑚𝑖 ,
We want to have: 𝑔~𝑓1 °𝑓2 ° … °𝑓𝑙 = 𝑓𝑙:1
g
X
~𝑓1 ~𝑓1 °𝑓2
~𝑓𝑙−1:1
Multilayer Barron Theorem [2017]
Let 𝜖, 𝑠 > 0 be parameters.
for 0 ≤ 𝑖 ≤ 𝑙, we will have: 𝑚𝑖 ∈ ℕ, 𝑓𝑖 : ℝ𝑚𝑖−1 → ℝ𝑚𝑖 , 𝐾𝑖 ⊂ ℝ𝑚𝑖 set
𝜇0 probability distribution on ℝ𝑚0 .
Suppose that:
1. 𝑆𝑢𝑝𝑝 𝜇0 ⊂ 𝐾0 (Support of initial distribution)
2. 𝑓𝑖 is 1-Lipschitz
3. 𝑓1 ∈ Γ𝐾0 (𝐶0 ) and for 1 ≤ 𝑖 ≤ 𝑙 , 𝑓𝑖 ∈ Γ𝐾𝑖−1 +𝑠𝐵𝑚 (𝐶𝑖 ) (𝑓𝑖 is Barron)
𝑖−1
4. 𝑓𝑖 𝐾𝑖−1 ⊆ 𝐾𝑖 (𝑓𝑖 takes each set to the next)
We will also mark the diameter of 𝐾𝑙 by 𝐷.
Then there exists a neural network 𝑔 with 𝑙 hidden layers, with
nodes on the 𝑖th layer such that:
𝑓𝑙:1 − 𝑔
𝐾0
2 𝑑𝜇
1
2
≤ 𝑙𝜖 1 + 2𝐶𝑙 𝑚𝑙 + 𝐷
2
𝑙
3𝑠 2
4𝐶𝑖2 𝑚𝑖
𝜖2
Multilayer Barron Theorem [2017]
Why is this important?
Because there are complex functions that are combination of baron
functions
𝑛 ∈ ℕ, 𝑓 = 𝑗°𝑘
𝐶𝑓 ≥ 𝑐 𝑛 , 𝐶𝑗 , 𝐶𝑘 ≤ 𝑝𝑜𝑙𝑦(𝑛)
≥ 𝑐𝑛
𝑝𝑜𝑙𝑦(𝑛)
Multilayer Barron Theorem [2017]
What’s next?
• Is there a separation between composition of 𝑙 functions and 𝑙 + 1
functions?
• What natural transformations can be represented by composition
of baron functions?
• Can we learn the approximation? how many examples do we need?
(optimization)
Matlab
Demo
fitnet Vs. parametric curve fitting
Curve Fitting
• Advantages: if the model of the function is known, can just
estimate the parameters.
•Disadvantages: need to have a model to begin with fitting the
function.
fitnet
•Advantages: does not need a model of the function in order to
approximate well.
•Disadvantages: sometimes, we want to learn from the
parameters of the model, however the NN is a black box.
Questions
References
• George Cybenko, 1989, Approximation by superpositions of a sigmoidal function
• Rudin, Walter (1991). Functional analysis. McGraw-Hill Science/Engineering/Math (Hahn Banach
Theorem)
• Riesz Representation Theorem - Wolfram
• Holden Lee, Rong Ge, Andrej Risteski, Tengyu Ma, Sanjeev Arora, On the ability of neural nets to
express distributions
Theorem 2 [Cybenko 89’]
Let 𝜎 be any continuous sigmoidal function. Then finite sums of the
form:
G X =
𝑁
𝑇
𝛼
𝜎(𝑌
𝑗=1 𝑗
𝑗 𝑋
+ 𝜃𝑗 )
are dense in 𝐶(𝐼𝑛 ).
In other words, given any 𝑓 ∈ 𝐶(𝐼𝑛 ) and 𝜀 > 𝑂, there is a sum, 𝐺(𝑋),
of the above form, for which
𝐺 𝑋 − 𝑓 𝑋
<𝜀
∀𝑋 ∈ 𝐼𝑛
Proof of Theorem 1 [Cybenko 89’]
Hahn-Banach Theorem
Suppose
1. 𝑅 is a subspace of a real linear space 𝑋.
2. 𝑝: 𝑋 → ℝ is sublinear, i.e.
𝑝 𝑥 + 𝑦 ≤ 𝑝 𝑥 + 𝑝 𝑦 and 𝑝 𝑡𝑥 = 𝑡𝑝 𝑥
if 𝑥, 𝑦 ∈ 𝑋 and 𝑡 ≥ 0.
3. 𝑓: 𝑅 → ℝ is linear and 𝑓 𝑥 ≤ 𝑝(𝑥) on 𝑅.
Then, ∃ a linear functional 𝐿: 𝑋 → ℝ such that:
𝐿 𝑥 =𝑓 𝑥 𝑥∈𝑅
−𝑝 −𝑥 ≤ 𝐿 𝑥 ≤ 𝑝 𝑥 𝑥 ∈ 𝑋
Theorem 3 [Cybenko 89’]
Let 𝜎 be a continuous sigmoidal function. Let 𝑓 be any decision
function of finite measurable partition of 𝐼𝑛 . For any 𝜀 > 0, there is a
finite sum of the form:
G X =
𝑁
𝑇
𝑗=1 𝛼𝑗 𝜎(𝑌𝑗 𝑋
+ 𝜃𝑗 )
and a set 𝐷 ⊂ 𝐼𝑛 with measure 𝑚 𝐷 ≥ 1 − 𝜀 such that:
𝐺 𝑋 − 𝑓 𝑋
<𝜀
∀𝑋 ∈ 𝐷
Proof idea: use Lusin’s theorem and theorem 2.
Multilayer Barron Theorem [2017]
Why is this important?
because there are functions that are combination of 𝑝𝑜𝑙𝑦(𝑛)-Barron
functions, but they are not 𝑝𝑜𝑙𝑦(𝑛)-Barron themselves:
In the article they so an example for function 𝑓: ℝ𝑛 → ℝ that is a
composition of two 𝑝𝑜𝑙𝑦(𝑛)-Barron functions, But is not 𝑂 𝑐 𝑛 Barron for some 𝑐 > 1.
A network with multiple hidden layers of poly(n) width can
approximate functions that need exponential number of nodes with
1 hidden layer.