CSE 598 TEL - Homework 1

CSE 598 TEL - Homework 1
MJT
October 2, 2016
Instructions.
• All rules on the webpage apply.
• You may work in groups of size at most two; put all the NetIDs clearly on the first page, and submit
through gradescope.
• Homework is due Wednesday, October 5, at 11:00am; no late homework accepted.
• Please consider using the provided LATEX file as a template (apologies for the weird indentation), or at
least something vaguely visually similar, since gradescope tries to automatically locate beginnings and
ends of problems.
1
1. (Analysis I: missing step from lecture on proof by Hornik et al. (1989).)
Provide a proof from the missing step in lecture 4, restated as follows.
Let σ : R → [0, 1] be given, and suppose it is sigmoidal, meaning continuous, monotone
nondecreasing, and satisfying
lim σ(z) = 0,
lim σ(z) = 1.
z→−∞
z→+∞
Given any any g ∈ Hcos = x 7→ cos(a> x + b) : a ∈ Rd , b ∈ R and any > 0, there exists
f ∈ span(Hσ ) with
n
o
kf − gku = sup |f (x) − g(x)| : x ∈ [0, 1]d ≤ .
Easy
mode / hint: feel free to instead prove this approximation result for the norm kf − gk1 =
R
|f
(x) − g(x)|dx (which is less finicky and carries more intuition from lecture), and moreover with
d
[0,1]
d = 1.
Solution.
2
2. (Analysis II: a nuisance from the neural net approximation lectures.)
Recall that the lectures on approximation of continuous functions by 2- and 3-layer networks did not
include a nonlinearity on theRfinal output. This exercise will set the record straight: namely, prove the
following, where kf − gk1 = [0,1]d |f (x) − g(x)|dx as in lecture.
Suppose a function class F is given so that for any continuous g : Rd → R and any τ > 0,
there exists f ∈ F with kf − gk1 ≤ τ . Prove that for any sigmoidal σ : R → [0, 1] (as in the
previous problem) the function class Fσ := {σ ◦ f : f ∈ F} can approximate appropriately
restricted continuous functions, meaning for any continuous g : Rd → [0, 1] and any > 0,
there exists f ∈ Fσ with kf − gk1 ≤ .
Easy mode / hint: feel free to assume σ is any combination of Lipschitz and bijective (with continuous
inverse, even), and that g : Rd → (0, 1) rather than g : Rd → [0, 1], but please state these assumptions
clearly.
Hard mode: don’t make those assumptions, but be sure to check that your proof doesn’t accidentally
rely on them.
Solution.
3
3. (A negative result for single node neural networks.)
Suppose σ : R → [0, 1] is sigmoidal as in the previous question. This question will develop a negative
result on the representation power of single layer networks (in particular, networks with exactly 1 node).
This result makes sense from the perspective of the result presented in class due to Minsky and Papert
(1969); they, however, had some motivation in vision tasks, whereas here the task will be even simpler,
namely univariate.
With this in mind, construct an appropriate continuous function g : R → [0, 1] and use it to (constructively) prove the following:
There exists a continuous function g : R → [0, 1] and a real c > 0 so that
inf kf − gk1 ≥ c
f ∈Hσ
where Hσ := x 7→ σ(a> x + b) : a ∈ R, b ∈ R .
Solution.
4
4. (Polynomial approximation (Weierstrass, 1885).)
The goal of this problem is to prove the following version of the Weierstrass Approximation Theorem:
For every continuous g : Rd → R and scalar > 0, there exists a polynomial f : Rd → R so
that
kf − gku = sup{|f (x) − g(x)| : x ∈ [0, 1]d } ≤ .
The proof will proceed in a few steps. First recall (and don’t bother to prove) the following analysis
fact from lecture.
Lemma 1. Given continuous g : Rd → R and scalar > 0, there exists δ > 0 so that every x, x0 ∈ [0, 1]d
with kx − x0 k∞ ≤ δ implies |g(x) − g(x0 )| ≤ /2, and an M < ∞ so that sup{|g(x)| : x ∈ [0, 1]d } ≤ M .
In the rest of the proof, fix a continuous g as in the problem statement, let δ > 0 and M < ∞ be given
according to the preceding lemma. The steps of the proof are as follows.
(a) Let (X1 , . . . , Xd ) denote independent random variables, where Xi has binomial distribution B(n, xi )
corresponding to n flips of an xi -biased coin. Prove that there exists n such that n > 1/δ 3 and
h
i
.
Pr ∃i ∈ {1, . . . , d} |Xi − nxi | > n2/3 <
4M
(See below1 for a hint.)
(b) Prove
n X
n
X
···
d n Y
X
n
id =0 j=1
i1 =0 i2 =0
ij
i
xjj (1 − xj )n−ij = 1.
(c) Recalling that ei denotes the ith standard basis vector, define the polynomial


n X
n
n
d
d X
X
X
Y
n ij


:=
f (x)
···
g
ij ej /n
xj (1 − xj )n−ij ,
i
j
i =0 i =0
i =0
j=1
j=1
1
2
d
whose form conveniently relates to the B(n, xi ) above. With this and the other parts of this
question statement in mind, prove kf − gku ≤ . (See below2 for a hint.)
Solution.
1 Hint:
what are some useful general-purpose probability inequalities?
split the sum defining f into two parts, one part being handled by the Lemma 1, and the other by the first part of
the question. . . and surely the middle part of the question is useful too. . .
2 Hint:
5
5. (Convexity calisthenics.)
(a) Given any p ∈ [1, ∞], define fp : Rd → R as
(
Pd
1
kvkpp = p1 i=1 |vi |p
p
fp (v) :=
ι[−1,+1]d (v)
when p ∈ [1, ∞),
when p = ∞.
Prove for any conjugate exponents p, q ≥ 1 (meaning 1/p + 1/q = 1 with convention 1/∞ = 0) that
fp∗ = fq .
(Note, we’re proving Hölder’s inequality, so don’t use it as a step of this proof.)
(b) Prove fp is convex for every t ∈ [1, ∞].
(c) Under the notation (and making use of) the previous part, show for any u, v ∈ Rd with kukp =
1 = kvkq that | hu, vi | ≤ 1.
(d) Now use the previous part to prove the full version of Hölder’s inequality (which implies the
Cauchy-Schwarz inequality): given any u, v ∈ Rd and any conjugate exponents p, q ≥ 1,
| hu, vi | ≤ kukp kvkq .
(e) Optional: prove (k · kp )∗ = k · kq , where p, q ≥ 1 are conjugate exponents.
(f) Define f (x) := x> Qx/2, where Q ∈ Rd×d a symmetric positive definite matrix. Derive and
rigorously prove an explicit form for f ∗ . (Hint: try to guess the answer first.)
(g) Define f (x) := x> Qx/2 once again, but now Q ∈ Rd×d is merely symmetric positive semi-definite.
Now derive and rigorously prove a new expression for f ∗ . (Hint: there are many ways here, but
again try to guess the answer, and in times of great need never forget your friend S-V-D.)
Solution.
6
6. (Max of random variables; moment generating functions.)
An important object in the study of random variables is the moment generating function (MGF),
MX (t), defined as MX (t) := E(exp(tX)). (MX will in general fail to be finite for all t ≥ 0, but in this
question it is finite for all t ≥ 0.)
Given a family (Xi , . . . , Xd ) of i.i.d. random variables drawn according to some distribution, this
question will investigate the behavior of the random variable Z := k(X1 , . . . , Xd )k∞ = maxi |Xi |.
(a) Prove the following inequality, which will be convenient in the remainder of the question: for any
t > 0,
1 E(Z) ≤ ln dE exp(tX1 ) + exp(−tX1 ) .
t
Note / hint: consider waiting for the “convexity bootcamp” lectures.
Hint: start your proof from the rearranged left hand exp(tE(Z)).
(b) Optional: Suppose X1 distributed according to a Gumbel distribution with scale parameter σ,
whereby E(exp(sX1 )) = Γ(1 − sσ) for all s ∈ R, where Γ denotes the gamma function. Prove that
√
E(Z) ≤ 2σ ln(2d π).
(See below3 for a hint.)
(c) Prove that Gaussian distribution is subgaussian: in particular, if X1 is Gaussian with mean 0 and
variance σ 2 , then E(exp(tX1 )) = exp(t2 σ 2 /2).
(d) Prove that if X1 is subgaussian with parameter σ (meaning E(exp(tX1 )) ≤ exp(t2 σ 2 /2), then
p
E(Z) ≤ σ 2 ln(2d).
(Together with the preceding part, this implies the bound for X1 a Gaussian with mean 0 and
variance σ 2 .)
(e) Was it necessary to assume (X1 , . . . , Xd ) were i.i.d.? Answer this question however you like.
Solution.
3 The
inequality from the first part holds for all t. . . can you find a particularly nice choice of t?
7
References
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.
Neural Networks, 2(5):359–366, july 1989.
Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press,
1969.
Karl Weierstrass. Über die analytische darstellbarkeit sogenannter willkürlicher functionen einer reellen
veränderlichen. Sitzungsberichte der Akademie zu Berlin, pages 633–639, 789–805, 1885.
8