Energy Propagation in Deep Convolutional Neural Networks

Energy Propagation in
Deep Convolutional Neural Networks
Thomas Wiatowski1 , Philipp Grohs2 , and Helmut Bölcskei1
1 Dept.
arXiv:1704.03636v1 [cs.IT] 12 Apr 2017
2 Dept.
IT & EE, ETH Zurich, Switzerland
Math., University of Vienna, Austria
[email protected], [email protected], [email protected]
April 13, 2017
Abstract
Many practical machine learning tasks employ very deep convolutional neural networks. Such large
depths pose formidable computational challenges in training and operating the network. It is therefore
important to understand how many layers are actually needed to have most of the input signal’s features
be contained in the feature vector generated by the network. This question can be formalized by asking
how quickly the energy contained in the feature maps decays across layers. In addition, it is desirable
that none of the input signal’s features be “lost” in the feature extraction network or, more formally, we
want energy conservation in the sense of the energy contained in the feature vector being proportional
to that of the corresponding input signal. This paper establishes conditions for energy conservation
for a wide class of deep convolutional neural networks and characterizes corresponding feature map
energy decay rates. Specifically, we consider general scattering networks, and find that under mild
analyticity and high-pass conditions on the filters (which encompass, inter alia, various constructions of
Weyl-Heisenberg filters, wavelets, ridgelets, (α)-curvelets, and shearlets) the feature map energy decays
at least polynomially fast. For broad families of wavelets and Weyl-Heisenberg filters, the guaranteed
decay rate is shown to be exponential. Our results yield handy estimates of the number of layers needed
to have at least ((1 − ε) · 100)% of the input signal energy be contained in the feature vector.
Keywords: Machine learning, deep convolutional neural networks, scattering networks, energy decay
and conservation, frame theory.
This paper will appear in part at the 2017 IEEE International Symposium on Information Theory (ISIT) [1].
1
2
I. I NTRODUCTION
Feature extraction based on deep convolutional neural networks (DCNNs) has been applied with
significant success in a wide range of practical machine learning tasks [2]–[7]. Many of these applications,
such as, e.g., the classification of images in the ImageNet data set, employ very deep networks with
potentially hundreds of layers [8]. Such network depths entail formidable computational challenges in the
training phase due to the large number of parameters to be learned, and in operating the network due to
the large number of convolutions that need to be carried out. It is therefore paramount to understand how
many layers are needed to have most of the input signal’s features be contained in the feature vector. This
can be formalized by asking how quickly the energy contained in the signals generated in the individual
network layers, a.k.a. feature maps, decays across layers. In addition, it is important in practice that the
feature maps be informative in the sense of none of the input signal’s features being “lost” in the feature
extraction network. This aspect can be formalized by asking for energy conservation in the sense of the
energy in the overall extracted feature vector—obtained by aggregating filtered versions of the feature
maps—being proportional to that of the corresponding input signal. Energy conservation then implies a
trivial kernel for the feature extractor, i.e., the only signal that is mapped to the all-zeros feature vector
is the zero signal.
Scattering networks as introduced in [9] and extended in [10] constitute an important class of feature
extractors based on convolutional transforms with pre-specified or learned filters in each network layer
(e.g., wavelets [9], [11], uniform covering filters [12], or general filters [10]), followed by a non-linearity
(e.g., the modulus [9], [11], [12], or a general Lipschitz non-linearity [10]), and a pooling operation (e.g.,
sub-sampling or average-pooling [10]). Scattering network-based feature extractors were shown to yield
classification performance competitive with the state-of-the-art on various data sets [13]–[18]. Moreover,
a mathematical theory exists, which allows to establish formally that such feature extractors are—under
certain technical conditions—horizontally [9] or vertically [10] translation-invariant, deformation-stable
in the sense of [9], or exhibit limited sensitivity to deformations in the sense of [10] on input signal
classes such as band-limited functions [10], [19], cartoon functions [20], and Lipschitz functions [20].
It was shown recently that the energy in the feature maps generated by scattering networks employing,
in every network layer, the same set of (certain) Parseval wavelets [11, Section 5] or “uniform covering”
[12] frame filters satisfying an analyticity and vanishing moments condition, the modulus non-linearity,
and no pooling, decays at least exponentially fast and “strict” energy conservation for the feature vector
holds. Specifically, the feature map energy decay was shown to be at least of order O(a−N ), for some
3
unspecified a > 1, where N denotes the network depth. Here we note that d-dimensional uniform
covering filters as introduced in [12] are a family of functions whose Fourier transforms’ support sets
can be covered by a union of finitely many balls. This covering condition is satisfied by, e.g., WeylHeisenberg filters [21] with a band-limited generating function, but fails to hold for multi-scale filters
such as wavelets [22], [23], (α)-curvelets [24]–[26], shearlets [27], [28], or ridgelets [29]–[31], see [12,
Remark 2.2 (b)].
Contributions. The first main contribution of this paper is a characterization of the feature map
energy decay rate in DCNNs employing the modulus non-linearity, no pooling, and general filters that
constitute a frame [22], [32]–[34], but not necessarily a Parseval frame, and are allowed to be different
in different network layers. We find that, under mild analyticity and high-pass conditions on the filters,
the energy decay rate is at least polynomial in the network depth, i.e., the decay is at least of order
O(N −α ), and we explicitly specify the decay exponent α > 0. This result encompasses, inter alia, various
constructions of Weyl-Heisenberg filters, wavelets, ridgelets, (α)-curvelets, shearlets, and learned filters
(of course as long as the learning algorithm imposes the analyticity and high-pass conditions that we
require). For broad families of wavelets and Weyl-Heisenberg filters, the guaranteed energy decay rate
is shown to be exponential in the network depth, i.e., the decay is at least of order O(a−N ) with the
decay factor given explicitly as a =
5
3
in the wavelet case and a =
3
2
in the Weyl-Heisenberg case. We
hasten to add, however, that our results constitute guaranteed decay rates and do not preclude the energy
from decaying faster in practice. For input signals that belong to the class of Sobolev functions1 , our
energy decay results are shown to yield handy estimates of the number of layers needed to have at least
((1 − ε) · 100)% of the input signal energy be contained in the feature vector. For example, in the case
of exponential energy decay with a =
5
3
and for band-limited input signals, only 8 layers are needed to
absorb 95% of the input signal’s energy.
Our second main contribution is a general energy conservation result in the sense of energy proportionality between the feature vector and the underlying input signal with the proportionality constant
lower- and upper-bounded by the frame bounds of the filters employed in the different layers. This
energy conservation result implies that the feature extractor exhibits a trivial kernel, i.e., the only signal
that maps to the all-zeros feature vector is the zero signal. We emphasize that throughout our energy
1
The class of Sobolev functions contains a wide range of practically relevant signal classes such as square-integrable functions,
band-limited functions, and—as established in the present paper—cartoon functions [35]. We note that cartoon functions are
widely used in the mathematical signal processing literature [16], [20], [26], [36], [37] as a model for natural images such as,
e.g., images of handwritten digits [38].
4
decay results pertain to the feature maps, whereas the energy conservation statements apply to the feature
vector, obtained by aggregating filtered versions of the feature maps.
Notation. The complex conjugate of z ∈ C is denoted by z . We write Re(z) for the real, and Im(z)
P
for the imaginary part of z ∈ C. The Euclidean inner product of x, y ∈ Cd is hx, yi := di=1 xi yi , with
p
associated norm |x| := hx, xi. For x ∈ R, we set (x)+ := max{0, x} and hxi := (1 + |x|2 )1/2 . We
denote the open ball of radius r > 0 centered at x ∈ Rd by Br (x) ⊆ Rd . The first canonical orthant
is H := {x ∈ Rd | xk ≥ 0, k = 1, ... , d}, and we define the rotated orthant HA := {Ax | x ∈ H} for
A ∈ O(d), where O(d) stands for the orthogonal group of dimension d ∈ N. The Minkowski sum of sets
A, B ⊆ Rd is (A+B) := {a+b | a ∈ A, b ∈ B}, and A∆B := (A\B)∪(B\A) denotes their symmetric
difference. A Lipschitz domain D is a set D ⊆ Rd whose boundary ∂D is “sufficiently regular” to be
thought of locally as the graph of a Lipschitz-continuous function, for a formal definition we refer to [39,
Definition 1.40]. A multi-index α = (α1 , . . . , αd ) ∈ Nd0 is an ordered d-tuple of non-negative integers
αi ∈ N0 .
For functions W : N → R and G : N → R, we say that W (N ) = O(G(N )) if there exist C > 0 and
N0 ∈ N such that W (N ) ≤ CG(N ), for all N ≥ N0 . The support supp(f ) of a function f : Rd → C
is the closure of the set {x ∈ Rd | f (x) 6= 0} in the topology induced by the Euclidean norm | · |.
R
For a Lebesgue-measurable function f : Rd → C, we write Rd f (x)dx for its integral w.r.t. Lebesgue
measure. The indicator function of a set B ⊆ Rd is defined as 1B (x) = 1, for x ∈ B , and 1B (x) = 0,
R
R
for x ∈ Rd \B . For a measurable set B ⊆ Rd , we let vold (B) := Rd 1B (x)dx = B 1dx, and we write
∂B for its boundary. Lp (Rd ), with p ∈ [1, ∞), stands for the space of Lebesgue-measurable functions
R
f : Rd → C satisfying kf kp := ( Rd |f (x)|p dx)1/p < ∞. L∞ (Rd ) denotes the space of Lebesgue-
measurable functions f : Rd → C such that kf k∞ := inf{α > 0 | |f (x)| ≤ α for a.e.2 x ∈ Rd } < ∞.
For a countable set Q, (L2 (Rd ))Q stands for the space of sets S := {fq }q ∈ Q , with fq ∈ L2 (Rd ) for all
P
q ∈ Q, satisfying |||S||| := ( q∈Q kfq k22 )1/2 < ∞. We denote the Fourier transform of f ∈ L1 (Rd ) by
R
fb(ω) := Rd f (x)e−2πihx,ωi dx and extend it in the usual way to L2 (Rd ) [40, Theorem 7.9].
Id : Lp (Rd ) → Lp (Rd ) stands for the identity operator on Lp (Rd ). The convolution of f ∈ L2 (Rd )
R
and g ∈ L1 (Rd ) is (f ∗ g)(y) := Rd f (x)g(y − x)dx. We write (Tt f )(x) := f (x − t), t ∈ Rd , for
the translation operator, and (Mω f )(x) := e2πihx,ωi f (x), ω ∈ Rd , for the modulation operator. We set
R
hf, gi := Rd f (x)g(x)dx, for f, g ∈ L2 (Rd ).
R
H s (Rd ), with s ≥ 0, stands for the Sobolev space of functions f ∈ L2 (Rd ) satisfying Rd (1 +
2
Throughout the paper “a.e.” is w.r.t. Lebesgue measure.
5
|ω|2 )s |fb(ω)|2 dω < ∞, see [41, Section 6.2.1]. Here, the index s reflects the degree of smoothness of
f ∈ H s (Rd ), i.e., larger s entails smoother f . For a multi-index α ∈ Nd0 , Dα denotes the differential
P
operator Dα := (∂/∂x1 )α1 . . . (∂/∂xd )αd , with order |α| := di=1 αi . The space of functions f : Rd →
C whose derivatives Dα f of order at most k ∈ N0 are continuous is designated by C k (Rd , C). Moreover,
we denote the gradient of a function f : Rd → C as ∇f .
II.
DCNN - BASED FEATURE EXTRACTORS
Throughout the paper we employ the terminology of [10], consider (unless explicitly stated otherwise)
input signals f ∈ L2 (Rd ), and employ the module-sequence
Ω := (Ψn , | · |, Id) n∈N ,
(1)
i.e., each network layer is associated with (i) a collection of filters Ψn := {χn } ∪ {gλn }λn ∈Λn ⊆
L1 (Rd ) ∩ L2 (Rd ), where χn , referred to as output-generating filter, and the gλn , indexed by a countable
set Λn , satisfy the frame condition [22], [32], [34]
An kf k22 ≤ kf ∗ χn k22 +
X
kf ∗ gλn k2 ≤ Bn kf k22 ,
∀ f ∈ L2 (Rd ),
(2)
λn ∈Λn
for some An , Bn > 0, (ii) the modulus non-linearity | · | : L2 (Rd ) → L2 (Rd ), |f |(x) := |f (x)|, and (iii)
no pooling, which, in the terminology of [10], corresponds to pooling through the identity operator with
pooling factor equal to one. Associated with the module (Ψn , | · |, Id), the operator Un [λn ] defined in
[10, Eq. 12] particularizes to
Un [λn ]f = f ∗ gλn.
(3)
We extend (3) to paths on index sets q = (λ1 , λ2 , ... , λn ) ∈ Λ1 × Λ2 × ... ×Λn =: Λn , n ∈ N, according
to
U [q]f = U [(λ1 , λ2 , ... , λn )]f := Un [λn ] · · · U2 [λ2 ]U1 [λ1 ]f,
(4)
where, for the empty path e := ∅, we set Λ0 := {e} and U [e]f := f , for f ∈ L2 (Rd ). The signals
U [q]f , q ∈ Λn , associated with the n-th network layer, are often referred to as feature maps in the deep
learning literature. The feature vector ΦΩ (f ) is obtained by aggregating filtered versions of the feature
maps. More formally, ΦΩ (f ) is defined as [10, Def. 3]
ΦΩ (f ) :=
∞
[
n=0
ΦnΩ (f ),
(5)
6
|||f ∗ gλ(j) | ∗ gλ(l) | ∗ gλ(m) |
1
2
|||f ∗ gλ(p) | ∗ gλ(r) | ∗ gλ(s) |
1
3
2
3
···
···
||f ∗ gλ(j) | ∗ gλ(l) |
1
||f ∗ gλ(p) | ∗ gλ(r) |
2
1
||f ∗ gλ(j) | ∗ gλ(l) | ∗ χ3
1
2
||f ∗ gλ(p) | ∗ gλ(r) | ∗ χ3
1
2
|f ∗ gλ(j) |
2
|f ∗ gλ(p) |
1
1
|f ∗ gλ(j) | ∗ χ2
|f ∗ gλ(p) | ∗ χ2
f
1
1
f ∗ χ1
(k)
Fig. 1: Network architecture underlying the feature extractor (5). The index λn corresponds to the k-th filter gλ(k) of the
collection Ψn associated with the n-th network layer. The function χn+1 is the output-generating filter of the n-th network
layer. The root of the network corresponds to n = 0.
where ΦnΩ (f ) := {(U [q]f ) ∗ χn+1 }q∈Λn are the features generated in the n-th network layer, see Figure
1. Here, n = 0 corresponds to the root of the network. The function χn+1 is the output-generating filter
of the n-th network layer. The feature extractor
ΦΩ : L2 (Rd ) → L2 (Rd )
n
S∞
n=0 Λ
was shown in [10, Theorem 1] to be vertically translation-invariant, provided although that pooling is
employed, with pooling factors Sn ≥ 1, n ∈ N, (see [10, Eq. 6] for the definition of the general pooling
QN
operator) such that lim
n=1 Sn = ∞. Moreover, ΦΩ exhibits limited sensitivity to certain non-linear
N →∞
deformations on (input) signal classes such as band-limited functions [10, Theorem 2], cartoon functions
[20, Theorem 1], and Lipschitz functions [20, Corollary 1].
III. E NERGY DECAY AND ENERGY CONSERVATION
The first central goal of this paper is to understand how quickly the energy contained in the feature
maps decays across layers. Specifically, we shall study the decay of
WN (f ) :=
X
q∈Λ
N
kU [q]f k22 ,
f ∈ L2 (Rd ),
(6)
7
across layers. Moreover, to ensure that none of the features (of the input signal f ) are “lost” in the
feature extraction network, one can ask for bounds of the form
AΩ kf k22 ≤ |||ΦΩ (f )|||2 ,
∀f ∈ L2 (Rd ),
(7)
for some constant AΩ > 0 (possibly depending on the module-sequence Ω) and with the feature space
P
P
n
2
n
2
2
norm |||ΦΩ (f )|||2 := ∞
n=0 |||ΦΩ (f )||| , where |||ΦΩ (f )||| :=
q ∈ Λn k(U [q]f )∗χn+1 k2 . The existence
of such a lower bound implies a trivial kernel for the feature extractor ΦΩ and thereby ensures that the
only signal f that is mapped to the all-zeros feature vector is f = 0. As ΦΩ is a non-linear operator
(owing to the modulus non-linearities) characterizing its kernel is non-trivial in general. An upper bound
corresponding to (7), namely |||ΦΩ (f )|||2 ≤ BΩ kf k22 , for all f ∈ L2 (Rd ), with BΩ > 0, was established
in [10, Appendix E]. Perhaps surprisingly, while it is shown in [10, Appendix E] that the existence of
an upper bound BΩ > 0 is implied by the filters Ψn , n ∈ N, satisfying the frame property (2), this is
not enough to guarantee AΩ > 0 (see Appendix A for an illustrative example). Establishing conditions
for energy conservation in the sense of
AΩ kf k22 ≤ |||ΦΩ (f )|||2 ≤ BΩ kf k22 ,
∀f ∈ L2 (Rd ),
(8)
constitutes the second central goal of this paper. Note that it is the lower bound in (8) that is of concern,
the upper bound follows straight from the results in [10, Appendix E]. Finally, we emphasize that
throughout the paper energy decay results pertain to the feature maps U [q]f , whereas energy conservation
according to (8) applies to the feature vector ΦΩ (f ).
Previous work on the decay rate of WN (f ) in [11, Section 5] shows that for wavelet-based networks
(i.e., in every network layer, the filters Ψ = {χ} ∪ {gλ }λ∈Λ in (1) are taken to be (specific) 1-D wavelets
that constitute a Parseval frame, where χ is a low-pass filter) there exist ε > 0 and a > 1 (both constants
unspecified) such that (6) satisfies
Z
|fb(ω)|2 1 − rg
R
2 dω,
N −1 ω
V
WN (f ) ≤
εa
(9)
2
for real-valued 1-D signals f ∈ L2 (R) and N ≥ 2, where rg (ω) := e−ω . To see that this result
V
indicates energy decay, Figure 2 illustrates the influence of network depth N on the upper bound in (9).
Specifically, we can see that increasing the network depth results in cutting out increasing amounts of
low-frequency energy of f and thereby making the upper bound in (9) decay as a function of N .
For scattering networks that employ, in every network layer, uniform covering filters Ψ = {χ} ∪
8
|fb(ω)|2
1
hN −1 (ω)
ω
−aN −1
−aN
aN −1
aN
aN −1
aN
aN −1
aN
N −1
N
1
|fb(ω)|2 · hN −1 (ω)
ω
−aN −1
−aN
|fb(ω)|2
1
hN (ω)
ω
−aN
−aN −1
|fb(ω)|2 · hN (ω)
1
ω
−a
N
−a
N −1
a
a
Fig. 2: Illustration of the impact of network depth N on the upper bound on WN (f ) in (9), for ε = 1. The function
2
hN (ω) := (1 − rbg ( εaωN )), where rbg (ω) = e−ω , is of increasing high-pass nature as N increases, which results in cutting out
increasing amounts of low-frequency energy of f and thereby making the upper bound in (9) decay in N .
{gλ }λ∈Λ ⊆ L1 (Rd ) ∩ L2 (Rd ) forming a Parseval frame (where χ, again, is a low-pass filter), exponential
energy decay according to
WN (f ) = O(a−N ),
∀f ∈ L2 (Rd ),
(10)
for an unspecified a > 1, was established in [12, Proposition 3.3]. Moreover, [11, Section 5] and [12,
Theorem 3.6 (a)] state—for the respective module-sequences—that (8) holds with AΩ = BΩ = 1 and
hence
|||ΦΩ (f )|||2 = kf k22 .
(11)
The first main goal of the present paper is to establish i) for d-dimensional complex-valued input
9
signals that (6) decays polynomially according to
Z
2 N
fb(ω)2 1 − rbl ω dω,
WN (f ) ≤ BΩ
Nα
Rd
where
α=
N =
BΩ
QN
k=1 max{1, Bk },
∀ f ∈ L2 (Rd ), ∀N ≥ 1,


1,
d = 1,
p

log ( d/(d − 1/2)),
2
d ≥ 2,
(12)
and rbl : Rd → R, rbl (ω) = (1 − |ω|)l+ , with l > bd/2c + 1, for networks
based on general filters {χn } ∪ {gλn }λn ∈Λn that satisfy mild analyticity and high-pass conditions and
are allowed to be different in different network layers (with the proviso that χn , n ∈ N, is of low-pass
nature in a sense to be made precise), and ii) for 1-D complex-valued input signals that (6) decays
exponentially according to
Z
ω 2 fb(ω)2 1 − rbl
WN (f ) ≤
dω,
N −1 a
R
∀ f ∈ L2 (R), ∀N ≥ 1,
(13)
for networks that are based, in every network layer, on a broad family of wavelets, with the decay factor
given explicitly as a = 35 , or on a broad family of Weyl-Heisenberg filters [10, Appendix B], with decay
factor a = 32 . Particularizing the right hand side (RHS) in (12) and (13) to Sobolev-class input signals
Z
n
o
s
d
2
d f ∈ H (R ) = f ∈ L (R ) (1 + |ω|2 )s |fb(ω)|2 dω < ∞ , s ≥ 0,
Rd
we show that (12) yields explicit polynomial energy decay according to
α(2s+β)
WN (f ) = O(N − 2s+β+1 ),
∀f ∈ H s (Rd ),
(14)
∀f ∈ H s (R).
(15)
and (13) explicit exponential energy decay according to
N (2s+β)
WN (f ) = O(a− 2s+β+1 ),
The constant β > 0 in (14) and (15) depends on the underlying Sobolev function3 f ∈ H s (Rd ). The
parameter s acts as a smoothness index. Specifically, larger s leads to smoother f and is seen to result
in faster decay of WN (f ). Sobolev spaces H s (Rd ) contain a wide range of practically relevant signal
classes such as, e.g.,
Specifically, β > 0 determines the decay of fb(ω) as |ω| → ∞ (see, e.g., [41, Sec. 6.2.1]): For every f ∈ H s (Rd ) there
β
s
d
exist β, µ, L > 0 such that |fb(ω)| ≤ µ(1 + |ω|2 )−( 2 + 4 + 4 ) , for |ω| ≥ L, where the constant L > 0 resembles the effective
bandwidth of f .
3
10
Fig. 3: An image of a handwritten digit is modeled by a 2-D cartoon function.
- the space of square-integrable functions L2 (Rd ) = H 0 (Rd ),
- the space L2L (Rd ) := {f ∈ L2 (Rd ) | supp(fb) ⊆ BL (0)}, L ≥ 0, of L-band-limited signals
according to L2L (Rd ) ⊆ H s (Rd ), for all L ≥ 0 and all s ≥ 0, which follows from
Z
Z
(1 + |ω|2 )s |fb(ω)|2 dω =
(1 + |ω|2 )s |fb(ω)|2 dω ≤ (1 + |L|2 )s kf k22 < ∞,
Rd
BL (0)
for f ∈ L2L (Rd ), L ≥ 0, and s ≥ 0, where we used Parseval’s formula and the fact that ω 7→
(1 + |ω|2 )s , ω ∈ Rd , is monotonically increasing in |ω|, for all s ≥ 0,
K
- the space CCART
of cartoon functions of size K, introduced in [35], and widely used in the
mathematical signal processing literature [16], [20], [26], [36], [37] as a model for natural images
K
such as, e.g., images of handwritten digits [38] (see Figure 3). For a formal definition of CCART
,
K
we refer the reader to Appendix B, where we also show that CCART
⊆ H s (Rd ), for all K > 0
and all s ∈ (0, 1/2).
Moreover, Sobolev functions are contained in the space of k -times continuously differentiable functions
C k (Rd , C) according to H s (Rd ) ⊆ C k (Rd , C), for all s > k +
d
2
[42, Section 4].
Our second central goal is to establish energy conservation according to (8) for the network configurations corresponding to the energy decay results (12) and (13). Finally, we derive handy estimates
of the number of layers needed to have at least ((1 − ε) · 100)% of the input signal energy be contained
in the feature vector.
IV. M AIN RESULTS
Throughout the paper, we make the following assumptions on the filters {gλn }λn ∈Λn .
11
Assumption 1. The {gλn }λn ∈Λn , n ∈ N, are analytic in the following sense: For every layer index
n ∈ N, for every λn ∈ Λn , there exists an orthant HAλn ⊆ Rd , for Aλn ∈ O(d), such that
supp(gc
λn ) ⊆ HAλn .
(16)
Moreover, there exists δ > 0 such that
X
2
|gc
λn (ω)| = 0,
a.e. ω ∈ Bδ (0).
(17)
λn ∈Λn
In the 1-D case, i.e., for d = 1, Assumption 1 simply amounts to every filter gλn satisfying either
supp(gc
λn ) ⊆ (−∞, −δ] or supp(gc
λn ) ⊆ [δ, ∞), which constitutes an “analyticity” and “high-pass”
condition. For dimensions d ≥ 2, Assumption 1 requires that every filter gλn be of high-pass nature and
have a Fourier transform supported in a (not necessarily canonical) orthant. Since the frame condition
(2) is equivalent to the Littlewood-Paley condition [43]
An ≤ |χ
cn (ω)|2 +
X
2
d
|gc
λn (ω)| ≤ Bn , a.e. ω ∈ R ,
(18)
λn ∈Λn
(17) implies low-pass characteristics for χn to fill the spectral gap Bδ (0) left by the filters {gλn }λn ∈Λn .
The conditions (16) and (17) we impose on the Ψn , n ∈ N, are not overly restrictive as they encompass,
inter alia, various constructions of Weyl-Heisenberg filters (e.g., a 1-D B -spline as generating function
[44, Section 1]), wavelets (e.g., analytic Meyer wavelets [22, Section 3.3.5] in 1-D, and Cauchy wavelets
[45] in 2-D), and specific constructions of ridgelets [31, Section 2.2], curvelets [25, Section 4.1], αcurvelets [26, Section 3], and shearlets (e.g., cone-adapted shearlets [37, Section 4.3]). We refer the
reader to [10, Appendices B and C] for a brief review of some of these filter structures.
We are now ready to state our main result on energy decay and energy conservation.
Theorem 1. Let Ω be the module-sequence (1) with filters {gλn }λn ∈Λn satisfying the conditions in
Assumption 1, and let δ > 0 be the radius of the spectral gap Bδ (0) left by the filters {gλn }λn ∈Λn
QN
QN
N
according to (17). Furthermore, let s ≥ 0, AN
Ω :=
k=1 min{1, Ak }, BΩ :=
k=1 max{1, Bk }, and


1,
d = 1,
α :=
(19)
p

log ( d/(d − 1/2)), d ≥ 2.
2
12
i) We have
N
WN (f ) ≤ BΩ
Z
2 fb(ω)2 1 − rbl ω dω,
N αδ
Rd
∀ f ∈ L2 (Rd ), ∀N ≥ 1,
(20)
where rbl : Rd → R, rbl (ω) := (1 − |ω|)l+ , with l > bd/2c + 1.
ii) For every Sobolev function f ∈ H s (Rd ), there exists β > 0 such that
α(2s+β)
N − 2s+β+1
WN (f ) = O(BΩ
N
).
(21)
N
0 < AΩ := lim AN
Ω ≤ BΩ := lim BΩ < ∞,
(22)
iii) If, in addition to Assumption 1,
N →∞
N →∞
then we have energy conservation according to
AΩ kf k22 ≤ |||ΦΩ (f )|||2 ≤ BΩ kf k22 ,
∀ f ∈ L2 (Rd ).
(23)
Proof. For the proofs of i) and ii), we refer to Appendices C and D, respectively. The proof of statement
iii) is based on two key ingredients. First, we establish—in Proposition 1 in Appendix E—that the feature
extractor ΦΩ satisfes the energy decomposition identity
2
AN
Ω kf k2 ≤
N
−1
X
N
|||ΦnΩ (f )|||2 + WN (f ) ≤ BΩ
kf k22 ,
∀ f ∈ L2 (Rd ), ∀N ≥ 1.
(24)
n=0
Second, we show—in Proposition 2 in Appendix F—that the integral on the RHS of (20) goes to zero
N = B < ∞, implies that W (f ) → 0 as N → ∞. We note that
as N → ∞ which, thanks to lim BΩ
Ω
N
N →∞
while the decomposition (24) holds for general filters {gλn }λn ∈Λn satisfying the frame property (2), it is
the upper bound (20) that makes use of the analyticity and high-pass conditions in Assumption 1. The
final energy conservation result (23) is obtained by letting N → ∞ in (24).
The strength of the results in Theorem 1 derives itself from the fact that the only condition we need to
impose on the filters {gλn }λn ∈Λn is Assumption 1, which as already mentioned, is met by a wide array
of filters. Moreover, condition (22) is easily satisfied by normalizing the filters Ψn , n ∈ N, appropriately
(see, e.g., [10, Proposition 3]). We note that this normalization does not have an impact on the properties
needed for Assumption 1 to be satisfied.
The identity (21) establishes, upon normalization [10, Proposition 3] of the Ψn to get Bn ≤ 1, n ∈ N,
that the energy decay rate, i.e., the decay rate of WN (f ), is at least polynomial in N . We hasten to add
13
gd
λn (ω)
1
χ
cn (ω)
fb(ω)
···
···
ω
−δ
δ
1
fb(ω) · gd
λn (ω)
ω
−δ
δ
1
χ
cn (ω)
V
|f ∗ gλn |(ω)
ω
−δ
δ
Fig. 4: Illustration for Weyl-Heisenberg filters in dimension d = 1. The {gλn }λn ∈Λn are taken as perfect band-pass filters and
hence trivially satisfy the conditions in Assumption 1. The modulus operation in combination with the analyticity of the filters
{gλn }λn ∈Λn ensures that—in every network layer—the energy contained in the high-frequency components of f is moved to
low frequencies, where it is collected by the (low-pass) output-generating filter χn .
that (20) does not preclude the energy from decaying faster in practice. Moreover, as already mentioned,
increasing the smoothness index s leads to faster energy decay irrespective of the filters employed in
the network.
Underlying the energy conservation result (23) is a demodulation effect that emerges in every network
layer. Casually speaking, the modulus operation in combination with the analyticity and high-pass
properties of the filters {gλn }λn ∈Λn ensures that—in every network layer—the energy contained in
the high-frequency components of f is moved to low frequencies, where it is collected by the low-pass
output-generating atom χn , see Figure 4. This ensures that the kernel of the feature vector is trivial as
N → ∞ and hence none of the input signal’s features are lost in the network.
The next result shows that, under additional structural assumptions on the filters {gλn }λn ∈Λ , the
guaranteed energy decay rate can be improved from polynomial to exponential. Specifically, we can
get exponential energy decay for networks employing a broad family of 1-D wavelets and 1-D WeylHeisenberg filters. For simplicity of exposition, we employ filters that constitute Parseval frames and are
identical across network layers.
14
Theorem 2. Let rbl : R → R, rbl (ω) := (1 − |ω|)l+ , with l > 1.
b ⊆ [1/2, 2]
i) Wavelets: Let the mother and father wavelets ψ, φ ∈ L1 (R) ∩ L2 (R) satisfy supp(ψ)
and
2
b
|φ(ω)|
+
∞
X
b −j ω)|2 = 1,
|ψ(2
a.e. ω ≥ 0.
(25)
j=1
Moreover, let gj (x) := 2j ψ(2j x), for x ∈ R, j ≥ 1, and gj (x) := 2|j| ψ(−2|j| x), for x ∈ R,
j ≤ −1, and set χ(x) := φ(|x|), for x ∈ R. Let Ω be the module-sequence (1) with filters
Ψ = {χ} ∪ {gj }j∈Z\{0} in every network layer. Then,
Z
2 ω
fb(ω)2 1 − rbl
WN (f ) ≤
dω,
N −1 (5/3)
R
∀ f ∈ L2 (Rd ), ∀N ≥ 1.
(26)
Moreover, for every Sobolev function f ∈ H s (R), there exists β > 0 such that
− N (2s+β)
WN (f ) = O (5/3) (2s+β+1) .
(27)
ii) Weyl-Heisenberg filters: For R ∈ R, let the functions g, φ ∈ L1 (R) ∩ L2 (R) satisfy supp(gb) ⊆
[−R, R], gb(−ω) = gb(ω), for ω ∈ R, and
2
b
|φ(ω)|
+
∞
X
|gb(ω − R(k + 1))|2 = 1,
a.e. ω ≥ 0.
(28)
k=1
Moreover, let gk (x) := e2πi(k+1)Rx g(x), for x ∈ R, k ≥ 1, and gk (x) := e−2πi(|k|+1)Rx g(x), for
x ∈ R, k ≤ −1, and set χ(x) := φ(|x|), for x ∈ R. Let Ω be the module-sequence (1) with filters
Ψ = {χ} ∪ {gk }k∈Z\{0} in every network layer. Then,
Z
2 ω
fb(ω)2 1 − rbl
WN (f ) ≤
dω,
N −1 R (3/2)
R
∀ f ∈ L2 (Rd ), ∀N ≥ 1.
(29)
Moreover, for every Sobolev function f ∈ H s (R), there exists β > 0 such that
− N (2s+β)
WN (f ) = O (3/2) (2s+β+1) .
(30)
Proof. See Appendix G.
The conditions we impose on the mother and father wavelet ψ, φ in i) are satisfied, e.g., by analytic
Meyer wavelets [22, Section 3.3.5], and those on the generating function g and low-pass filter φ in
ii) by B-splines [44, Section 1]. In addition, we note that in the presence of pooling by sub-sampling,
say with pooling factors Sn := S ∈ [1, a), for all n ∈ N, (see [10, Eq. 9] for the definition of
15
pooling by sub-sampling in continuous-time) the effective decay factor in (27) and (30) becomes
and
3
2S ,
5
3S
respectively. Hence, exponential energy decay is compatible with vertical translation invariance
according to [10, Theorem 1], albeit at the cost of slower (exponential) decay. The proof of this statement
is structurally very similar to that of Theorem 2 and will therefore not be presented here. We next put
the results in Theorems 1 and 2 into perspective with respect to the literature.
Relation to [11, Section 5]: The basic philosophy of our proof technique for (20), (23), (26), and
(29) is inspired by the proof in [11, Section 5], which establishes (9) and (11) for scattering networks
based on certain wavelet filters and with 1-D real-valued input signals f ∈ L2 (R). Specifically, in
[11, Section 5], in every network layer, the filters ΨW = {χ} ∪ {gj }j∈Z (where gj (ω) := 2j ψ(2j ω),
j ∈ Z, for some mother wavelet ψ ∈ L1 (R) ∩ L2 (R)) are 1-D functions satisfying the frame property
(2) with An = Bn = 1, n ∈ N, a mild analyticity condition4 [11, Eq. 5.5] in the sense of |gbj (ω)|,
j ∈ Z, being larger for positive frequencies ω than for the corresponding negative ones, and a vanishing
b
moments condition [11, Eq. 5.6] which controls the behavior of ψ(ω)
around the origin according to
b
|ψ(ω)|
≤ C|ω|1+ε , for ω ∈ R, for some C, ε > 0. Similarly to the proof of (11) in [11, Section 5], we
base our proof of (23) on the energy decomposition identity (24) and on an upper bound on WN (f ) (see
(9) for the corresponding upper bound established in [11, Section 5]) shown to go to zero as N → ∞.
The exponential energy decay results (21), (27), and (30) for f ∈ H s (Rd ) are entirely new. The major
differences between [11, Section 5] and our results are (i) that (9) (reported in [11, Section 5]) depends
on an unspecified a > 1, whereas our results in (20), (21), (26), (27), (29), and (30) make the decay
factor a and the decay exponent α explicit, (ii) the technical elements employed to arrive at the upper
bounds on WN (f ), specifically, while the proof in [11, Section 5] makes explicit use of the algebraic
structure of the filters, namely, the multi-scale structure of wavelets, our proof of (20) is oblivious to the
algebraic structure of the filters, which is why it applies to general (possibly unstructured) filters that, in
addition, can be different in different network layers, (iii) the assumptions imposed on the filters, namely
the analyticity and vanishing moments conditions in [11, Eq. 5.5–5.6], in contrast to our Assumption 1,
and (iv) the class of input signals f the results apply to, namely 1-D real-valued signals in [11, Section
5], and d-dimensional complex-valued signals in our Theorem 1.
Relation to [12]: For scattering networks that are based on so-called uniform covering filters [12], (10)
and (11) are established in [12] for d-dimensional complex-valued signals f ∈ L2 (Rd ). Specifically, in
4
At the time of completion of the present paper, I. Waldspurger kindly sent us a preprint [46] which shows that the analyticity
condition [11, Eq. 5.5] on the mother wavelet is not needed for (9) to hold.
16
[12], in every network layer, the d-dimensional filters {χ} ∪ {gλ }λ∈Λ are taken to satisfy i) the frame
property (2) with A = B = 1 and hence An = Bn = 1, n ∈ N, see [12, Definition 2.1 (c)], ii) a
vanishing moments condition [12, Definition 2.1 (a)] according to gbλ (0) = 0, for all λ ∈ Λ, and iii) a
uniform covering condition [12, Definition 2.1 (b)] which says that the filters’ Fourier transform support
sets can be covered by a union of finitely many balls. The major differences between [12] and our
results are as follows: (i) the results in [12] apply exclusively to filters satisfying the uniform covering
condition such as, e.g., Weyl-Heisenberg filters with a band-limited generating function [12, Proposition
2.3], but do not apply to multi-scale filters such as wavelets, (α)-curvelets, shearlets, and ridgelets (see
[12, Remark 2.2 (b)]), (ii) (10) as established in [12] leaves the decay factor a > 1 unspecified, whereas
our results in (27) and (30) make the decay factor a explicit (namely, a = 5/3 in the wavelet case and
a = 3/2 in the Weyl-Heisenberg case), (iii) the exponential energy decay result in (10) as established in
[12] is universal in f ∈ L2 (Rd ), i.e., the RHS of (10) is independent of f ∈ L2 (Rd ), whereas our decay
results in (21), (27), and (30) depend on the particular f through the parameter β (which determines
the decay of fb(ω) as |ω| → ∞), (iv) the technical elements employed to arrive at the upper bounds
on WN (f ), specifically, while the proof in [12] makes explicit use of the uniform covering property of
the filters, our proof of (20) is completely oblivious to the (algebraic) structure of the filters, (v) the
assumptions imposed on the filters, i.e., the vanishing moments and uniform covering condition in [12,
Definition 2.1 (a)-(b)], in contrast to our Assumption 1, which is less restrictive, and thereby makes our
results in Theorem 1 apply to general (possibly unstructured) filters that, in addition, can be different in
different network layers.
V. N UMBER OF LAYERS NEEDED
DCNNs used in practice employ potentially hundreds of layers [8]. As already mentioned, such
network depths entail formidable computational challenges both in training and in operating the network.
It is therefore important to understand how many layers are needed to have most, say ((1 − ε) · 100)%
of the input signal energy be contained in the feature vector. This can be formalized by considering
Parseval frames in all layers (i.e., frames with frame bounds satisfying An = Bn = 1, n ∈ N) and by
asking for bounds of the form
(1 −
ε)kf k22
≤
N
X
n=0
|||ΦnΩ (f )|||2 ≤ kf k22 .
(31)
17
Sandwiching
PN
n
2
n=0 |||ΦΩ (f )|||
according to (31) implies
PN
n
2
n=0 |||ΦΩ (f )|||
kf k22
(1 − ε) ≤
≤1
and hence guarantees that at least ((1 − ε) · 100)% of the input signal energy is contained in the feature
vector {ΦnΩ (f )}N
n=0 .
The following results establish handy estimates of the number of layers needed to guarantee (31). For
pedagogical reasons, we start with the case of band-limited input signals and then proceed to a more
general statement that pertains to Sobolev functions H s (Rd ).
Corollary 1.
i) Let Ω be the module-sequence (1) with filters {gλn }λn ∈Λn satisfying the conditions in Assumption
1, and let the corresponding frame bounds be An = Bn = 1, n ∈ N. Let δ > 0 be the radius of the
spectral gap Bδ (0) left by the filters {gλn }λn ∈Λn according to (17). Furthermore, let l > bd/2c+1,
ε ∈ (0, 1), α as defined in (19), and f ∈ L2 (Rd ) be L-band-limited. If
&
N≥
!1/α
L
'
−1 ,
1
(1 − (1 − ε) 2l )δ
(32)
then (31) holds.
ii) Assume that the conditions in Theorem 2 i) and ii) hold. For the wavelet case, let a =
5
3
and
δ = 1 where δ corresponds to the radius of the spectral gap left by the wavelets {gj }j∈Z\{0} .
For the Weyl-Heisenberg case, let a =
3
2
and δ = R here, δ corresponds to the radius of the
spectral gap left by the Weyl-Heisenberg filters {gk }k∈Z\{0} . Moreover, let l > 1, ε ∈ (0, 1), and
f ∈ L2 (R) be L-band-limited. If
&
N≥
loga
!'
L
1
(1 − (1 − ε) 2l )δ
,
(33)
then (31) holds in both cases.
Proof. See Appendix H.
Corollary 1 nicely shows how the description complexity of the signal class under consideration,
namely the bandwidth L, the decay exponent α defined in (19), and the decay factor a > 1 determine
the number N of layers needed to guarantee (31). Specifically, (32) and (33) show that larger bandwidths
L and large dimension d render the input signal f more “complex”, which requires deeper networks to
18
(1 − ε)
0.25 0.5 0.75 0.9 0.95 0.99
wavelets
2
3
4
6
8
11
Weyl-Heisenberg filters 2
4
5
8
10
14
general filters
2
3
7
19 39 199
Table I: Number N of layers needed to ensure that ((1 − ε) · 100)% of the input signal energy is contained in the features
generated in the first N network layers.
capture most of the energy of f . The dependence of the lower bounds in (32) and (33) on the network
properties (respectively, the module-sequence Ω) is through the radius δ of the spectral gap left by the
filters {gλn }λn ∈Λn only.
The following numerical example provides quantitative insights on the influence of the parameter ε
on (32) and (33). Specifically, we set L = 1, δ = 1, d = 1 (which implies α = 1, see (19)), l = 1.0001,
and show in Table I the number N of layers needed according to (32) and (33) for different values of
ε. The results show that 95% of the input signal energy are contained in the first 8 layers in the wavelet
case and the first 10 layers in the Weyl-Heisenberg case. We can therefore conclude that in practice a
relatively small number of layers is needed to have most of the input signal energy be contained in the
feature vector. In contrast, for general filters, where we can guarantee polynomial energy decay only, at
least N = 39 layers are needed to absorb 95% of the input signal energy. We hasten to add, however,
that (20) simply guarantees polynomial energy decay and therefore does not preclude the energy from
decaying faster in practice.
We proceed with the estimates on the number of layers for Sobolev-class input signals.
Corollary 2.
i) Let Ω be the module-sequence (1) with filters {gλn }λn ∈Λn satisfying the conditions in Assumption
1, and let the corresponding frame bounds be An = Bn = 1, n ∈ N. Let δ > 0 be the radius of the
spectral gap Bδ (0) left by the filters {gλn }λn ∈Λn according to (17). Furthermore, let l > bd/2c+1,
ε ∈ (0, 1), α as defined in (19), and f ∈ H s (Rd )\{0}. If
&
N≥
!1/α
τ
1
(1 − (1 − ε) 4l )δ
'
−1 ,
where
(
τ = max L,
µ2 vold−1 (∂B1 (0))
√
(2s + β)(1 − 1 − ε )kf k22
!
(34)
1
2s+β
)
,
(35)
19
then (31) holds. Here, the parameters β, µ, L > 0 depend on f in the sense of
|fb(ω)| ≤
µ
s
β
d
(1 + |ω|2 ) 2 + 4 + 4
∀ |ω| ≥ L.
,
ii) Assume that the conditions in Theorem 2 i) and ii) hold. For the wavelet case, let a =
5
3
and
δ = 1 where δ corresponds to the radius of the spectral gap left by the wavelets {gj }j∈Z\{0} .
For the Weyl-Heisenberg case, let a =
3
2
and δ = R here, δ corresponds to the radius of the
spectral gap left by the Weyl-Heisenberg filters {gk }k∈Z\{0} . Furthermore, let l > 1, ε ∈ (0, 1),
and f ∈ H s (R)\{0}. If
τ
,
N≥
loga
(
2µ2
√
(2s + β)(1 − 1 − ε )kf k22
1
(1 − (1 − ε) 4l )δ
(36)
where
τ = max L,
!
1
2s+β
)
,
(37)
then (31) holds. Here, the parameters β, µ, L > 0 depend on f in the sense of
|fb(ω)| ≤
µ
s
1
β
(1 + |ω|2 ) 2 + 4 + 4
,
∀ |ω| ≥ L.
Proof. See Appendix I.
As already mentioned in Section III, Sobolev spaces H s (Rd ) contain a wide range of practically
relevant signal classes such as square-integrable functions L2 (Rd ) = H 0 (Rd ), L-band-limited signals
(according to L2L (Rd ) ⊆ H s (Rd ), for all L ≥ 0 and all s ≥ 0), and cartoon functions (according
K
to CCART
⊆ H s (Rd ), for all s ∈ (0, 1/2), as established in Appendix B). The results in Corollary 2
therefore provide—for a wide variety of input signals—a good picture of how many layers are needed
to have most of the input signal energy be contained in the feature vector.
A PPENDIX A
A
FEATURE EXTRACTOR WITH A NON - TRIVIAL KERNEL
We show, by way of example, that employing filters Ψn which satisfy the frame property (2) alone
does not guarantee that the feature extractor ΦΩ defined in (5) satisfies
AΩ kf k22 ≤ |||ΦΩ (f )|||2 ,
∀f ∈ L2 (Rd ),
(38)
20
for some AΩ > 0. The existence of such a lower bound AΩ > 0 would imply a trivial kernel for the
feature extractor ΦΩ and thereby ensure that the only signal f that maps to the all-zeros feature vector
is f = 0.
Our example employs, in every network layer, filters Ψ = {χ} ∪ {gk }k∈Z that satisfy the LittlewoodPaley condition (18) with An = Bn = 1, n ∈ N, and where g0 is such that gb0 (ω) = 1, for ω ∈ B1 (0).
We emphasize that no further restrictions are imposed on the filters {χ} ∪ {gk }k∈Z , specifically χ need
not be of low-pass nature and the filters {gk }k∈Z may be structured (such as wavelets [10, Appendix
B]) or unstructured (such as random filters [47], [48]), as long as they satisfy the Littlewood-Paley
condition (18). Now, consider the input signal f ∈ L2 (Rd ) according to fb(ω) := (1 − |ω|)l+ , ω ∈ Rd ,
with l > bd/2c + 1. Then f ∗ g0 = f , owing to supp(f ) ⊆ B1 (0) and fb · gb0 = fb. Moreover, fb is a
positive definite radial basis function [49, Theorem 6.20] and hence by [49, Theorem 6.18] f (x) ≥ 0,
x ∈ Rd , which, in turn, implies |f | = f . This yields U [q0N ]f = · · · |f ∗ g0 | ∗ g0 · · · ∗ g0 = f, for
q0N := (0, 0, . . . , 0) ∈ ZN and N ∈ N. Owing to the energy decomposition identity (83) in Appendix E,
N
together with AN
Ω = BΩ = 1, N ∈ N, which, in turn, is by An = Bn = 1, n ∈ N, we have
kf k22 =
N
−1
X
|||ΦnΩ (f )|||2 + WN (f )
n=0
=
N
−1
X
n=0
|||ΦnΩ (f )|||2 + kU [q0N ]k22 +
| {z }
= kf k22
X
||U [q]f ||22 ,
q ∈ ZN \{q0N }
for N ∈ N, which implies
N
−1
X
n=0
|||ΦnΩ (f )|||2 +
X
||U [q]f ||22 = 0.
(39)
q ∈ ZN \{q0N }
P −1
n
2
As both terms in (39) are positive, we can conclude that N
n=0 |||ΦΩ (f )||| = 0, N ∈ N, and thus
P
n
2
2
|||ΦΩ (f )|||2 = ∞
n=0 |||ΦΩ (f )||| = 0. Since |||ΦΩ (f )||| = 0 implies ΦΩ (f ) = 0, we have constructed
R
a non-zero f , namely f (x) = Rd (1 − |ω|)l+ e 2πihx, ωi dω , that maps to 0.
A PPENDIX B
R EGULARITY OF CARTOON FUNCTIONS
Cartoon functions, introduced in [35], satisfy mild decay properties and are piecewise continuously
differentiable apart from curved discontinuities along Lipschitz-continuous hypersurfaces. This function
21
class has been widely adopted in the literature [16], [20], [26], [36], [37] as a standard model for natural
images such as, e.g., images of handwritten digits [38] (see Figure 3).
We proceed to the formal definition of cartoon functions.
Definition 1. The function f : Rd → C is referred to as a cartoon function if it can be written as
f = f1 + 1D f2 , where D ⊆ Rd is a compact Lipschitz domain with boundary of finite length, i.e.,
vold−1 (∂D) < ∞, and fi ∈ L2 (Rd ) ∩ C 2 (Rd , C), i = 1, 2, satisfies the decay condition
|∇fi (x)| ≤ Chxi−d ,
i = 1, 2,
for some C > 0 (not depending on f1 ,f2 ). Furthermore, we denote by
K
CCART
:= {f1 + 1D f2 | fi ∈ L2 (Rd ) ∩ C 2 (Rd , C), i = 1, 2,
|∇fi (x)| ≤ Khxi−d , vold−1 (∂D) ≤ K, kf2 k∞ ≤ K}
the class of cartoon functions of size K > 0.
Even though cartoon functions are in general discontinuous, they still admit Sobolev regularity. The
following result formalizes this statement.
K
Lemma 1. Let K > 0. Then CCART
⊆ H s (Rd ), for all s ∈ (0, 1/2).
K
Proof. Let (f1 + 1D f2 ) ∈ CCART
. We first establish 1D ∈ H s (Rd ), for all s ∈ (0, 1/2). To this end, we
define the Sobolev-Slobodeckij semi-norm [50, Section 2.1.2.]
1/s
Z Z |f (x) − f (y)|2
dx
dy
,
|f |H s :=
|x − y|2s+d
Rd Rd
and note that, thanks to [50, Section 2.1.2], 1D ∈ H s (Rd ) if |1D |H s < ∞. We have
Z Z
|1D (x) − 1D (y)|2
|1D |sH s =
dx dy
|x − y|2s+d
d
d
R
R
Z
Z
1
=
|1D (x) − 1D (x − t)|2 dx dt,
2s+d
|t|
d
d
R
R
where we employed the change of variables t = x − y . Next, we note that the function h(x) :=
|1D (x) − 1D (x − t)|2 satisfies h(x) = 1, for x ∈ S , where
S := {x ∈ Rd | x ∈ D and x − t ∈
/ D} ∪ {x ∈ Rd | x ∈
/ D and x − t ∈ D} = D∆(D + t),
and h(x) = 0, for x ∈ Rd \S . Since S ⊆ ∂D+B|t| (0) , where (∂D+B|t| (0)) is a tubular neighborhood
22
(D + t)
D
t
∂D
Fig. 5: Illustration in dimension d = 2. The set (D + t) (grey) is obtained by translating the set D (white) by t ∈ R2 . The
symmetric difference D∆(D + t) is contained in (∂D + B|t| (0)), which is a tubular neighborhood of diameter 2|t| around
the boundary ∂D of D.
of diameter 2|t| around the boundary ∂D of D (see Figure 5), we have
Z
Z
Z
|1D (x) − 1D (x − t)|2 dx =
h(x)dx =
1dx = vold (S) ≤ min{2|t|vold−1 (∂D), 2 vold (D)} .
{z
}
|
d
d
R
R
S
=:g(t)
Let R > 0. We have
Z
Z
Z
g(t)
g(t)
g(t)
s
dt =
dt +
dt
|1D |H s ≤
2s+d
2s+d
2s+d
Rd \BR (0) |t|
BR (0) |t|
Rd |t|
Z
Z
2 vold (D)
2 vold−1 (∂D)
≤
dt
+
dt
2s+d
|t|2s+d−1
Rd \BR (0) |t|
BR (0)
Z ∞
Z R
d
d−1
−(2s+1)
d−1
d−1
= 2 vol (D) vol (∂B1 (0))
r
dr +2 vol (∂D) vol (∂B1 (0))
r−2s dr,
|R
{z
}
| 0 {z }
=:I1
(40)
=:I2
where in the last step we introduced polar coordinates. The first integral I1 is finite for all s > 0. The
second integral I2 is finite for all s < 1/2. We can therefore conclude that (40) is finite for s ∈ (0, 1/2),
and hence 1D ∈ H s (Rd ), for s ∈ (0, 1/2). To see that (f1 + 1D f2 ) ∈ H s (Rd ), we simply note that
|f1 + 1D f2 |H s ≤ |f1 |H s + |1D f2 |H s ≤ |f1 |H s + C|1D |H s < ∞,
(41)
where the first inequality in (41) is thanks to the sub-additivity of the semi-norm | · |H s , the second
inequality is due to the operation
h 7→ hf,
h ∈ H s (Rd ), f ∈ C 2 (Rd , C),
constituting a bounded linear operator on H s (Rd ), for all s ∈ (0, 1/2), (i.e., |hf |H s ≤ C|h|H s , for some
C > 0), and the last inequality follows from |1D |H s < ∞, for s ∈ (0, 1/2), established above, and
23
|f1 |H s < ∞, for s ∈ (0, 1/2), which is by f1 ∈ L2 (Rd ) ∩ C 2 (Rd , C). This completes the proof.
A PPENDIX C
P ROOF OF STATEMENT I ) IN T HEOREM 1
p
We start by establishing (20) with α = log2 ( d/(d − 1/2)), for all d ≥ 1. Then, we sharpen our
result in the 1-D case by proving that (20) holds for d = 1 with α = 1. This leads to a significant
p
improvement, in the 1-D case, of the decay exponent from log2 ( d/(d − 1/2)) = 12 to 1.
p
The idea for the proof of (20) for α = log2 ( d/(d − 1/2)), for all d ≥ 1, is to establish that5
Z
2 X
2
n+N −1
fb(ω)2 1 − rbl ω dω, ∀N ∈ N,
kU [q]f k2 ≤ Cn
(42)
N αδ
Rd
q ∈ Λn ×Λn+1 ×···×Λn+N −1
where
Cnn+N −1
:=
n+N
Y−1
max{1, Bk }.
k=n
N yields the desired result (20). We proceed by induction
Setting n = 1 in (42) and noting that C1N = BΩ
over the path length `(q) := N , for q = (λn , λn+1 , ... , λn+N −1 ) ∈ Λn × Λn+1 × · · · × Λn+N −1 . Starting
with the base case N = 1, we have
X
kU [q]f k22 =
q ∈ Λn
X
kf ∗ gλn k22
λn ∈Λn
Z
X
=
Rd λ ∈Λ
n
n
2 b
2
|gc
λn (ω)| |f (ω)| dω
(43)
Z
|fb(ω)|2 dω
\Bδ (0)
Z
2 fb(ω)2 1 − rbl ω dω,
≤ max{1, Bn }
{z
} Rd
|
δ
≤ Bn
(44)
Rd
∀N ∈ N,
(45)
n
n
=C
where (43) is by Parseval’s formula, (44) is thanks to (17) and (18), and (45) is due to supp(rbl ) ⊆ B1 (0)
and 0 ≤ rbl (ω) ≤ 1, for ω ∈ Rd . The inductive step is established as follows. Let N > 1 and suppose
that (42) holds for all paths q of length `(q) = N − 1, i.e.,
Z
X
fb(ω)2 1− rbl
kU [q]f k22 ≤ Cnn+N −2
q ∈ Λn ×Λn+1 ×···×Λn+N −2
Rd
2 ω
dω,
(N − 1)α δ
∀N ∈ N. (46)
5
We prove the more general result (42) for technical reasons, concretely in order to be able to argue by induction over path
lengths with flexible starting index n.
24
We start by noting that every path q̃ ∈ Λn × Λn+1 × ... × Λn+N −1 of length `(q̃) = N , with arbitrary
starting index n, can be decomposed into a path q ∈ Λn+1 × ... × Λn+N −1 of length `(q) = N − 1 and
an index λn ∈ Λn according to q̃ = (λn , q). Thanks to (4) we have U [q̃] = U [(λn , q)] = U [q]Un [λn ],
which yields
X
kU [q]f k22 =
X
X
kU [q] Un [λn ]f k22 ,
∀N ∈ N.
(47)
λn ∈Λn q ∈ Λn+1 ×···×Λn+N −1
q ∈ Λn ×Λn+1 ×···×Λn+N −1
We proceed by examining the inner sum on the RHS of (47). Invoking the induction hypothesis (46)
with n replaced by (n + 1) and employing Parseval’s formula, we get
Z
X
2
n+N −1
Un [λn ]f (ω)2 1 − rbl
kU [q] Un [λn ]f k2 ≤ Cn+1
V
Rd
q ∈ Λn+1 ×···×Λn+N −1
2 ω
dω
α
(N − 1) δ
n+N −1
= Cn+1
kUn [λn ]f k22 − k(Un [λn ]f ) ∗ rl,N −1,α,δ k22
n+N −1
= Cn+1
kf ∗ gλn k22 − k|f ∗ gλn | ∗ rl,N −1,α,δ k22 ,
(48)
ω
for n ∈ N, where rl,N −1,α,δ is the inverse Fourier transform of rbl (N −1)
α δ . Next, we note that
ω
rbl (N −1)
is a positive definite radial basis function [49, Theorem 6.20] and hence by [49, Theorem
αδ
6.18] rl,N −1,α,δ (x) ≥ 0, for x ∈ Rd . Furthermore, it follows from Lemma 2, stated below, that for all
{νλn }λn ∈Λn ⊆ Rd , we have
k|f ∗ gλn | ∗ rl,N −1,α,δ k22 ≥ kf ∗ gλn ∗ (Mνλn rl,N −1,α,δ )k22 .
(49)
Here, we note that choosing the modulation factors {νλn }λn ∈Λn ⊆ Rd appropriately (see (53) below)
will be key to establishing the inductive step.
Lemma 2. [9, Lemma 2.7]: Let f, g ∈ L2 (Rd ) with g(x) ≥ 0, for x ∈ Rd . Then,
k|f | ∗ gk22 ≥ kf ∗ (Mω g)k22 ,
∀ω ∈ Rd .
Inserting (48) and (49) into the inner sum on the RHS of (47) yields
X
n+N −1
kU [q]f k22 ≤ Cn+1
q ∈ Λn ×Λn+1 ×···×Λn+N −1
X kf ∗ gλn k22 − kf ∗ gλn ∗ (Mνλn rl,N −1,α,δ )k22
λn ∈Λn
=
n+N −1
Cn+1
Z
Rd
|fb(ω)|2 hn,N,α,δ (ω)dω,
∀N ∈ N,
(50)
25
2
d
d
b
[
where we applied Parseval’s formula together with M
ω f = Tω f , for f ∈ L (R ), and ω ∈ R , and set
hn,N,α,δ (ω) :=
X
λn ∈Λn
ω−ν
2 λn
2
|gc
(ω)|
1
−
r
b
.
λn
l
(N − 1)α δ
(51)
The key step is now to establish—by appropriately choosing {νλn }λn ∈Λn ⊆ Rd —the upper bound
ω 2 hn,N,α,δ (ω) ≤ max{1, Bn } 1 − rbl
,
N αδ
∀ ω ∈ Rd ,
(52)
n+N −1
which upon noting that Cnn+N −1 = max{1, Bn } Cn+1
yields (42) and thereby completes the proof.
We start by defining HAλn , for λn ∈ Λn , to be the orthant supporting gc
λn , i.e., supp(gc
λn ) ⊆ HAλn ,
where Aλn ∈ O(d) (see Assumption 1). Furthermore, for λn ∈ Λn , we choose the modulation factors
according to
νλn := Aλn ν ∈ Rd ,
(53)
where the components of ν ∈ Rd are given by νk := (1 + 2−1/2 ) dδ , for k ∈ {1, . . . , d}. Invoking (16)
and (17), we get
ω−ν
2 λn
2
(ω)|
1
−
r
b
|gc
l
λn
(N − 1)α δ
λn ∈Λn
ω−ν
2 X
λn
2
=
|gc
,
λn (ω)| 1Sλn ,δ (ω) 1 − rbl
α
(N − 1) δ
hn,N,α,δ (ω) =
X
∀ ω ∈ Rd ,
(54)
λn ∈Λn
where Sλn ,δ := HAλn\Bδ (0). For the first canonical orthant H = {x ∈ Rd | xk ≥ 0, k = 1, ... , d} we
show in Lemma 3 below that
rbl
ω − ν ω ≥ rbl
,
(N − 1)α δ
N αδ
∀ ω ∈ H\Bδ (0), ∀N ≥ 2.
(55)
∀ ω ∈ Sλn ,δ , ∀ λn ∈ Λn , ∀N ≥ 2,
(56)
This will allow us to deduce
ω−ν
ω λn
≥ rbl
,
rbl
(N − 1)α δ
N αδ
26
where Sλn ,δ = HAλn\Bδ (0), simply by noting that
ω−ν
Aλn (ω 0 − ν) l
λn
rbl
= 1−
(N − 1)α δ
(N − 1)α δ +
ω 0 − ν l
ω 0 − ν = 1−
= rbl
(N − 1)α δ +
(N − 1)α δ
ω 0 ω 0 l
≥ rbl
=
1
−
α α
N δ
N δ +
ω A λn ω 0 l
= 1 − α = rbl
,
αδ N δ
N
+
(57)
(58)
(59)
for ω = Aλn ω 0 ∈ HAλn \Bδ (0), where ω 0 ∈ H\Bδ (0). Here, (57) and (59) are thanks to |ω| = |Aλn ω|,
which is by Aλn ∈ O(d), and the inequality in (58) is due to (55). Insertion of (56) into (54) then yields
ω 2 2
|gc
λn (ω)| 1Sλn ,δ (ω) 1 − rbl
α
N δ
λn ∈Λn
ω 2 X
2
(ω)|
1
−
=
|gc
rbl
λn
N αδ
λn ∈Λn
ω 2 ≤ max{1, Bn } 1 − rbl
∀ ω ∈ Rd ,
,
α
N δ
hn,N,α,δ (ω) ≤
X
(60)
(61)
where in (60) we employed Assumption 1, and (61) is thanks to (18). This establishes (52) and completes
p
the proof of (20) for α = log2 ( d/(d − 1/2)), for all d ≥ 1.
It remains to show (55), which is accomplished through the following lemma.
Lemma 3. Let α := log2
p
d/(d − 1/2) , rbl : Rd → R, rbl (ω) := (1 − |ω|)l+ , with l > bd/2c + 1, and
define ν ∈ Rd to have components νk = (1 + 2−1/2 ) dδ , for k ∈ {1, . . . , d}. Then,
rbl
ω − ν ω ≥ rbl
,
(N − 1)α δ
N αδ
∀ ω ∈ H\Bδ (0), ∀N ≥ 2.
(62)
Proof. The key idea of the proof is to employ a monotonicity argument. Specifically, thanks to rbl
monotonically decreasing in |ω|, i.e., rbl (ω1 ) ≥ rbl (ω2 ), for ω1 , ω2 ∈ Rd with |ω2 | ≥ |ω1 |, (62) can be
established simply by showing that
N − 1 2α − ω − ν|2 ≥ 0,
κN (ω) := |ω|2 N ∀ ω ∈ H\Bδ (0), ∀N ≥ 2.
(63)
We first note that for ω ∈ H\Bδ (0) with |ω| > N α δ , (62) is trivially satisfied as the RHS of (62)
equals zero (owing to Nωα δ > 1 together with supp(rbl ) ⊆ B1 (0)). It hence suffices to prove (63)
for ω ∈ H with δ ≤ |ω| ≤ N α δ . To this end, fix τ ∈ [δ, N α δ], and define the spherical segment
27
ω2
Ξτ
ν2
ν
w∗
ν1 δ
τ N αδ
ω1
Fig. 6: Illustration in dimension d = 2. The mapping ω 7→ |ω − ν|2 , ω ∈ Ξτ = {ω = (ω1 , ω2 ) ∈ R2 | |ω| = τ, ω1 ≥
0, ω2 ≥ 0}, computes the squared Euclidean distance between an element ω of the spherical segment Ξτ and the vector
ν = (ν1 , ν2 ) with components νk = (1 + 2−1/2 ) 2δ , k ∈ {1, 2}. The mapping attains its maxima along the coordinate axes,
e.g., for ω ∗ = (τ, 0) ∈ Ξτ .
Ξτ := {ω ∈ H | |ω| = τ }. We then have
2α
N − 1 2α ∗
2
2
2 N − 1 − ω − ν|2 ,
κN (ω) = τ − ω − ν| ≥ τ N N (64)
for ω ∈ Ξτ and N ≥ 2, where ω ∗ = (τ, 0, . . . , 0) ∈ Ξτ . The inequality in (64) holds thanks to the
mapping ω 7→ |ω − ν|2 , ω ∈ Ξτ , attaining its maxima along the coordinate axes (see Figure 6). Inserting
δ(1 + 2−1/2 ) 2 (d − 1)δ 2 (1 + 2−1/2 )2
τ δ(2 + 21/2 ) δ 2 (1 + 2−1/2 )2
2
|ω ∗ − ν|2 = τ −
+
=
τ
−
+
d
d2
d
d
into (64) and rearranging terms yields
2α
τ δ(2 + 21/2 ) δ 2 (1 + 2−1/2 )2
2 N − 1
,
−
1
+
−
κN (ω) ≥ τ N d
d
|
{z
}
∀ ω ∈ Ξτ , ∀N ≥ 2.
=:pN (τ )
This inequality shows that κN (ω) is lower-bounded—for ω ∈ Ξτ —by the 1-D function pN (τ ). Now,
2α
pN (τ ) is quadratic in τ , with the highest-degree coefficient ( NN−1 − 1 negative (owing to α =
p
log2
d/(d − 1/2) > 0, for d ≥ 1). Therefore, thanks to pN , N ≥ 2, being concave, establishing
pN (δ) ≥ 0 and pN (N α δ) ≥ 0, for N ≥ 2, implies pN (τ ) ≥ 0, for τ ∈ [δ, N α δ] and N ≥ 2 (see Figure
7), and thus (63), which completes the proof. It remains to show that pN (δ) ≥ 0 and pN (N α δ) ≥ 0,
both for N ≥ 2. We have
2α
2 + 21/2 (1 + 2−1/2 )2
d − 1/2
2 N − 1
2
−2α
−1+
−
≥δ 2
−
= 0,
(65)
pN (δ) = δ N d
d
d
2α
where the inequality in (65) is by N 7→ NN−1 , for N ≥ 2, monotonically increasing in N , and the
28
pN (τ )
N αδ
δ
τ
Fig. 7: The function pN (τ ) is quadratic in τ , with the coefficient of the highest-degree term negative. Establishing pN (δ) ≥ 0
and pN (N α δ) ≥ 0 therefore implies pN (τ ) ≥ 0, τ ∈ [δ, N α δ].
p
equality is thanks to α = log2
d/(d − 1/2) , which is by assumption. Next, we have
α
1/2
−1/2 )2
2α
pN (N α δ) − N 2α + N (2 + 2 ) − (1 + 2
=
N
−
1
δ2
d
d
α
1/2
−1/2
2
2 (2 + 2 ) (1 + 2
)
≥ 1 − 22α +
−
d
d
√
d(2 + 21/2 ) (1 + 2−1/2 )2
d
+ p
≥ 0,
−
=1−
d − 1/2
d
d d − 1/2
(66)
∀ d ≥ 1, ∀N ≥ 2,
(67)
where (66) is by N 7→ (N − 1)2α − N 2α + d−1 N α (2 + 21/2 ), for N ≥ 2, monotonically increasing
p
in N (owing to α = log2
d/(d − 1/2) > 0, for d ≥ 1), and the equality in (67) is thanks to
p
d/(d − 1/2) . The inequality in (67) is established in Lemma 4 below. This completes the
α = log2
proof.
It remains to show (67), which is accomplished through the following lemma.
Lemma 4. For every d ≥ 1 it holds that
√
d
d(2 + 21/2 ) (1 + 2−1/2 )2
1−
+ p
−
≥ 0.
d − 1/2
d
d d − 1/2
Proof. We start by multiplying the inequality by d(d − 1/2), which (after rearranging terms) yields
p
d(d − 1/2)α ≥ (d − 1/2)β + d/2,
d ≥ 1,
(68)
where α := (2 + 21/2 ) and β := (1 + 2−1/2 )2 . Squaring (68) yields (again, after rearranging terms)
α2
β
β2
1
+ β2 + ) −
≥ 0,
d2 (α2 − β 2 − β − ) +d (−
4}
2 } |{z}
4
|
{z
| 2 {z
=0
≥4
d ≥ 1,
≥3
which completes the proof.
p
We proceed to sharpen the exponent α = log2 ( d/(d − 1/2)) to α = 1 for d = 1. The structure
29
of the corresponding proof is similar to that of the proof for d ≥ 1 with α = log2
p
d/(d − 1/2) .
Specifically, we start by employing the arguments leading to (50) with N α replaced by N . With this
ω−νλ 2 P
2
n
, where,
replacement hn,N,α,δ in (51) becomes hn,N,α,δ (ω) :=
λn (ω)| 1 − rbl (N −1)δ
λn ∈Λn |gc
again, appropriate choice of the modulation factors {νλn }λn ∈Λn ⊆ Rd will be key to establishing the
inductive step. We start by defining Λ+
λn ) ⊆ [δ, ∞),
n to be the set of indices λn ∈ Λn such that supp(gc
and take Λ−
λn ) ⊆ (−∞, −δ] (see Assumption 1).
n to be the set of indices λn ∈ Λn such that supp(gc
−
Clearly, Λn = Λ+
n ∪ Λn . Moreover, we define the modulation factors according to νλn := δ , for all
−
λn ∈ Λ+
n , and νλn := −δ , for all λn ∈ Λn . We then get
ω − ν 2 λn 2
|gc
(ω)|
1
−
rbl
λn
(N − 1)δ
λn ∈Λn
ω − δ 2 X
2
=
|gc
λn (ω)| 1[δ, ∞) (ω) 1 − rbl
(N − 1)δ
λn ∈Λ+
n
ω + δ 2 X
2
|gc
+
λn (ω)| 1(−∞,−δ ] (ω) 1 − rbl
(N
−
1)δ
λn ∈Λ−
n
ω − δ 2 ≤ max{1, Bn } 1[δ, ∞) (ω) 1 − rbl
(N − 1)δ
ω + δ 2 + max{1, Bn } 1(−∞,−δ ] (ω) 1 − rbl
,
(N − 1)δ
hn,N,α,δ (ω) =
X
(69)
(70)
(71)
(72)
where (69) and (70) are thanks to Assumption 1, and for the last step we employed (18). For the set
[δ, ∞), we show in Lemma 5 below that
ω − δ ω rbl
≥ rbl
,
(N − 1)δ
Nδ
∀ ω ∈ [δ, ∞), ∀N ≥ 2.
(73)
∀ ω ∈ (−∞, −δ], ∀N ≥ 2,
(74)
This will allow us to deduce
ω + δ ω ≥ rbl
,
rbl
(N − 1)δ
Nδ
simply by noting that
ω + δ −(−ω − δ) l
−ω − δ ω+δ l
1
−
=
=
rbl
rbl
= 1 − (N − 1)δ
(N − 1)δ +
(N − 1)δ +
(N − 1)δ
−ω l
−ω ω ≥ rbl
= rbl
= 1−
,
Nδ
Nδ +
Nδ
(75)
for ω ∈ (−∞, −δ]. Here, the inequality in (75) is due to (73). Insertion of (73) into (71) and of (74)
30
into (72) then yields
ω 2 ω 2 hn,N,α,δ (ω) ≤ max{1, Bn } 1(−∞,δ ] ∪ [δ, ∞) (ω) 1 − rbl
≤ max{1, Bn } 1 − rbl
,
Nδ
Nδ
for ω ∈ R, where the last inequality is thanks to 0 ≤ rbl (ω) ≤ 1, for ω ∈ R. This establishes (20)—in
the 1-D case—for α = 1 and completes the proof of statement i) in Theorem 1.
It remains to prove (73), which is done through the following lemma.
Lemma 5. Let rbl : R → R, rbl (ω) := (1 − |ω|)l+ , with l > 1. Then,
ω − δ ω rbl
≥ rbl
,
(N − 1)δ
Nδ
∀ ω ∈ [δ, ∞), ∀N ≥ 2.
(76)
Proof. We first note that for ω > N δ , (76) is trivially satisfied as the RHS of (76) equals zero (owing
to Nωδ > 1 together with supp(rbl ) ⊆ B1 (0)). It hence suffices to prove (76) for δ ≤ ω ≤ N δ . The
key idea of the proof is to employ a monotonicity argument. Specifically, thanks to rbl monotonically
decreasing in |ω|, i.e., rbl (ω1 ) ≥ rbl (ω2 ), for ω1 , ω2 ∈ R with |ω2 | ≥ |ω1 |, (76) can be established simply
by showing that
ω−δ ω ≤
,
(N − 1)δ
Nδ
∀ ω ∈ [δ, N δ ], ∀N ≥ 2,
which, by ω ∈ [δ, N δ], is equivalent to
ω
ω−δ
≤
,
(N − 1)δ
Nδ
∀ ω ∈ [δ, N δ ], ∀N ≥ 2.
(77)
Rearranging terms in (77), we get
ω ≤ N δ,
∀ ω ∈ [δ, N δ ], ∀N ≥ 2,
which completes the proof.
Remark 1. What makes the improved exponent α possible in the 1-D case is the absence of rotated
orthants. Specifically, for d = 1, the filters {gλn }λn ∈Λn satisfy either supp(gc
λn ) ⊆ (−∞, −δ] or
supp(gc
λn ) ⊆ [δ, ∞), i.e., the support sets supp(gc
λn ) are located in one of the two half-spaces.
31
A PPENDIX D
P ROOF OF STATEMENT II )
IN
T HEOREM 1
We need to show that there exist C > 0 (that is independent of N ) and N0 ∈ N such that
α(2s+β)
N − 2s+β+1
WN (f ) ≤ CBΩ
N
,
∀N ≥ N0 ,
for every f ∈ H s (Rd ), where β > 0 depends on f . Let us start by noting that for f ∈ H s (Rd ), by [41,
Sec. 6.2.1] there exist β, µ, L > 0 (all depending on f ) such that
|fb(ω)| ≤
µ
s
d
β
(1 + |ω|2 ) 2 + 4 + 4
,
∀ |ω| ≥ L.
(78)
The key idea of the proof is to split the integral on the RHS of (20), i.e.,
Z
ω 2 |fb(ω)|2 1 − rbl
dω,
αδ N
d
R |
{z
}
=:z(ω)
α
into integrals over the sets BRN (0) and Rd \BRN (0), where RN := N 2s+β+1 , for N ∈ N, and to establish
that
Z
α(2s+β)
z(ω) dω ≤ 2 lδ −1 kf k22 N − 2s+β+1 ,
∀N ≥ 1,
(79)
BRN (0)
and
Z
z(ω) dω ≤
Rd \BRN (0)
µ2 vold−1 (∂B1 (0)) − α(2s+β)
N 2s+β+1 ,
2s + β
(80)
α(2s+β)
2s+β+1 R
for N ≥ N0 := L α
. This then implies Rd z(ω) dω ≤ C N − 2s+β+1 , for N ≥ max{1, N0 } = N0 ,
where
n
o
C := 2 max 2 lδ −1 kf k22 , µ2 vold−1 (∂B1 (0))(2s + β)−1 ,
and thereby completes the proof. It remains to establish (79) and (80). We start with (79) and note that
d
(1 − 2l|ω|) ≤ (1 − |ω|)2l
+ , for ω ∈ R , where l > bd/2c + 1, see Figure 8. This implies
ω 2
ω 2l
2l
1 − rbl
=
1
−
1
−
≤ α |ω|,
α α
N δ
N δ +
N δ
∀ω ∈ Rd ,
32
g2
g1
ω
Fig. 8: Illustration in dimension d = 1. The functions g1 (ω) := (1 − 2l|ω|) (dashed line) and g2 (ω) := (1 − |ω|)2l
+ (solid line)
satisfy g1 (ω) ≤ g2 (ω), for ω ∈ R. Note that l > bd/2c + 1.
which when employed in
R
BRN (0) z(ω) dω
Z
2l
z(ω) dω ≤ α
N δ
BRN (0)
yields (79) thanks to
Z
α(2s+β)
2lRN
|fb(ω)|2 |ω| dω ≤ α kf k22 = 2 lδ −1 kf k22 N − 2s+β+1 ,
|{z}
N δ
BRN (0)
≤RN
α
for N ≥ 1, where we used Parseval’s formula and RN = N 2s+β+1 . Next, we establish (80) and start by
2
noting that 1 − rbl Nωα δ ≤ 1, for ω ∈ Rd , which is by 0 ≤ rbl (ω) ≤ 1, for ω ∈ Rd . Moreover, we have
α
α
RN = N 2s+β+1 ≥ N02s+β+1 =
L
2s+β+1
α
α
2s+β+1
≥ L,
∀N ≥ N0 ,
which by (78) implies
|fb(ω)| ≤
Employing (81) in
R
Rd \BRN (0)
µ
s
d
β
(1 + |ω|2 ) 2 + 4 + 4
,
∀ |ω| ≥ RN .
(81)
z(ω) dω now yields
µ2
Z
Z
z(ω) dω ≤
Rd \BRN (0)
β dω
d
(1 + |ω|2 )s+ 2 + 2
Z ∞
rd−1
2
d−1
dr,
= µ vol (∂B1 (0))
s+ d + β
RN (1 + r 2 ) 2 2
Rd \BRN (0)
d
β
(82)
where in (82) we introduced polar coordinates. Moreover, (1 + r2 )s+ 2 + 2 ≥ r2s+d+β , for r ≥ 0, which
when employed in (82) yields (80) owing to
Z
Z
z(ω) dω ≤ µ2 vold−1 (∂B1 (0))
Rd
\BRN (0)
∞
r−2s−β−1 dr
RN
=
µ2 vold−1 (∂B1 (0))
−2s−β
RN
2s + β
µ2 vold−1 (∂B1 (0)) − α(2s+β)
=
N 2s+β+1 ,
2s + β
α
where in the last step we used RN = N 2s+β+1 . This completes the proof.
33
A PPENDIX E
P ROPOSITION 1
Proposition 1. Let Ω be the module-sequence (1). Then,
2
AN
Ω kf k2
≤
N
−1
X
N
|||ΦnΩ (f )|||2 + WN (f ) ≤ BΩ
kf k22 ,
∀ f ∈ L2 (Rd ), ∀N ≥ 1,
(83)
n=0
where
AN
Ω =
N
Y
N
Y
N
BΩ
=
min{1, Ak },
k=1
max{1, Bk }.
k=1
Proof. We proceed by induction across N and start with the base case N = 1 which follows directly
from the frame property (2) according to
A1Ω kf k22 = min{1, A1 }kf k22 ≤ A1 kf k22 ≤ kf ∗ χ1 k22 +
X
kf ∗ gλ1 k22 ≤ B1 kf k22
λ1 ∈Λ1
{z
|
1
≤ max{1, B1 }kf k22 = BΩ
kf k22 ,
= |||Φ0Ω (f )|||2 +W1 (f )
}
∀ f ∈ L2 (Rd ).
The inductive step is obtained as follows. Let N > 1 and suppose that (83) holds for N − 1, i.e.,
−1
kf k22 ≤
AN
Ω
N
−2
X
N −1
kf k22 ,
|||ΦnΩ (f )|||2 + WN −1 (f ) ≤ BΩ
∀f ∈ L2 (Rd ).
(84)
n=0
We start by noting that
N
−1
X
|||ΦnΩ (f )|||2
+ WN (f ) =
n=0
N
−2
X
|||ΦnΩ (f )|||2 +
n=0
X
k(U [q]f ) ∗ χN k22 +
q ∈ ΛN −1
X
kU [q]f k22 ,
(85)
q ∈ ΛN
and proceed by examining the third term on the RHS of (85). Every path
q̃ ∈ ΛN = Λ1 × ... × ΛN −1 ×ΛN
|
{z
}
=ΛN −1
of length N can be decomposed into a path q ∈ ΛN −1 of length N −1 and an index λN ∈ ΛN according
to q̃ = (q, λN ). Thanks to (4) we have U [q̃] = U [(q, λN )] = UN [λN ]U [q], which yields
X
q∈Λ
N
kU [q]f k22 =
X
q∈Λ
N −1
X
λN ∈ΛN
k(U [q]f ) ∗ gλN k22 .
(86)
34
Substituting the third term on the RHS of (85) by (86) and rearranging terms, we obtain
N
−1
X
N
−2
X
|||ΦnΩ (f )|||2 + WN (f ) =
n=0
n=0
X |||ΦnΩ (f )|||2 +
X
k(U [q]f ) ∗ χN k22 +
k(U [q]f ) ∗ gλN k22 .
λN ∈ΛN
q ∈ ΛN −1
|
{z
=:ρN (U [q]f )
}
Thanks to the frame property (2) and U [q]f ∈ L2 (Rd ), which is by [10, Eq. 16], we have
AN kU [q]f k22 ≤ ρN (U [q]f ) ≤ BN kU [q]f k22 ,
and thus
min{1, AN }
−2
NX
|||ΦnΩ (f )|||2 + WN −1 (f )
(87)
n=0
≤
N
−1
X
|||ΦnΩ (f )|||2 + WN (f )
n=0
≤ max{1, BN }
−2
NX
|||ΦnΩ (f )|||2 + WN −1 (f ) ,
(88)
n=0
where we employed the identity
P
q ∈ ΛN −1
kU [q]f k22 = WN −1 (f ). Invoking the induction hypothesis
(84) in (87) and (88) and noting that
N −1
AN
,
Ω = min{1, AN }AΩ
N −1
N
,
BΩ
= max{1, BN }BΩ
completes the proof.
A PPENDIX F
P ROPOSITION 2
p
Proposition 2. Let rbl : Rd → R, rbl (ω) := (1−|ω|)l+ , with l > bd/2c+1, and let α := log2 ( d/(d − 1/2)).
Then, we have
Z
2 fb(ω)2 1 − rbl ω dω = 0,
N αδ
Rd
lim
N →∞
∀ f ∈ L2 (Rd ).
Proof. We start by setting
ω 2 dN,α,δ (ω) := 1 − rbl
,
α
N δ
ω ∈ Rd , N ∈ N.
(89)
35
Let f ∈ L2 (Rd ). For every ε > 0 there exists R > 0 such that
Z
|fb(ω)|2 dω ≤ ε/2,
Rd \BR (0)
where BR (0) denotes the closed ball of radius R centered at the origin. Next, we employ Dini’s theorem
[51, Theorem 7.3] to show that (dN,α,δ )N ∈N converges to zero uniformly on BR (0). To this end, we
note that (i) dN,α,δ is continuous as a composition of continuous functions, (ii) z0 (ω) = 0, for ω ∈ Rd ,
is, clearly, continuous, (iii) dN,α,δ (ω) ≥ dN +1,α,δ (ω), for ω ∈ Rd and N ∈ N, and (iv) dN,α,δ converges
to z0 pointwise on BR (0), i.e., lim dN,α,δ (ω) = z0 (ω) = 0, for ω ∈ Rd . This allows us to conclude
N →∞
that there exists N0 ∈ N (that depends on ε) such that dN,α,δ (ω) ≤
ε
2kf k22 ,
and we therefore get
Z
Z
fb(ω)2 dN,α,δ (ω)dω =
Rd
fb(ω)2 dN,α,δ (ω) dω +
| {z }
Rd \BR (0)
≤1
for ω ∈ BR (0) and N ≥ N0 ,
Z
fb(ω)2 dN,α,δ (ω) dω
| {z }
BR (0)
≤ 2kfεk2
2
ε
ε
≤ +
kfbk22 = ε,
2 2kf k22
where in the last step we employed Parseval’s formula. Since ε > 0 was arbitrary, we have (89), which
completes the proof.
A PPENDIX G
P ROOF OF T HEOREM 2
We start by establishing (26) in statement i). The structure of the proof is similar to that of the proof
of statement i) in Theorem 1, specifically we perform induction over N . Starting with the base case
−|j| ω),
b −j ω), for j ≥ 1, and gbj (ω) = ψ(−2
b
N = 1, we first note that supp(ψb ) ⊆ [1/2, 2], gbj (ω) = ψ(2
for j ≤ −1, all by assumption, imply
b −j ·)) ⊆ [2j−1 , 2j+1 ],
supp(gbj ) = supp(ψ(2
j ≥ 1,
(90)
and
−|j|
b
supp(gbj ) = supp(ψ(−2
·)) ⊆ [−2|j|+1 , −2|j|−1 ],
j ≤ −1.
(91)
36
We then get
X
W1 (f ) =
kf ∗ gj k22 =
Z
R
j∈Z\{0}
=
Z X
Z
=
|gbj (ω)|2 |fb(ω)|2 dω
−|j|
b
|ψ(−2
ω)|2 |fb(ω)|2 dω
R j≤−1
Z −1 X
−|j|
b
|ψ(−2
ω)|2 |fb(ω)|2 dω
(93)
−∞ j≤−1
|fb(ω)|2 dω ≤
≤
Z X
b −j ω)|2 |fb(ω)|2 dω +
|ψ(2
j≥1
Z
(92)
j∈Z\{0}
b −j ω)|2 |fb(ω)|2 dω +
|ψ(2
R j≥1
∞X
1
X
Z
R\[−1,1]
R
fb(ω)2 (1 − |rbl (ω)|2 )dω,
(94)
where (92) is by Parseval’s formula, and (93) is thanks to (90) and (91). The first inequality in (94) is
owing to (25), and the second inequality is due to supp(rbl ) ⊆ [−1, 1] and 0 ≤ rbl (ω) ≤ 1, for ω ∈ R.
The inductive step is obtained as follows. Let N > 1 and suppose that (26) holds for N − 1, i.e.,
Z
2 ω
fb(ω)2 1 − rbl
WN −1 (f ) ≤
dω,
∀ f ∈ L2 (R).
(95)
N −2 (5/3)
R
We start by noting that every path q̃ ∈ (Z\{0})N of length N can be decomposed into a path q ∈
(Z\{0})N −1 of length N − 1 and an index j ∈ Z\{0} according to q̃ = (j, q). Thanks to (4) we have
U [q̃] = U [(j, q)] = U [q]U1 [j], which yields
WN (f ) =
X
X
X
||U [q](U1 [j]f )||22 =
j∈Z\{0} q ∈ (Z\{0})N −1
WN −1 (U1 [j]f ).
(96)
j∈Z\{0}
We proceed by examining the term WN −1 (U1 [j]f ) inside the sum on the RHS of (96). Invoking the
induction hypothesis (95) and employing Parseval’s formula, we get
Z
2 2 ω
2
2
\
WN −1 (U1 [j]f ) ≤
U1 [j]f (ω) 1 − rbl
dω
=
kU
[j]f
k
−
k(U
[j]f
)
∗
r
k
1
1
l,N
−2
2
2
(5/3)N −2
R
= kf ∗ gj k22 − k|f ∗ gj | ∗ rl,N −2 k22 ,
(97)
where rl,N −2 is the inverse Fourier transform of rbl
ω
(5/3)N −2
. Next, we note that rbl
ω
(5/3)N −2
is a positive
definite radial basis function [49, Theorem 6.20] and hence by [49, Theorem 6.18] rl,N −2 (x) ≥ 0, for
x ∈ R. Furthermore, it follows from Lemma 2 in Appendix C that
k|f ∗ gj | ∗ rl,N −2 k22 ≥ kf ∗ gj ∗ (Mνj rl,N −2 )k22 ,
(98)
37
for all {νj }j∈Z\{0} ⊆ R. Choosing the modulation factors {νj }j∈Z\{0} ⊆ R appropriately (see (102)
below) will be key to establishing the inductive step. Using (97) and (98) to upper-bound the term
WN −1 (U1 [j]f ) inside the sum on the RHS of (96) yields
WN (f ) ≤
X Z
|fb(ω)|2 hl,N −2 (ω)dω,
kf ∗ gj k22 − kf ∗ gj ∗ (Mνj rl,N −2 )k22 =
(99)
R
j∈Z\{0}
where
hl,N −2 (ω) :=
X
j∈Z\{0}
ω − ν 2 j
|gbj (ω)|2 1 − rbl
.
(5/3)N −2
(100)
2
b
[
In (99) we employed Parseval’s formula together with M
ω f = Tω f , for f ∈ L (R) and ω ∈ R. The
key step is now to establish—for appropriately chosen {νj }j∈Z\{0} ⊆ R—the upper bound
hl,N −2 (ω) ≤ 1 − rbl
2 ω
,
N
−1
(5/3)
∀ ω ∈ R,
(101)
which then yields (26) and thereby completes the proof. To this end, we set η := 54 ,
νj := 2j η,
j ≥ 1,
νj := −2|j| η,
j ≤ −1,
(102)
and note that it suffices to prove (101) for ω ≥ 0, as
−ω − ν 2 j |gbj (−ω)|2 1 − rbl
N
−2
(5/3)
j∈Z\{0}
ω + ν 2 X
j
|gbj (−ω)|2 1 − rbl
=
N
−2
(5/3)
j≤−1
ω + ν 2 X
−j
2
1
−
=
|gd
(−ω)|
rbl
−j
(5/3)N −2
j≥1
ω − ν 2 X
j
2
|gbj (ω)| 1 − rbl
=
(5/3)N −2
hl,N −2 (−ω) =
X
(103)
(104)
j≥1
= hl,N −2 (ω),
∀ω ≥ 0.
(105)
Here, (103) is thanks to gbj (−ω) = 0, for j ≥ 1 and ω ≥ 0, which is by (90), and (105) is owing
to gbj (ω) = 0, for j ≤ −1 and ω ≥ 0, which is by (91). Moreover, in (103) we used that rbl satisfies
rbl (−ω) = rbl (ω), for ω ∈ R, and (104) is thanks to
b −|−j| ω) = ψ(2
b −j ω) = gbj (ω),
gd
−j (−ω) = ψ(2
∀ ω ∈ R, ∀ j ≥ 1,
38
as well as ν−j = −2j η = −νj , for j ≥ 1. Now, let ω ∈ [0, 1], and note that
hl,N −2 (ω) =
X
j∈Z\{0}
ω − ν 2 2
ω
j
=
0
≤
1
−
|gbj (ω)|2 1 − rbl
r
b
,
l
(5/3)N −2
(5/3)N −1
∀N ≥ 2, (106)
where the second equality in (106) is simply a consequence of gbj (ω) = 0, for j ∈ Z\{0} and ω ∈ [0, 1],
which, in turn, is by (90) and (91). The inequality in (106) is thanks to 0 ≤ rbl (ω) ≤ 1, for ω ∈ R. Next,
let ω ∈ [1, 2]. Then, we have
ω − 2η 2 hl,N −2 (ω) = |gb1 (ω)|2 1 − rbl
(5/3)N −2
ω − η 2 ω − 2η 2 2
+
1
−
|
g
b
(ω)|
1
−
≤ |gb1 (ω)|2 1 − rbl
rbl
1
(5/3)N −2
(5/3)N −2
{z
}
|
|
{z
}
≥0
(107)
(108)
≥0
ω − η 2
ω − η 2 ω − 2η 2 2 +
|
g
b
(ω)|
= 1 − rbl
rbl
− rbl
,
1
(5/3)N −2
(5/3)N −2
(5/3)N −2
(109)
where (107) is thanks to gbj (ω) = 0, for j ∈ Z\{0, 1} and ω ∈ [1, 2], which, in turn, is by (90) and (91).
Moreover, (108) is owing to |gb1 (ω)|2 ∈ [0, 1], which, in turn, is by (25) and 0 ≤ rbl (ω) ≤ 1, for ω ∈ R.
Next, fix j ≥ 2 and let ω ∈ [2j−1 , 2j ]. Then, we have
ω − 2j η 2 hl,N −2 (ω) = |gbj (ω)|2 1 − rbl
+
N
−2
(5/3)
ω − 2j−1 η 2 1 − rbl
(110)
N
−2
(5/3)
|
{z
}
2)
b
=(1−|gb (ω)|2 −|φ(ω)|
2
|gd
j−1 (ω)|
| {z }
j
≥0
ω − 2j−1 η 2
ω − 2j−1 η 2 ω − 2j η 2 2 ≤ 1 − rbl
+ |gbj (ω)| rbl
− rbl
,
(5/3)N −2
(5/3)N −2
(5/3)N −2
(111)
0
j−1 , 2j ], which, in turn, is by
where (110) is thanks to i) gc
j 0 (ω) = 0, for j ∈ Z\{0, j, j − 1} and ω ∈ [2
(90) and (91), and ii)
2
2
2
b
|φ(ω)|
+ |gd
j−1 (ω)| + |gbj (ω)| = 1,
∀ω ∈ [2j−1 , 2j ],
(112)
which is a consequence of the Littlewood-Paley condition (25) and of (90) and (91). It follows from
(109) and (111) that for every j ≥ 1, we have
ω − 2j−1 η 2
ω − 2j−1 η 2 ω − 2j η 2 2 hl,N −2 (ω) ≤ 1 − rbl
+
|
g
b
(ω)|
rbl
− rbl
,
j
N
−2
N
−2
N
−2
(5/3)
(5/3)
(5/3)
|
{z
}
=:s(ω)
for ω ∈ [2j−1 , 2j ]. Next, we divide the interval [2j−1 , 2j ] into two intervals, namely IL := [2j−1 , 53 2j ]
and IR := [ 35 2j , 2j ], and note that s(ω) ≥ 0, for ω ∈ IL , and s(ω) ≤ 0, for ω ∈ IR , as rbl is monotonically
39
h3
h1
h2
ω
2j−1 35 2j
2j
Fig. 9: The functions h1 (ω) := |ω − 2j η| (solid line), h2 (ω) := |ω − 2j−1 η| (dashed line), and h3 (ω) :=
satisfy h2 ≤ h1 ≤ h3 on IL = [2j−1 , 53 2j ] and h1 ≤ h2 ≤ h3 on IR = [ 35 2j , 2j ].
3
ω
5
(dotted line)
decreasing in |ω| and |ω − 2j η| ≥ |ω − 2j−1 η|, for ω ∈ IL , and |ω − 2j η| ≤ |ω − 2j−1 η|, for ω ∈ IR ,
respectively (see Figure 9). For ω ∈ IL , we therefore have
ω − 2j−1 η 2
ω − 2j−1 η 2
2
hl,N −2 (ω) ≤ 1 − rbl
s(ω)
≤
1
−
+
|
g
b
(ω)|
rbl
+ s(ω)
j
N
−2
N
−2
(5/3)
(5/3)
| {z } |{z}
∈ [0,1]
≥0
ω − 2j η 2
2
ω
≤
1
−
r
b
= 1 − rbl
,
l
(5/3)N −2
(5/3)N −1
where |gbj (ω)|2 ∈ [0, 1] follows from (112), and the last inequality is a consequence of |ω − 2j η| ≤
3ω
5 ,
for ω ∈ IL , see Figure 9. For ω ∈ IR , we have
ω − 2j−1 η 2
ω − 2j−1 η 2
2
ω
2
≤
1
−
hl,N −2 (ω) ≤ 1 − rbl
s(ω)
r
b
+
|
g
b
(ω)|
≤
1
−
r
b
,
j
l
l
N
−2
N
−1
|{z}
(5/3)N −2
(5/3)
(5/3)
| {z }
∈ [0,1]
≤0
where the last inequality now follows from |ω − 2j−1 η| ≤
3ω
5 ,
for ω ∈ IR , see Figure 9. This completes
the proof of (26).
We proceed to the proof of statement ii), again, effected by induction over N. Specifically, we first
establish (29) by employing the same arguments as those leading to (99) with (5/3)N −2 replaced by
(3/2)N −2 R. With this replacement hl,N −2 in (100) becomes
hl,N −2 (ω) :=
X
k∈Z\{0}
|gbk (ω)|2 1 − rbl
ω − νk 2 ,
(3/2)N −2 R
(113)
where, again, appropriate choice of the modulation factors {νk }k∈Z\{0} ⊆ R (see (117) below) will be
key in establishing the inductive step. Here, we note that the functions gbk in (113) satisfy gbk (ω) =
gb(ω − R(k + 1)), for k ≥ 1, gbk (ω) = gb(ω + R(|k| + 1)), for k ≤ −1, by assumption, as well as
supp(gbk ) = supp(gb(· − R(k + 1))) ⊆ [Rk, R(k + 2)],
k ≥ 1,
(114)
40
and
supp(gbk ) = supp(gb(· + R(|k| + 1))) ⊆ [−R(|k| + 2), −R|k|],
k ≤ −1,
(115)
where (114) and (115) follow from supp(gb) ⊆ [−R, R], which is by assumption. It remains to establish
the equivalent of (101), namely
hl,N −2 (ω) ≤ 1 − rbl
2
ω
,
(3/2)N −1 R
∀ω ∈ R.
(116)
To this end, we set η := 23 R,
∀k ≥ 1,
νk := Rk + η,
νk := −ν|k| ,
∀k ≤ −1,
(117)
and note that it suffices to establish (116) for ω ≥ 0, thanks to
−ω − ν 2 k
|gbk (−ω)|2 1 − rbl
(3/2)N −2 R
k∈Z\{0}
ω+ν
2 X
k
=
|gbk (−ω)|2 1 − rbl
N
−2
(3/2)
R
k≤−1
ω+ν
2 X
−k
2
1
−
=
|gd
(−ω)|
r
b
−k
l
(3/2)N −2 R
k≥1
ω−ν
2 X
k
=
|gbk (ω)|2 1 − rbl
N
−2
(3/2)
R
hl,N −2 (−ω) =
X
(118)
(119)
k≥1
= hl,N −2 (ω),
∀ω ≥ 0.
(120)
Here, (118) follows from gbk (−ω) = 0, for k ≥ 1 and ω ≥ 0, which, in turn, is by (114), and (120) is
owing to gbk (ω) = 0, for k ≤ −1 and ω ≥ 0, which is by (115). Moreover, in (118) we used that rbl
satisfies rbl (−ω) = rbl (ω), for ω ∈ R, and (119) is thanks to ν−k = −νk , for k ≥ 1, and
gd
b(−ω + R(|−k| + 1)) = gb(−(ω − R(k + 1)))
−k (−ω) = g
= gb(ω − R(k + 1)) = gbk (ω),
∀ω ∈ R, ∀ k ≥ 1,
where we used gb(−ω) = gb(ω), for ω ∈ R, which is by assumption. Now, let ω ∈ [0, R], and note that
hl,N −2 (ω) = 0 ≤ 1 − rbl
2
ω
,
(3/2)N −1 R
∀N ≥ 2,
(121)
where the equality in (121) is a consequence of (114) and (115), and the inequality is thanks to 0 ≤
41
rbl (ω) ≤ 1, for ω ∈ R. Next, let ω ∈ [R, 2R]. Then, we have
ω−ν
2 1
hl,N −2 (ω) = |gb1 (ω)|2 1 − rbl
(3/2)N −2 R
ω−ν
ω − η 2 2 1
2
≤ |gb1 (ω)|2 1 − rbl
+
(1
−
|
g
b
(ω)|
)
1
−
rbl
1
{z
}
|
(3/2)N −2 R
(3/2)N −2 R
{z
}
|
≥0
(122)
(123)
≥0
= 1 − rbl
ω−η
(3/2)N −2 R
2
2 + |gb1 (ω)| rbl
ω−η
(3/2)N −2 R
2 − rbl
ω − ν1 2 ,
(3/2)N −2 R
(124)
where (122) is thanks to gbk (ω) = 0, for k ∈ Z\{0, 1} and ω ∈ [0, R], which, in turn, is by (114) and
(115). Moreover, (123) is owing to |gb1 (ω)|2 ∈ [0, 1], which, in turn, is by (28), and 0 ≤ rbl (ω) ≤ 1, for
ω ∈ R. Next, fix k ≥ 2, and let ω ∈ [Rk, R(k + 1)]. Then, we have
hl,N −2 (ω) = |gbk (ω)|2 1 − rbl
ω − νk 2 +
(3/2)N −2 R
ω−ν
2 k−1
1 − rbl
(125)
(3/2)N −2 R
{z
}
2) |
b
=(1−|gc(ω)|2 −|φ(ω)|
2
|gd
k−1 (ω)|
| {z }
k
≥0
ω−ν
2
ω − ν
2 ω − ν
2 k−1
k
k−1
2 +
|
g
b
(ω)|
r
b
−
≤ 1 − rbl
r
b
,
k
l
l
(3/2)N −2 R
(3/2)N −2 R
(3/2)N −2 R
(126)
0
where (125) is thanks to i) gc
k0 (ω) = 0, for k ∈ Z\{0, k, k − 1} and ω ∈ [Rk, R(k + 1)], which, in turn,
is by (114) and (115), and ii)
2
2
|χ
b(ω)|2 + |gd
k−1 (ω)| + |gbk (ω)| = 1,
∀ ω ∈ [Rk, R(k + 1)],
(127)
which is a consequence of the Littlewood-Paley condition (28) and of (114) and (115). It follows from
(124) and (126) that for k ≥ 1, we have
ω−ν
ω − ν
2
2 ω − ν
2 k−1
k
k−1
2 r
b
+
|
g
b
(ω)|
−
r
b
hl,N −2 (ω) ≤ 1 − rbl
,
l
k
l
(3/2)N −2 R
(3/2)N −2 R
(3/2)N −2 R
|
{z
}
=:s(ω)
for ω ∈ [Rk, R(k + 1)], where ν0 := η . Next, we divide the interval [Rk, R(k + 1)] into two intervals,
namely IL := [Rk, R(k + 16 )] and IR := [R(k + 16 ), R(k + 1)], and note that s(ω) ≥ 0, for ω ∈ IL , and
s(ω) ≤ 0, for ω ∈ IR , as rbl is monotonically decreasing in |ω| and |ω − νk | ≥ |ω − νk−1 |, for ω ∈ IL ,
and |ω − νk | ≤ |ω − νk−1 |, for ω ∈ IR , respectively (see Figure 10). For ω ∈ IL , we therefore have
ω−ν
ω−ν
2
2
k−1
k−1
2
r
b
hl,N −2 (ω) ≤ 1 − rbl
+
|
g
b
(ω)|
s(ω)
≤
1
−
+ s(ω)
l
k
N
−2
N
−2
| {z } |{z}
(3/2)
R
(3/2)
R
∈ [0,1]
= 1 − rbl
ω − νk
(3/2)N −2 R
2
≤ 1 − rbl
≥0
2
ω
,
(3/2)N −1 R
42
h3
h1
h2
ω
Rk R(k + 16 )
R(k + 1)
Fig. 10: The functions h1 (ω) := |ω − νk | (solid line), h2 (ω) := |ω − νk−1 | (dashed line), and h3 (ω) =
satisfy h2 ≤ h1 ≤ h3 on IL = [Rk, R(k + 61 )] and h1 ≤ h2 ≤ h3 on IR = [R(k + 16 ), R(k + 1)].
where |gbk (ω)|2 ∈ [0, 1] follows from (127), and the last inequality is by |ω − νk | ≤
2ω
3 ,
2
ω
3
(dotted line)
for ω ∈ IL (see
Figure 10). For the interval ω ∈ IR , we have
ω−ν
2
k−1
hl,N −2 (ω) ≤ 1 − rbl
+ |gbk (ω)|2 s(ω)
| {z } |{z}
(3/2)N −2 R
∈ [0,1]
≤0
ω−ν
2
2
ω
k−1
≤ 1 − rbl
≤
1
−
r
b
,
l
N
−2
N
−1
(3/2)
R
(3/2)
R
where the last inequality is by |ω − νk−1 | ≤
2ω
3 ,
for ω ∈ IR (see Figure 10). This completes the proof
of (29).
Next, we establish (30). The proof is very similar to that of statement ii) in Theorem 1 in Appendix D.
We start by noting that (30) amounts to the existence of C > 0 (that is independent of N ) and N0 ∈ N
such that
N (2s+β)
WN (f ) ≤ C(3/2)− 2s+β+1 ,
∀N ≥ N0 ,
for every f ∈ H s (R), where β > 0 depends on f . Now, for f ∈ H s (R), by [41, Sec. 6.2.1] there exist
β, µ, L > 0 (all depending on f ) such that
|fb(ω)| ≤
µ
s
1
β
(1 + |ω|2 ) 2 + 4 + 4
,
∀ |ω| ≥ L.
(128)
Without loss of generality, we can assume that L ≥ 1 (if L < 1, the upper bound (128) clearly holds
for all ω ∈ R with |ω| ≥ L̃ := 1 > L). The key idea of the proof is to split the integral on the RHS of
(29), i.e.,
Z
2 ω
|fb(ω)|2 1 − rbl
dω,
N −1 R (3/2)
R
|
{z
}
=:z(ω)
N −1
into integrals over the sets BRN (0) and R\BRN (0), where RN := (3/2) 2s+β+1 , for N ∈ N, and to
43
establish that
Z
z(ω) dω ≤ 2lkf k22 R−1 (3/2)−
(N −1)(2s+β)
2s+β+1
∀N ≥ 1,
,
(129)
BRN (0)
and
(N −1)(2s+β)
2µ2
(3/2)− 2s+β+1 ,
2s + β
Z
z(ω) dω ≤
R\BRN (0)
(130)
−N (2s+β)
R
for N ≥ N0 := (2s + β + 1) log3/2 (L) + 1 . This then implies R z(ω) dω ≤ C (3/2)− 2s+β+1 for
N ≥ max{1, N0 } = N0 , where
n
o
2s+β
C := 4 · (3/2) 2s+β+1 max lkf k22 R−1 , µ2 (2s + β)−1 ,
and thereby completes the proof. It remains to establish (129) and (130). We start with (129) and note
that (1 − 2l|ω|) ≤ (1 − |ω|)2l
+ , for ω ∈ R, where l > 1, see Figure 8 in Appendix D. This implies
2l
2
ω
ω
2l
≤
|ω|,
=1− 1−
(3/2)N −1 R
(3/2)N −1 R +
(3/2)N −1 R
R
which when employed in BR (0) z(ω) dω yields (129) thanks to
1 − rbl
∀ω ∈ R,
N
Z
2l
z(ω) dω ≤
(3/2)N −1 R
BRN (0)
(N −1)(2s+β)
2lRN
2l kf k22 3 − 2s+β+1
2
|fb(ω)|2 |ω| dω ≤
,
kf
k
=
2
|{z}
(3/2)N −1 R
R
2
BRN (0)
Z
≤RN
(N −1)
for N ≥ 1, where we used Parseval’s formula and RN = (3/2) 2s+β+1 . Next, we establish (130) and start
2
by noting that 1 − rbl (3/2)ωN −1 R ≤ 1, for ω ∈ R, which is by 0 ≤ rbl (ω) ≤ 1, for ω ∈ R. Moreover,
we have
N0 −1
N −1
RN = (3/2) 2s+β+1 ≥ (3/2) 2s+β+1 = (3/2)
d(2s+β+1) log3/2 (L)+1e−1
2s+β+1
≥ (3/2)log3/2 (L) = L,
∀N ≥ N0 ,
which by (128) implies
|fb(ω)| ≤
Employing (131) in
R
R\BRN (0)
Z
µ
s
β
,
∀ |ω| ≥ RN .
(131)
z(ω) dω now yields
µ2
Z
z(ω) dω ≤
R\BRN (0)
1
(1 + |ω|2 ) 2 + 4 + 4
R\BRN (0)
(1 +
2
Z
∞
β dω = 2µ
1
|ω|2 )s+ 2 + 2
1
1
RN
β
(1 + r2 )s+ 2 + 2
1
β
dr,
(132)
where in the last step we introduced polar coordinates. Moreover, (1 + r2 )s+ 2 + 2 ≥ r2s+1+β , for r ≥ 0,
44
which when employed in (132) yields (130) owing to
Z
Z ∞
(N −1)(2s+β)
2µ2
2µ2
−2s−β
RN
=
(3/2)− 2s+β+1 ,
z(ω) dω ≤ 2µ2
r−2s−β−1 dr =
2s + β
2s + β
R\BRN (0)
RN
N −1
where in the last step we used RN = (3/2) 2s+β+1 . This completes the proof of statement ii).
A PPENDIX H
P ROOF OF C OROLLARY 1
N
2
d
We start with statement i) and note that AN
Ω = BΩ = 1, N ∈ N, by assumption. Let f ∈ L (R )
with supp(fb) ⊆ BL (0). Then, by Proposition 1 in Appendix E together with lim WN (f ) = 0, for
N →∞
f ∈ L2 (Rd ), which follows from Proposition 2 in Appendix F, we have
kf k22
2
= |||ΦΩ (f )||| =
∞
X
|||ΦnΩ (f )|||2
n=0
≥
N
X
|||ΦnΩ (f )|||2 = kf k22 − WN +1 (f )
(133)
n=0
Z
2
ω
|fb(ω)|2 rbl
dω
αδ (N
+
1)
d
R
Z
2
ω
=
|fb(ω)|2 rbl
dω,
αδ (N
+
1)
BL (0)
≥
(134)
(135)
where (133) is by the lower bound in (83), (134) is thanks to Parseval’s formula and (20), and (135)
follows from f being L-band-limited. Next, thanks to rbl monotonically decreasing in |ω|, we get
rbl
2 2
ω
L
≥
r
b
,
l
α
α
(N + 1) δ
(N + 1) δ
∀ ω ∈ BL (0).
(136)
Employing (136) in (135), we obtain
kf k22
2
2l
L
L
2
1
−
kf
k
=
kf k22
≥ rbl
2
(N + 1)α δ
(N + 1)α δ +
2l
L
= 1−
kf k22 ≥ (1 − ε)kf k22 ,
(N + 1)α δ
(137)
(138)
where in (137) we used Parseval’s formula, the equality in (138) is due to L ≤ (N + 1)α δ , which, in
turn, is by (32), and the inequality in (138) is also by (32) (upon rearranging terms). This establishes
(31) and thereby completes the proof.
The proof of statement ii) is very similar to that of statement i). Again, we start by noting that
N
2
b
AN
Ω = BΩ = 1, N ∈ N, by assumption. Let f ∈ L (R) with supp(f ) ⊆ BL (0). Then, by Proposition 1
45
in Appendix E together with lim WN (f ) = 0, for f ∈ L2 (R), we have
N →∞
kf k22 = |||ΦΩ (f )|||2 =
∞
X
|||ΦnΩ (f )|||2
n=0
≥
N
X
|||ΦnΩ (f )|||2 = kf k22 − WN +1 (f )
(139)
n=0
Z
ω 2
|fb(ω)|2 rbl N dω
a
δ
ZR
ω 2
2
b
=
|f (ω)| rbl N dω,
a
δ
BL (0)
≥
(140)
(141)
where (139) is by the lower bound in (83), (140) is thanks to Parseval’s formula and (26) and (29), and
(141) follows from f being L-band-limited. Next, thanks to rbl monotonically decreasing in |ω|, we get
ω 2 L 2
rbl N ≥ rbl N ,
a δ
a δ
∀ ω ∈ BL (0).
(142)
Employing (142) in (141) yields
kf k22
L 2
L 2l
kf k22
≥ rbl N kf k22 = 1 − N
a δ
a δ +
L 2l
= 1− N
kf k22 ≥ (1 − ε)kf k22 ,
a δ
(143)
(144)
where in (143) we used Parseval’s formula, the equality in (144) is by L ≤ aN δ , which, in turn, is by
(33), and the inequality in (144) is also due to (33) (upon rearranging terms). This establishes (31) and
thereby completes the proof of ii).
A PPENDIX I
P ROOF OF C OROLLARY 2
The proof is very similar to that of Corollary 1 in Appendix H. We start with statement i) and note
that H s (Rd ) ⊆ L2 (Rd ), s ≥ 0, which follows from
Z
Z
kf k22 =
|fb(ω)|2 dω ≤
|fb(ω)|2 (1 + |ω|2 )s dω < ∞,
| {z }
d
d
R
R
∀f ∈ H s (Rd ),
(145)
≥1
where we used Parseval’s formula. Next, let f ∈ H s (Rd )\{0} and ε ∈ (0, 1) and note that, by [41, Sec.
6.2.1], there exist β, µ, L > 0 (all depending on f ) such that
|fb(ω)| ≤
µ
s
d
β
(1 + |ω|2 ) 2 + 4 + 4
,
∀ |ω| ≥ L.
(146)
46
Moreover, the constant
(
µ2 vold−1 (∂B1 (0))
√
(2s + β)(1 − 1 − ε )kf k22
τ := max L,
!
1
(2s+β)
)
,
(147)
playing the role of an “effective bandwidth”, is finite thanks to 0 < kf k2 < ∞, which, in turn, is by
f 6= 0 and (145). Then, we have
Z
Z
|fb(ω)|2 dω ≤
µ2
β dω
d
(1 + |ω|2 )s+ 2 + 2
Z ∞
rd−1
2
d−1
= µ vol (∂B1 (0))
β dr
d
τ
(1 + r2 )s+ 2 + 2
Z ∞
≤ µ2 vold−1 (∂B1 (0))
r−(2s+β+1) dr
Rd \Bτ (0)
(148)
Rd \Bτ (0)
(149)
(150)
τ
=
√
µ2 vold−1 (∂B1 (0)) −(2s+β)
τ
≤ (1 − 1 − ε )kf k22 .
(2s + β)
(151)
Here, (148) follows from (146) and (147), in (149) we changed to polar coordinates, (150) is owing
d
β
to (1 + r2 )s+ 2 + 2 ≥ r2s+d+β , r ∈ R, and the last step in (151) is by (147). Next, we note that
N
AN
Ω = BΩ = 1, N ∈ N, by assumption. By Proposition 1 in Appendix E together with lim WN (f ) = 0,
N →∞
for f ∈ L2 (Rd ), which follows from Proposition 2 in Appendix F, we have
kf k22
2
= |||ΦΩ (f )||| =
∞
X
|||ΦnΩ (f )|||2
n=0
≥
N
X
|||ΦnΩ (f )|||2 = kf k22 − WN +1 (f )
(152)
n=0
Z
2
ω
dω
αδ (N
+
1)
d
R
Z
2
ω
=
|fb(ω)|2 rbl
dω,
αδ (N
+
1)
B(N +1)α δ (0)
≥
|fb(ω)|2 rbl
(153)
(154)
where (152) is by the lower bound in (83), (153) is thanks to Parseval’s formula and (20), and (154)
follows from supp(rbl ) ⊆ [0, 1]. Next, thanks to (N + 1)α δ ≥ τ , by (34) and (35), we have
Z
Z
2
2
ω
ω
2
b
b(ω)|2 rbl
dω
≥
|
f
|f (ω)| rbl
dω
α
α
(N + 1) δ
(N + 1) δ
B(N +1)α δ (0)
Bτ (0)
2 Z
τ
≥ rbl
|fb(ω)|2 dω,
(N + 1)α δ
Bτ (0)
(155)
47
where the last step is because rbl is monotonically decreasing in |ω|. Employing (155) in (154), we get
2 Z
τ
|fb(ω)|2 dω
kf k22 ≥ rbl
(N + 1)α δ
Bτ (0)
Z
2 Z
τ
2
b
b(ω)|2 dω
= rbl
|
f
(ω)|
dω
−
|
f
(N + 1)α δ
Rd
Rd \Bτ (0)
2l √
2
√
τ
τ
2
1
−
ε
kf
k
=
1
−
1 − ε kf k22
(156)
≥ rbl
2
(N + 1)α δ
(N + 1)α δ +
2l √
τ
= 1−
1 − ε kf k22 ≥ (1 − ε)kf k22 ,
(157)
(N + 1)α δ
where the inequality in (156) is by Parseval’s formula and (151). The equality in (157) is due to
(N + 1)α δ ≥ τ , which, in turn, is by (34) and (35), and the inequality in (157) is also due to (34) and
(35). This establishes (31) and thereby completes the proof of i).
The proof of statement ii) is very similar to that of statement i). Let f ∈ H s (R)\{0} and ε ∈ (0, 1)
and note that, by [41, Sec. 6.2.1], there exist β, µ, L > 0 (all depending on f ) such that
|fb(ω)| ≤
µ
s
1
β
(1 + |ω|2 ) 2 + 4 + 4
,
∀ |ω| ≥ L.
Again, the constant
(
τ := max L,
2µ2
√
(2s + β)(1 − 1 − ε )kf k22
!
1
(2s+β)
(158)
)
,
(159)
is finite thanks to 0 < kf k2 < ∞, which, in turn, is by f 6= 0 and (145). Then, we have
Z
Z
µ2
2
b
|f (ω)| dω ≤
dω
s+ 12 + β2
R\Bτ (0)
R\Bτ (0) (1 + |ω|2 )
Z ∞
1
2
= 2µ
β dr
1
τ
(1 + r2 )s+ 2 + 2
Z ∞
2
≤ 2µ
r−(2s+β+1) dr
=
τ
2µ2
(2s + β)
τ −(2s+β) ≤ (1 −
√
1 − ε )kf k22 .
(160)
(161)
(162)
(163)
Here, (160) follows from (158) and (159), in (161) we introduced polar coordinates, (162) is owing
1
β
to (1 + r2 )s+ 2 + 2 ≥ r2s+1+β , r ∈ R, and the step leading to (163) is by (159). Next, we note that
N
AN
Ω = BΩ = 1, N ∈ N, by assumption. By Proposition 1 in Appendix E together with lim WN (f ) = 0,
N →∞
48
for f ∈ L2 (R), we have
kf k22
2
= |||ΦΩ (f )||| =
∞
X
|||ΦnΩ (f )|||2
n=0
≥
N
X
|||ΦnΩ (f )|||2 = kf k22 − WN +1 (f )
(164)
n=0
Z
ω 2
|fb(ω)|2 rbl N dω
a
δ
ZR
ω 2
=
|fb(ω)|2 rbl N dω,
a
δ
Bξ (0)
≥
(165)
(166)
where (164) is by the lower bound in (83), (165) is thanks to Parseval’s formula and (26) and (29), and
(166), where ξ := aN δ , follows from supp(rbl ) ⊆ [0, 1]. Next, thanks to ξ = aN δ ≥ τ , by (36) and (37),
we have
Z
Z
ω 2
ω 2
2
b
|f (ω)| rbl N dω ≥
|fb(ω)|2 rbl N dω
a δ
a δ
Bξ (0)
Bτ (0)
τ 2 Z
≥ rbl N |fb(ω)|2 dω,
a δ
Bτ (0)
(167)
where the last step is because rbl is monotonically decreasing in |ω|. Employing (167) in (166), we get
τ 2 Z
2
|fb(ω)|2 dω
kf k2 ≥ rbl N a δ
Bτ (0)
Z
τ 2 Z
= rbl N |fb(ω)|2 dω −
|fb(ω)|2 dω
a δ
R
R\Bτ (0)
τ 2 √
τ 2l √
≥ rbl N 1 − ε kf k22 = 1 − N
1 − ε kf k22
(168)
a δ
a δ +
τ 2l √
= 1− N
1 − ε kf k22 ≥ (1 − ε)kf k22 ,
(169)
a δ
where the inequality in (168) is by Parseval’s formula and (163). The equality in (169) is due to
aN δ ≥ R, which, in turn, is by (36) and (37), and the inequality in (169) is also due to (36) and (37).
This establishes (31) and thereby completes the proof of ii).
R EFERENCES
[1] P. Grohs, T. Wiatowski, and H. Bölcskei, “Energy decay in deep convolutional neural networks,” IEEE International
Symposium on Information Theory (ISIT), 2017, to appear.
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
49
[4] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
[5] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Handwritten digit recognition with
a back-propagation network,” in Proc. of International Conference on Neural Information Processing Systems (NIPS),
pp. 396–404, 1990.
[6] D. E. Rumelhart, G. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel
distributed processing: Explorations in the microstructure of cognition (J. L. McClelland and D. E. Rumelhart, eds.),
pp. 318–362, MIT Press, 1986.
[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of
the IEEE, pp. 2278–2324, 1998.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of IEEE International
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
[9] S. Mallat, “Group invariant scattering,” Comm. Pure Appl. Math., vol. 65, no. 10, pp. 1331–1398, 2012.
[10] T. Wiatowski and H. Bölcskei, “A mathematical theory of deep convolutional neural networks for feature extraction,”
arXiv:1512.06293, 2015.
[11] I. Waldspurger, Wavelet transform modulus: Phase retrieval and scattering. PhD thesis, École Normale Supérieure, Paris,
2015.
[12] W. Czaja and W. Li, “Analysis of time-frequency scattering transforms,” arXiv:1606.08677, 2017.
[13] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
no. 8, pp. 1872–1886, 2013.
[14] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Trans. Signal Process., vol. 62, no. 16, pp. 4114–4128, 2014.
[15] L. Sifre, Rigid-motion scattering for texture classification. PhD thesis, Centre de Mathématiques Appliquées, École
Polytechnique Paris-Saclay, 2014.
[16] T. Wiatowski, M. Tschannen, A. Stanić, P. Grohs, and H. Bölcskei, “Discrete deep feature extraction: A theory and new
architectures,” in Proc. of International Conference on Machine Learning (ICML), pp. 2149–2158, 2016.
[17] M. Tschannen, T. Kramer, G. Marti, M. Heinzmann, and T. Wiatowski, “Heart sound classification using deep structured
features,” in Proc. of Computing in Cardiology (CinC), pp. 565–568, 2016.
[18] M. Tschannen, L. Cavigelli, F. Mentzer, T. Wiatowski, and L. Benini, “Deep structured features for semantic segmentation,”
arXiv:1609.07916, 2016.
[19] R. Balan, M. Singh, and D. Zou, “Lipschitz properties for deep convolutional networks,” arXiv:1701.05217, 2017.
[20] P. Grohs, T. Wiatowski, and H. Bölcskei, “Deep convolutional neural networks on cartoon functions,” in Proc. of IEEE
International Symposium on Information Theory (ISIT), pp. 1163–1167, 2016.
[21] K. Gröchenig, Foundations of time-frequency analysis. Birkhäuser, 2001.
[22] I. Daubechies, Ten lectures on wavelets. Society for Industrial and Applied Mathematics, 1992.
[23] S. Mallat, A wavelet tour of signal processing: The sparse way. Academic Press, 3rd ed., 2009.
[24] E. J. Candès and D. L. Donoho, “New tight frames of curvelets and optimal representations of objects with piecewise
C 2 singularities,” Comm. Pure Appl. Math., vol. 57, no. 2, pp. 219–266, 2004.
[25] E. J. Candès and D. L. Donoho, “Continuous curvelet transform: II. Discretization and frames,” Appl. Comput. Harmon.
Anal., vol. 19, no. 2, pp. 198–222, 2005.
50
[26] P. Grohs, S. Keiper, G. Kutyniok, and M. Schaefer, “Cartoon approximation with α-curvelets,” J. Fourier Anal. Appl.,
pp. 1–59, 2015.
[27] K. Guo, G. Kutyniok, and D. Labate, “Sparse multidimensional representations using anisotropic dilation and shear
operators,” in Wavelets and Splines (G. Chen and M. J. Lai, eds.), pp. 189–201, Nashboro Press, 2006.
[28] G. Kutyniok and D. Labate, eds., Shearlets: Multiscale analysis for multivariate data. Birkhäuser, 2012.
[29] E. J. Candès, Ridgelets: Theory and applications. PhD thesis, Stanford University, 1998.
[30] E. J. Candès and D. L. Donoho, “Ridgelets: A key to higher-dimensional intermittency?,” Philos. Trans. R. Soc. London
Ser. A, vol. 357, no. 1760, pp. 2495–2509, 1999.
[31] P. Grohs, “Ridgelet-type frame decompositions for Sobolev spaces related to linear transport,” J. Fourier Anal. Appl.,
vol. 18, no. 2, pp. 309–325, 2012.
[32] S. T. Ali, J. P. Antoine, and J. P. Gazeau, “Continuous frames in Hilbert spaces,” Annals of Physics, vol. 222, no. 1,
pp. 1–37, 1993.
[33] O. Christensen, An introduction to frames and Riesz bases. Birkhäuser, 2003.
[34] G. Kaiser, A friendly guide to wavelets. Birkhäuser, 1994.
[35] D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constructive Approximation, vol. 17,
no. 3, pp. 353–382, 2001.
[36] P. Grohs and G. Kutyniok, “Parabolic molecules,” Foundations of Computational Mathematics, vol. 14, no. 2, pp. 299–337,
2014.
[37] G. Kutyniok and D. Labate, “Introduction to shearlets,” in Shearlets: Multiscale analysis for multivariate data [28],
pp. 1–38.
[38] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits.” http://yann.lecun.com/exdb/mnist/, 1998.
[39] B. Dacorogna, Introduction to the calculus of variations. Imperial College Press, 2004.
[40] W. Rudin, Functional analysis. McGraw-Hill, 2nd ed., 1991.
[41] L. Grafakos, Modern Fourier analysis. Springer, 2nd ed., 2009.
[42] A. R. Adams, Sobolev spaces. Academic Press, 1975.
[43] M. Frazier, B. Jawerth, and G. Weiss, Littlewood-Paley theory and the study of function spaces. American Mathematical
Society, 1991.
[44] K. Gröchening, A. J. E. M. Janssen, N. Kaiblinger, and G. E. Pfander, “Note on B-Splines, wavelet scaling functions,
and Gabor frames,” IEEE Trans. Inf. Theory, vol. 49, no. 12, pp. 3318–3320, 2003.
[45] P. Vandergheynst, “Directional dyadic wavelet transforms: Design and algorithms,” IEEE Trans. Image Process., vol. 11,
no. 4, pp. 363–372, 2002.
[46] I. Waldspurger, “Exponential decay of scattering coefficients,” Proc. of International Conference on Sampling Theory and
Applications (SampTA), 2017, submitted.
[47] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,”
in Proc. of IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153, 2009.
[48] M. A. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with
applications to object recognition,” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 1–8, 2007.
[49] H. Wendland, Scattered data approximation. Cambridge University Press, 2004.
51
[50] T. Runst and W. Sickel, Sobolev spaces of fractional order, Nemytskij operators, and nonlinear partial differential
equations, vol. 3. Walter de Gruyter, 1996.
[51] E. DiBenedetto, Real analysis. Birkhäuser, 2002.