Robust 1-bit compressed sensing and sparse logistic regression: A

ROBUST 1-BIT COMPRESSED SENSING AND SPARSE LOGISTIC
REGRESSION: A CONVEX PROGRAMMING APPROACH
YANIV PLAN AND ROMAN VERSHYNIN
Abstract. This paper develops theoretical results regarding noisy 1-bit compressed sensing and
arXiv:1202.1212v3 [cs.IT] 19 Jul 2012
sparse binomial regression. We demonstrate that a single convex program gives an accurate estimate
of the signal, or coefficient vector, for both of these models. We show that an s-sparse signal in
R n can be accurately estimated from m = O(s log(n/s)) single-bit measurements using a simple
convex program. This remains true even if each measurement bit is flipped with probability nearly
1/2. Worst-case (adversarial) noise can also be accounted for, and uniform results that hold for
all sparse inputs are derived as well. In the terminology of sparse logistic regression, we show
that O(s log(2n/s)) Bernoulli trials are sufficient to estimate a coefficient vector in R n which is
approximately s-sparse. Moreover, the same convex program works for virtually all generalized
linear models, in which the link function may be unknown. To our knowledge, these are the first
results that tie together the theory of sparse logistic regression to 1-bit compressed sensing. Our
results apply to general signal structures aside from sparsity; one only needs to know the size of
the set K where signals reside. The size is given by the mean width of K, a computable quantity
whose square serves as a robust extension of the dimension.
1. Introduction
1.1. One-bit compressed sensing. In modern data analysis, a pervasive challenge is to recover
extremely high-dimensional signals from seemingly inadequate amounts of data. Research in this
direction is being conducted in several areas including compressed sensing, sparse approximation
and low-rank matrix recovery. The key is to take into account the signal structure, which in essence
reduces the dimension of the signal space. In compressed sensing and sparse approximation, this
structure is sparsity—we say that a vector in Rn is s-sparse if it has s nonzero entries. In low-rank
matrix recovery, one restricts to matrices with low-rank.
The standard assumption in these fields is that one has access to linear measurements of the
form
(1.1)
yi = hai , xi,
i = 1, 2, . . . , m
where a1 , a2 , . . . , am ∈ Rn are known measurement vectors and x ∈ Rn is the signal to be recovered.
Typical compressed sensing results state that when ai are iid random vectors drawn from a certain
Date: Submitted February 2012; revised July 2012.
2000 Mathematics Subject Classification. 94A12; 60D05; 90C25.
Y.P. is supported by an NSF Postdoctoral Research Fellowship under award No. 1103909. R.V. is supported by
NSF grants DMS 0918623 and 1001829.
1
distribution (e.g. Gaussian), m ∼ s log(2n/s) measurements suffice for robust recovery of s-sparse
signals x, see [8].
In the recently introduced problem of 1-bit compressed sensing [5], the measurements are no
longer linear but rather consist of single bits. If there is no noise, the measurements are modeled
as
(1.2)
yi = sign(hai , xi),
i = 1, 2, . . . , m
where sign(x) = 1 if x ≥ 0 and sign(x) = −1 if x < 0.1 On top of this, noise may be introduced as
random or adversarial bit flips.
The 1-bit measurements are meant to model quantization in the extreme case. It is interesting
to note that when the signal to noise ratio is low, numerical experiments demonstrate that such
extreme quantization can be optimal [17] when constrained to a fixed bit budget. The webpage
http://dsp.rice.edu/1bitCS/ is dedicated to the rapidly growing literature on 1-bit compressed
sensing. Further discussion of this recent literature will be given in Section 3.1; we note for now
that this paper presents the first theoretical accuracy guarantees in the noisy problem using a
polynomial-time solver (given by a convex program).
1.2. Noisy one-bit measurements. We propose the following general model for noisy 1-bit compressed sensing. We assume that the measurements, or response variables, yi ∈ {−1, 1} are drawn
independently at random satisfying
(1.3)
E yi = θ(hai , xi),
i = 1, 2, . . . , m
where θ is some function, which automatically must satisfy −1 ≤ θ(z) ≤ 1. A key point in our
results is that θ may be unknown or unspecified; one only needs to know the measurements yi and
the measurement vectors ai in order to recover x. Thus, there is an unknown non-linearity in
the measurements. See [4, 16] for earlier connections between the 1-bit problem and non-linear
measurements.
In compressed sensing it is typical to choose the measurement vectors ai at random, see [8]. In
this paper, we choose ai to be independent standard Gaussian random vectors in Rn . Although
this assumption can be relaxed to allow for correlated coordinates (see Section 3.4), discrete distributions are not permitted. Indeed, unlike traditional compressed sensing, accurate noiseless 1-bit
compressed sensing is provably impossible for some discrete distributions of ai (e.g. for Bernoulli
distribution, see [23]). Summarizing, the model (1.3) has two sources of randomness:
(1) the measurement vectors ai are independent standard Gaussian random vectors;
(2) given {ai }, the measurements yi are independent {−1, 1} valued random variables.
Note that (1.3) is the generalized linear model in statistics, and θ is known as the inverse of
the link function; the particular choice θ(z) = tanh(z/2) coresponds to logistic regression. The
1For concreteness, we set sign(0) = 1; this choice is arbitrary and could be replaced with sign(0) = −1.
2
statisticians may prefer to switch x with β, ai with xi , n with p and m with n, but we prefer to
keep our notation which is standard in compressed sensing.
Notice that in the noiseless 1-bit compressed sensing model (1.2), all information about the
magnitude of x is lost in the measurements. Similarly, in the noisy model (1.3) the magnitude of
x may be absorbed into the definition of θ. Thus, our goal will be to estimate the projection of x
onto the Euclidean sphere, x/ kxk2 . Without loss of generality, we thus assume that kxk2 = 1 in
most of our discussion that follows.
We shall make a single assumption on the function θ defining the model (1.3), namely that
(1.4)
E θ(g)g =: λ > 0
where g is standard normal random variable. To see why this assumption is natural, notice that
hai , xi ∼ N (0, 1) since ai are standard Gaussian random vectors and kxk2 = 1; thus
E yi hai , xi = E θ(g)g = λ.
Thus our assumption is simply that the 1-bit measurements yi are positively correlated with the
corresponding linear measurements hai , xi.2 Standard 1-bit compressed sensing (1.2) is a partial
p
case of model (1.3) with θ(z) = sign(z). In this case λ achieves its maximal value: λ = E |g| = 2/π.
In general, λ plays a role similar to a signal to noise ratio.
1.3. The signal set. To describe the structure of possible signals, we assume that x lies in some
set K ⊂ B2n where B2n denotes the Euclidean unit ball in Rn . A key characteristic of the size of the
signal set is its mean width w(K), defined as
(1.5)
w(K) := E sup hg, xi
x∈K−K
where g is a standard normal Gaussian vector in Rn and K − K denotes the Minkowski difference.3
The notion of mean width is closely related to that of the Gaussian complexity, which is widely
used in statistical learning theory to measure the size of classes of functions, see [3, 19]. An intuitive
explanation of the mean width, its basic properties and simple examples are given in Section 2.
The important point is that w(K)2 can serve as the effective dimension of K.
The main example of interest is where K encodes sparsity. If K = Kn,s is the convex hull of the
unit s-sparse vectors in Rn , the mean width of this set computed in (2.2) and (3.3) is
(1.6)
w(Kn,s ) ∼ (s log(2n/s))1/2 .
2If E θ(g)g < 0, we could replace y with −y to change the sign; thus our assumption is really that the correlation
i
i
is non-zero: E θ(g)g 6= 0.
3Specifically, K − K = {x − y : x, y ∈ K}.
3
1.4. Main results. We propose the following solver to estimate the signal x from the 1-bit measurements yi . It is given by the optimization problem
(1.7)
max
m
X
i=1
yi hai , x′ i subject to
x′ ∈ K.
This can be described even more compactly as
(1.8)
maxhy, Ax′ i subject to
x′ ∈ K
where A is the m × n measurement matrix with rows ai and y = (y1 , . . . , ym ) is the vector of 1-bit
measurements.
If the set K is convex, (1.7) is a convex program, and therefore it can be solved in an algorithmically efficient manner. This is the situation we will mostly care about, although our results below
apply for general, non-convex signal sets K as well.
Theorem 1.1 (Fixed signal estimation, random noise). Let a1 , . . . , am be independent standard
Gaussian random vectors in Rn , and let K be a subset of the unit Euclidean ball in Rn . Fix x ∈ K
satisfying kxk2 = 1. Assume that the measurements y1 , . . . , yn follow the model above.4 Then for
each β > 0, with probability at least 1 − 4 exp(−2β 2 ) the solution x̂ to the optimization problem
(1.7) satisfies
8
kx̂ − xk22 ≤ √ (w(K) + β).
λ m
As an immediate consequence, we see that the signal x ∈ K can be effectively estimated from
m = O(w(K)2 ) one-bit noisy measurements. The following result makes this statement precise.
Corollary 1.2 (Number of measurements). Let δ > 0 and suppose that
m ≥ Cδ−2 w(K)2 .
Then, under the assumptions of Theorem 1.1, with probability at least 1−8 exp(−cδ2 m) the solution
x̂ to the optimization problem (1.7) satisfies
kx̂ − xk22 ≤ δ/λ.
Here and in the rest of the paper, C and c denote positive absolute constants whose values may
change from instance to instance.
Theorem 1.1 is concerned with an arbitrary but fixed signal x ∈ K, and with a stochastic model
on the noise in the measurements. We will show how to strengthen these results to cover all signals
x ∈ K uniformly, and to allow for a worst-case (adversarial) noise. Such noise can be modeled as
flipping some fixed percentage of arbitrarily chosen bits, and it can be measured using the Hamming
P
m
distance dH (ỹ, y) = m
i=1 1{y˜i 6=yi } between ỹ, y ∈ {−1, 1} .
We present the following theorem in the same style as Corollary 1.2.
4Specifically, our assumptions are that y are {−1, 1} valued random variables that are independent given {a },
i
i
and that (1.3) holds with some function θ satisfying (1.4).
4
Theorem 1.3 (Uniform estimation, adversarial noise). Let a1 , . . . , am be independent standard
Gaussian random vectors in Rn , and let K be a subset of the unit Euclidean ball in Rn . Let δ > 0
and suppose that
(1.9)
m ≥ Cδ−6 w(K)2 .
Then with probability at least 1 − 8 exp(−cδ2 m), the following event occurs. Consider a signal
x ∈ K satisfying kxk2 = 1 and its (unknown) uncorrupted 1-bit measurements ỹ = (ỹ1 , . . . , ỹm )
given as
ỹi = sign(hai , xi),
i = 1, 2, . . . , m.
Let y = (y1 , . . . , ym ) ∈ {−1, 1}m be any (corrupted) measurements satisfying dH (ỹ, y) ≤ τ m. Then
the solution x̂ to the optimization problem (1.7) with input y satisfies
p
p
(1.10)
kx̂ − xk22 ≤ δ log(e/δ) + 11τ log(e/τ ).
This uniform result will follow from a deeper analysis than the fixed-signal result, Theorem 1.1.
Its proof will be based on the recent results from [22] on random hyperplane tessellations of K.
Remark 1.4 (Sparse estimation). A remarkable example is for s-sparse signals in Rn . Recalling
the mean width estimate (1.6), we see that our results above imply that an s-sparse signal in Rn can
be effectively estimated from m = O(s log(2n/s)) one-bit noisy measurements. We will make this
statement precise in Corollary 3.1 and the remark after it.
Remark 1.5 (Hamming cube encoding and decoding). Let us put Theorem 1.3 in the context of
coding in information theory. In the earlier paper [22] we proved that K ∩ S n−1 can be almost
isometrically embedded into the Hamming cube {−1, 1}m , with the same m and same probability
bound as in Theorem 1.3. Specifically, one has
1
1
′
′ (1.11)
π dG (x, x ) − m dH sign(Ax), sign(Ax ) ≤ δ
for all x, x′ ∈ K ∩ S n−1 . Above, dG and dH denote the geodesic distance in S n−1 and the Hamming
distance in {−1, 1}m respectively, see Theorem 6.3 below. Thus the embedding K∩S n−1 → {−1, 1}m
is given by the map5
x 7→ sign(Ax).
This map encodes a given signal x ∈ K into a binary string y = sign(Ax). Conversely, one can
accurately and robustly decode x from y by solving the optimization problem (1.7). This is the
content of Theorem 1.3.
Remark 1.6 (Optimality). While the dependence of m on the mean width w(K) in the results
above seems to be optimal (see [22] for a discussion), the dependence on the accuracy δ in Theorem
1.3 is most likely not optimal. We are not trying to optimize dependence on δ in this paper, but
5The sign function is applied to each coordinate of Ax.
5
are leaving this as an open problem. Nevertheless, in some cases, the dependence on δ in Theorem
1.1 is optimal; see Section 3.1 below.
Theorem 1.3 can be extended to allow for a random noise together with adversarial noise; this
is discussed in Remark 4.4 below.
1.5. Organization. An intuitive discusison of the mean width along with the estimate (1.6) of
the mean width when K encodes sparse vectors is given in Section 2. In Section 3 we specialize
our results to a variety of (approximately) sparse signal models—1-bit compressed sensing, sparse
logistic regression and low-rank matrix recovery. In Subsection 3.4, we extend our results to allow
for correlations in the entries of the measurement vectors.
The proofs of our main results, Theorems 1.1 and 1.3, are given in Sections 4—6. In Section 4
we quickly reduce these results to the two concentration inequalities that hold uniformly over the
set K—Propositions 4.2 and 4.3 respectively. Proposition 4.2 is proved in Section 5 using standard
techniques of probability in Banach spaces. The proof of Proposition 4.3 is deeper; it is based on
the recent work of the authors [22] on random hyperplane tessellations. The argument is given in
Section 6.
1.6. Notation. We write a ∼ b if ca ≤ b ≤ Ca for some positive absolute constants c, C (a and
b may have dimensional dependence). In order to increase clarity, vectors are written in lower
case bold italics (e.g., g), and matrices are upper case bold italics (e.g, A). We let g denote a
standard Gaussian random vector whose length will be clear from context; g denotes a standard
normal random variable. C, c will denote positive absolute constants whose values may change from
instance to instance. Given a vector v in Rn and a subset T ⊂ {1, . . . , n}, we denote by vT ∈ RT
the restriction of v onto the coordinates in T .
B2n and S n−1 denote the unit Euclidean ball and sphere in Rn respectively, and B1n denotes the
unit ball with respect to ℓ1 norm. The Euclidean and ℓ1 norms of a vector v are denoted kvk2 and
kvk1 respectively. The number of non-zero entries of v is denoted kvk0 . The operator norm (the
largest singular value) of a matrix A is denoted kAk.
2. Mean width and sparsity
2.1. Mean width. In this section we explain the geometric meaning of the mean width of a set
K ⊂ Rn which was defined by the formula (1.5), and discuss its basic properties and examples.
The notion of mean width plays a significant role in asymptotic convex geometry (see e.g. [9]).
The width of K in the direction of η ∈ S n−1 is the smallest width of the slab between two parallel
hyperplanes with normals η that contains K. Analytically, the width can be expressed as
sup hη, ui − inf hη, vi =
u∈K
v∈K
sup hη, xi,
x∈K−K
see Figure 1. Averaging over η uniformly distributed in S n−1 , we obtain the spherical mean width:
w̃(K) := E sup hη, xi.
x∈K−K
6
Figure 1. Width of a set K in the direction of η is illustrated by the dashed line.
Instead of averaging using η ∈ S n−1 , it is often more convenient to use a standard Gaussian
random vector g ∈ Rn . This gives the definition (1.5) of the Gaussian mean width of K:
w(K) := E sup hg, xi.
x∈K−K
In this paper we shall use the Gaussian mean width, which we call the “mean width” for brevity.
Note that the spherical and Gaussian versions of mean width are proportional to each other.
Indeed, by rotation invariance we can realize η as η = g/kgk2 and note that η is independent
of the magnitude factor kgk2 . It follows that w(K) = E kgk2 · w̃(K). Further, once can use that
√
E kgk2 ∼ n and obtain the useful comparison of Gaussian and spherical versions of mean width:
√
w(K) ∼ n · w̃(K).
Let us record some further simple but useful properties of the mean width.
Proposition 2.1 (Mean width). The mean width of a subset K ⊂ Rn has the following properties.
1. The mean width is invariant under orthogonal transformations and translations.
2. The mean width is invariant under taking the convex hull, i.e. w(K) = w(conv(K)).
3. We have
w(K) = E sup |hg, xi|.
x∈K−K
4. Denoting the diameter of K in the Euclidean metric by diam(K), we have
r
2
diam(K) ≤ w(K) ≤ n1/2 diam(K).
π
5. We have
w(K) ≤ 2 E sup hg, xi ≤ 2 E sup |hg, xi|.
x∈K
x∈K
For an origin-symmetric set K, both these inequalities become equalities.
6. The inequalities in part 5 can be essentially reversed for arbitrary K:
r
2
w(K) ≥ E sup |hg, xi| −
dist(0, K).
π
x∈K
Here dist(0, K) = inf x∈K kxk2 is the Euclidean distance from the origin to K. In particular, if
0 ∈ K then one has w(K) ≥ E supx∈K |hg, xi|.
7
Proof. Parts 1, 2 and 5 are obvious by definition; part 3 follows by the symmetry of K − K.
To prove part 4, note that for every x0 ∈ K − K one has
r
2
kx0 k2 .
(2.1)
w(K) ≥ E |hg, x0 i| =
π
The equality here follows because hg, x0 i is a normal random variable with variance kx0 k2 . This
yields the lower bound in part 4. For the upper bound, we can use part 3 along with the bound
|hg, xi| ≤ kgk2 kxk2 ≤ kgk2 · diam(K) for all x ∈ K − K. This gives
w(K) ≤ E kgk2 · diam(K) ≤ (E kgk22 )1/2 diam(K)
= n1/2 diam(K).
To prove part 6, let us start with the special case where 0 ∈ K. Then K − K ⊃ K ∪ (−K), thus
w(K) ≥ E supx∈K∪(−K) hg, xi = E supx∈K |hg, xi| as claimed. Next, consider a general case. Fix
x0 ∈ K and apply the previous reasoning for the set K − x0 ∋ 0. Using parts 1 and 3 we obtain
w(K) = w(K − x0 ) ≥ E sup |hg, x − x0 i|
x∈K
≥ E sup |hg, xi| − E |hg, x0 i| .
x∈K
Finally, as in (2.1) we note that E |hg, x0 i| =
proof. q
2
π
kx0 k2 . Minimizing over x0 ∈ K completes the
For illustration, let us evaluate the mean width of some sets K ⊆ B2n .
√
√
1. If K = B2n or K = S n−1 then w(K) = E kgk2 ≤ (E kgk22 )1/2 = n (and in fact w(K) ∼ n).
√
2. If the linear algebraic dimension dim(K) = k then w(K) ≤ k.
p
3. If K is a finite set, then w(K) ≤ C log |K|.
Example.
Remark 2.2 (Effective dimension). The square of the mean width, w(K)2 , may be interpreted as
the effective dimension of a set K ⊆ B2n . It is always bounded by the linear algebraic dimension
(see the example above), but it has the advantage of robustness—a small perturbation of K leads to
a small change in w(K)2 .
In this light, the invariance of the mean width under taking the convex hull (Proposition 2.1,
part 2) is especially useful in compressed sensing, where a usual tactic is to relax the non-convex
program to a convex program. It is important that in the course of this relaxation, the “effective
dimension” of the signal set K remains the same.
Mean width of a given set K can be computed using several tools from probability in Banach
spaces. These include Dudley’s inequality, Sudakov minoration, the Gaussian concentration inequality, Slepian’s inequality and the sharp technique of majorizing measures and generic chaining
[19, 26].
8
2.2. Sparse signal set. The quintessential signal structure considered in this paper is sparsity.
Thus for given n ∈ and 0 < s ≤ n, we consider the set
Sn,s = {x ∈ Rn : kxk0 ≤ s, kxk2 ≤ 1}.
In words, Sn,s consists of s-sparse (or sparser) vectors with length n whose Euclidean norm is
bounded by 1.
Although the linear algebraic dimension of Sn,s is n (as this set spans Rn ), the dimension of
Sn,s ∩ {x ∈ Rn : kxk0 = s} as a manifold with boundary embedded in Rn is s.6 It turns out that the
“effective dimension” of Sn,s given by the square of its mean width is much closer to the manifold
dimension s than to the linear algebraic dimension n:
Lemma 2.3 (Mean width of the sparse signal set). We have
(2.2)
cs log(2n/s) ≤ w2 (Sn,s ) ≤ Cs log(2n/s).
Proof. Let us prove the upper bound. Without loss of generality we can assume that s ∈
representing Sn,s as the union of ns s-dimensional unit Euclidean balls we see that
. By
w(Sn,s ) = E max kgT k2 .
|T |=s
For each T , the Gaussian concentration inequality (see Theorem 5.2 below) yields
P {kgT k2 ≥ E kgT k2 + t} ≤ exp(−t2 /2), t > 0.
√
Next, E kgT k2 ≤ (E kgT k22 )1/2 = s. Thus the union bound gives
√
n
exp(−t2 /2)
P max kgT k2 ≥ s + t ≤
s
|T |=s
for t > 0. Note that ns ≤ exp(s log(en/s)); integrating gives the desired upper bound in (2.2).
The lower bound in (2.2) follows from Sudakov minoration (see Theorem 6.1) combined with
finding a tight lower bound on the covering number of Sn,s . Since the lower bound will not be used
in this paper, we leave the details to the interested reader. 3. Applications to sparse signal models
Our main results stated in the introduction are valid for general signal sets K. Now we specialize
to the cases where K encodes sparsity. It would be ideal if we could take K = Sn,s , but this set
would not be convex and thus the solver (1.7) would not be known to run in polynomial time.
We instead take a convex relaxation of Sn,s , an effective tactic from the sparsity literature. Notice
√
that if x ∈ Sn,s then kxk1 ≤ s by Cauchy-Schwarz inequality. This motivates us to consider the
convex set
√
√
Kn,s = {x ∈ Rn : kxk2 ≤ 1, kxk1 ≤ s} = B2n ∩ sB1n .
6Thus S
n,s is the union of s + 1 manifolds with boundary each of whose dimension is bounded by s.
9
Kn,s is almost exactly the convex hull of Sn,s , as is shown in [23]:
(3.1)
conv(Sn,s ) ⊂ Kn,s ⊂ 2 conv(Sn,s ).
Kn,s can be though of a set of approximately sparse or compressible vectors.
If the signal is known to be exactly or approximately sparse, i.e. x ∈ Kn,s , we may estimate x
by solving the convex program
P
′
max m
i=1 yi hai , x i
(3.2)
√
subject to kx′ k1 ≤ s and kx′ k2 ≤ 1.
This is just a restatement of the program (1.7) for the set Kn,s . In our convex relaxation, we do
not require that x̂ ∈ S n−1 ; this stands in contrast to many previous programs considered in the
literature. Nevertheless, the accuracy of the solution x̂ and the fact that kxk2 = 1, implies that
kxk2 ≈ 1.
Theorems 1.1 and 1.3 are supposed to guarantee that x can indeed can be estimated by a solution
to (3.2). But in order to apply these results, we need to know the mean width of Kn,s . A good
bound for it follows from (3.1) and Lemma 2.3, which give
p
(3.3)
w(Kn,s ) ≤ 2w(conv(Sn,s )) ≤ C s log(2n/s).
This yields the following version of Corollary 1.2.
Corollary 3.1 (Estimating a compressible signal). Let a1 , . . . , am be independent standard Gaussian random vectors in Rn , and fix x ∈ Kn,s satisfying kxk2 = 1. Assume that the measurements
y1 , . . . , yn follow the model from Section 1.3.7 Let δ > 0 and suppose that
m ≥ Cδ−2 s log(2n/s).
Then, with probability at least 1−8 exp(−cδ2 m), the solution x̂ to the convex program (3.2) satisfies
kx̂ − xk22 ≤ δ/λ.
Remark 3.2. In a similar way, one can also specialize the uniform result, Theorem 1.3, to the
approximately sparse case.
3.1. 1-bit compressed sensing. Corollary 3.1 can be easily specialized to various specific models
of noise. Let us consider some of the interesting models, and compute the correlation coefficient
λ = E θ(g)g in (1.4) for each of them.
Noiseless 1-bit compressed sensing: In the classic noiseless model (1.2), the measurements are given as yi = sign(hai , xi) and thus θ(z) = sign(z). Thus
p
λ = E |g| = 2/π.
Therefore, with high probability we obtain kx̂ − xk22 ≤ δ provided that the number of
measurements is m ≥ Cδ−2 s log(2n/s). This is similar to the results available in [23].
7Specifically, our assumptions were that {y } are independent random variables that are jointly independent of
i
{ai }, and that (1.3) holds with some function θ satisfying (1.4).
10
Random bit flips Assume that each measurement yi is only correct with probability p,
thus
yi = ξi sign(hai , xi),
i = 1, 2, . . . , m
where ξi are independent {−1, 1} valued random variables with P {ξi = 1} = p, which
represent random bit flips. Then θ(z) = sign(z) · E ξ1 = 2 sign(z)(p − 1/2) and
p
λ = 2(p − 1/2) E |g| = 2 2/π(p − 1/2).
Therefore, with high probability we obtain kx̂ − xk22 ≤ δ provided that the number of
measurements is m ≥ Cδ−2 (p − 1/2)−2 s log(2n/s). Thus we obtain a surprising conclusion:
The signal x can be estimated even if each measurement is flipped with probablity
nearly 1/2.
Somewhat surprisingly, the estimation of x is done by one simple convex program (3.2).
Of course, if each measurement is corrupted with probability 1/2, recovery is impossible by
any algorithm.
Random noise before quantization Assume that the measurements are given as
yi = sign(hai , xi + νi ),
i = 1, 2, . . . , m
where νi are iid random variables representing noise added before quantization. This situation is typical in analog-to-digital converters. It is also the latent variable model from
statistics.
Assume for simplicity that νi have density f (x). Then θ(z) = 1 − 2P {νi ≤ −z}, and the
correlation coefficient λ = E θ(g)g can be evaluated using integration by parts, which gives
λ = E θ ′ (g) = 2 E f (−g) > 0.
A specific value of λ is therefore not hard to estimate for concrete densities f . For instance,
if νi are normal random variables with mean zero and variance σ 2 , then
s
r
2
2
.
exp(−g 2 /2σ 2 ) =
λ=E
2
2
πσ
π(σ + 1)
(3.4)
Therefore, with high probability we obtain kx̂ − xk22 ≤ δ provided that the number of
measurements is m ≥ Cδ−2 (σ 2 + 1)s log(2n/s). In other words,
r
(σ 2 + 1)s log(2n/s)
kx̂ − xk22 ≤ C
.
m
Thus we obtain an unexpected conclusion:
The signal x can be estimated even when the noise level σ eclipses the magnitude
of the linear measurements.
11
p
Indeed, the average magnitude of the linear measurements is E |hai , xi| = 2/π, while the
average noise level σ can be much larger.
Let us also compare to the results available in the standard unquantized compressed
sensing model
yi = hai , xi + νi
i = 1, 2, . . . , m
√
where once again we take νi ∼ N (0, σ 2 ). Under the assumption
that
kxk
≤
s the
1
q
n
minimax squared error given by [24, Theorem 1] is δ = cσ s log
m . A slight variation
on their proof yields a minimax
squared error under the assumption that m < n and
q
x ∈ Kn,s ∩ S n−1 of δ = cσ s log(2n/s)
. Up to a constant, this matches the upper bound
m
we have just derived in the 1-bit case in Equation (3.4). Thus we have another surprising
result:
The error in recovering the signal x matches the minimax error for the unquantized
compressed sensing problem (up to a constant): Essentially nothing is lost by
quantizing to a single bit.
Let us put these results in a perspective of the existing literature on 1-bit compressed sensing.
The problem of 1-bit compressed sensing, as introduced by Boufounos and Baraniuk in [5], is
the extreme version of quantized compressed sensing; it is particularly beneficial to consider 1-bit
measurements in analog-to-digital conversion (see the webpage http://dsp.rice.edu/1bitCS/).
Several numerical results are available, and there are a few recent theoretical results as well.
Suppose that x ∈ Rn is s-sparse. Gupta et al. [13] demonstrate that the support of x can
tractably be recovered from either 1) O(s log n) nonadaptive measurements assuming a constant
dynamic range of x (i.e. the magnitude of all nonzero entries of x is assumed to lie between two
constants), or 2) O(s log n) adaptive measurements. Jacques et al. [14] demonstrate that any
consistent estimate of x will be accurate provided that m ≥ O(s log n). Here consistent means that
the estimate x̂ should have unit norm, be at least as sparse as x, and agree with the measurements,
i.e. sign(hai , x̂i) = sign(hai , xi) for all i. These results of Jacques et al. [14] can be extended to
handle adversarial bit flips. The difficulty in applying these results is that the first two conditions
are nonconvex, and thus it is unknown whether there is a polynomial-time solver which is guaranteed
to return a consistent solution. We note that there are heuristic algorithms, including one in [14]
which often provide such a solution in simulations.
In a dual line or research, Gunturk et al. [11, 12] analyze sigma-delta quantization. The focus of
their results is to achieve an excellent dependence of on the accuracy δ while minimizing the number
of bits per measurement. However the measurements yi in sigma-delta quantization are not related
to any linear measurements (unlike those in (1.2) and (1.3)) but are allowed to be constructed in a
judicious fashion (e.g. iteratively). Furthermore, in Gunturk et al. [11, 12] the number of bits per
12
measurement depends on the dynamic range of the nonzero part of x. Similarly, the recent work
of Ardestanizadeh et al. [1] requires a finite number of bits per measurement.
The noiseless 1-bit compressed sensing given by the model (1.2) was considered by the present
authors in the earlier paper [23], where the following convex program was introduced:
min x′ 1
subject to
and
yi = sign(hai , x′ i) i = 1, 2, . . . , m
m
X
i=1
yi hai , x′ i = m.
This program was shown in [23] to accurately recover an s-sparse vector x from m = O(s log(n/s)2 )
measurements yi . This result was the first to propose a polynomial-time solver for 1-bit compressed
sensing with provable accuracy guarantees. However, it was unclear how to modify the above convex
program to account for possible noise.
The present paper proposes to overcome this difficulty by considering the convex program (3.2)
(and in the most general case, the optimization problem (1.7)). One may note that the program
(3.2) requires the knowledge of a bound on the (approximate) sparsity level s. In return, it does
not need to be adjusted depending on the kind of noise or level of noise.
3.2. Sparse logistic regression. In order to give concrete results accessible to the statistics
community, we now specialize Corollary 1.2 to the logistic regression model. Further, we drop
the assumption that kxk2 = 1 in this section; this will allow easier comparison with the related
literature (see below).
The simple logistic function is defined as
(3.5)
f (z) =
ez
.
ez + 1
In the logistic regression model, the observations yi ∈ {−1, 1} are iid random variables satisfying
(3.6)
P {yi = 1} = f (hai , xi),
i = 1, 2, . . . , m.
Note that this is a partial case of the generalized linear model (1.3) with θ(z) = tanh(z/2). We
thus have the following specialization of Corollary 1.2.
Corollary 3.3 (Sparse logistic regression). Let a1 , . . . , am be independent standard Gaussian random vectors in Rn , and fix x satisfying x/ kxk2 ∈ Kn,s . Assume that the observations y1 , . . . , yn
follow the logistic regression model (3.6). Let δ > 0 and suppose that
m ≥ Cδ−2 s log(2n/s).
Then, with probability at least 1−8 exp(−cδ2 m), the solution x̂ to the convex program (3.2) satisfies
2
x −1
(3.7)
x̂ − kxk ≤ δ max(kxk2 , 1).
2 2
13
Proof. We begin by reducing to the case when kxk2 = 1 by rescaling the logistic function. Thus,
let α = kxk2 and define the scaled logistic function fα (x) = f (αx). In particular,
P {yi = 1} = fα hai ,
x i .
kxk2
To apply Corollary 3.1, it suffices to compute the correlation coefficient λ in (1.4). First, by
rescaling f we have also rescaled θ, so we consider θ(z) = tanh(αz/2). We can now compute λ
using integration by parts:
λ = E θ(g)g = E θ ′ (g) =
α
E sech2 (αg/2).
2
To further bound this quantity below, we can use the fact that sech2 (x) is an even and decreasing
function for x ≥ 0. This yields
α
P {|αg/2| ≤ 1/2} · sech2 (1/2)
2
sech2 (1/2)
1
≥
· α · P {|g| ≤ 1/α} ≥ min(α, 1).
2
6
λ≥
The result follows from Corollary 3.1 since α = kxk2 . Remark 3.4. Corollary 3.3 allows one to estimate the projection of x onto the unit sphere. One
may ask whether the norm of x may be estimated as well. This depends on the assumptions made
(see the literature described below). However, note that as kxk2 grows, the logistic regression model
quickly approaches the noiseless 1-bit compressed sensing model, in which knowledge of kxk2 is lost
in the measurements. Thus, since we do not assume that kxk2 is bounded, recovery of kxk2 becomes
impossible.
For concreteness, we specialized to logistic regression. But as mentioned in the introduction,
the model (1.3) can be interpreted as the generalized linear model, so our results can be readily
used for various problems in sparse binomial regression. Some of the recent work in sparse binomial
regression includes the papers [21, 6, 27, 2, 25, 20, 15]. Let us point to the most directly comparable
results.
In [2, 6, 15, 21] the authors propose to estimate the coefficient vector (which in our notation
is x) by minimizing the negative log-likelihood plus an extra ℓ1 regularization term. Bunea [6]
considers the logistic regression model. She derives an accuracy bound for the estimate (in the ℓ1
norm) under a certain condition stabil and a under bound on the magnitude of the entries of x.
Similarly, Bach [2] and Kakade et al. [15] derive accuracy bounds (again in the ℓ1 norm) under
restrictive eigenvalue conditions. The most directly comparable result is given by Negahban et al.
[21]. There the authors show that if the measurement vectors ai have independent subgaussian
entries, kxk0 ≤ s, and kxk2 ≤ 1, then with high probability one has kx̂ − xk22 ≤ δ, provided that
the number of measurements is m ≥ Cδ−1 s log n. Their results apply to the generalized linear
model (1.3) under some assumptions on θ.
14
One main novelty in this paper is that knowledge of the function θ, which defines the model
family, is completely unnecessary when recovering the coefficient vector. Indeed, the optimization
problems (1.7) and (3.2) do not need to know θ. This stands in contrast to programs based on
maximum likelihood estimation. This may be of interest in non-parametric statistical applications
in which it is unclear which binary model to pick–the logistic model may be chosen somewhat
arbitrarily.
Another difference between our results and those above is in the conditions required. The above
papers allow for more general design matrices than those in this paper, but this necessarily leads
to strong assumptions on kxk2 . As the inner products between hai , xi grow large, the logistic
regression model approaches the 1-bit compressed sensing model. However, as shown in [23],
accurate 1-bit compressed sensing is impossible for discrete measurement ensembles (not only is
it impossible to recovery x, it is also impossible to recover x/ kxk2 ). Thus the above results, all
of which do allow for discrete measurement ensemsembles, necessitate rather strong conditions on
the magnitude of hai , xi, or equivalently, on kxk2 ; these are made explicitly in [6, 15, 21] and
implicitly in [2]. In contrast, our theoretical bounds on the relative error only improve as the
average magnitude of hai , xi increases.
3.3. Low-rank matrix recovery. We quickly mention that our model applies to single bit measurements of a low-rank matrix. Perhaps the closest practical application is quantum state tomography [10], but still, the requirement of Gaussian measurements is somewhat unrealistic. Thus, the
purpose of this section is to give an intuition and a benchmark.
Let X ∈ Rn1 ×n2 be a matrix of interest with rank r and Froobenius norm kXkF = 1. Consider
that we have m single-bit measurements following the model in the introduction so that n = n1 ×n2 .
Similarly to sparse vectors, the set of low-rank matrices is not convex, but has a natural convex
relaxation as follows. Let
√
Kn1 ,n2 ,r = {X ∈ Rn1 ×n2 : kXk∗ ≤ r, kXkF ≤ 1}
where kXk∗ denotes the nuclear norm, i.e., the sum of the singular values of X.
In order to apply Theorems 1.1 and 1.3, we only need to calculate w(Kn1 ,n2 ,r ), as follows:
w(Kn1 ,n2 ,r ) = 2 E
sup
X∈Kn1 ,n2 ,r
hG, Xi
where G is a matrix with standard normal entries and the inner product above is the standard
P
entrywise inner product, i.e. hG, Xi = i,j Gi,j Xi,j . Since the nuclear norm and operator norm
√
are dual to each other, we have hG, Xi ≤ kGk · kXk∗ . Further, for each X ∈ Kn1 ,n2 ,r , kXk∗ ≤ r,
and thus
√
w(Kn1 ,n2 ,r ) ≤ r E kGk .
√
√
The expected norm of a Gaussian matrix is well studied; one has E kGk ≤ n1 + n2 (see e.g.,
√
√
√
[28, Theorem 5.32]). Thus, w(Kn1 ,n2 ,r ) ≤ ( n1 + n2 ) r. It follows that O((n1 + n2 )r) noiseless
1-bit measurements are sufficient to guarantee accurate recovery of rank-r matrices. We note that
15
this matches the number of linear (infinite bit precision) measurements required in the low-rank
matrix recovery literature (see [7]).
3.4. Extension to measurements with correlated entries. A commonly used statistical model
would take ai to be Gaussian vectors with correlated entries, namely ai ∼ N (0, Σ) where Σ is a
given covariance matrix. In this section we present an extension of our results to allow such
correlations. Let λmin = λmin (Σ) and λmax = λmax (Σ) denote the smallest and largest eigenvalues
of Σ; the condition number of Σ is then κ(Σ) = λmax (Σ)/λmin (Σ). It will be convenient to choose
the normalization Σ1/2 x2 = 1; as before, this may be done by absorbing a constant into the
definition of θ.
We propose the following generalization of the convex program (3.2):
P
′
max m
i=1 yi hai , x i p
(3.8)
subject to kx′ k1 ≤ s/λmin and kΣ1/2 x′ k2 ≤ 1.
The following result extends Corollary 3.1 to general covariance Σ. For simplicity we restrict
ourselves to exactly sparse signals; however the proof below allows for a more general signal set.
Corollary 3.5. Let a1 , . . . , am be independent random vectors with distribution N (0, Σ). Fix x
satisfying kxk0 ≤ s and kΣ1/2 xk2 = 1. Assume that the measurements y1 , . . . , ym follow the model
from Section 1.3. Let δ > 0 and suppose that
m ≥ Cκ(Σ) δ−2 s log(2n/s).
Then with probability at least 1 − 8 exp(−cδ2 m), the solution x̂ to the convex program (3.8) satisfies
λmin (Σ) · kx̂ − xk22 ≤ kΣ1/2 x̂ − Σ1/2 xk22 ≤ δ/λ.
Remark 3.6. Theorem 1.3 can be generalized in the same way—the number of measurements
required is scaled by κ(Σ) and the error bound is scaled by λmin (Σ)−1 .
p
Proof of Corollary 3.5. The feasible set in (3.8) is K := {x ∈ Rn : kxk1 ≤ s/λmin (Σ), kΣ1/2 xk2 ≤
1}. Note that the signal x considered in the statement of the corollary is feasible, since kxk1 ≤
p
p
√
kxk2 kxk0 ≤ kΣ−1/2 k · kΣ1/2 xk2 s ≤ s/λmin (Σ).
Define ãi := Σ−1/2 ai ; then ãi are independent standard normal vectors and hai , xi = hΣ1/2 ãi , xi =
hãi , Σ1/2 xi. Thus, it follows from Corollary 1.2 applied with Σ1/2 x replacing x that if
m ≥ Cδ−2 w(Σ1/2 K)2
then with probability at least 1 − 8 exp(−cδ2 m)
kΣ1/2 x̂ − Σ1/2 xk2 ≤ δ/λ.
It remains to bound w(Σ1/2 K). Since Σ1/2 /kΣ1/2 k acts as a contraction, Slepian’s inequality
(see [19, Corollary 3.14]) gives w(Σ1/2 K) ≤ kΣ1/2 k · w(K) = λmax (Σ)1/2 · w(K). Further, K ⊆
λmin (Σ)−1/2 Kn,s . Thus, it follows from (3.3) that w(Σ1/2 K)2 ≤ κ(Σ) w(Kn,s ) ≤ Cκ(Σ) s log(2n/s).
This completes the proof. 16
4. Deducing Theorems 1.1 and 1.3 from concentration inequalities
In this section we show how to deduce our main results, Theorems 1.1 and 1.3, from concentration
inequalities. These inequalities are stated in Propositions 4.2 and 4.3 below, whose proofs are
deferred to Sections 5 and 6 respectively.
4.1. Proof of Theorem 1.1. Consider the rescaled objective function from the program (1.7):
m
1 X
fx (x ) =
yi hai , x′ i.
m
′
(4.1)
i=1
Here the subscript x indicates that f is a random function whose distribution depends on x through
yi . Note that the solution x̂ to the program (1.7) satisfies fx (x̂) ≥ fx (x), since x is feasible. We
claim that for any x′ ∈ K which is far away from x, the value fx (x′ ) is small with high probability.
Thus x̂ must be near to x.
To begin to substantiate this claim, let us calculate E fx (x′ ) for a fixed vector x′ ∈ K.
Lemma 4.1 (Expectation). Fix x ∈ S n−1 , x′ ∈ B2n . Then
E fx (x′ ) = λhx, x′ i
and thus
E[fx (x) − fx (x′ )] = λ(1 − hx, x′ i) ≥
Proof. We have
m
λ
x − x′ 2 .
2
2
1 X
E yi hai , x′ i = E y1 ha1 , x′ i.
E fx (x ) =
m
′
i=1
Now we condition on a1 to give
E y1 ha1 , x′ i = E E[y1 ha1 , x′ i|a1 ]
= E θ(ha1 , xi)ha1 , x′ i).
Note that ha1 , xi and ha1 , x′ i are a pair of normal random variables with covariance hx, x′ i. Thus,
by taking g, h ∈ N (0, 1) to be independent, we may rewrite the above expectation as
2
E θ(g) hx, x′ ig + (x′ 2 − hx, x′ i2 )1/2 h
= hx, x′ i E θ(g)g = λhx, x′ i
where the last equality follows from (1.4). Lemma 4.1 is proved. Next we show that f (x′ ) does not deviate far from its expectation uniformly for all x′ ∈ K.
Proposition 4.2 (Concentration). For each t > 0, we have
√
P
sup |fx (z) − E fx (z)| ≥ 4w(K)/ m + t
z∈K−K
≤ 4 exp(−mt2 /8).
17
This result is proved in Section 5 using standard techniques of probability in Banach spaces.
Theorem 1.1 is a direct consequence of this proposition.
Proof of Theorem 1.1. Let t > 0. By Proposition 4.2, the following event occurs with probability
at least 1 − 4 exp(−mt2 /8):
√
sup |fx (z) − E fx (z)| ≤ 4w(K)/ m + t.
z∈K
Suppose the above event indeed occurs. Let us apply this inequality for z = x̂ − x ∈ K − K. By
definition of x̂, we have fx (x̂) ≥ fx (x). Noting that the function fx (z) is linear in z, we obtain
0 ≤ fx (x̂) − f (x) = fx (x̂ − x)
√
≤ E[fx (x̂ − x)] + 4w(K)/ m + t
√
λ
≤ − kx̂ − xk22 + 4w(K)/ m + t.
2
√
The last inequality follows from Lemma 4.1. Finally, we choose t = 4β/ m and rearrange terms
to complete the proof of Theorem 1.1. (4.2)
4.2. Proof of Theorem 1.3. The argument is similar to that of Theorem 1.1 given above. We
consider the rescaled objective functions with corrupted and uncorrupted measurements:
1 Pm
′
fx (x′ ) = m
i=1 yi hai , x i,
P
m
1
′
(4.3)
f˜x (x′ ) = m
i=1 ỹi hai , x i
P
m
1
′
=m
i=1 sign(hai , xi)hai , x i
Arguing as in Lemma 4.1 (with θ(z) = sign(z)), we have
(4.4)
E f˜x (x′ ) = λhx, x′ i = E |g| · hx, x′ i =
p
2/π hx, x′ i.
Similarly to the proof of Theorem 1.1, we now need to show that fx (x′ ) does not deviate far from
the expectation of f˜x (x′ ); but this time the result should hold uniformly over not only x′ ∈ K − K
but also x ∈ K and y with small Hamming distance to ỹ. This is the content of the following
proposition.
Proposition 4.3 (Uniform Concentration). Let δ > 0 and suppose that
m ≥ Cδ−6 w(K)2 .
Then with probability at least 1 − 8 exp(−cδ2 m), we have
p
p
(4.5)
sup fx (z) − E f˜x (z) ≤ δ log(e/δ) + 4τ log(e/τ )
x,z,y
where the supremum is taken over x ∈ K ∩ S n−1 , z ∈ K − K and y satisfying dH (y, ỹ) ≤ τ m.
This result is significantly deeper than Proposition 4.2. It is based on a recent geometric result
from [22] on random tessellations of sets on the sphere. The proof is given is proved in Section 6.
Theorem 1.3 now follows from the same steps as in the proof of Theorem 1.1 above. 18
Remark 4.4 (Random noise). The adversarial bit flips allowed in Theorem 1.3 can be combined
with random noise. We considered two models of random noise in Section 3.1. One was random bit
flips where one would take ỹi = ξi sign(hai , xi); here ξi are iid Bernoulli random variables satisfying
P {ξi = 1} = p. The proof of Theorem 1.3 would remain unchanged under this model, aside from
the calculation
p
E fx (x′ ) = 2/π (2p − 1).
The end result is that the error bound in (1.10) would be divided by 2p − 1.
Another model considered in Section 3.1 was random noise before quantization. Thus we let
(4.6)
ỹi = sign(hai , xi + gi )
where gi ∼ N (0, σ 2 ) are iid. Once again, a slight modification of the proof of Theorem 1.3 allows
the incorporation of such noise. Note that the above model is equivalent to ỹi = sign(hãi , x̃i) where
ãi = (ai , gi ) and x̃ = (x, σ) (we have concatenated an extra entry onto ai and x). Thus, by slightly
adjusting the set K, we are back in our original model.
5. Concentration: proof of Proposition 4.2
Here we prove the concentration inequality given by Proposition 4.2.
5.1. Tools: symmetrization and Gaussian concentration. The argument is based on the
standard techniques of probability in Banach spaces—symmetrization and the Gaussian concentration inequality. Let us recall both these tools.
Lemma 5.1 (Symmetrization). Let ε1 , ε2 , . . . , εm be independent Rademacher random variables.8
Then
m
1 X
(5.1)
εi yi hai , zi .
µ := E sup |fx (z) − E fx (z)| ≤ 2 E sup
m
z∈K−K
z∈K−K
i=1
Furthermore, we have the deviation inequality
(
(5.2)
P
sup |f (z) − E f (z)| ≥ 2µ + t ≤ 4P
sup
m
)
X
εi yi hai , zi > t/2 .
z∈K−K z∈K−K
i=1
Inequality (5.1) follows e.g., from the proof of [19, Lemma 6.3]. The proof of inequality (5.2) is
contained in [19, Chapter 6.1]. Theorem 5.2 (Gaussian concentration inequality). Let (Gx )x∈T be a centered Gaussian process
indexed by a finite set T . Then for every r > 0 one has
P sup Gx ≥ E sup Gx + r ≤ exp(−r 2 /2σ 2 )
x∈T
where
σ2
=
supx∈T E G2x
x∈T
< ∞.
8This means that P{ε = 1} = P{ε = −1} = 1/2 for each i. The random variables ε are assumed to be
i
i
i
independent of each other and of any other random variables in question, namely ai and yi .
19
A proof of this result can be found e.g. in [18, Theorem 7.1]. This theorem can be extended to separable sets T in metric spaces by an approximation argument.
In particular, given a set K ⊆ B2n and r > 0, the standard Gaussian random vector g in Rn satisfies
(5.3)
P
sup hg, xi ≥ w(K) + r ≤ exp(−r 2 /2).
x∈K−K
5.2. Proof of Proposition 4.2. We apply the first part (5.1) of Symmetrization Lemma 5.1. Note
that since ai have symmetric distributions and yi ∈ {−1, 1}, the random vectors εi yi ai has the
same (iid) distribution as ai . Using the rotational invariance and the symmetry of the Gaussian
distribution, we can represent the right hand side of (5.1) as
m
m
X
1 X
1
dist
sup
εi yi hai , zi = sup
hai , zi
m
m
z∈K−K
z∈K−K
i=1
i=1
1
1
= √
sup |hg, zi| = √
sup hg, zi
m z∈K−K
m z∈K−K
dist
(5.4)
dist
where = signifies the equality in distribution. So taking the expectation in (5.1) we obtain
(5.5)
2
2w(K)
E sup |fx (z) − E fx (z)| ≤ √ E sup hg, zi = √
.
m z∈K−K
m
z∈K−K
To supplement this expectation bound with a deviation inequality, we use the second part (5.2)
of Symmetrization Lemma 2.3 along with (5.5) and (5.4). This yields
1
4w(K)
+ t ≤ 4P √
sup hg, zi > t/2 .
P
sup |fx (z) − E fx (z)| ≥ √
m
m z∈K−K
z∈K−K
√
Now it remains to use the Gaussian concentration inequality (5.3) with r = t m/2. The proof of
Proposition 4.2 is complete. 6. Concentration: proof of Proposition 4.3
Here we prove the uniform concentration inequality given by Proposition 4.3. Beside standard
tools in geometric functional analysis such as Sudakov minoration for covering numbers, our argument is based on the recent work [22] on random hyperplane tessellations. Let us first recall the
relevant tools.
6.1. Tools: covering numbers, almost isometries and random tessellations. Consider a
set T ⊂ Rn and a number ε > 0. Recall that an ε-net of T (in the Euclidean norm) is a set Nε ⊂ T
which has the following property: for every x ∈ T there exists x̄ ∈ Nε satisfying kx − x̄k2 ≤ ε.
The covering number of T to precision ε, which we call N (T, ε), is the minimal cardinality of an
ε-net of T . The covering numbers are closely related to the mean width, as shown by the following
well-known inequality:
20
Theorem 6.1 (Sudakov Minoration). Given a set T ⊂ Rn and a number ε > 0, one has
log N (K, ε) ≤ log N (K − K, ε) ≤ Cε−2 w(K)2 .
A proof of this theorem can be found e.g. in [19, Theorem 3.18]. We will also need two results from the recent work [22]. To state them conveniently, let A denote
the m × n matrix with rows ai . Thus A is standard Gaussian matrix with iid standard normal
entries. The first (simple) result guarantees that A acts as a almost isometric embedding from
(K, k·k2 ) into (Rm , k·k1 ).
Lemma 6.2 ([22] Lemma 2.1). Consider a subset K ⊂ B2n . Then for every u > 0 one has
(
)
r
1
4w(K)
2
mu2
P
sup kAxk1 −
kxk2 ≥ √
.
+ u ≤ 2 exp −
π
2
m
x∈K−K m
The second (not simple) result demonstrates that the discrete map x 7→ sign(Ax) acts as an
almost isometric embedding from (K ∩ S n−1 , dG ) into the Hamming cube ({−1, 1}m , dH ). Here dG
and dH denote the geodesic and Hamming metrics respectively; the sign function is applied to all
entries of Ax, thus sign(Ax) is the vector with entries sign(hai , xi), i = 1, . . . , m.
Theorem 6.3 ([22] Theorem 1.5). Consider a subset K ⊂ B2n and let δ > 0. Let
m ≥ Cδ−6 w(K)2 .
Then with probability at least 1 − 2 exp(−cδ2 m), the following holds for all x, x′ ∈ K ∩ S n−1 :
1
1
′
′
dG (x, x ) − dH (sign(Ax), sign(Ax )) ≤ δ.
π
m
6.2. Proof of Proposition 4.3. Let us first assume that τ = 0 for simplicity; the general case
will be discussed at the end of the proof. With this assumption, (4.3) becomes
m
(6.1)
1 X
sign(hai , xi)hai , zi.
fx (z) = f˜x (z) =
m
i=1
To be able to approximate x by a net, we use Sudakov minoration (Theorem 6.1) and find a δ-net
Nδ of K ∩ S n−1 whose cardinality satisfies
(6.2)
log |Nδ | ≤ Cδ−2 w(K)2 .
Lemma 6.4. Let δ > 0 and assume that m ≥ Cδ−6 w(K)2 . Then we have the following.
1. (Bound on the net.) With probability at least 1 − 4 exp(−cmδ2 ) we have
(6.3)
sup |fx0 (z) − E fx0 (z)| ≤ δ
x0 ,z
where the supremum is taken over all x0 ∈ Nδ and all z ∈ K − K.
21
2. (Deviation of sign patterns.) For x, x0 ∈ K ∩ S n−1 , consider the set
T (x, x0 ) := {i ∈ [m] : sign(hai , xi) 6= sign(hai , x0 i)}.
Then, with probability at least 1 − 2 exp(−cmδ2 ) we have
sup |T (x, x0 )| ≤ 2mδ
(6.4)
x,x0
where the supremum is taken over all x, x0 ∈ K ∩ S n−1 satisfying kx − x0 k2 ≤ δ.
3. (Deviation of sums.) Let s be a natural number. With probability at least 1−2 exp(−s log(em/s)/2)
we have
!
r
X
p
8 4w(K)
|hai , zi| ≤ s
sup
+ √
+ 2 log(em/s) .
(6.5)
π
s
z,T
i∈T
where the supremum is taken over all z ∈ K −K and all subsets T ⊂ [n] with cardinality |T | ≤ s.
Proof of Proposition 4.3. Let us apply Lemma 6.4, where in part 3 we take s = 2mδ rounded down
to the next smallest integer. Then all of the events (6.3), (6.4), (6.5) hold simultaneously with
probability at least 1 − 8 exp(−cmδ2 ).
Recall that our goal is to bound the deviation of fx (z) from its expectation uniformly over
x ∈ K ∩ S n−1 and z ∈ K − K. To this end, given x ∈ K ∩ S n−1 we choose x0 ∈ Nδ so that
kx − x0 k2 ≤ δ. By (6.1) and definition of the set T (x, x0 ) in Lemma 6.4, we can approximate
fx (z) by fx0 (z) as follows:
X
2
|hai , zi| .
(6.6)
|fx (z) − fx0 (z)| ≤
m
i∈T (x,x0 )
Furthermore, (6.4) guarantees that |T (x, x0 )| ≤ 2mδ. It then follows from (6.5), our choice s = 2mδ
and the assumption on m that
X
p
|hai , zi| ≤ Cmδ log(e/δ).
i∈T (x,x0 )
Thus
|fx (z) − fx0 (z)| ≤ 2Cδ
Combining this with (6.3) we obtain
p
log(e/δ).
p
|fx (z) − E fx0 (z)| ≤ C1 δ log(e/δ).
p
Further, recall from (4.4) that E fx0 (z) = 2/π hx, zi and thus
p
| E fx0 (z) − E fx (z)| = 2/π |hx0 − x, zi|
p
p
≤ 2 2/π kx0 − xk2 ≤ 2 2/π δ.
(6.8)
(6.7)
The last two inequalities in this line follow since z ∈ K − K ⊂ 2B2n and kx0 − xk2 ≤ δ. Finally, we
combine inequalities (6.7) and (6.8) to give
p
|fx (z) − E fx (z)| ≤ C2 δ log(e/δ).
22
Note that we can absorb the constant C2 into the requirement m ≥ Cδ−6 w(K)2 . This completes
the proof in the case where τ = 0.
In the general case, we only need to tweak the above argument by increasing the size of s
considered in (6.5). Specifically, it is enough to choose s to be τ m + 2mδ rounded down to the next
smallest integer. This allows one to account for arbitrary τ m bit flips of the numbers sign(hai , xi),
which produce the difference between fx (z) and f˜x (z). The proof of Proposition 4.3 is complete.
It remains to prove Lemma 6.4 that was used in the argument above.
Proof of Lemma 6.4. To prove part 1, we can use Proposition 4.2 combined with the union bound
over the net Nδ . Using the bound (6.2) on the cardinality of Nδ we obtain
√
P sup |fx0 (z) − E fx0 (z)| ≥ 4w(K)/ m + t
x0 ,z
≤ |Nδ | 4 exp(−mt2 /8)
≤ 4 exp − mt2 /8 + Cδ−2 w(K)2
where the supremum is taken over all x0 ∈ Nδ and z ∈ K − K. It remains to chose t = δ/2 and
recall that m ≥ Cδ−6 w(K)2 to finish the proof.
We now turn to part 2. First, note that |T (x, x0 )| = dH (sign(Ax), sign(Ax0 )). Theorem 6.3
demonstrates that this Hamming distance is almost isometric to the geodesic distance, which itself
satisfies π1 dG (x, x0 ) ≤ kx − x0 k2 ≤ δ. Specifically, Theorem 6.3 yields that under our assumption
that m ≥ Cδ−6 w(K)2 , with probability at least 1 − 2 exp(−cδ2 m) one has
|T (x, x0 )| ≤ 2mδ
(6.9)
for all x, x0 ∈ K ∩ S n−1 satisfying kx − x0 k2 ≤ δ. This proves part 2.
In order to prove part 3, we may consider the subsets T satisfying |T | = s; there are m
s ≤
exp(s log(em/s)) of them. Now we apply Lemma 6.2 for the T × n matrix PT A where PT denotes
the coordinate restriction in Rm onto RT ; so in the statement of Lemma 6.2 we replace m by |T | = s.
Combined with the union bound over all T , this gives
(
)
r
1X
4w(K)
su2
2
P sup
|hai , zi| ≥
kzk2 + √
+ u ≤ 2 exp s log(em/s) −
.
π
s
2
z,T s
i∈T
Recall that kzk2 ≤ 2 since z ∈ K − K ⊂ 2B2n . Finally, we take u2 = 4 log(em/s) to complete the
proof. 7. Discussion
Unlike traditional compressed sensing, which has already enjoyed an extraordinary wave of theoretical results, 1-bit compressed sensing is in its early stages. In this paper, we proposed a
polynomial-time solver (given by a convex program) for noisy 1-bit compressed sensing, and we
gave theoretical guarantees on its performance. The discontinuity inherent in 1-bit measurements
23
led to some unique mathematical challenges. We also demonstrated the connection to sparse binomial regression, and derived novel results for this problem as well.
The problem setup in 1-bit compressed sensing (as first defined in [5]) is quite elegant, allowing
for a theoretical approach. On the other hand, there are many compressed sensing results assuming
substantially finer quantization. It would be of interest to build a bridge between the two regimes;
for example, 2-bit compressed sensing would already open up new questions.
References
[1] Ardestanizadeh, E., Cheraghchi, M., and Shokrollahi, A. Bit precision analysis for compressed sensing.
In International Symposium on Information Theory (ISIT) (2009), IEEE.
[2] Bach, F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics 4 (2010), 384–414.
[3] Bartlett, P., and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results.
The Journal of Machine Learning Research 3 (2003), 463–482.
[4] Boufounos, P. Reconstruction of sparse signals from distorted randomized measurements. In Acoustics Speech
and Signal Processing (ICASSP), 2010 IEEE International Conference on (2010), IEEE, pp. 3998–4001.
[5] Boufounos, P. T., and Baraniuk, R. G. 1-Bit compressive sensing. In 42nd Annual Conference on Information Sciences and Systems (CISS) (Mar. 2008).
[6] Bunea, F. Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1 + ℓ2 penalization.
Electronic Journal of Statistics 2 (2008), 1153–1194.
[7] Candès, E., and Plan, Y. Tight oracle inequalities for low-rank matrix recovery from a minimal number of
noisy random measurements. IEEE Transactions on Information Theory 57, 4 (2011), 2342–2359.
[8] Eldar, C., and Kutyniok, G., Eds. Compressed Sensing. Cambridge University Press, 2012.
[9] Giannopoulos, A. A., and Milman, V. D. Asymptotic convex geometry: short overview. In Different faces
of geometry, I. M. Series, Ed. Kluwer/Plenum, 2004, pp. 87–162.
[10] Gross, D., Liu, Y., Flammia, S., Becker, S., and Eisert, J. Quantum state tomography via compressed
sensing. Physical Review Letters 105, 15 (2010), 150401.
[11] Gunturk, C., Lammers, M., Powell, A., Saab, R., and Ylmaz, O. Sigma delta quantization for compressed
sensing. In 44th Annual Conference on Information Sciences and Systems (CISS) (2010), IEEE.
[12] Gunturk, C., Lammers, M., Powell, A., Saab, R., and Ylmaz, O. Sobolev duals for random frames and
sigma-delta quantization of compressed sensing measurements. Preprint. Available at http://arxiv.org/abs/
1002.0182.
[13] Gupta, A., Nowak, R., and Recht, B. Sample complexity for 1-bit compressed sensing and sparse classification. In International Symposium on Information Theory (ISIT) (2010), IEEE.
[14] Jacques, L., Laska, J. N., Boufounos, P. T., and Baraniuk, R. G. Robust 1-bit compressive sensing via
binary stable embeddings of sparse vectors. Preprint. Available at http://arxiv.org/abs/1104.3160.
[15] Kakade, S., Shamir, O., Sridharan, K., and Tewari, A. Learning exponential families in high-dimensions:
Strong convexity and sparsity. Preprint. Available at http://arxiv.org/abs/0911.0054.
[16] Laska, J. Regime Change: Sampling Rate vs. Bit-Depth in Compressive Sensing. PhD thesis, RICE UNIVERSITY, 2011.
[17] Laska, J., and Baraniuk, R. Regime change: Bit-depth versus measurement-rate in compressive sensing.
Preprint. Available at http://arxiv.org/abs/1110.3450.
[18] Ledoux, M. The concentration of measure phenomenon. American Mathematical Society, Providence, 2001.
[19] Ledoux, M., and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes. Springer-Verlag,
Berlin, 1991.
24
[20] Meier, L., Van De Geer, S., and Bühlmann, P. The group lasso for logistic regression. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 70, 1 (2008), 53–71.
[21] Negahban, S., Ravikumar, P., Wainwright, M., and Yu, B. A unified framework for high-dimensional
analysis of m-estimators with decomposable regularizers. Preprint. Available at http://arxiv.org/abs/1010.
2731.
[22] Plan, Y., and Vershynin, R. Dimension reduction by random hyperplane tessellations. Preprint. Available at
http://arxiv.org/abs/1111.4452.
[23] Plan, Y., and Vershynin, R. One-bit compressed sensing by linear programming. Preprint. Available at
http://arxiv.org/abs/1109.4299.
[24] Raskutti, G., Wainwright, M., and Yu, B. Minimax rates of convergence for high-dimensional regression
under q-ball sparsity. In Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton
Conference on (2009), IEEE, pp. 251–257.
[25] Ravikumar, P., Wainwright, M., and Lafferty, J. High-dimensional ising model selection using 1regularized logistic regression. The Annals of Statistics 38, 3 (2010), 1287–1319.
[26] Talagrand, M. Majorizing measures: the generic chaining. The Annals of Probability 24, 3 (1996), 1049–1103.
[27] Van De Geer, S. High-dimensional generalized linear models and the lasso. The Annals of Statistics 36, 2
(2008), 614–645.
[28] Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory
and Applications, Y. Eldar and G. Kutyniok, Eds. Cambridge University Press. To appear (2011).
Department of Mathematics, University of Michigan, 530 Church St., Ann Arbor, MI 48109, U.S.A.
E-mail address: {yplan,romanv}@umich.edu
25

Download Report

Robust 1-bit compressed sensing and sparse logistic regression: A

Paperzz.com

Your Paperzz