LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements Ehsan Abbasi Department of Electrical Engineering Caltech [email protected] Christos Thrampoulidis, Department of Electrical Engineering Caltech [email protected] Babak Hassibi Department of Electrical Engineering Caltech [email protected] ∗ Abstract Consider estimating an unknown, but structured (e.g. sparse, low-rank, etc.), signal x0 ∈ Rn from a vector y ∈ Rm of measurements of the form yi = gi (ai T x0 ), where the ai ’s are the rows of a known measurement matrix A, and, g(·) is a (potentially unknown) nonlinear and random link-function. Such measurement functions could arise in applications where the measurement device has nonlinearities and uncertainties. It could also arise by design, e.g., gi (x) = sign(x + zi ), corresponds to noisy 1-bit quantized measurements. Motivated by the classical work of Brillinger, and more recent work of Plan and Vershynin, we estimate x0 via solving the Generalized-LASSO, i.e., x̂ := arg minx ky − Ax0 k2 + λf (x) for some regularization parameter λ > 0 and some (typically non-smooth) convex regularizer f (·) that promotes the structure of x0 , e.g. `1 -norm, nuclear-norm, etc. While this approach seems to naively ignore the nonlinear function g(·), both Brillinger (in the non-constrained case) and Plan and Vershynin have shown that, when the entries of A are iid standard normal, this is a good estimator of x0 up to a constant of proportionality µ, which only depends on g(·). In this work, we considerably strengthen these results by obtaining explicit expressions forkx̂−µx0 k2 , for the regularized Generalized-LASSO, that are asymptotically precise when m and n grow large. A main result is that the estimation performance of the Generalized LASSO with non-linear measurements is asymptotically the same as one whose measurements are linear yi = µai T x0 + σzi , with µ = Eγg(γ) and σ 2 = E(g(γ) − µγ)2 , and, γ standard normal. To the best of our knowledge, the derived expressions on the estimation performance are the first-known precise results in this context. One interesting consequence of our result is that the optimal quantizer of the measurements that minimizes the estimation error of the Generalized LASSO is the celebrated Lloyd-Max quantizer. 1 Introduction Non-linear Measurements. Consider the problem of estimating an unknown signal vector x0 ∈ Rn from a vector y = (y1 , y2 , . . . , ym )T of m measurements taking the following form: yi = gi (aTi x0 ), i = 1, 2, . . . , m. (1) Here, each ai represents a (known) measurement vector. The gi ’s are independent copies of a (generically random) link function g. For instance, gi (x) = x + zi , with say zi being normally ∗ This work was supported in part by the National Science Foundation under grants CNS-0932428, CCF-1018927, CCF-1423663 and CCF-1409204, by a grant from Qualcomm Inc., by NASA’s Jet Propulsion Laboratory through the President and Directors Fund, by King Abdulaziz University, and by King Abdullah University of Science and Technology. 1 distributed, recovers the standard linear regression setup with gaussian noise. In this paper, we are particularly interested in scenarios where g is non-linear. Notable examples include g(x) = sign(x) (or gi (x) = sign(x+zi )) and g(x) = (x)+ , corresponding to 1-bit quantized (noisy) measurements, and, to the censored Tobit model, respectively. Depending on the situation, g might be known or unspecified. In the statistics and econometrics literature, the measurement model in (1) is popular under the name single-index model and several aspects of it have been well-studied, e.g. [4,5,14,15]1 . Structured Signals. It is typical that the unknown signal x0 obeys some sort of structure. For instance, it might be sparse, i.e.√only a few k n, of its entries are non-zero; or, it might be that √ n× n x0 = vec(X0 ), where X0 ∈ R is a matrix of low-rank r n. To exploit this information it is typical to associate with the structure of x0 a properly chosen function f : Rn → R, which we refer to as the regularizer. Of particular interest are convex and non-smooth such regularizers, e.g. the `1 -norm for sparse signals, the nuclear-norm for low-rank ones, etc. Please refer to [1, 6, 13] for further discussions. An Algorithm for Linear Measurements: The Generalized LASSO. When the link function is linear, i.e. gi (x) = x + zi , perhaps the most popular way of estimating x0 is via solving the Generalized LASSO algorithm: x̂ := arg min ky − Axk2 + λf (x). x (2) Here, A = [a1 , a2 , . . . , am ]T ∈ Rm×n is the known measurement matrix and λ > 0 is a regularizer parameter. This is often referred to as the `2 -LASSO or the square-root-LASSO [3] to distinguish from the one solving minx 21 ky − Axk22 + λf (x), instead. Our results can be accustomed to this latter version, but for concreteness, we restrict attention to (2) throughout. The acronym LASSO for (2) was introduced in [22] for the special case of `1 -regularization; (2) is a natural generalization to other kinds of structures and includes the group-LASSO [25], the fused-LASSO [23] as special cases. We often drop the term “Generalized” and refer to (2) simply as the LASSO. One popular, measure of estimation performance of (2) is the squared-error kx̂ − x0 k22 . Recently, there have been significant advances on establishing tight bounds and even precise characterizations of this quantity, in the presence of linear measurements [2, 10, 16, 18, 19, 21]. Such precise results have been core to building a better understanding of the behavior of the LASSO, and, in particular, on the exact role played by the choice of the regularizer f (in accordance with the structure of x0 ), by the number of measurements m, by the value of λ, etc.. In certain cases, they even provide us with useful insights into practical matters such as the tuning of the regularizer parameter. Using the LASSO for Non-linear Measurements?. The LASSO is by nature tailored to a linear model for the measurements. Indeed, the first term of the objective function in (2) tries to fit Ax to the observed vector y presuming that this is of the form yi = aTi x0 + noise. Of course, no one stops us from continuing to use it even in cases where yi = g(aTi x0 ) with g being non-linear2 . But, the question then becomes: Can there be any guarantees that the solution x̂ of the Generalized LASSO is still a good estimate of x0 ? The question just posed was first studied back in the early 80’s by Brillinger [5] who provided answers in the case of solving (2) without a regularizer term. This, of course, corresponds to standard Least Squares (LS). Interestingly, he showed that when the measurement vectors are Gaussian, then the LS solution is a consistent estimate of x0 , up to a constant of proportionality µ, which only depends on the link-function g. The result is sharp, but only under the assumption that the number of measurements m grows large, while the signal dimension n stays fixed, which was the typical setting of interest at the time. In the world of structured signals and high-dimensional measurements, the problem was only very recently revisited by Plan and Vershynin [17]. They consider a constrained version of the Generalized LASSO, in which the regularizer is essentially replaced by a constraint, and derive upper bounds on its performance. The bounds are not tight (they involve absolute constants), but they demonstrate some key features: i) the solution to the constrained LASSO x̂ is a good estimate of x0 up to the same constant of proportionality µ that appears in Brillinger’s result. ii) Thus, kx̂ − µx0 k22 is a natural measure of performance. iii) Estimation is possible even with m < n measurements by taking advantage of the structure of x0 . 1 The single-index model is a classical topic and can also be regarded as a special case of what is known as sufficient dimension reduction problem. There is extensive literature on both subjects; unavoidably, we only refer to the directly relevant works here. 2 Note that the Generalized LASSO in (2) does not assume knowledge of g. All that is assumed is the availability of the measurements yi . Thus, the link-function might as well be unknown or unspecified. 2 3 Non-linear Linear Prediction 2.5 kµ−1 x̂ − x0 k22 2 m<n 1.5 m>n 1 0.5 0 0 0.5 1 1.5 2 2.5 3 λ Figure 1: Squared error of the `1 -regularized LASSO with non-linear measurements () and with corresponding linear ones (?) as a function of the regularizer parameter λ; both compared to the asymptotic prediction. Here, gi (x) = sign(x + 0.3zi ) with zi ∼ N (0, 1). The unknown signal x0 is of dimension n = 768 and has d0.15ne non-zero entries (see Sec. 2.2.2 for details). The different curves correspond to d0.75ne and d1.2ne number of measurements, respectively. Simulation points are averages over 20 problem realizations. 1.1 Summary of Contributions Inspired by the work of Plan and Vershynin [17], and, motivated by recent advances on the precise analysis of the Generalized LASSO with linear measurements, this paper extends these latter results to the case of non-linear mesaurements. When the measurement matrix A has entries i.i.d. Gaussian (henceforth, we assume this to be the case without further reference), and the estimation performance is measured in a mean-squared-error sense, we are able to precisely predict the asymptotic behavior of the error. The derived expression accurately captures the role of the link function g, the particular structure of x0 , the role of the regularizer f , and, the value of the regularizer parameter λ. Further, it holds for all values of λ, and for a wide class of functions f and g. Interestingly, our result shows in a very precise manner that in large dimensions, modulo the information about the magnitude of x0 , the LASSO treats non-linear measurements exactly as if they were scaled and noisy linear measurements with scaling factor µ and noise variance σ 2 defined as µ := E[γg(γ)], and σ 2 := E[(g(γ) − µγ)2 ], for γ ∼ N (0, 1), (3) where the expecation is with respect to both γ and g. In particular, when g is such that µ 6= 03 , then, the estimation performance of the Generalized LASSO with measurements of the form yi = gi (aTi x0 ) is asymptotically the same as if the measurements were rather of the form yi = µaTi x0 + σzi , with µ, σ 2 as in (3) and zi standard gaussian noise. Recent analysis of the squared-error of the LASSO, when used to recover structured signals from noisy linear observations, provides us with either precise predictions (e.g. [2, 20]), or in other cases, with tight upper bounds (e.g. [10, 16]). Owing to the established relation between non-linear and (corresponding) linear measurements, such results also characterize the performance of the LASSO in the presence of nonlinearities. We remark that some of the error formulae derived here in the general context of non-linear measurements, have not been previously known even under the prism of linear measurements. Figure 1 serves as an illustration; the error with non-linear measurements matches well with the error of the corresponding linear ones and both are accurately predicted by our analytic expression. Under the generic model in (1), which allows for g to even be unspecified, x0 can, in principle, be estimated only up to a constant of proportionality [5, 15, 17]. For example, if g is uknown then any information about the norm kx0 k2 could be absorbed in the definition of g. The same is true when g(x) = sign(x), eventhough g might be known here. In these cases, what becomes important is the direction of x0 . Motivated by this, and, in order to simplify the presentation, we have assumed throughout that x0 has unit Euclidean norm4 , i.e. kx0 k2 = 1. 3 This excludes for example link functions g that are even, but also some other not so obvious cases [11, Sec. 2.2]. For a few special cases, e.g. sparse recovery with binary measurements yi [24], different methodologies than the LASSO have been recently proposed that do not require µ = 0. 4 In [17, Remark 1.8], they note that their results can be easily generalized to the case when kx0 k2 6= 1 by simply redifining ḡ(x) = g(kx0 k2 x) and accordingly adjusting the values of the parameters µ and σ 2 in (3). The very same argument is also true in our case. 3 1.2 Discussion of Relevant Literature Extending an Old Result. Brillinger [5] identified the asymptotic behavior of the estimation error of the LS solution x̂LS = (AT A)−1 AT y by showing that, when n (the dimension of x0 ) is fixed, √ √ lim mkx̂LS − µx0 k2 = σ n, (4) m→∞ 2 where µ and σ are same as in (3). Our result can be viewed as a generalization of the above in several directions. First, we extend (4) to the regime where m/n = δ ∈ (1, ∞) and both grow large by showing that σ lim kx̂LS − µx0 k2 = √ . (5) n→∞ δ−1 Second, and most importantly, we consider solving the Generalized LASSO instead, to which LS is only a very special case. This allows versions of (5) where the error is finite even when δ < 1 (e.g., see (8)). Note the additional challenges faced when considering the LASSO: i) x̂ no longer has a closed-form expression, ii) the result needs to additionally capture the role of x0 , f , and, λ. Motivated by Recent Work. Plan and Vershynin consider a constrained Generalized LASSO: x̂C-LASSO = arg min ky − Axk2 , (6) x∈K n with y as in (1) and K ⊂ R some known set (not necessarily convex). In its simplest form, their result shows that when m & DK (µx0 ) then with highp probability, σ DK (µx0 ) + ζ √ . (7) kx̂C-LASSO − µx0 k2 . Here, D (µx ) is the Gaussian width, a specific measure ofmcomplexity of the constrained set K K 0 when viewed from µx0 . For our purposes, it suffices to remark that if K is properly chosen, and, if µx0 is on the boundary of K, then DK (µx0 ) is less than n. Thus, estimation is in principle is possible with m < n measurements. The parameters µ and σ that appear in (7) are the same as in (3) and ζ := E[(g(γ) − µγ)2 γ 2 ]. Observe that, in contrast to (4) and to the setting of this paper, the result in (7) is non-asymptotic. Also, it suggests the critical role played by µ and σ. On the other hand, (7) is only an upper bound on the error, and also, it suffers from unknown absolute proportionality constants (hidden in .). Moving the analysis into an asymptotic setting, our work expands upon the result of [17]. First, we consider the regularized LASSO instead, which is more commonly used in practice. Most importantly, we improve the loose upper bounds into precise expressions. In turn, this proves in an exact manner the role played by µ and σ 2 to which (7) is only indicative. For a direct comparison with (7) we mention the following result which follows from our analysis (we omit the proof for brevity). Assume K is convex, m/n = δ ∈ (0, p ∞), DK (µx0 )/n = ρ ∈ (0, 1] and n → ∞. Also, δ > ρ. Then, (7) yields an upper bound Cσ ρ/δ to the error, for some √ constant C > 0. Instead, we show ρ kx̂C-LASSO − µx0 k2 ≤ σ √ . (8) δ−ρ Precise Analysis of the LASSO With Linear Measurements. The first precise error formulae were established in [2, 10] for the `22 -LASSO with `1 -regularization. The analysis was based on the the Approximate Message Passing (AMP) framework [9]. A more general line of work studies the problem using a recently developed framework termed the Convex Gaussian Min-max Theorem (CGMT) [19], which is a tight version of a classical Gaussian comparison inequality by Gordon [12]. The CGMT framework was initially used by Stojnic [18] to derive tight upper bounds on the constrained LASSO with `1 -regularization; [16] generalized those to general convex regularizers and also to the `2 -LASSO; the `22 -LASSO was studied in [21]. Those bounds hold for all values of SNR, but they become tight only in the high-SNR regime. A precise error expression for all values of SNR was derived in [20] for the `2 -LASSO with `1 -regularization under a gaussianity assumption on the distribution of the non-zero entries of x0 . When measurements are linear, our Theorem 2.3 generalizes this assumption. Moreover, our Theorem 2.2 provides error predictions for regularizers going beyond the `1 -norm, e.g. `1,2 -norm, nuclear norm, which appear to be novel. When it comes to non-linear measurements, to the best of our knowledge, this paper is the first to derive asymptotically precise results on the performance of any LASSO-type program. 2 Results 2.1 Modeling Assumptions Unknown structured signal. We let x0 ∈ Rn represent the unknown signal vector. We assume that x0 = x0 /kx0 k2 , with x0 sampled from a probability density px0 in Rn . Thus, x0 is deterministically 4 of unit Euclidean-norm (this is mostly to simplify the presentation, see Footnote 4). Information about the structure of x0 (and correspondingly of x0 ) is encoded in px0 . E.g., to study an x0 which is sparse, it is typical to assume that its entries are i.i.d. x0,i ∼ (1 − ρ)δ0 + ρqX 0 , where ρ ∈ (0, 1) becomes the normalized sparsity level, qX 0 is a scalar p.d.f. and δ0 is the Dirac delta function5 . Regularizer. We consider convex regularizers f : Rn → R. Measurement matrix. The entries of A ∈ Rm×n are i.i.d. N (0, 1). Measurements and Link-function. We observe y = ~g (Ax0 ) where ~g is a (possibly random) map from Rm to Rm and ~g (u) = [g1 (u1 ), . . . , gm (um )]T . Each gi is i.i.d. from a real valued random function g for which µ and σ 2 are defined in (3). We assume that µ and σ 2 are nonzero and bounded. Asymptotics. We study a linear asymptotic regime. In particular, we consider a sequence of prob(n) lem instances {x0 , A(n) , f (n) , m(n) }n∈N indexed by n such that A(n) ∈ Rm×n has entries i.i.d. (n) N (0, 1), f : Rn → R is proper convex, and, m := m(n) with m = δn, δ ∈ (0, ∞). We further require that the following conditions hold: (n) (a) x0 (n) is sampled from a probability density px0 in Rn with one-dimensional marginals that are (n) P independent of n and have bounded second moments. Furthermore, n−1 kx0 k22 − → σx2 = 1. (b) For any n ∈ N and any kxk2 ≤ C, it holds n−1/2 f (x) ≤ c1 and n−1/2 maxs∈∂f (n) (x) ksk2 ≤ c2 , for constants c1 , c2 , C ≥ 0 independent of n. P In (a), we used “− →” to denote convergence in probability as n → ∞. The assumption σx2 = 1 holds without loss of generality, and, is only necessary to simplify the presentation. In (b), ∂f (x) denotes the subdifferential of f at x. The condition itself is no more than a normalization condition on f . (n) (n) (n) Every such sequence {x0 , A(n) , f (n) }n∈N generates a sequence {x0 , y(n) }n∈N where x0 := (n) (n) (n) (n) x0 /kx0 k2 and y := ~g (Ax0 ). When clear from the context, we drop the superscript (n). 2.2 Precise Error Prediction (n) Let {x0 , A(n) , f (n) , y(n) }n∈N be a sequence of problem instances that satisfying all the conditions above. With these, define the sequence {x̂(n) }n∈N of solutions to the corresponding LASSO problems for fixed λ > 0: o 1 n (n) ky − A(n) xk2 + λf (n) (x) . (9) x̂(n) := min √ x n (n) The main contribution of this paper is a precise evaluation of limn→∞ kµ−1 x̂(n) − x0 k22 with high probability over the randomness of A, of x0 , and of g. 2.2.1 General Result (n) To state the result in a general framework, we require a further assumption on px0 and f (n) . Later in this section we illustrate how this assumption can be naturally met. We write f ∗ for the Fenchel’s conjugate of f , i.e., f ∗ (v) := supx xT v − f (x); also, we denote the Moreau envelope of f at v with index τ to be ef,τ (v) := minx { 12 kv − xk22 + τ f (x)}. Assumption 1. We say Assumption 1 holds if for all non-negative constants c1 , c2 , c3 ∈ R the point-wise limit of n1 e√n(f ∗ )(n) ,c3 (c1 h + c2 x0 ) exists with probability one over h ∼ N (0, In ) and (n) x0 ∼ px0 . Then, we denote the limiting value as F (c1 , c2 , c3 ). Theorem 2.1 (Non-linear=Linear). Consider the asymptotic setup of Section 2.1 and let Assumption 1 hold. Recall µ and σ 2 as in (3) and let x̂ be the minimizer of the Generalized LASSO in (9) for fixed λ > 0 and for measurements given by (1). Further let x̂lin be the solution to the Generalized LASSO when used with linear measurements of the form ylin = A(µx0 ) + σz, where z has entries i.i.d. standard normal. Then, in the limit of n → ∞, with probability one, kx̂ − µx0 k22 = kx̂lin − µx0 k22 . 5 Such models have been widely used in the relevant literature, e.g. [7,8,10]. In fact, the results here continue to hold as long as the marginal distribution of x0 converges to a given distribution (as in [2]). 5 Theorem 2.1 relates in a very precise manner the error of the Generalized LASSO under non-linear measurements to the error of the same algorithm when used under appropriately scaled noisy linear measurements. Theorem 2.2 below, derives an asymptotically exact expression for the error. Theorem 2.2 (Precise Error Formula). Under the same assumptions of Theorem 2.1 and δ := m/n, it holds, with probability one, lim kx̂ − µx0 k22 = α∗2 , n→∞ where α∗ is the unique optimal solution to the convex program √ p ατ µ2 τ αλ2 β µτ τ max min β δ α2 + σ 2 − + − F , , . 0≤β≤1 α≥0 2 2α τ λ λα λα (10) τ ≥0 Also, the optimal cost of the LASSO in (9) converges to the optimal cost of the program in (10). Under the stated conditions, Theorem 2.2 proves that the limit of kx̂ − µx0 k2 exists and is equal to the unique solution of the optimization program in (10). Notice that this is a deterministic and convex optimization, which only involves three scalar optimization variables. Thus, the optimal α∗ can, in principle, be efficiently numerically computed. In many specific cases of interest, with some extra effort, it is possible to yield simpler expressions for α∗ , e.g. see Theorem 2.3 below. The role of the normalized number of measurement δ = m/n, of the regularizer parameter λ, and, that of g, through µ and σ 2 , are explicit in (10); the structure of x0 and the choice of the regularizer f are implicit in F . Figures 1-2 illustrate the accuracy of the prediction of the theorem in a number of different settings. The proofs of both the Theorems are deferred to Appendix A. In the next sections, we specialize Theorem 2.2 to the cases of sparse, group-sparse and low-rank signal recovery. 2.2.2 Sparse Recovery Assume each entry x0,i , i = 1, . . . , n is sampled i.i.d. from a distribution pX 0 (x) = (1 − ρ) · δ0 (x) + ρ · qX 0 (x), (11) where δ0 is the delta Dirac function, ρ ∈ (0, 1) and qX 0 a probability density function with second moment normalized to 1/ρ so that condition (a) of Section 2.1 is satisfied. Then, x0 = x0 /kx0 k2 is ρn-sparse on average and has unit Euclidean norm. Letting f (x) = kxk1 also satisfies condition (b). Let us now check Assumption 1. The Fenchel’s conjugate of the `1 -norm is simply the indicator function of the `∞ unit ball. Hence, without much effort, n 1 √ 1 X e n(f ∗ )(n) ,c3 (c1 h + c2 x0 ) = min (vi − (c1 hi + c2 x0,i ))2 n 2n i=1 |vi |≤1 n = 1 X 2 η (c1 hi + c2 x0,i ; 1), 2n i=1 (12) where we have denoted η(x; τ ) := (x/|x|) (|x| − τ )+ (13) for the soft thresholding operator. An application of the weak law of large numbers to see that the limit of the expression in (12) equals F (c1 , c2 , c3 ) := 21 E η 2 (c1 h + c2 X 0 ; 1) , where the expectation is over h ∼ N (0, 1) and X 0 ∼ pX 0 . With all these, Theorem 2.2 is applicable. We have put extra effort in order to obtain the following equivalent but more insightful characterization of the error, as stated below and proved in Appendix B. Theorem 2.3 (Sparse Recovery). If δ > 1, then define λcrit = 0. Otherwise, let λcrit , κcrit be the unique pair of solutions to the following set of equations: ( (14) κ2 δ = σ 2 + E (η(κh + µX 0 ; κλ) − µX 0 )2 , κδ = E[(η(κh + µX 0 ; κλ) · h)], (15) where h ∼ N (0, 1) and is independent of X 0 ∼ pX 0 . Then, for any λ > 0, with probability one, 2 δκcrit − σ 2 , λ ≤ λcrit , 2 lim kx̂ − µx0 k2 = n→∞ δκ2∗ (λ) − σ 2 , λ ≥ λcrit , where κ2∗ (λ) is the unique solution to (14). 6 Sparse signal recovery Group-sparse signal recovery 0.55 Simulation Thm. 2.3 0.5 2 kx − µx0 k22 kµ−1 x − x0 k22 0.45 1.5 δ = 0.75 1 0.4 0.35 δ = 1.2 0.3 0.5 Simulation Thm. 2.2 0.25 λcrit 0.5 1 1.5 2 0.2 0.5 2.5 λ 1 1.5 2 2.5 3 3.5 4 4.5 λ Figure 2: Squared error of the LASSO as a function of the regularizer parameter compared to the asymptotic predictions. Simulation points represent averages over 20 realizations. (a) Illustration of Thm. 2.3 for g(x) = sign(x), n = 512, pX 0 (+1) = pX 0 (+1) = 0.05, pX 0 (+1) = 0.9 and two values of δ, namely 0.75 and 1.2. (b) Illustration of Thm. 2.2 for x0 being group-sparse as in Section 2.2.3 and gi (x) = sign(x + 0.3zi ). In particular, x0 is composed of t = 512 blocks of block size b = 3. Each block is zero with probability 0.95, otherwise its entries are i.i.d. N (0, 1). Finally, δ = 0.75. Figures 1 and 2(a) validate the prediction of the theorem, for different signal distributions, namely qX 0 being Gaussian and Bernoulli, respectively. For the case of compressed (δ < 1) measurements, observe the two different regimes of operation, one for λ ≤ λcrit and the other for λ ≥ λcrit , precisely as they are predicted by the theorem (see also [16, Sec. 8]). The special case of Theorem 2.3 for which qX 0 is Gaussian has been previously studied in [20]. Otherwise, to the best of our knowledge, this is the first precise analysis result for the `2 -LASSO stated in that generality. Analogous result, but via different analysis tools, has only been known for the `22 -LASSO as appears in [2]. 2.2.3 Group-Sparse Recovery Let x0 ∈ Rn be composed of t non-overlapping blocks of constant size b each such that n = t · b. Each block [x0 ]i , i = 1, . . . , t is sampled i.i.d. from a probability density in Rb : pX 0 (x) = (1 − ρ) · δ0 (x) + ρ · qX 0 (x), x ∈ Rb , where ρ ∈ (0, 1). Thus, x0 is ρt-block-sparse on average. We operate in the regime of linear measurements m/n = δ ∈ (0, ∞). As is common we use the Pt `1,2 -norm to induce block-sparsity, i.e., f (x) = i=1 k[x0 ]i k2 ; with this, (9) is often referred to as group-LASSO in the literature [25]. It is not hard to show that Assumption 1 holds with 1 E k~η (c1 h + c2 X 0 ; 1)k22 , where ~η (x; τ ) = x/kxk (kxk2 − τ )+ , x ∈ Rb is the F (c1 , c2 , c3 ) := 2b vector soft thresholding operator and h ∼ N (0, Ib ), X 0 ∼ pX 0 and are independent. Thus Theorem 2.2 is applicable in this setting; Figure 2(b) illustrates the accuracy of the prediction. 2.2.4 Low-rank Matrix Recovery Let X0 ∈ Rd×d be an unknown matrix of rank r, in which case, x0 = vec(X0 ) with n = d2 . Assume m/d2 = δ ∈ (0, ∞) and r/d = ρ ∈ (0, 1). As √ usual in this setting, we consider nuclearnorm regularization; in particular, we choose f (x) = dkXk∗ . Each subgradient S ∈ ∂f (X) then satisfies kSkF ≤ d in agreement with assumption (b) of Section 2.1. Furthermore, for this choice of regularizer, we have 1 √ 1 e n(f ∗ )(n) ,c3 c1 H + c2 X0 = 2 min√ kV − (c1 H + c2 X0 )k2F n 2d kVk2 ≤ d = d 1 1 X 2 −1/2 min kV − d−1/2 (c1 H + c2 X0 )k2F = η si d (c1 H + c2 X0 ) ; 1 , 2d kVk2 ≤1 2d i=1 where η(·; ·) is as in (13), si (·) denotes the ith singular value of its argument and H ∈ Rd×d has entries N (0, 1). If conditions are met such that the empirical distribution of the singular values of (the sequence of random matrices) c1 H + c2 X0 converges asymptotically to a limiting distribution, say q(c1 , c2 ), then F (c1 , c2 , c3 ) := 21 Ex∼q(c1 ,c2 ) η 2 (x; 1) , and Theorem 2.1–2.2 apply. For instance, this will be the case if d−1/2 X0 = USVt , where U, V unitary matrices and S is a diagonal matrix 7 whose entries have a given marginal distribution with bounded moments (in particular, independent of d). We leave the details and the problem of (numerically) evaluating F for future work. 2.3 An Application to q-bit Compressive Sensing 2.3.1 Setup Consider recovering a sparse unknown signal x0 ∈ Rn from scalar q-bit quantized linear measurements. Let t := {t0 = 0, t1 , . . . , tL−1 , tL = +∞} represent a (symmetric with respect to 0) set of decision thresholds and ` := {±`1 , ±`2 , . . . , ±`L } the corresponding representation points, such that L = 2q−1 . Then, quantization of a real number x into q-bits can be represented as L X Qq (x, `, t) = sign(x) `i 1{ti−1 ≤|x|≤ti } , i=1 where 1S is the indicator function of a set S. For example, 1-bit quantization with level ` corresponds to Q1 (x, `) = ` · sign(x). The measurement vector y = [y1 , y2 . . . , ym ]T takes the form yi = Qq (aTi x0 , `, t), i = 1, 2, . . . , m, (16) where aTi ’s are the rows of a measurement matrix A ∈ Rm×n , which is henceforth assumed i.i.d. standard Gaussian. We use the LASSO to obtain an estimate x̂ of x0 as x̂ := arg min ky − Axk2 + λkxk1 . x (17) Henceforth, we assume for simplicity that kx0 k2 = 1. Also, in our case, µ is known since g = Qq is known; thus, is reasonable to scale the solution of (17) as µ−1 x̂ and consider the error quantity kµ−1 x̂ − x0 k2 as a measure of estimation performance. Clearly, the error depends (besides others) on the number of bits q, on the choice of the decision thresholds t and on the quantization levels `. An interesting question of practical importance becomes how to optimally choose these to achieve less error. As a running example for this section, we seek optimal quantization thresholds and corresponding levels (t∗ , `∗ ) = arg min kµ−1 x̂ − x0 k2 , (18) t,` while keeping all other parameters such as the number of bits q and of measurements m fixed. 2.3.2 Consequences of Precise Error Prediction Theorem 2.1 shows that kµ−1 x̂ − x0 k2 = kx̂lin − x0 k2 , where x̂lin is the solution to (17), but only, this time with a measurement vector ylin = Ax0 + σµ z, where µ, σ as in (20) and z has entries i.i.d. standard normal. Thus, lower values of the ratio σ 2 /µ2 correspond to lower values of the error and the design problem posed in (18) is equivalent to the following simplified one: (t∗ , `∗ ) = arg min t,` σ 2 (t, `) . µ2 (t, `) (19) To be explicit, µrand σ 2 above can be easily expressed from (3) after setting g = Qq as follows: L 2 2 2X µ := µ(`, t) = `i · e−ti−1 /2 − e−ti /2 and σ 2 := σ 2 (`, t) := τ 2 − µ2 , (20) π i=1 Z ∞ L X 1 where, τ 2 := τ 2 (`, t) = 2 `2i · (Q(ti−1 ) − Q(ti )) and Q(x) = √ exp(−u2 /2)du. 2π x i=1 2.3.3 An Algorithm for Finding Optimal Quantization Levels and Thresholds In contrast to the initial problem in (18), the optimization involved in (19) is explicit in terms of the variables ` and t, but, is still hard to solve in general. Interestingly, we show in Appendix C that the popular Lloyd-Max (LM) algorithm can be an effective algorithm for solving (19), since the values to which it converges are stationary points of the objective in (19). Note that this is not a directly obvious result since the classical objective of the LM algorithm is minimizing the quantity E[ky − Ax0 k22 ] rather than E[kµ−1 x̂ − x0 k22 ]. 8 References [1] Francis R Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems, pages 118–126, 2010. [2] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. Information Theory, IEEE Transactions on, 58(4):1997–2017, 2012. [3] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [4] David R. Brillinger. The identification of a particular nonlinear time series system. Biometrika, 64(3):509– 515, 1977. [5] David R Brillinger. A generalized linear model with” gaussian” regressor variables. A Festschrift For Erich L. Lehmann, page 97, 1982. [6] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012. [7] David L Donoho and Iain M Johnstone. Minimax risk overl p-balls forl p-error. Probability Theory and Related Fields, 99(2):277–303, 1994. [8] David L Donoho, Lain Johnstone, and Andrea Montanari. Accurate prediction of phase transitions in compressed sensing via a connection to minimax denoising. IEEE transactions on information theory, 59(6):3396–3433, 2013. [9] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. [10] David L Donoho, Arian Maleki, and Andrea Montanari. The noise-sensitivity phase transition in compressed sensing. Information Theory, IEEE Transactions on, 57(10):6920–6941, 2011. [11] Alexandra L Garnham and Luke A Prendergast. A note on least squares sensitivity in single-index model estimation and the benefits of response transformations. Electronic J. of Statistics, 7:1983–2004, 2013. [12] Yehoram Gordon. On Milman’s inequality and random subspaces which escape through a mesh in Rn . Springer, 1988. [13] Marwa El Halabi and Volkan Cevher. A totally unimodular view of structured sparsity. arXiv preprint arXiv:1411.1990, 2014. [14] Hidehiko Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58(1):71–120, 1993. [15] Ker-Chau Li and Naihua Duan. Regression analysis under link violation. The Annals of Statistics, pages 1009–1052, 1989. [16] Samet Oymak, Christos Thrampoulidis, and Babak Hassibi. The squared-error of generalized lasso: A precise analysis. arXiv preprint arXiv:1311.0830, 2013. [17] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. arXiv preprint arXiv:1502.04071, 2015. [18] Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291, 2013. [19] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise analysis of the estimation error. In Proceedings of The 28th Conference on Learning Theory, pages 1683– 1709, 2015. [20] Christos Thrampoulidis, Ashkan Panahi, Daniel Guo, and Babak Hassibi. Precise error analysis of the lasso. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 3467–3471. [21] Christos Thrampoulidis, Ashkan Panahi, and Babak Hassibi. Asymptotically exact error analysis for the generalized `22 -lasso. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 2021–2025. IEEE, 2015. [22] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [23] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91– 108, 2005. [24] Xinyang Yi, Zhaoran Wang, Constantine Caramanis, and Han Liu. Optimal linear estimation under unknown nonlinear transform. arXiv preprint arXiv:1505.03257, 2015. [25] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. 9
© Copyright 2026 Paperzz