6.893 FINAL PROJECT: QUANTILE ESTIMATION IN MANY DIMENSIONS JOSH ALMAN AND JON SCHNEIDER 1. Introduction In our project, we consider the following generalization of the median estimator problem: Problem 1. Given a stream S of n vectors in Rd , followed by a query vector v, estimate the maximum (or median or rth quantile) of {a · v | a ∈ S}, the set of projections of the vectors from S onto v. There is a clear algorithm that takes linear time and space to solve this problem exactly: store all of S, then compute the projections onto v, and find the median using a linear time median finding algorithm. We seek an algorithm to estimate the answer using sublinear time and space. In particular, we typically envision n d, or even d constant, and so we seek time and space complexities that are independent of n. In our project, we focus mainly on space complexity. In addition to this problem being interesting in its own right, it could have some insightful applications, as comparisons of vector projections are ubiquitous in combinatorial optimization, data analysis, and throughout computer science. The variant of our problem where we find the maximum is particularly significant in the context of optimizing an unknown weight function over a set of given points, a common problem in combinatorial optimization. 1.1. Estimation. The important question to ask before solving our problem is what we mean by estimate. There are two reasonable notions of estimation: quantile estimation, and geometric estimation. 1.1.1. Quantile Estimation. In quantile estimation, we want our algorithm to return a value that is within n elements of our stream from the actual quantile we want. Hence, if we are estimating the rth quantile, we need to return a value between the (r ± )th quantile. However, for fixed d, quantile estimation is just as hard in one dimension as it is in d > 1 dimensions. Indeed, recall our sublinear time quantile estimation algorithm from class for one dimension in the sampling model: We sample O(log(1/δ)/2 ) points, and return the rth quantile of our sample. With probability 1−δ, this value is within n elements of the real rth quantile Date: May 2013. 1 2 JOSH ALMAN AND JON SCHNEIDER by the Chernoff bound. This method carries over to d > 1 dimensions in either the streaming or the sampling model. If we sample or remember a uniformly random selection of O(log(1/δ)/2 ) of the vectors of S, then no matter what query vector v we get, the rth quantile of the projections of the vectors we remember onto v is within n elements of the real rth quantile with probability 1 − δ. For this reason, the quantile estimation variant of our problem is not very interesting. We instead focus on the geometric estimation variant. 1.1.2. Geometric Estimation. In geometric estimation, we want our algorithm to return a value that is within (1 ± ) of the rth quartile. In the rest of this paper, we prove the following lower bounds, when d is fixed: Theorem 1. Approximating the maximum to within 1± requires Θ(−(d−1)/2 ) space. Theorem 2. Approximating the median to within 1 ± requires Ω(n) space (when d > 1). Theorem 3. Approximating the rth-quantile (for r 6= 0, 1/2, 1) to within √ 1 ± requires Ω( n) space (when d > 1). 2. Approximating the Maximum We begin with the problem of estimating the maximum of the projections of the vectors of S onto v. We will first give an algorithm for the problem that uses O(−(d−1)/2 ) space, and then prove the same lower bound of Ω(−(d−1)/2 ) space required for the problem, in order to prove Theorem 1. Both our upper and lower bounds will make use of spherical codes for efficiently packing and covering the sphere with circles. We prove the results about sizes of spherical codes that we use in Appendix A. 2.1. An Algorithm. To give an upper bound for the space required for approximating the maximum, we provide an algorithm for the problem that uses ideas from core-sets in computation geometry. The algorithm is a simplification of an algorithm for computing -kernels that was found independently by Chan [2] and Yu et al. [5]. We first give some intuition for the algorithm. Say that there is some direction v0 for which we know that the vector of S whose projection onto v0 is maximal is s0 . Then, in any other direction v, we could project s0 onto v as well. If θ is the angle between v0 and v, then projecting s0 onto v gives within cos θ of the length of the largest projection of a vector of S onto v. In other words, knowing the maximum projection onto v0 gives a cos θ approximation for any vector with angle θ from v0 . The proof of this can be found in [1, Lemma 2.1]. The idea√for our algorithm, then, is to pick θ such that cos θ = , namely pick θ ≈ , and then select some directions vi so that any direction is within angle θ of one of the vi s we pick. If we find a covering of the unit 6.893 FINAL PROJECT: QUANTILE ESTIMATION IN MANY DIMENSIONS (a) An illustration of two radius θ circles on a sphere. 3 (b) A circle covering of the sphere. Figure 1. Covering the sphere with radius θ circles. sphere by circles of radius θ (see Figure 1), then picking our vi to be the vectors to the centers of each of the circles will give us this property. We know that a sphere covering exists that uses O(θ−(d−1) ) = O(−(d−1)/2 ) circles (see Lemma 4). By picking the corresponding vi vectors, we thus get an algorithm using this much space. In the next section, we prove a lower bound that meets the space bound we just achieved. 2.2. Hardness of Approximating the Maximum. In the previous section we saw that we can approximate the maximum with 1 + by using O(−(d−1)/2 ) space. In this section, we will see that this much space is required. The key idea in the proof is that the information in the data set is hard to compress; in order to encode answers to all possible queries, we need this amount of space. As a corollary, it follows that this hardness result extends to a far more general model of computation than the streaming model namely, even in the model where we have unlimited computational power while processing the data set, we need to remember at least Ω(−(d−1)/2 ) bits of information to answer the subsequent query. Theorem 4. Approximating the maximum to within 1± requires Ω(−(d−1)/2 ) space. Proof. As mentioned above, we wish to show that it is hard to compress the information in such a way that we can still answer our desired queries. To do this, we will demonstrate a family of possible inputs, such that any two inputs differ significantly (i.e., by at least (1 + )2 ≈ 1 + 2) on some possible query vector. We will construct this family as follows. Choose a set V = {v1 , v2 , . . . , vN } of unit vectors in Rd such that any two √ vectors in V are angle at least θ apart (where as before, we choose θ ≈ so that cos θ = (1 + )−2 ≈ 1 − 2). 4 JOSH ALMAN AND JON SCHNEIDER Figure 2. Packing the sphere with radius θ circles. Note that this corresponds to a sphere packing of θ-radius circles on the surface of a (d − 1)-dimensional sphere (see Figure 2). It is known that such a set of size N = Ω(θ−(d−1) ) exists (see Lemma 3). Given V , construct our set S of input vectors by, for each vi , either including vi in S or including (1 + )2 vi in S. Since we have two choices per vector in V , our family of possible inputs contains 2|V | = 2N possible inputs. Next, let mi = maxhw, vi i, where w ranges over all vectors in S (in other words, mi is the answer to the query with query vector vi ). We claim that mi = 1 if vi ∈ S and mi = (1 + )2 if (1 + )2 vi ∈ S. To show this, it suffices to notice that, since every other element of S is angle at least θ from vi , the maximum of hw, vi i over all w ∈ S not proportional to vi is at most (1 + )2 cos θ ≤ 1. It follows that vi or (1 + )2 vi is the maximizer of maxhw, vi i. It now follows that any for any two of the 2N possible choices of input, there is a query vector for which the two exact outputs differ by a factor of at least (1+)2 . Any correct algorithm must present different outputs for these two different inputs, and therefore that any correct algorithm must be able to distinguish perfectly between all 2N possible input sets. The required space complexity for such an algorithm is Ω(N ) = Ω(θ−(d−1) ) = Ω(−(d−1)/2 ), as desired. 6.893 FINAL PROJECT: QUANTILE ESTIMATION IN MANY DIMENSIONS 5 3. Approximating the Median In the previous section, we saw that we can approximate our generalized notion of maximum in space independent of n. A natural question is whether we can obtain such a result for the median (perhaps by modifying the maximum algorithm to instead store the median answer in a variety of different directions). Unfortunately, we will see that this is impossible (when d > 1). In this section we will show that any algorithm that estimates the median to within 1 + must use at least Ω(n) space (and in particular, no √ sublinear algorithm can exist for it). A similar yet weaker lower bound of Ω( n) likewise carries over to the general problem of computing rth quantiles for general r. As in the previous section, our core approach will be the same. We will show that the required information is hard to compress; in particular, that there are at least 2N different data sets with O(N ) points that we must successfully distinguish between. Theorem 5. Approximating the median to within 1 ± requires Ω(n) space (when d > 1). Proof. It suffices to prove this for the case where d = 2; in any higher dimension, we can just restrict all our points to lie in a plane. Choose N vectors V = {v1 , v2 , . . . , vN } in R2 such that no two are parallel, and choose any subset S of V . We will show how to construct an input set with O(N ) points such that for any vi ∈ S, the median is 0, whereas for any vi ∈ V \ S, the median is nonzero (and hence multiplicatively separated from 0). Since there are 2N possible choices of S, it follows that we need Ω(N ) = Ω(n) space to distinguish between all of these possibilities. To construct our desired input set, we will incrementally add points starting with an input set which contains only the origin. For each v ∈ S, draw the line hv, xi = 0. These lines divide R2 into 2|S| different regions (see Figure 3). For the median in each of these directions to occur at 0, we would like there to be the same number of points on each side of each line. Currently this condition is satisfied, since there are no points in any of the regions. Now, consider in turn each v ∈ V \ S. For each v, draw the line `v given by hv, xi = 0; we would like there to be a different number of points on each side of the line. Each such line intersects two existing regions; label these regions R and −R. If either R or −R contains any points, then we can just move all these points to the same side of v while not changing the number of points in any existing region (and hence keeping the medians across all other lines the same). On the other hand, if there are no points in R or −R, we can first add one point to each region; note that doing so preserves the absolute difference between the number of points on each side of each already existing line. We can then reduce this to the previous case. 6 JOSH ALMAN AND JON SCHNEIDER Figure 3. Dividing R2 into regions. Altogether, since we consider at most n different directions v, we add at most 2n points to our input set, so our resulting input set has size O(n), as desired. Remark: In the above proof, we’ve exploited the fact that to predict 0 up to a multiplicative factor, we need to predict it exactly. Such a proof is not necessary, however; we can straightforwardly adapt the same proof as above in the case where all vectors v have magnitude at least C (for any fixed C). The main change is that, instead of considering line through the origin, we consider them through some arbitrary point p, and now our regions are constrained by additional error margins about each line; see figure 4. We can adapt the proof above to the case of approximating the rthquantile, unfortunately at a cost of a weaker lower bound. Theorem 6. Approximating the rth-quantile (for r 6= 0, 1/2, 1) to within √ 1 ± requires Ω( n) space (when d > 1). Proof. The main reason why the proof of Theorem 5 above fails in this case, is that by adding one point to each of regions R and −R, we can no longer guarantee that the rth quantile in an existing direction has not changed 6.893 FINAL PROJECT: QUANTILE ESTIMATION IN MANY DIMENSIONS 7 Figure 4. Dividing R2 into regions with -margins. (whereas we could for the median). Below we fix our proof to address this problem. Write r = p/q, and without loss of generality assume r > 1/2. When originally choosing the set V of vectors, choose them all so they lie in a common half-plane (e.g., make all of them have positive x-component). Now, each vector v once again defines a line hv, xi = 0 which divides R2 into two regions; this time, call the region hv, xi > 0 the “positive” region for v and call the region hv, xi < 0 the “negative” region for v. Since all vectors v lie in a common half-plane, it follows that there exists some region Rp that lies in all of the positive regions. Now, we claim that, if there are 2n regions, then adding (p − q)n + q points to the region Rp and q points to each other region does not change for any direction whether the rth quantile is 0. To see why this is true, note that for every line, one side of the line (the line containing Rp ) gets ((p − q)n + q) + (n − 1)q = pn new points, whereas the other side of the line gets qn new points. It follows that if the ratio of the number of points on the two sides of the line was already r, it will stay equal to r. Since this adds at least one point to every region, we can perform this operation in lieu of adding one point to R and −R (as we did for the median), 8 JOSH ALMAN AND JON SCHNEIDER and our new construction thus works. We now possibly add O(N ) points for each of the N directions, so our resulting input set will have at most n = O(N 2 ) points in our input set,√from which it follows that any approximation algorithm requires at least Ω( n) space. 4. Further Steps There are many possible avenues for extending and generalizing the results of this paper. One such avenue is strengthening the lower bound for rth quantile finding. At heart, the rth quantile problem seems very similar to the median problem, so a similar linear bound seems attainable (possibly via an improved construction). Another is proving lower bounds for median approximation in more restricted settings. Currently, our construction requires that our input points can range across all of Rd . While we can strengthen our construction somewhat (so that we only require input points that have magnitude at least C, for example), our results don’t seem to extend to cases where all our input points must lie in a partially-bounded region. For example, if all input points and query vectors must lie in (R+ )d (i.e. have all positive components), is it still true that we require Ω(n) space? What if all points lie in [1, 2]d ? In this latter case, it seems like we should be able to subdivide this region in such a way that we can answer such queries in space independent of n. Finally, there are other models for median estimation which we have not considered in this paper. For example, in [3] Greenwald and Khanna present an algorithm for deterministic median estimation in the quantile estimation model. Likewise, in [4] Guho and McGregor present an algorithm for median estimation in the random stream model (where we can choose to receive the elements in the stream in a random order). It would be interesting to adapt either algorithm to this higher-dimensional setting. References [1] P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric Approximation via Coresets. Survey available at http://valis.cs.uiuc.edu/ sariel/research/papers/04/survey/survey.pdf [2] T. M. Chan. Faster coreset constructions and data stream algorithms in fixed dimensions. In Proc. 20th Annu. ACM Sympos. Comput. Geom., 152–159, 2004. [3] M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, 58-66, 2001. [4] S. Guha and A. McGregor. Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams. In SIAM Journal on Computing, 38(5): 2044-2059, 2008. [5] H. Yu, P. K. Agarwal, R. Poreddy, and K. R. Varadarajan. Practical methods for shape fitting and kinetic data structures using core sets. In Proc. 20th Annu. ACM Sympos. Comput. Geom., 263–272, 2004. 6.893 FINAL PROJECT: QUANTILE ESTIMATION IN MANY DIMENSIONS 9 Appendix A. Spherical Codes We prove some lemmas about sizes of spherical codes that are used in our bounds on the maximum estimation problem in Section 2. We will show that the best coverings and the best packings of the (d − 1)-sphere of radius 1 by (d − 2)-spheres of radius θ have size Θ(θ−(d−1) ), when d is constant. Lemma 1. A packing of the (d − 1)-sphere of radius 1 by (d − 2)-spheres of radius θ uses O(θ−(d−1) ) (d − 2)-spheres. Proof. We know from calculus that, if Sd is the surface area of the radius 1 d-sphere, and Vd is its volume, then for all d: Sd−1 = 2πVd−2 . 0 is the volume of a (d − 2)-sphere of radius θ, then Morever, if Vd−2 0 Vd−2 = Vd−2 × θd−1 . If we pack (d − 2)-spheres onto a (d − 1)-sphere, then the total volume of the (d − 2)-spheres cannot exceed the surface area of the (d − 1)-sphere, since the (d − 2)-spheres do not overlap. Hence, the maximum number of of (d − 2)-spheres we can pack is Sd−1 = O(θ−(d−1) ). 0 Vd−2 Lemma 2. A covering of the (d − 1)-sphere of radius 1 by (d − 2)-spheres of radius θ uses Ω(θ−(d−1) ) (d − 2)-spheres. Proof. If we use (d − 2)-spheres to cover a (d − 1)-sphere, then the total volume of the (d − 2)-spheres must be at least the surface area of the (d − 1)sphere. Hence, the minimum number of of (d − 2)-spheres we need is again Sd−1 = Ω(θ−(d−1) ). 0 Vd−2 Lemma 3. A packing of the (d − 1)-sphere of radius 1 by (d − 2)-spheres of radius θ can be found that uses Ω(θ−(d−1) ) (d − 2)-spheres. Proof. We can repeatedly place (d − 2)-spheres on the (d − 1) sphere in any way we want so that no two intersect until we have a maximal packing, meaning, it is no longer possible to place a new (d − 2)-spheres without it intersecting one that was already placed. Then, we could increase the radius of each circle we placed from θ to 2θ. This gives a covering of the (d − 2)-sphere, since if any point is not covered, then we could have centered a (d − 2)-sphere there initially. The result follows from our bound in Lemma 2. 10 JOSH ALMAN AND JON SCHNEIDER Lemma 4. A covering of the (d − 1)-sphere of radius 1 by (d − 2)-spheres of radius θ can be found that uses O(θ−(d−1) ) (d − 2)-spheres. Proof. We can take any maximal packing of the (d − 1)-sphere by (d − 2)spheres of radius 21 θ and then double the radii of the (d − 2)-spheres to get a covering as above of the desired size.
© Copyright 2025 Paperzz