A Linear Lower Bound on the Unbounded Error Probabilistic Communication Complexity Jurgen Forster Lehrstuhl Mathematik & Informatik, Fakultat fur Mathematik, Ruhr-Universitat Bochum, 44780 Bochum, Germany E-mail: [email protected] We prove a general lower bound on the complexity of unbounded error probabilistic communication protocols. This result improves on a lower bound for bounded error protocols by Krause. As a simple consequence we obtain the rst linear lower bound on the complexity of unbounded error probabilistic communication protocols for the functions dened by Hadamard matrices. We also give an upper bound on the margin of any embedding of a concept class in half spaces. Key Words : operator norm 0. 1. 2. 3. 4. lower bounds, probabilistic communication complexity, Hadamard matrix, CONTENTS Introduction. Notation. A lower bound on the complexity of unbounded error probabilistic communication protocols. An upper bound on the margin of arrangements of half spaces. A result from linear algebra. 0. INTRODUCTION Lower bounds on the complexity of communication protocols have applications in a large eld of problems such as in VLSI and distributed computing. For a list of references see, e.g., Krause [12]. In this paper we prove new and improved lower bounds on the communication complexity of distributed functions. In particular we show that the unbounded error probabilistic communication complexity of the functions dened by Hadamard matrices is linear. This solves a long standing open problem stated in Paturi and Simon [14], Krause [12]. In this paper a probabilistic communication protocol is a probabilistic algorithm for two processors P0 and P1 that computes a distributed function f : f0; 1gn f0; 1gn ! f0; 1g. Both processors have unbounded computational power. P0 sees only the rst part, x, and P1 sees only the last part, y, of the input (x; y) 2 f0; 1gn f0; 1gn. Obviously there has to be some communication between the 1 2 FORSTER two processors to calculate f (x; y) 2 f0; 1g. The processors can communicate by exchanging messages b 2 f0; 1g. The computation takes place in rounds. In each round one of the processors is active, in odd rounds it is P0 and in even rounds it is P1 . The active processor probabilistically (depending on the part of the input it knows and on the past messages) chooses a message according to the communication protocol. In the nal round the active processor probabilistically chooses the result of the computation. We say that a protocol computes the distributed function f : f0; 1gn f0; 1gn ! f0; 1g with unbounded error if for all inputs (x; y) 2 f0; 1gn f0; 1gn the correct output is calculated with probability greater than 21 . The complexity of a communication protocol is dlog2 N e, where N is the number of distinct message sequences that can occur in computations that follow the protocol. The communication complexity C~f of a distributed function f : f0; 1gn f0; 1gn ! f0; 1g is the smallest complexity that a communication protocol for f can have. It is known that the communication complexity for almost all functions f : f0; 1gn f0; 1gn ! f0; 1g is linear in n (see Alon, Frankl and Rodl [1], Paturi and Simon [14]). This was shown by counting arguments that do not give lower bounds on the communication complexity of particular functions. Although it was known that most functions have linear communication complexity, one did not know this for any particular function. Paturi and Simon [14] give an example of a distributed function with logarithmic communication complexity. They conjecture that the functions dened by Hadamard matrices have linear probabilistic communication complexity. We prove this conjecture in Corollary 2.2. Our techniques can also be applied to another class of problems. Recently there has been a lot of interest in maximal margin classiers. Learning algorithms that calculate the hyperplane that separates a sample with the largest margin and use this hyperplane to classify new instances have shown excellent empirical performance (see [4]). Often the instances are mapped (implicitly when a kernel function is used) to some possibly high dimensional space before the hyperplane with maximal margin is calculated. If the norms of the instances are bounded and a hyperplane with large margin can be found, a bound on the VC-dimension can be applied (Vapnik [15]; [4], Theorem 4.16). A small VC-dimension means that a concept class can be learned with a small sample size (Vapnik and Chervonenkis [16]; Blumer et. al. [3]; [11], Theorem 3.3). The success of maximal margin classiers raises the question which concept classes can be embedded in half spaces with a large margin. For every concept class there is a trivial embedding into half spaces. Ben-David, Eiron and Simon [2] show that most concept classes (even of small VC-dimension) cannot be embedded with a margin that is much larger than the trivial margin. They use counting arguments that do not give an upper bound on the margin for a particular concept class. Vapnik [15] also showed an upper bound on the margin that is in terms of the VC-dimension. We prove a stronger general bound on the margin, and show that for the concept classes dened by Hadamard matrices only the trivial margin can be achieved. Both a distributed function f : f0; 1gn f0; 1gn ! f0; 1g and a concept class C over an instance space X can be represented by a matrix with entries 1. For a distributed function f we can use the matrix Mf := (2f (x; y)?1)x;y2f0;1g , and for a n 3 LINEAR LOWER BOUND concept class C over the instance space X we can use the matrix MX;C 2 f?1; 1gX C for which the entry (MX;C )x;c is 1 if x 2 c and ?1 otherwise. Krause [12] shows that the complexity of any probabilistic communication protocol which computes a function f : f0; 1gn f0; 1gn ! pf0; 1g with error probability bounded by 21 ? 1s is at least 41 (n ? log2 kMf k ? log2 s ? 2) for all s 2 N. Here kMf k is the operator norm of the matrix Mf . We improve on this result by showing that the assumption that the error of the protocol is bounded is not really needed: In Section 2 we prove a lower bound of n ? log2 kMf k on the unbounded error communication complexity. A result from linear algebra that we need in the proof is given in Section 4. In Section 3 we show that any concept class (X; C ) can only be embedded in homogeneous half spaces with margin at most pkMjX j CjCjk . In the next section we x some notation for the rest of the paper. After a preliminary version of this paper [5] was published, some of the techniques and results presented here have been improved and strengthened. It is shown in [6] that the lower bound on the communication complexity of a distributed function given in Section 2 is reasonably good for almost all distributed functions: It is linear in n for the vast majority of distributed functions. Theorem 2.2 is used in [6] to show lower bounds on the size of depth-2 threshold circuits that compute Hadamard matrices: If the top gate of the circuit is a linear threshold gate with unrestricted weights and if there are s linear threshold gates onthe bottom level with integer weights of absolute value at most W , then s = 2 nW . If the top gate of the circuit is a linear threshold gate with unrestricted weights and there are s gates on the bottom level that compute symmetric functions, 2 then s = n . Also, Theorem 2.2 is used to give an example of a function which can only be computed by probabilistic ordered binary decision diagrams (OBBD's) of exponential size. Furthermore it is shown in [6] that the lower bound of Theorem 2.2 even holds for all matrices M 2 RX Y for which the entries Mx;y have absolute value at least 1. An example of a class of matrices is given for which Theorem 2.2 fails, but for which the generalization of Theorem 2.2 still gives a strong lower bound on the dimension. A similar generalization of Theorem 3.1 to matrices with arbitrary non-zero entries is given in [7]. There it is shown that the margin of embedding of a matrix pjXany j k M M 2 RX Y with non-zero entries is at most qP P jkM j . This result is ) 2 ( 2 used to show that the optimal margin for embedding the matrices Mn 2 f?1; 1gnn with Mi;j = 1 i i j is of the order 2 ln n + ( (ln1n) ). X; n=2 n=2 y Y x X x;y 2 2 1. NOTATION For a nite set X , RX is the vector space of real valuedpP functions (\vectors") X on X . The Euclidean norm of a vector a 2 R is kak := x2X a2x . As usual n f 1 ;:::;n g n we write R := R . The vectors x 2 R are column vectors. For two nite sets X; Y we write RX Y for the set of real matrices with rows indexed by the elements of X and columns indexed by the elements of Y . The transposition of a matrix A 2 RX Y is denoted by A> 2 RY X . The identity matrix is denoted by 4 FORSTER IX 2 RX X . We dene ex 2 RX , x 2 X , to be the canonical unit vectors satisfying (ex )y = 01 ;; x=y ; x 6= y ; for x; y 2 X . The (k ? 1)-dimensional sphere, i.e. the set fx 2 Rk j kxk = 1g, is denoted by S k?1 . For a linear function A : RX ! RY , the null space is null(A) = fx 2 RX j Ax = 0g ; and the range of A is range(A) = fAx j x 2 RX g : The trace of a matrix A 2 RX X is trace(A) = X x2X Ax;x : The group of nonsingular matrices A 2 Rkk is denoted by GL(k). The operator norm of a matrix A 2 RX Y is kAk = sup kAxk = max kAxk : x2R kxk1 x2R kxk1 X X The supremum is attained because kAxk is a continuous function of x and the unit ball fx 2 RX j kxk 1g is compact. It is well known that for any matrix A 2 RX Y : kAk2 = kA> Ak = kAA> k. For every symmetric matrix A 2 RX X (i.e. if A = A> ) there is an orthonormal basis d1 ; : : : ; djX j of RX and there are numbers 1 ; : : : ; jX j 2 R (the eigenvalues of A) such that A= jX j X i=1 i di d>i : (This follows from the Spectral Theorem.) In this case kAk is the absolute value of the largest eigenvalue of A. Note that for every orthonormal basis d1 ; : : : ; djX j of RX jX j X di d>i i=1 X X is the unit matrix IX 2 R . A symmetric matrix A 2 RX X is said to be positive semi-denite, denoted by A 0, if x> Ax 0 for all x 2 RX . For two symmetric matrices A; B 2 RX X we write A B if B ? A 0. In this case the smallest eigenvalue of A is less than or equal to the smallest eigenvalue of B . This can for example be seen by noting that for a symmetric matrix A the inequality A IX means that the smallest eigenvalue of A is at least 2 R. For details see, e.g., [13]. 5 LINEAR LOWER BOUND In our bounds the operator norm of matrices M 2 RX Y with p entries 1 will appear. For a matrix of this form it is easy to p see that kM k = jY j if the rows of M are pairwise orthogonal and that kM k = pjX j if the columns of M are pairwise orthogonal. If rank(M ) = 1 then kM k = jX j jY j. The Hadamard matrices Hn 2 R2 2 are examples of matrices with pairwise orthogonal rows and pairwise orthogonal columns. They are recursively dened by n n H0 = 1 ; Hn Hn Hn+1 = H n ?Hn : The signum function sign : R ! R is given by 8 < 1; sign(x) = : 0 ; ?1 ; x>0 ; x=0 ; x<0 : 2. A LOWER BOUND ON THE COMPLEXITY OF UNBOUNDED ERROR PROBABILISTIC COMMUNICATION PROTOCOLS We say that a matrix M 2 RX Y with no zero entries can be realized by an arrangement of homogeneous half spaces in Rk if there are vectors ux; vy 2 Rk for x 2 X , y 2 Y such that sign(Mx;y ) = signhux; vy i for all x 2 X , y 2 Y . A vector vy (or analogously a vector ux) can be interpreted as a normal vector of the boundary of the homogeneous half space fz 2 Rk j hz; vy i 0g. Then signhux ; vy i = Mx;y means that the vector ux lies in this half space if and only if Mx;y = 1. The communication complexity C~f of a distributed function f is strongly re- lated to the smallest dimension of an arrangement of homogeneous half spaces that realizes the matrix Mf : Theorem 2.1 (Paturi and Simon [14], Theorem 2). Let f : f0; 1gnf0; 1gn ! f0; 1g be a distributed function. If k is the smallest dimension of an arrangement of homogeneous half spaces that realizes Mf , then dlog2 ke C~f dlog2 ke + 1 : We state our main result now. It immediately implies a lower bound on communication complexities. Theorem 2.2. If a matrix M 2 f?1; 1gX Y can be realized by an arrangement of homogeneous half spaces in Rk , then p k kjXMj kjY j : Proof. Assume that there are vectors ux; vy 2 Rk such that signhux; vy i = Mx;y for all x 2 X , y 2 Y . We can assume without loss of generality that spanfux j x 2 6 FORSTER X g = Rk . This implies that jX j k. Furthermore, we can assume that any k of the vectors ux, x 2 X , are linearly independent because the signs of the scalar products hux ; vy i are not aected by small changes of the ux. For any nonsingular linear mapping A 2 GL(k) we can replace ux; vy by Aux , (A> )?1 vy without changing the > ? 1 scalar products hux; vy i = Aux; (A ) vy . Furthermore, normalizing the vectors ux, vy does not change the signs of the scalar products hux; vy i. Because of these observations and because of Theorem 4.1 (which is proven in Section 4) we can P assume w.l.o.g. that ux ; vy 2 S k?1 and that x2X uxu>x = jXk j Ik , i.e. that the vectors ux are nicely balanced in this sense. Now we have for all y 2 Y that X x2X jhux ; vy ij 1jhux ;vy ij X x2X hux; vy i2 = v > y X x2X ! uxu>x vy = jXk j : (1) Inequality (1) means that for all y 2 Y the absolute values of the scalar products hux ; vy i are on the average at least k1 , i.e., the vectors ux cannot lie arbitrarily close to the homogeneous hyperplane with normal vector vy . Lemma 2.1 (which is proven below) gives a corresponding upper bound in terms of the operator norm kM k on the absolute values of the scalar products hux; vy i. It follows that 2 (1) X j X j jY j k X y2Y x2X jhux ; vy ij !2 Lemma 2.1 jX j kM k2 : From Theorem 2.1 and Theorem 2.2 (applied to the case X = Y = f0; 1gn, M = Mf ) we get the following corollary: Corollary 2.1. For every distributed function f : f0; 1gn f0; 1gn ! f0; 1g the communication complexity is at least C~f n ? log2 kMf k : X Y with pairwise orthogonal columns the Recall that for a matrix M p 2 f?1; 1g operator norm is kM k = jX j. Now we can give an example of a distributed function that has linear communication complexity: Corollary 2.2. The communication complexity of the distributed function f : f0; 1gn f0; 1gn ! f0; 1g for which Mf is the Hadamard matrix Hn is at least n2 . The Hadamard matrix Hn can only be realized by an arrangement of homogeneous half spaces in Rk if k 2 . n 2 We still have to prove Lemma 2.1. In the proof we will use the following fact that follows from Fejer's Theorem (Corollary 7.5.4, Horn and Johnson [10]): For any two symmetric positive semi-denite matrices A; B 2 RX X the sum of the products of the entries is nonnegative. 7 LINEAR LOWER BOUND Lemma 2.1. For nite subsets X; Y of S k?1 let M := (signhx; yi)x2X;y2Y 2 RX Y : Then X X y2Y x2X !2 jhx; yij jX j kM k2 : Proof. For every y 2 Y we have that X x2X jhx; yij = X x2X Mx;y hx; yi = * X x2X Mx;y x; y + kyk=1 X x2X Mx;y x : (2) We square this inequality and sum over y 2 Y : X X y2Y x2X = jhx; yij X X y2Y x;x~2X () X ? x;x~2X !2 X (2) X M y2Y x2X x;y x Mx;y Mx~;y hx; x~i = kM k2IX 2 X ? x;x~2X 2 x;x~ hx; x~i = kM k MM > x;x~ hx; x~i X x2X kxk2 = jX j kM k2 : Here inequality () holds because A := kM k2IX ?MM > and B := (hx; x~i)x;x~2X are P symmetric positive semi-denite matrices, i.e. x;x~2X Ax;x~ Bx;x~ 0. 3. AN UPPER BOUND ON THE MARGIN OF ARRANGEMENTS OF HALF SPACES In this section we are interested in the largest margin (not the smallest dimension) that an arrangement of half spaces that realizes a matrix M 2 RX Y can have. We say that the matrix M 2 RX Y with no zero entries can be realized by an arrangement of homogeneous half spaces with margin if there are vectors ux, vy for x 2 X , y 2 Y that lie in the unit ball of Rk (where k can be arbitrarily large) such that sign(Mx;y ) = signhux ; vy i and jhux ; vy ij for all x 2 X , y 2 Y . If we interpret vy as the normal vector of a homogeneous half space, then signhux ; vy i = Mx;y means that the vector ux lies in the interior of this half space if and only if Mx;y = 1. The requirement jhux ; vy ij means that the point ux has distance at least from the boundary of the half space. Analogously we can interpret the vectors ux as normal vectors of half spaces and the vectors vy as points. It is crucial that we require the vectors to lie in a unit ball (or that they are bounded) because otherwise we could increase the margin by simply stretching all vectors. Note that it is not really a restriction to assume that the half spaces are homogeneous: Assume we have points ux 2 Rk , x 2 X , that lie in the unit ball and an arrangement of (not necessarily homogeneous) half spaces given by normal vectors 8 FORSTER vy 2 S k?1 and thresholds ty 2 [?1; 1] such that M = (sign(hux ; vy i ? ty ))x2X;y2Y : Then the vectors u~x := p1 2 ux 1 ; vy 2 ?ty v~y := p1 lie in the unit ball of Rk+1 , M = (sign(hu~x ; u~y i))x2X;y2Y , and the margin of the new arrangement is only by a factor of 12 worse than the old margin. As observed by Ben-David, Eiron and Simon [2], a matrix M 2 f?1; 1gX Y can always be realized by an arrangement of homogeneous half spaces with margin max jX j? ; jY j? : In the case jX j jY j let, for x 2 X , ux be the canonical unit vector ex 2 RX , and let, for y 2 Y , vy := jX j? (Mx;y )x2X 2 RX . Then kuxk = kvy k = 1, Mx;y = signhux ; vy i and jhux; vy ij = jX j? for all x 2 X , y 2 Y . A concept class (X; C ) consists of a set X , called the instance space, and a set C of concepts, where any subset of X is called a concept. For a concept class (X; C ), a theorem by Vapnik ([15]; [4], Theorem 4.16) gives an upper bound on the margin of any arrangement of half spaces that realizes MX;C : There it is shown that if d is the VC-dimension of (X; C ), then at most the margin d? is possible. It is always true that d jX j, and if d = jX j then the upper bound on the margin meets the lower bound from the trivial embedding given above. However, d = jX j is a very special case: d = jX j means that C = Pot(X ). Ben-David, Eiron and Simon [2] show that most concept classes (even of small VC-dimension) cannot be realized by an arrangement of half spaces with a margin that is much larger than the margin achieved by the trivial embedding. They use counting arguments that do not give an upper bound on the margin for particular concept classes. We show a result that implies that for concrete concept classes (even if the VCdimension is much smaller than jX j and jCj) the margin cannot be much larger than the margin achieved by the trivial embedding. In particular it will follow from Theorem 3.1 that the trivial embedding gives the best possible margin if the rows or the columns of MX;C are orthogonal, or also if MX;C has jX j pairwise orthogonal columns. Theorem 3.1. If a matrix M 2 f?1; 1gX Y can be realized by an arrangement of homogeneous half spaces with margin , then 1 2 1 2 1 2 1 2 1 2 pkM k : jX j jY j Proof. W.l.o.g. X; Y S k?1 , M = (signhx; yi)x2X;y2Y and jhx; yij for all x 2 X , y 2 Y . From Lemma 2.1 of Section 2 it follows that jY j (jX j )2 jX j kM k2. For orthogonal matrices, in particular for Hadamard matrices, this shows that the trivial embedding gives the optimal margin: LINEAR LOWER BOUND 9 Corollary 3.1. The largest margin of any arrangement of homogeneous half spaces that realizes the concept class (X; C ) for which MX;C is the Hadamard matrix Hn is = 2? 2 . n Proof. The trivial embedding gives this margin, and Theorem 3.1 shows that this margin is optimal. Note that the VC-dimension d of the concept class of Corollary 3.1 is n. Thus for this concept class the upper bound d? = n? on the margin in terms of the VC-dimension is much weaker than our bound from Theorem 3.1. 1 2 1 2 4. A RESULT FROM LINEAR ALGEBRA In this section we show a result that was needed in the proof of Theorem 2.2. We consider sets X Rk with the property that any subset of X with at most k elements is linearly independent. If X has this property then also the set A(X ) := fAx j x 2 X g for any nonsingular linear mapping A 2 GL(k) and the set N (X ) := fN (x) j x 2 X g where N : Rk n f0g ! S k?1 , N (x) := kxxk , normalizes vectors. Furthermore, jA(X )j = jN (X )j = jX j. Repeated applications of nonsingular linear functions and of normalizations N can be merged: Obviously N (B (N (A(x)))) = N ((B A)(x)) for any A; B 2 GL(k) and any x 2 X . For X Rk we consider the following positive semi-denite and symmetric matrix M(X ) := X x2X xx> 2 Rkk : This matrix and its eigenvalues and eigenvectors can be used to measure in which directions the vectors x 2 X point on the average. The range of the matrix M(X ) is range(M(X )) = span(X ). (This simple fact is, for example, shown in [8].) For all matrices A 2 GL(k) we obviously have that M(A(X )) = AM(X )A> . We want to prove that Theorem 4.1. Let X Rk , jX j k, such that all subsets of X with at most k elements are linearly independent. Then there is a nonsingular linear mapping A 2 GL(k) such that M(N (A(X ))) = 1 (Ax)(Ax)> = jX j I : k Ax k2 k k x2X X Note that it is easy to nd a nonsingular linear mapping A 2 GL(k) such that M(A(X )) is \optimally balanced", i.e. such that M(A(X )) = Ik . (How this can be done is shown below in the proof of Lemma 4.1.) However, this observation does not suce to prove Theorem 4.1 because we are considering M(N (A(X ))) and not M(A(X )) there. But we can still use the linear mapping A with M(A(X )) = Ik to get a little closer to the solution we are looking for. This is formalized in Lemma 4.1. With this result and with a compactness argument (Lemma 4.2) we can nally prove Theorem 4.1. 10 FORSTER X S k?1 , jX j k, such that all subsets of X with at most k elements are linearly independent. Then either M(X ) = jXk j Ik or there is some nonsingular linear mapping A 2 GL(k) such that the smallest eigenvalue of Lemma 4.1. Let M(N (A(X ))) = X x2X 1 (Ax)(Ax)> kAxk2 is strictly larger than the smallest eigenvalue of M(X ). Proof. Let be the smallest eigenvalue of M(X ) and V its multiplicity. The sum P of the k (nonnegative) eigenvalues of M(X ) is trace(M(X )) = x2X kxk2 = jX j. Thus the smallest eigenvalue is at most jXk j . Furthermore, if = jXk j then it follows that all eigenvalues of M(X ) are jXk j , i.e. M(X ) = jXk j Ik . Thus we can assume that < jXk j . Then also V < k must hold. For the case jX j = k we can simply map the elements of X to the canonical unit vectors with a nonsingular linear mapping A. Thus we can assume that jX j > k. We only show that there is a nonsingular linear mapping A 2 GL(k) such that the smallest eigenvalue of M(N (A(X ))) is at least and the multiplicity of the eigenvalue of M(N (A(X ))) is strictly smaller than V . This suces because we can repeat this procedure until the multiplicity of is zero. We know that 0 because M(X ) is positive semi-denite. We even know that > 0 because there are k linearly independent elements in X which means that M(X ) is invertible. W.l.o.g. (using an orthogonal transformation) M(X ) = diag(1 ; : : : ; k ) : We know that 1 ; : : : ; k > 0. Let A := diag(1? ; : : : ; ?k ) 2 GL(k). Then M(A(X )) = AM(X )A> = Ik . For all x 2 X : 1 2 k X x2i = 2 kAxk i=1 i 1 !?1 1 2 (3) because ki=1 x2i = kxk2 = 1 and 1 ; : : : ; k . Equality in (3) holds only if x is an eigenvector of M(X ) for the eigenvalue . Because of 1 < V < k this happens for at most V elements x of X , i.e. for at least jX j? V elements of X we have strict inequality in (3). It is not hard to see that because of this the rank of the matrix P M(N (A(X ))) ? Ik = X 1 > 2 ? (Ax)(Ax) ; k Ax k x2X | {z } (4) 0 (3) is at least min(jX j ? V; k). We can argue as follows: There is a subset Y of X of cardinality min(jX j ? V; k) such that strict inequality in (3) holds for all x 2 Y . The elements of Y are linearly independent because Y is a subset of X with at most k elements. It follows that the matrix on the right hand side of (4) has rank at least jY j = min(jX j ? V; k) because for all x 2 Y the coecients of the summands 11 LINEAR LOWER BOUND in (4) are strictly positive and range(M (Z )) = span(Z ) holds for any set of vectors Z Rk . Thus the eigenspace of M(N (A(X ))) for has dimension at most jX j>k k ? min(jX j ? V; k) < V : We still have to show that the matrix M(N (A(X ))) has no eigenvalues strictly smaller than . This is equivalent to showing that the matrix on the left hand side of (4) is positive semi-denite. This is true because the right hand side of (4) is a sum of positive semi-denite matrices. Lemma 4.2. Let X Rk , jX j k , such that all subsets of X with at most k elements are linearly independent. Then for every " > 0 the subset A := fA 2 GL(k) j kAk = 1; M(N (A(X ))) (1 + ")Ik g of Rkk is compact. Proof. Obviously A is a bounded subset of Rkk and we have to show that A is closed in Rkk . To show this let A1 ; A2 ; : : : 2 A be a sequence that converges to A 2 Rkk . Clearly kAk = 1 holds because the operator norm is continuous. We have to show that A 2 GL(k). Then it easily follows that M(N (A M(N (A(X ))) = llim l (X )))} (1 + ")Ik : !1 | {z (1+")I k We assume that A 2= GL(k) and show that this leads to a contradiction. The idea of the proof is to show that if A 2= GL(k) then most vectors in N (Al (X )) (namely those that do not lie in null(A)) get arbitrarily close to range(A) as l ! 1. This means that not enough \weight" remains in range(A)? for M(N (Al (X ))) (1 + ")Ik to hold. Let d := dim(null(A)). We know that 0 < d < k because of kAk = 1 and because of A 2= GL(k). Thus jX \ null(A)j d. For x 2 X n null(A) we have that kAxk > 0, thus Al x l!1 Ax 2 range(A) : ! kA xk kAxk l We also have that dim(range(A)? ) = k ? dim(range(A)) = dim(null(A)) = d : Let e1 ; : : : ; ed be an orthonormal basis of range(A)? . Then lim sup d X l!1 i=1 e>i M(N (Al (X )))ei = lim sup = lim sup X d X l!1 x2X \null(A) i=1 | d XX 2 ei ; kAAl xxk l l!1 x2X i=1 | {z } !0 if x=2null(A) 2 A x l ei ; kA xk jX \ null(A)j d : l {z 1 } 12 FORSTER Thus there exist l; i such that e>i M(N (Al (X )))ei < 1+ ". This is a contradiction to M(N (Al (X ))) (1 + ")Ik . Proof (of Theorem 4.1). We start with a linear mapping A that maps k arbitrarily chosen elements of X to the canonical unit vectors of Rk . In the case jX j = k we are already done. Now assume that jX j > k. Then M(N (A(X ))) Ik , i.e. the smallest eigenvalue of M(N (A(X ))) is at least 1. Lemma 4.1 says that if M(N (A(X ))) = jXk j Ik does not already hold we can modify A such that the smallest eigenvalue of M(N (A(X ))) increases. Thus we can nd an A and an " > 0 such that the smallest eigenvalue of M(N (A(X ))) is 1 + ". By Lemma 4.2 the set of all A 2 GL(k), kAk = 1 for which the smallest eigenvalue of M(N (A(X ))) is at least 1 + " is compact. The smallest eigenvalue of M(N (A(X ))) is a continuous function of A 2 GL(k). (The smallest eigenvalue of a symmetric positive semi-denite matrix is equal to the smallest singular value, and the singular values depend continuously on the matrix, see, e.g., Golub and Van Loan [9], Theorem 8.3.4.) This means that there is an A 2 GL(k) for which the smallest eigenvalue of M(N (A(X ))) is maximal. For this A we must have M(N (A(X ))) = jXk j Ik , because otherwise we could apply Lemma 4.1 and increase the smallest eigenvalue of M(N (A(X ))). But this would contradict the fact that the smallest eigenvalue of M(N (A(X ))) is already maximal. ACKNOWLEDGMENT The author wants to thank Hans Ulrich Simon for a lot of helpful comments and for telling him about a nice idea how random projection and results from communication complexity theory can be used to prove upper bounds on margins of arrangements of half spaces. The author is also grateful to Shai Ben-David, Niels Schmitt, Eike Kiltz, Ingo Wegener, Satyanarayana Lokam and Sanjoy Dasgupta for helpful discussions. We thank Dietrich Braess for simplifying the proof of Lemma 2.1. The author was supported by the Deutsche Forschungsgemeinschaft grant SI 498/4-1 and by a grant from the G.I.F., the German-Israeli Foundation for Scientic Research and Development. REFERENCES 1. Alon, N., Frankl, P., and Rodl, V. (1985). Geometrical realization of set systems and probabilistic communication complexity. Proceedings of the 26th Annual Symposium on Foundations of Computer Science. IEEE Computer Society. 2. Ben-David, S., Eiron, N., and Simon, H. U. (2001). Limitations of learning via embeddings in Euclidean half-spaces. Proceedings of the 14th Annual Workshop an Computational Learning Theory. Berlin, Heidelberg: Springer. 3. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 100, 157{184. 4. Christianini, N., and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge, United Kingdom: Cambridge University Press. 5. Forster, J. (2001). A linear lower bound on the unbounded error probabilistic communication complexity. Proceedings of the 16th Annual Conference on Computational Complexity. IEEE Computer Society. 6. Forster, J., Krause, M., Lokam, S. V., Mubarakzjanov, R., Schmitt, N., and Simon, H. U. (2001). Relations between Communication Complexity, Linear Arrangements, and Computational Complexity. Unpublished manuscript. LINEAR LOWER BOUND 13 7. Forster, J., Schmitt, N., and Simon, H. U. (2001). Estimating the optimal margins of embeddings in Euclidean half spaces. Proceedings of the 14th Annual Workshop an Computational Learning Theory. Berlin, Heidelberg: Springer. 8. Forster, J., and Warmuth, M. K. (2000). Relative loss bounds for temporal-dierence learning. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 295{302). San Francisco: Morgan Kaufmann. 9. Golub, G. H., and Van Loan, C. F. (1991). Matrix Computations. Baltimore and London: The Johns Hopkins University Press. 10. Horn, R. A., and Johnson, C. R. (1985). Matrix Analysis. Cambridge, United Kingdom: Cambridge University Press. 11. Kearns, M. J., and Vazirani, U. V. (1994). An Introduction to Computational Learning Theory. Cambridge, Massachusetts: Massachusetts Institute of Technology. 12. Krause, M. (1996). Geometric arguments yield better bounds for threshold circuits and distributed computing. Theoretical Computer Science, 156, 99{117. 13. Meise, R., and Vogt, D. (1992). Einfuhrung in die Funktionalanalysis. Braunschweig / Wiesbaden: Friedrich Vieweg & Sohn Verlagsgesellschaft mbH. 14. Paturi, R., and Simon, J. (1986). Probabilistic communication complexity. Journal of Computer and System Sciences, 33, 106{123. 15. Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley & Sons, Inc. 16. Vapnik, V. N., and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264{280.
© Copyright 2026 Paperzz