A Linear Lower Bound on the Unbounded Error

A Linear Lower Bound on the Unbounded Error
Probabilistic Communication Complexity
Jurgen Forster
Lehrstuhl Mathematik & Informatik, Fakultat fur Mathematik, Ruhr-Universitat Bochum,
44780 Bochum, Germany
E-mail: [email protected]
We prove a general lower bound on the complexity of unbounded error
probabilistic communication protocols. This result improves on a lower
bound for bounded error protocols by Krause. As a simple consequence
we obtain the rst linear lower bound on the complexity of unbounded
error probabilistic communication protocols for the functions dened by
Hadamard matrices. We also give an upper bound on the margin of any
embedding of a concept class in half spaces.
Key Words :
operator norm
0.
1.
2.
3.
4.
lower bounds, probabilistic communication complexity, Hadamard matrix,
CONTENTS
Introduction.
Notation.
A lower bound on the complexity of unbounded error probabilistic communication protocols.
An upper bound on the margin of arrangements of half spaces.
A result from linear algebra.
0. INTRODUCTION
Lower bounds on the complexity of communication protocols have applications
in a large eld of problems such as in VLSI and distributed computing. For a list
of references see, e.g., Krause [12]. In this paper we prove new and improved lower
bounds on the communication complexity of distributed functions. In particular
we show that the unbounded error probabilistic communication complexity of the
functions dened by Hadamard matrices is linear. This solves a long standing open
problem stated in Paturi and Simon [14], Krause [12].
In this paper a probabilistic communication protocol is a probabilistic algorithm
for two processors P0 and P1 that computes a distributed function f : f0; 1gn f0; 1gn ! f0; 1g. Both processors have unbounded computational power. P0 sees
only the rst part, x, and P1 sees only the last part, y, of the input (x; y) 2
f0; 1gn f0; 1gn. Obviously there has to be some communication between the
1
2
FORSTER
two processors to calculate f (x; y) 2 f0; 1g. The processors can communicate by
exchanging messages b 2 f0; 1g. The computation takes place in rounds. In each
round one of the processors is active, in odd rounds it is P0 and in even rounds it
is P1 . The active processor probabilistically (depending on the part of the input it
knows and on the past messages) chooses a message according to the communication
protocol. In the nal round the active processor probabilistically chooses the result
of the computation.
We say that a protocol computes the distributed function f : f0; 1gn f0; 1gn !
f0; 1g with unbounded error if for all inputs (x; y) 2 f0; 1gn f0; 1gn the correct
output is calculated with probability greater than 21 . The complexity of a communication protocol is dlog2 N e, where N is the number of distinct message sequences
that can occur in computations that follow the protocol. The communication complexity C~f of a distributed function f : f0; 1gn f0; 1gn ! f0; 1g is the smallest
complexity that a communication protocol for f can have.
It is known that the communication complexity for almost all functions f :
f0; 1gn f0; 1gn ! f0; 1g is linear in n (see Alon, Frankl and Rodl [1], Paturi and
Simon [14]). This was shown by counting arguments that do not give lower bounds
on the communication complexity of particular functions. Although it was known
that most functions have linear communication complexity, one did not know this
for any particular function. Paturi and Simon [14] give an example of a distributed
function with logarithmic communication complexity. They conjecture that the
functions dened by Hadamard matrices have linear probabilistic communication
complexity. We prove this conjecture in Corollary 2.2.
Our techniques can also be applied to another class of problems. Recently there
has been a lot of interest in maximal margin classiers. Learning algorithms that
calculate the hyperplane that separates a sample with the largest margin and use
this hyperplane to classify new instances have shown excellent empirical performance (see [4]). Often the instances are mapped (implicitly when a kernel function is used) to some possibly high dimensional space before the hyperplane with
maximal margin is calculated. If the norms of the instances are bounded and a
hyperplane with large margin can be found, a bound on the VC-dimension can
be applied (Vapnik [15]; [4], Theorem 4.16). A small VC-dimension means that a
concept class can be learned with a small sample size (Vapnik and Chervonenkis
[16]; Blumer et. al. [3]; [11], Theorem 3.3).
The success of maximal margin classiers raises the question which concept
classes can be embedded in half spaces with a large margin. For every concept
class there is a trivial embedding into half spaces. Ben-David, Eiron and Simon [2]
show that most concept classes (even of small VC-dimension) cannot be embedded
with a margin that is much larger than the trivial margin. They use counting arguments that do not give an upper bound on the margin for a particular concept
class. Vapnik [15] also showed an upper bound on the margin that is in terms of
the VC-dimension. We prove a stronger general bound on the margin, and show
that for the concept classes dened by Hadamard matrices only the trivial margin
can be achieved.
Both a distributed function f : f0; 1gn f0; 1gn ! f0; 1g and a concept class C
over an instance space X can be represented by a matrix with entries 1. For a
distributed function f we can use the matrix Mf := (2f (x; y)?1)x;y2f0;1g , and for a
n
3
LINEAR LOWER BOUND
concept class C over the instance space X we can use the matrix MX;C 2 f?1; 1gX C
for which the entry (MX;C )x;c is 1 if x 2 c and ?1 otherwise.
Krause [12] shows that the complexity of any probabilistic communication protocol which computes a function f : f0; 1gn f0; 1gn ! pf0; 1g with error probability
bounded by 21 ? 1s is at least 41 (n ? log2 kMf k ? log2 s ? 2) for all s 2 N. Here
kMf k is the operator norm of the matrix Mf . We improve on this result by showing
that the assumption that the error of the protocol is bounded is not really needed:
In Section 2 we prove a lower bound of n ? log2 kMf k on the unbounded error
communication complexity. A result from linear algebra that we need in the proof
is given in Section 4. In Section 3 we show that any concept class (X; C ) can only
be embedded in homogeneous half spaces with margin at most pkMjX j CjCjk . In the next
section we x some notation for the rest of the paper.
After a preliminary version of this paper [5] was published, some of the techniques
and results presented here have been improved and strengthened.
It is shown in [6] that the lower bound on the communication complexity of a
distributed function given in Section 2 is reasonably good for almost all distributed
functions: It is linear in n for the vast majority of distributed functions.
Theorem 2.2 is used in [6] to show lower bounds on the size of depth-2 threshold
circuits that compute Hadamard matrices: If the top gate of the circuit is a linear
threshold gate with unrestricted weights and if there are s linear threshold gates
onthe bottom
level with integer weights of absolute value at most W , then s =
2
nW . If the top gate of the circuit is a linear threshold gate with unrestricted
weights and there
are
s gates on the bottom level that compute symmetric functions,
2
then s = n . Also, Theorem 2.2 is used to give an example of a function
which can only be computed by probabilistic ordered binary decision diagrams
(OBBD's) of exponential size.
Furthermore it is shown in [6] that the lower bound of Theorem 2.2 even holds for
all matrices M 2 RX Y for which the entries Mx;y have absolute value at least 1.
An example of a class of matrices is given for which Theorem 2.2 fails, but for which
the generalization of Theorem 2.2 still gives a strong lower bound on the dimension.
A similar generalization of Theorem 3.1 to matrices with arbitrary non-zero entries
is given in [7]. There it is shown that the margin of
embedding of a matrix
pjXany
j
k
M
M 2 RX Y with non-zero entries is at most qP P jkM j . This result is
)
2 ( 2
used to show that the optimal margin for embedding the matrices Mn 2 f?1; 1gnn
with Mi;j = 1 i i j is of the order 2 ln n + ( (ln1n) ).
X;
n=2
n=2
y
Y
x
X
x;y
2
2
1. NOTATION
For a nite set X , RX is the vector space of real valuedpP
functions (\vectors")
X
on X . The Euclidean norm of a vector a 2 R is kak :=
x2X a2x . As usual
n
f
1
;:::;n
g
n
we write R := R
. The vectors x 2 R are column vectors. For two nite
sets X; Y we write RX Y for the set of real matrices with rows indexed by the
elements of X and columns indexed by the elements of Y . The transposition of a
matrix A 2 RX Y is denoted by A> 2 RY X . The identity matrix is denoted by
4
FORSTER
IX 2 RX X . We dene ex 2 RX , x 2 X , to be the canonical unit vectors satisfying
(ex )y = 01 ;;
x=y ;
x 6= y ;
for x; y 2 X . The (k ? 1)-dimensional sphere, i.e. the set fx 2 Rk j kxk = 1g, is
denoted by S k?1 .
For a linear function A : RX ! RY , the null space is
null(A) = fx 2 RX j Ax = 0g ;
and the range of A is
range(A) = fAx j x 2 RX g :
The trace of a matrix A 2 RX X is
trace(A) =
X
x2X
Ax;x :
The group of nonsingular matrices A 2 Rkk is denoted by GL(k). The operator
norm of a matrix A 2 RX Y is
kAk = sup kAxk = max kAxk :
x2R
kxk1
x2R
kxk1
X
X
The supremum is attained because kAxk is a continuous function of x and the
unit ball fx 2 RX j kxk 1g is compact. It is well known that for any matrix
A 2 RX Y : kAk2 = kA> Ak = kAA> k. For every symmetric matrix A 2 RX X
(i.e. if A = A> ) there is an orthonormal basis d1 ; : : : ; djX j of RX and there are
numbers 1 ; : : : ; jX j 2 R (the eigenvalues of A) such that
A=
jX j
X
i=1
i di d>i :
(This follows from the Spectral Theorem.) In this case kAk is the absolute value of
the largest eigenvalue of A. Note that for every orthonormal basis d1 ; : : : ; djX j of
RX
jX j
X
di d>i
i=1
X
X
is the unit matrix IX 2 R .
A symmetric matrix A 2 RX X is said to be positive semi-denite, denoted by
A 0, if x> Ax 0 for all x 2 RX . For two symmetric matrices A; B 2 RX X we
write A B if B ? A 0. In this case the smallest eigenvalue of A is less than
or equal to the smallest eigenvalue of B . This can for example be seen by noting
that for a symmetric matrix A the inequality A IX means that the smallest
eigenvalue of A is at least 2 R. For details see, e.g., [13].
5
LINEAR LOWER BOUND
In our bounds the operator norm of matrices M 2 RX Y with
p entries 1 will
appear. For a matrix of this form it is easy to
p see that kM k = jY j if the rows of
M are pairwise orthogonal and that kM k = pjX j if the columns of M are pairwise
orthogonal. If rank(M ) = 1 then kM k = jX j jY j. The Hadamard matrices
Hn 2 R2 2 are examples of matrices with pairwise orthogonal rows and pairwise
orthogonal columns. They are recursively dened by
n
n
H0 = 1 ;
Hn Hn
Hn+1 = H
n ?Hn
:
The signum function sign : R ! R is given by
8
<
1;
sign(x) = : 0 ;
?1 ;
x>0 ;
x=0 ;
x<0 :
2. A LOWER BOUND ON THE COMPLEXITY OF UNBOUNDED
ERROR PROBABILISTIC COMMUNICATION PROTOCOLS
We say that a matrix M 2 RX Y with no zero entries can be realized by an
arrangement of homogeneous half spaces in Rk if there are vectors ux; vy 2 Rk for
x 2 X , y 2 Y such that sign(Mx;y ) = signhux; vy i for all x 2 X , y 2 Y . A vector vy
(or analogously a vector ux) can be interpreted as a normal vector of the boundary
of the homogeneous half space fz 2 Rk j hz; vy i 0g. Then signhux ; vy i = Mx;y
means that the vector ux lies in this half space if and only if Mx;y = 1.
The communication complexity C~f of a distributed function f is strongly re-
lated to the smallest dimension of an arrangement of homogeneous half spaces that
realizes the matrix Mf :
Theorem 2.1 (Paturi and Simon [14], Theorem 2). Let f : f0; 1gnf0; 1gn !
f0; 1g be a distributed function. If k is the smallest dimension of an arrangement
of homogeneous half spaces that realizes Mf , then
dlog2 ke C~f dlog2 ke + 1 :
We state our main result now. It immediately implies a lower bound on communication complexities.
Theorem 2.2. If a matrix M 2 f?1; 1gX Y can be realized by an arrangement
of homogeneous half spaces in Rk , then
p
k kjXMj kjY j :
Proof. Assume that there are vectors ux; vy 2 Rk such that signhux; vy i = Mx;y
for all x 2 X , y 2 Y . We can assume without loss of generality that spanfux j x 2
6
FORSTER
X g = Rk . This implies that jX j k. Furthermore, we can assume that any k of the
vectors ux, x 2 X , are linearly independent because the signs of the scalar products
hux ; vy i are not aected by small changes of the ux. For any nonsingular linear
mapping A 2 GL(k) we can
replace ux; vy by
Aux , (A> )?1 vy without changing the
>
?
1
scalar products hux; vy i = Aux; (A ) vy . Furthermore, normalizing the vectors
ux, vy does not change the signs of the scalar products hux; vy i. Because of these
observations and because of Theorem 4.1 (which
is proven in Section 4) we can
P
assume w.l.o.g. that ux ; vy 2 S k?1 and that x2X uxu>x = jXk j Ik , i.e. that the
vectors ux are nicely balanced in this sense.
Now we have for all y 2 Y that
X
x2X
jhux ; vy ij
1jhux ;vy ij X
x2X
hux; vy
i2 = v >
y
X
x2X
!
uxu>x vy = jXk j :
(1)
Inequality (1) means that for all y 2 Y the absolute values of the scalar products
hux ; vy i are on the average at least k1 , i.e., the vectors ux cannot lie arbitrarily close
to the homogeneous hyperplane with normal vector vy .
Lemma 2.1 (which is proven below) gives a corresponding upper bound in terms
of the operator norm kM k on the absolute values of the scalar products hux; vy i.
It follows that
2
(1) X
j
X
j
jY j k X
y2Y x2X
jhux ; vy ij
!2
Lemma 2.1
jX j kM k2 :
From Theorem 2.1 and Theorem 2.2 (applied to the case X = Y = f0; 1gn,
M = Mf ) we get the following corollary:
Corollary 2.1. For every distributed function f : f0; 1gn f0; 1gn ! f0; 1g
the communication complexity is at least
C~f n ? log2 kMf k :
X Y with pairwise orthogonal columns the
Recall that for a matrix M
p 2 f?1; 1g
operator norm is kM k = jX j. Now we can give an example of a distributed
function that has linear communication complexity:
Corollary 2.2. The communication complexity of the distributed function f :
f0; 1gn f0; 1gn ! f0; 1g for which Mf is the Hadamard matrix Hn is at least n2 .
The Hadamard matrix Hn can only be realized by an arrangement of homogeneous
half spaces in Rk if k 2 .
n
2
We still have to prove Lemma 2.1. In the proof we will use the following fact
that follows from Fejer's Theorem (Corollary 7.5.4, Horn and Johnson [10]): For
any two symmetric positive semi-denite matrices A; B 2 RX X the sum of the
products of the entries is nonnegative.
7
LINEAR LOWER BOUND
Lemma 2.1. For nite subsets X; Y of S k?1 let
M := (signhx; yi)x2X;y2Y 2 RX Y :
Then
X
X
y2Y x2X
!2
jhx; yij jX j kM k2 :
Proof. For every y 2 Y we have that
X
x2X
jhx; yij =
X
x2X
Mx;y hx; yi =
*
X
x2X
Mx;y x; y
+
kyk=1 X
x2X
Mx;y x :
(2)
We square this inequality and sum over y 2 Y :
X
X
y2Y x2X
=
jhx; yij
X X
y2Y x;x~2X
() X ?
x;x~2X
!2
X
(2) X M
y2Y x2X
x;y x
Mx;y Mx~;y hx; x~i =
kM k2IX
2
X ?
x;x~2X
2
x;x~ hx; x~i = kM k
MM > x;x~ hx; x~i
X
x2X
kxk2 = jX j kM k2 :
Here inequality () holds because A := kM k2IX ?MM > and B := (hx; x~i)x;x~2X are
P
symmetric positive semi-denite matrices, i.e. x;x~2X Ax;x~ Bx;x~ 0.
3. AN UPPER BOUND ON THE MARGIN OF ARRANGEMENTS
OF HALF SPACES
In this section we are interested in the largest margin (not the smallest dimension)
that an arrangement of half spaces that realizes a matrix M 2 RX Y can have.
We say that the matrix M 2 RX Y with no zero entries can be realized by an
arrangement of homogeneous half spaces with margin if there are vectors ux, vy
for x 2 X , y 2 Y that lie in the unit ball of Rk (where k can be arbitrarily large)
such that sign(Mx;y ) = signhux ; vy i and jhux ; vy ij for all x 2 X , y 2 Y . If we
interpret vy as the normal vector of a homogeneous half space, then signhux ; vy i =
Mx;y means that the vector ux lies in the interior of this half space if and only if
Mx;y = 1. The requirement jhux ; vy ij means that the point ux has distance
at least from the boundary of the half space. Analogously we can interpret the
vectors ux as normal vectors of half spaces and the vectors vy as points.
It is crucial that we require the vectors to lie in a unit ball (or that they are
bounded) because otherwise we could increase the margin by simply stretching all
vectors.
Note that it is not really a restriction to assume that the half spaces are homogeneous: Assume we have points ux 2 Rk , x 2 X , that lie in the unit ball and an
arrangement of (not necessarily homogeneous) half spaces given by normal vectors
8
FORSTER
vy 2 S k?1 and thresholds ty 2 [?1; 1] such that
M = (sign(hux ; vy i ? ty ))x2X;y2Y :
Then the vectors
u~x := p1
2
ux
1
;
vy
2 ?ty
v~y := p1
lie in the unit ball of Rk+1 , M = (sign(hu~x ; u~y i))x2X;y2Y , and the margin of the
new arrangement is only by a factor of 12 worse than the old margin.
As observed by Ben-David, Eiron and Simon [2], a matrix M 2 f?1; 1gX Y
can always
be realized
by an arrangement of homogeneous half spaces with margin
max jX j? ; jY j? : In the case jX j jY j let, for x 2 X , ux be the canonical
unit vector ex 2 RX , and let, for y 2 Y , vy := jX j? (Mx;y )x2X 2 RX . Then
kuxk = kvy k = 1, Mx;y = signhux ; vy i and jhux; vy ij = jX j? for all x 2 X , y 2 Y .
A concept class (X; C ) consists of a set X , called the instance space, and a set C
of concepts, where any subset of X is called a concept. For a concept class (X; C ),
a theorem by Vapnik ([15]; [4], Theorem 4.16) gives an upper bound on the margin
of any arrangement of half spaces that realizes MX;C : There it is shown that if d is
the VC-dimension of (X; C ), then at most the margin d? is possible. It is always
true that d jX j, and if d = jX j then the upper bound on the margin meets the
lower bound from the trivial embedding given above. However, d = jX j is a very
special case: d = jX j means that C = Pot(X ).
Ben-David, Eiron and Simon [2] show that most concept classes (even of small
VC-dimension) cannot be realized by an arrangement of half spaces with a margin
that is much larger than the margin achieved by the trivial embedding. They use
counting arguments that do not give an upper bound on the margin for particular
concept classes.
We show a result that implies that for concrete concept classes (even if the VCdimension is much smaller than jX j and jCj) the margin cannot be much larger
than the margin achieved by the trivial embedding. In particular it will follow from
Theorem 3.1 that the trivial embedding gives the best possible margin if the rows
or the columns of MX;C are orthogonal, or also if MX;C has jX j pairwise orthogonal
columns.
Theorem 3.1. If a matrix M 2 f?1; 1gX Y can be realized by an arrangement
of homogeneous half spaces with margin , then
1
2
1
2
1
2
1
2
1
2
pkM k :
jX j jY j
Proof. W.l.o.g. X; Y S k?1 , M = (signhx; yi)x2X;y2Y and jhx; yij for all
x 2 X , y 2 Y . From Lemma 2.1 of Section 2 it follows that jY j (jX j )2 jX j kM k2.
For orthogonal matrices, in particular for Hadamard matrices, this shows that
the trivial embedding gives the optimal margin:
LINEAR LOWER BOUND
9
Corollary 3.1. The largest margin of any arrangement of homogeneous half
spaces that realizes the concept class (X; C ) for which MX;C is the Hadamard matrix
Hn is = 2? 2 .
n
Proof. The trivial embedding gives this margin, and Theorem 3.1 shows that this
margin is optimal.
Note that the VC-dimension d of the concept class of Corollary 3.1 is n. Thus
for this concept class the upper bound d? = n? on the margin in terms of the
VC-dimension is much weaker than our bound from Theorem 3.1.
1
2
1
2
4. A RESULT FROM LINEAR ALGEBRA
In this section we show a result that was needed in the proof of Theorem 2.2.
We consider sets X Rk with the property that any subset of X with at most k
elements is linearly independent. If X has this property then also the set A(X ) :=
fAx j x 2 X g for any nonsingular linear mapping A 2 GL(k) and the set N (X ) :=
fN (x) j x 2 X g where N : Rk n f0g ! S k?1 , N (x) := kxxk , normalizes vectors.
Furthermore, jA(X )j = jN (X )j = jX j. Repeated applications of nonsingular linear
functions and of normalizations N can be merged: Obviously N (B (N (A(x)))) =
N ((B A)(x)) for any A; B 2 GL(k) and any x 2 X .
For X Rk we consider the following positive semi-denite and symmetric
matrix
M(X ) :=
X
x2X
xx> 2 Rkk :
This matrix and its eigenvalues and eigenvectors can be used to measure in which
directions the vectors x 2 X point on the average. The range of the matrix M(X )
is range(M(X )) = span(X ). (This simple fact is, for example, shown in [8].) For
all matrices A 2 GL(k) we obviously have that M(A(X )) = AM(X )A> .
We want to prove that
Theorem 4.1. Let
X Rk , jX j k, such that all subsets of X with at most
k elements are linearly independent. Then there is a nonsingular linear mapping
A 2 GL(k) such that
M(N (A(X ))) =
1 (Ax)(Ax)> = jX j I :
k
Ax
k2
k k
x2X
X
Note that it is easy to nd a nonsingular linear mapping A 2 GL(k) such that
M(A(X )) is \optimally balanced", i.e. such that M(A(X )) = Ik . (How this can be
done is shown below in the proof of Lemma 4.1.) However, this observation does
not suce to prove Theorem 4.1 because we are considering M(N (A(X ))) and not
M(A(X )) there. But we can still use the linear mapping A with M(A(X )) = Ik to
get a little closer to the solution we are looking for. This is formalized in Lemma
4.1. With this result and with a compactness argument (Lemma 4.2) we can nally
prove Theorem 4.1.
10
FORSTER
X S k?1 , jX j k, such that all subsets of X with at most
k elements are linearly independent. Then either M(X ) = jXk j Ik or there is some
nonsingular linear mapping A 2 GL(k) such that the smallest eigenvalue of
Lemma 4.1. Let
M(N (A(X ))) =
X
x2X
1 (Ax)(Ax)>
kAxk2
is strictly larger than the smallest eigenvalue of M(X ).
Proof. Let be the smallest eigenvalue of M(X ) and V its multiplicity.
The sum
P
of the k (nonnegative) eigenvalues of M(X ) is trace(M(X )) = x2X kxk2 = jX j.
Thus the smallest eigenvalue is at most jXk j . Furthermore, if = jXk j then it
follows that all eigenvalues of M(X ) are jXk j , i.e. M(X ) = jXk j Ik . Thus we can
assume that < jXk j . Then also V < k must hold.
For the case jX j = k we can simply map the elements of X to the canonical unit
vectors with a nonsingular linear mapping A. Thus we can assume that jX j > k.
We only show that there is a nonsingular linear mapping A 2 GL(k) such that
the smallest eigenvalue of M(N (A(X ))) is at least and the multiplicity of the
eigenvalue of M(N (A(X ))) is strictly smaller than V . This suces because we
can repeat this procedure until the multiplicity of is zero.
We know that 0 because M(X ) is positive semi-denite. We even know that
> 0 because there are k linearly independent elements in X which means that
M(X ) is invertible. W.l.o.g. (using an orthogonal transformation)
M(X ) = diag(1 ; : : : ; k ) :
We know that 1 ; : : : ; k > 0. Let A := diag(1? ; : : : ; ?k ) 2 GL(k). Then
M(A(X )) = AM(X )A> = Ik . For all x 2 X :
1
2
k
X
x2i
=
2
kAxk
i=1 i
1
!?1
1
2
(3)
because ki=1 x2i = kxk2 = 1 and 1 ; : : : ; k . Equality in (3) holds only if x is
an eigenvector of M(X ) for the eigenvalue . Because of 1 < V < k this happens
for at most V elements x of X , i.e. for at least jX j? V elements of X we have strict
inequality in (3). It is not hard to see that because of this the rank of the matrix
P
M(N (A(X ))) ? Ik =
X
1
>
2 ? (Ax)(Ax) ;
k
Ax
k
x2X |
{z
}
(4)
0
(3)
is at least min(jX j ? V; k). We can argue as follows: There is a subset Y of X of
cardinality min(jX j ? V; k) such that strict inequality in (3) holds for all x 2 Y .
The elements of Y are linearly independent because Y is a subset of X with at most
k elements. It follows that the matrix on the right hand side of (4) has rank at
least jY j = min(jX j ? V; k) because for all x 2 Y the coecients of the summands
11
LINEAR LOWER BOUND
in (4) are strictly positive and range(M (Z )) = span(Z ) holds for any set of vectors
Z Rk . Thus the eigenspace of M(N (A(X ))) for has dimension at most
jX j>k
k ? min(jX j ? V; k) < V :
We still have to show that the matrix M(N (A(X ))) has no eigenvalues strictly
smaller than . This is equivalent to showing that the matrix on the left hand side of
(4) is positive semi-denite. This is true because the right hand side of (4) is a sum of
positive semi-denite matrices.
Lemma 4.2. Let X Rk , jX j k , such that all subsets of X with at most k
elements are linearly independent. Then for every " > 0 the subset
A := fA 2 GL(k) j kAk = 1; M(N (A(X ))) (1 + ")Ik g
of Rkk is compact.
Proof. Obviously A is a bounded subset of Rkk and we have to show that A
is closed in Rkk . To show this let A1 ; A2 ; : : : 2 A be a sequence that converges
to A 2 Rkk . Clearly kAk = 1 holds because the operator norm is continuous. We
have to show that A 2 GL(k). Then it easily follows that
M(N (A
M(N (A(X ))) = llim
l (X )))} (1 + ")Ik :
!1 |
{z
(1+")I
k
We assume that A 2= GL(k) and show that this leads to a contradiction. The idea of
the proof is to show that if A 2= GL(k) then most vectors in N (Al (X )) (namely those
that do not lie in null(A)) get arbitrarily close to range(A) as l ! 1. This means
that not enough \weight" remains in range(A)? for M(N (Al (X ))) (1 + ")Ik to
hold.
Let d := dim(null(A)). We know that 0 < d < k because of kAk = 1 and because
of A 2= GL(k). Thus jX \ null(A)j d. For x 2 X n null(A) we have that kAxk > 0,
thus
Al x l!1
Ax 2 range(A) :
!
kA xk
kAxk
l
We also have that
dim(range(A)? ) = k ? dim(range(A)) = dim(null(A)) = d :
Let e1 ; : : : ; ed be an orthonormal basis of range(A)? . Then
lim sup
d
X
l!1 i=1
e>i M(N (Al (X )))ei = lim sup
= lim sup
X
d
X
l!1 x2X \null(A) i=1
|
d XX
2
ei ; kAAl xxk
l
l!1 x2X i=1 |
{z
}
!0 if x=2null(A)
2
A
x
l
ei ; kA xk jX \ null(A)j d :
l
{z
1
}
12
FORSTER
Thus there exist l; i such that e>i M(N (Al (X )))ei < 1+ ". This is a contradiction to
M(N (Al (X ))) (1 + ")Ik .
Proof (of Theorem 4.1). We start with a linear mapping A that maps k arbitrarily chosen elements of X to the canonical unit vectors of Rk . In the case jX j = k
we are already done.
Now assume that jX j > k. Then M(N (A(X ))) Ik , i.e. the smallest eigenvalue
of M(N (A(X ))) is at least 1. Lemma 4.1 says that if M(N (A(X ))) = jXk j Ik does
not already hold we can modify A such that the smallest eigenvalue of M(N (A(X )))
increases. Thus we can nd an A and an " > 0 such that the smallest eigenvalue
of M(N (A(X ))) is 1 + ".
By Lemma 4.2 the set of all A 2 GL(k), kAk = 1 for which the smallest eigenvalue of M(N (A(X ))) is at least 1 + " is compact. The smallest eigenvalue of
M(N (A(X ))) is a continuous function of A 2 GL(k). (The smallest eigenvalue of a
symmetric positive semi-denite matrix is equal to the smallest singular value, and
the singular values depend continuously on the matrix, see, e.g., Golub and Van
Loan [9], Theorem 8.3.4.) This means that there is an A 2 GL(k) for which the
smallest eigenvalue of M(N (A(X ))) is maximal.
For this A we must have M(N (A(X ))) = jXk j Ik , because otherwise we could apply
Lemma 4.1 and increase the smallest eigenvalue of M(N (A(X ))). But this would
contradict the fact that the smallest eigenvalue of M(N (A(X ))) is already maximal.
ACKNOWLEDGMENT
The author wants to thank Hans Ulrich Simon for a lot of helpful comments and for telling
him about a nice idea how random projection and results from communication complexity theory
can be used to prove upper bounds on margins of arrangements of half spaces. The author is
also grateful to Shai Ben-David, Niels Schmitt, Eike Kiltz, Ingo Wegener, Satyanarayana Lokam
and Sanjoy Dasgupta for helpful discussions. We thank Dietrich Braess for simplifying the proof
of Lemma 2.1. The author was supported by the Deutsche Forschungsgemeinschaft grant SI
498/4-1 and by a grant from the G.I.F., the German-Israeli Foundation for Scientic Research
and Development.
REFERENCES
1. Alon, N., Frankl, P., and Rodl, V. (1985). Geometrical realization of set systems and probabilistic communication complexity. Proceedings of the 26th Annual Symposium on Foundations
of Computer Science. IEEE Computer Society.
2. Ben-David, S., Eiron, N., and Simon, H. U. (2001). Limitations of learning via embeddings in
Euclidean half-spaces. Proceedings of the 14th Annual Workshop an Computational Learning
Theory. Berlin, Heidelberg: Springer.
3. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the
Vapnik-Chervonenkis dimension. Journal of the ACM, 100, 157{184.
4. Christianini, N., and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines.
Cambridge, United Kingdom: Cambridge University Press.
5. Forster, J. (2001). A linear lower bound on the unbounded error probabilistic communication
complexity. Proceedings of the 16th Annual Conference on Computational Complexity. IEEE
Computer Society.
6. Forster, J., Krause, M., Lokam, S. V., Mubarakzjanov, R., Schmitt, N., and Simon, H. U.
(2001). Relations between Communication Complexity, Linear Arrangements, and Computational Complexity. Unpublished manuscript.
LINEAR LOWER BOUND
13
7. Forster, J., Schmitt, N., and Simon, H. U. (2001). Estimating the optimal margins of embeddings in Euclidean half spaces. Proceedings of the 14th Annual Workshop an Computational
Learning Theory. Berlin, Heidelberg: Springer.
8. Forster, J., and Warmuth, M. K. (2000). Relative loss bounds for temporal-dierence learning.
Proceedings of the Seventeenth International Conference on Machine Learning (pp. 295{302).
San Francisco: Morgan Kaufmann.
9. Golub, G. H., and Van Loan, C. F. (1991). Matrix Computations. Baltimore and London: The
Johns Hopkins University Press.
10. Horn, R. A., and Johnson, C. R. (1985). Matrix Analysis. Cambridge, United Kingdom: Cambridge University Press.
11. Kearns, M. J., and Vazirani, U. V. (1994). An Introduction to Computational Learning Theory.
Cambridge, Massachusetts: Massachusetts Institute of Technology.
12. Krause, M. (1996). Geometric arguments yield better bounds for threshold circuits and distributed computing. Theoretical Computer Science, 156, 99{117.
13. Meise, R., and Vogt, D. (1992). Einfuhrung in die Funktionalanalysis. Braunschweig / Wiesbaden: Friedrich Vieweg & Sohn Verlagsgesellschaft mbH.
14. Paturi, R., and Simon, J. (1986). Probabilistic communication complexity. Journal of Computer and System Sciences, 33, 106{123.
15. Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley & Sons, Inc.
16. Vapnik, V. N., and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264{280.