On Weight Ratio Estimation for Covariate Shift

On Weight Ratio Estimation for Covariate Shift
Ruth Urner
Department of Empirical Inference
MPI for Intelligent Systems
Tübingen, Germany
[email protected]
Shai Ben-David
Cheriton School of Computer Science
University of Waterloo
Waterloo, Canada
[email protected]
Abstract
Covariate shift is a common assumption for various Domain Adaptation (DA)
tasks. Many of the common DA algorithms for that setup rely on density ratio
estimation applied to the source training sample. In this work, we analyze the
sample complexity of reliable density ratio estimation and its relation to classification under covariate shift. We provide a strong lower bound on the number
of samples needed, even for an extremely simple version of the problem. Aiming to shed light on the practical success of reweighing paradigms, we present a
novel reweighing scheme for which we prove finite sample size success guarantees under some natural conditions for the learning problems involved. Notably,
our reweighing algorithm for learning under covariate shift, does not rely on first
approximating the density ratio of the two distributions.
1
Introduction
Sample reweighing (also known as importance weighing or density ratio estimation) is a paradigm
that plays a major role in domain adaptation learning under the covariate shift assumption ([1], [2],
[3]). Indeed, provided that the covariate shift assumption holds, when the source marginal distribution is absolutely continuous w.r.t. the target marginal, it can be readily shown that reweighing the
source labeled sample so that each sample points gets a weight proportional to the ratio of its target
density to its source density, turns it into a sample that can be used as if it were target generated [4].
This approach therefore provides a full solution to the domain adaptation learning problem once
the density ratio between the source and target distributions can be reliably evaluated at the sample
points. However, such knowledge is rare. Instead, what a learner typically has access to are finite
samples generated by those distributions.
Consequently, algorithms that employ sample reweighing rely on estimating those ratios from finite
source and target samples. Many sample based reweighing algorithms have been proposed and
are repeatedly reported to perform well on domain adaptation and other learning tasks [5, 2, 6, 3].
However, theoretical analysis of such estimation algorithms has not yielded finite sample size quality
of approximation guarantees (also see the related work in Appendix Section A).
Our first result is a strong lower bound. We show that even a very simple version of the weight ratio
estimation problem already requires unreasonably large samples sizes. More specifically, we show
that even distinguishing between the cases that two distributions are identical or very different in
terms of their density ratio, requires sample sizes in the order of the square root of the domain size.
The main contribution of this work is on the positive side, though. We propose a novel reweighing
scheme for learning under covariate shift and show that it enjoys strong performance guarantees. We
prove finite sample error bounds for our algorithm under some natural niceness assumptions about
the relevant distributions, that are likely to hold for many practical learning tasks. The properties that
we rely on are a relaxed version of Lipschitzness of the labeling that takes the marginal distribution
1
into account, and a novel notion of simplicity of the marginal distribution, called concentratedness,
which reflects some structure likely to exist in many real data and may prove useful for other learning
tasks. It is worth noting that our reweighing algorithm does not rely on first approximating the
density ratio between the two distributions involved.
Notation We employ standard notation, see Appendix, Section B, for more details. We let P S and
P T be two distributions over a labeled domain X × {0, 1}. We denote their marginal distributions
over X by PXS and PXT respectively. We assume covariate shift, that is we assume that P S and P T
have the same labeling function l : X → {0, 1}. A function h : X → {0, 1} is a classifier and a set
of classifiers H ⊆ {0, 1}X a hypothesis class. We let errP (h) = Pr(x,y)∼P [h(x) 6= y] denote the
error of classifier h with respect to the distribution P . An empirical error with respect to a sample
S is denoted by err
c S (h). We let the best achievable error by hypothesis class H with respect to
distribution P be denoted by optP (H) = inf h∈H errP (h).
2
The difficulty of density ratio estimation
Density ratio estimation has been advocated as a key tool for domain adaptation in the covariate shift
setup ([5, 2, 6, 3]). However, note that density ratio estimation is at least as difficult a task as density
estimation (just consider the case where one of the densities is a constant function). Furthermore, we
derive a strong lower bound even for a particularly simple version of density ratio estimation. Our
lower bound implies that this task cannot be reliably achieved with finite, task-independent sample
sizes as long as no further assumptions about the densities involved are made. It is shown via a
reduction from hardness results for domain adaptation which appeared in [7]. For more details on
this reduction and our lower bound, see Appendix, Section C.
Theorem 1. Given a finite set X , no algorithm that is based on samples from the two distributions
can distinguish between the case that two distributions over X have density ratio 1 for all x ∈ X
and the case where
they have
p
density ratio 0 for all x ∈ X , as long as the sizes of those samples sum
up to less than Ω
|X | .
3
A new reweighing algorithm
We introduce a DA learning that is based on a novel reweighing mechanism, ReWeigh(S, T, , B)
(where S is a source sample, T is target unlabeled sample, some accuracy parameter and B a partition of the domain set). Our learning algorithm uses that to reweigh
a labeled sample S from the source according to an (unlabeled) sample T from the
target.
Then it chooses a weighted ERM classifier among the members of H that
satisfy some margin requirement (see definition below for the notion margin we use).
Algorithm Learn
Input Finite sets S, T ⊆ X , partition B, accuracy , margin ρ
Step 1 Get weights for S by ReWeigh(S, T, 3
4 , B)
b = {h ∈ H : h has a (ρ, 0) margin w.r.t. w(S)}
Step 2 H
Output ERMHb (w(S))
Empirical Risk minimization (ERM) can be readily generalized to a weighted sample S as follows:
Let w(S) = ((x1 , y1 , w1 ), . . . , (xn , yn , wn )) be a sample where wi is a weight associated with example (xi , yi ). Then we define the weighted empirical error with respect to the weighted sample
as

−1 

X
X
err
c w(S) (h) = 
wi  
wi 1[h(xi ) 6= yi ] ,
(xi ,yi ,wi )∈S
(xi ,yi ,wi )∈S
and ERMH (S) is defined as
ERMH (w(S)) ∈ argminh∈H err
c w(S) (h)
2
For our reweighing scheme, assume that B is a partition of the space X into sets of diameter at most
ρ. For (sample) points s and t we let bs ∈ B and bt ∈ B denote the cells of the partition that contains
s and t respectively.
Given samples S and T , a parameter η and partition B, our reweighing procedure focuses on cells
that are η-heavy from the point of view of sample T . Let Tη be the set of all elements of T that
reside in such η-heavy cells. We focus on the set of heavy cells, and the subset, Tη , of the sample .
Let w(B,T,η) (S) be the reweighing of S that assigns to every s ∈ S the weight w(s) =
|T ∩bs |
|T ||S∩bs |
if
|T ∩bs |
|T |
≥ η and 0 otherwise. Namely, this procedure uniformly divides the empirical T weight of
every cell among all its S members, after T is restricted to the η-heavy cells that are occupied by
members of S.
Algorithm ReWeigh
Input Finite sets S, T ⊆ X , parameter η, partition B
|T ∩bs |
|T ∩bs |
|T ||S∩bs | if |T | ≥ η
Step 1 For all s ∈ S set w(s) =
0 otherwise
Output The weighted set wB,T,η (S) = w(S) = {(s, w(s)) : s ∈ S}
4
Distributional assumptions
In this section, we discuss the assumptions on the data generation underlying our analysis. Note that,
by our lower bound, no positive results are possible in a distribution-free framework, i.e. without
such assumptions. Our work can therefore also be viewed as identifying a first set of properties of
the data generation that allow for domain adaptation learning based on sample reweighing.
Weight ratio A common way of restricting the divergence of source and target weights marginal
distributions is to assume some non-zero lower bound on the density ratio between the two distributions. The strongest such assumption (which is nevertheless often employed) is a bound on the
pointwise weight-ratio. However, this is rather unrealistic [1]. The following relaxation of a density
ratio, the η-weight ratio, has been introduced in [8].
Definition (Weight ratio). Let B ⊆ 2X be a collection of subsets of the domain X measurable with
respect to both PXS and PXT . For some η > 0 we define the η-weight ratio of the source distribution
and the target distribution with respect to B as
PXS (b)
CB,η (PXS , PXT ) = inf
,
b∈B
PXT (b)
T
PX (b)≥η
This quantity become relevant for domain adaptation when bounded away from zero. Note that the
pointwise weight ratio mentioned above can be obtained by setting B = {{x} : x ∈ X }.
Distribution concentratedness We introduce an abstract notion of niceness of probability distributions that measures how “concentrated” the distribution is. Similar to notions of intrinsic dimension, it reflects the intuition of the distribution not occupying the full high dimensional space (or unit
cube, or ball) over which it is defined. However, while common notions of intrinsic dimension focus
on having only a few degrees of freedom in the distribution’s support, or being coverable by a small
number of balls (as a function of their radius), we consider a different aspect - being coverable by
dense subsets. Concentratedness states that most of the domain (with respect to a distribution) can
be covered by relatively heavy sets of a bounded diameter. That is, intuitively, the distribution can
not “spread very thinly over a large area”. For example, a mixture of well separated, full dimensional
Gaussians is a distribution that has high dimension when viewed from the perspective of common
intrinsic dimensions, but is simple with respect to our notion.
Definition (Concentratedness). Let X be a subset of some Euclidean space, say [0, 1]d , and B ⊆ 2X
a collection of subsets of the domain, usually of some small diameter, that covers X .
We say that a probability distribution P over X is (α, β) concentrated with respect to B if
h[
i
P
{b ∈ B : PX (b) ≥ α} > 1 − β
3
When B is the set of all balls of radius ρ, we say that P is (ρ, α, β) concentrated.
Note that we do not care how many sets from B are required for such a majority cover. Our notion becomes particularly useful for classification prediction when paired with some Lipschitzness
assumption governing the labeling rule. This is the way we will be applying it here.
Probabilistic Lipschitzness Probabilistic Lipschitzness (PL), is a relaxation of standard Lipschitzness, introduced in [9]. Loosely speaking, for PL, we require Lipschitzness to only hold with
some (high) probability. A Lipschitz constant λ for a distribution with deterministic labeling function forces a 1/λ gap between differently labeled points. Thus, the standard Lipschitz condition for
deterministic labeling functions implies that the data lies in label homogeneous regions (clusters)
that are separated by 1/λ-margins of weight zero with respect to the distribution. PL weakens this
assumptions by allowing the margins to “smoothen out”. The relaxation from Lipschitzness to Probabilistic Lipschitzness is thus especially relevant to the deterministic labeling regime. It allows to
model the marginal-label relatedness without trivializing the setup.
Definition (Probabilistic Lipschitzness). Let X be some Euclidean domain and let φ : R → [0, 1].
We say that f : X → R is φ-Lipschitz with respect to a distribution PX over X if, for all λ > 0:
Pr
Pr [ |f (x) − f (y)| > 1/λ kx − yk ] > 0 ≤ φ(λ)
x∼PX
y∼PX
If, for a distribution P = (PX , l), the labeling function l is φ-Lipschitz, then we also say that P
satisfies φ-Probabilistic Lipschitzness. We then let φ−1 () denote the smallest λ such that φ(λ) ≥ .
Classifiers with margins If a distribution P = (PX , l) with a deterministic labeling function l
is φ-Lipschitz, then the weight of points x that have a label-heterogeneous λ-ball around them, is
bounded by φ(λ). In other words, l has a (λ, φ−1 (λ/2)) margin with the following notion of margin:
Definition (γ, margin classifier). Given a probability distribution P over X and parameters γ, ∈
(0, 1), We say that a classifier h : X → {0, 1} has margin (γ, ) with respect to P , if
Pr [{x ∈ X : ∃y ∈ X such that ||x − y|| ≤ 2γ and h(x) 6= h(y)}] < .
x∼PX
P
For a class H, let Hγ,
denote the class of all h in H that have a (γ, ) margin with respect to P .
5
Finite sample error bound
We prove that under the above assumptions about the distributions in a class of pairs of distributions
W, our proposed algorithm is a successful DA learner for W. See Appendix, Section D, for details
on the proof.
Theorem 2. Let X be some domain and let H be a class over X of finite VC dimension. Let
φ : [0, 1] → [0, 1]. For > 0, let ρ = φ−1 (/2) and let B be a partition of X into disjoint subsets
each of diameter at most ρ. Further, for some constant, C, let W = W(H, , ρ, C) be the class of all
pairs of probability distributions, (P S , P T ) satisfying φ-Probabilistic Lipschitzness, the covariate
shift assumption as well as
S
• (Concentration of the marginal) P T [ {b ∈ B : P T (b) ≥ /2}] > 1 − .
• (Margin w.r.t. H) There exist some h∗ ∈ H that has (2ρ, /2)-margin w.r.t. PXT and
errP T (h∗ ) ≤ optP T (H) + .
• (Weight-ratio for cells) CB,/4 (P S , P T ) ≥ C.
Let η = 3
b (w(T,B,η) (S)) is an 3-successful
4 . Then there exists constants C1 , C2 such that ERMH
DA learner for W, for sample sizes
VC(H∆H) log(d/C) + log(1/δ)
|S| ≥ C1
C
and
VC(H∆H) + log(1/δ)
|T | ≥ C2
,
4
where H∆H = {h1 ∆h2 | h1 , h2 ∈ H}, and h1 ∆h2 = {x ∈ X | h1 (x) 6= h2 (x)}.
4
References
[1] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance
weighting. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems (NIPS), pages 442–450. 2010.
[2] M. Sugiyama, M. Krauledat, and K.-R. Muller. Covariate shift adaptation by importance
weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007.
[3] Y. Tsuboi, Kashima, Hido S. H., S. Bickel, and M Sugiyama. Direct density ratio estimation for
large-scale covariate shift adaptation. Journal of Information Processing, 17:138–155, 2009.
[4] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection
bias correction theory. In Proceedings of the Conference on Algorithmic Learning Theory
(ALT), pages 38–53, 2008.
[5] M. Sugiyama and K. Mueller. Generalization error estimation under covariate shift. In Workshop on Information-Based Induction Sciences, 2005.
[6] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and
Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the
Institute of Statistical Mathematics, 60(4):699–746, 2008.
[7] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of
unlabeled target samples. In Proceedings of the Conference on Algorithmic Learning Theory
(ALT), pages 139–153, 2012.
[8] Shai Ben-David, Shai Shalev-Shwartz, and Ruth Urner. Domain adaptation–can quantity compensate for quality? In International Symposium on Artificial Intelligence and Mathematics
(ISAIM), 2012.
[9] Ruth Urner, Shai Ben-David, and Shai Shalev-Shwartz. Unlabeled data can speed up prediction
time. In Proceedings of the International Conference on Machine Learning (ICML), pages
641–648, 2011.
[10] Takafumi Kanamori and Masashi Sugiyama. Statistical analysis of distance estimators with
density differences and density ratios. Entropy, 16(2):921–942, 2014.
[11] Qichao Que and Mikhail Belkin. Inverse density as an inverse problem: the fredholm equation
approach. In Advances in Neural Information Processing Systems (NIPS), pages 1484–1492,
2013.
[12] Assaf Glazer, Michael Lindenbaum, and Shaul Markovitch. Learning high-density regions
for a generalized kolmogorov-smirnov test in high-dimensional data. In Advances in Neural
Information Processing Systems (NIPS), pages 737–745, 2012.
[13] Benjamin G. Kelly, Thitidej Tularak, Aaron B. Wagner, and Pramod Viswanath. Universal
hypothesis testing in the learning-limited regime. In IEEE International Symposium on Information Theory (ISIT), pages 1478–1482, 2010.
[14] D Haussler and E Welzl. Epsilon-nets and simplex range queries. In Proceedings of the second
annual symposium on Computational geometry, SCG ’86, pages 61–71, 1986.
[15] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning. 2014. Cambridge University Press.
[16] Vladimir N. Vapnik and Alexey J. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–
280, 1971.
5
A
Related work
Due to the wide range of problems to which sample based importance reweighing is relevant, there
is a large body of previous work addressing this topic. We aim to refer to some of the most relevant
works below. However, there are some high level differences between the previously published
papers that we are aware of and our work. In that respect, we believe the particular works we
address below are representative of other papers on the topic as well. The first point to emphasize is
that in this paper we are focusing on finite sample size guarantees. Furthermore, we are interested in
sample size guaranties that are distribution free, or at least hold for a large non-parameterized family
of distributions, and are independent of the size of the domain (or distribution support).
Naturally, much of the work on this topic is aiming to provide practical algorithms supported by
experimental results on some concrete data, rather than provide performance guarantees. This is,
for example, the case with [5, 2, 6, 3]. Another relevant line of work aims to estimate the function
f (x) = p(x)/q(x) (where p and q are the density functions of two distributions). Examples of such
works are [10] and [11]. The difference between this line of work and ours is that it focuses on
average norms, l1 and l2 , of the difference between that function and its estimate by the algorithm.
This measure is too crude for implying domain adaptation error guarantees as in the setting discussed
here.
[4] discuss the effect of bias in the sample weighting estimation on the accuracy of hypotheses returned by a learning algorithm that is based on those reweighed samples. However, the
accuracy guarantees (Theorem 2 in that work) grow unboundedly with the parameter B =
ˆ which, in turn, grows unboundedly with the domain size. In contrast,
maxx∈S (1/p(x), 1/p(x))
our focus is on error guarantees that are independent of the domain size (and thus also apply to
probability distributions with infinite support).
Lower bounds for domain adaptation have appeared in [7]. In fact, our lower bound for weight ratio
estimation is based on a reduction from the latter lower bound for domain adaptation. Another interesting work is [12], focusing on the closely related two-sample problem. However, the theoretical
bounds that they provide (Theorem 2 there) are only asymptotic in the sample sizes.
B
Notation
We use the following notation: We let X be some domain set and {0, 1} denote the label set. Let P S
and P T be two distributions over X × {0, 1}. We call P S the source distribution and P T the target
distribution. We denote the marginal distribution of P S over X by PXS and the marginal of P T by
PXT , and their labeling functions by lS : X → {0, 1} and lT : X → {0, 1}, respectively (where,
for a probability distribution P over X × {0, 1}, the associated labeling function is the conditional
probability of label 1 at any given point: l(x) = Pr(X,Y )∼P (Y = 1|X = x)).
A function h : X → {0, 1} is a classifier and a set of classifiers H ⊆ {0, 1}X a hypothesis class.
In this work, we analyze learning with respect to the binary loss (0/1-loss). We let errP (h) =
Pr(x,y)∼P [h(x) 6= y] denote the error of classifier h with respect to the distribution P . An empirical
error with respect to a sample S is denoted by err
c S (h). We let the best achievable error by hypothesis
class H with respect to distribution P be denoted by optP (H) = inf h∈H errP (h).
Empirical Risk minimization
the errors on a samples S:
An Empirical Risk Minimizer is a classifier in H that minimizes
ERMH (S) ∈ argminh∈H err
c S (h)
It is straight forward to generalize to a weighted sample S. Let S = ((x1 , y1 , w1 ), . . . , (xn , yn , wn ))
be a sample where wi is a weight associated with example (xi , yi ). Then we define the weighted
empirical error with respect to the weighted sample as

−1 

X
X
err
c w(S) (h) = 
wi  
wi 1[h(xi ) 6= yi ] ,
(xi ,yi ,wi )∈S
(xi ,yi ,wi )∈S
and ERMH (S) is defined as above.
6
Domain Adaptation learnability A Domain Adaptation learner (DA learner) takes as input a
labeled i.i.d. sample S drawn according to P S and an unlabeled i.i.d. sample T drawn according to
PXT and aims to generate a good label predictor h : X → {0, 1} for P T . Formally, a DA learner is a
function
∞ [
∞
[
A:
((X × {0, 1})m × X n ) → {0, 1}X .
m=1 n=1
Clearly, the success of domain adaptation learning cannot be achieved for every source-target pair
of learning tasks. Therefore, we state the definition of successful learning in relation to a restricted
class of pairs of distributions.
Definition (DA Learnability). Let X be some domain, W a class of pairs (P S , P T ) of distributions
over X × {0, 1}, H ⊆ {0, 1}X a hypothesis class and A a DA learner. We say that A solves
DA for H with respect to the class W, if there exists functions m : (0, 1) × (0, 1) → N and
n : (0, 1) × (0, 1) → N such that for all pairs (P S , P T ) ∈ W, for all > 0 and δ > 0, when
given access to a labeled sample S of size m(, δ), generated i.i.d. by P S , and an unlabeled sample
T of size n(, δ), generated i.i.d by PXT , then, with probability at least 1 − δ (over the choice of the
samples S and T ) A outputs a function h with
errP T (h) ≤ optP T (H) + .
For s ≥ m(, δ) and t ≥ n(, δ), we also say that the learner A (, δ, s, t)-solves DA for H with
respect to the class W.
We are interested in finding pairs of functions m and n for the labeled source and unlabeled target
samples sizes respectively, that satisfy the definitions of DA learnability for some DA learner A.
Covariate shift The first property we introduce is often assumed in domain adaptation analysis
(for example by [5]). In this work, we assume this property throughout.
Definition (Covariate shift). We say that source and target distribution satisfy the covariate shift
property if they have the same labeling function. Namely, if we have lS (x) = lT (x) for all x ∈ X .
We then denote this common labeling function of P S and P T by l.
The covariate shift assumption is realistic for many DA tasks. For example, it is a reasonable assumption in many natural language processing (NLP) learning problems, such as part-of-speech
tagging, where a learner that trains on documents from one domain is applied to a different domain.
For such tasks, it is reasonable to assume that the difference between the two tasks is only in their
marginal distributions over English words rather than in the tagging of each word (an adjective is an
adjective independently of the type of text it occurs in). While, on first thought, it may seem like
under this assumption DA becomes easy, the lower bound in [7] implies that DA remains a very hard
learning problem even under covariate shift.
C
Details on lower bound
Density ratio estimation has been advocated as a key tool for domain adaptation in the covariate
shift setup ([5, 2, 6, 3]). However, note that density ratio estimation is at least as difficult a task
as density estimation (just consider the case where one of the densities is a constant function).
Furthermore, now we present a strong lower bound even for a particularly simple version of density
ratio estimation. Our lower bound implies that this task cannot be reliably achieved with finite,
task-independent sample sizes as long as no further assumptions about the densities involved are
made.
Formally, we reduce a certain statistical tasks, the Left/Right problem (see Definition C.1 below),
to an extremely simple case of density ratio estimation. The Left/Right problem has been shown to
require large sample sizes (in the order of the square-root of the domain size) to be reliably solved
[7]. Thus, this lower bound on the sample complexity of the Left/Right problem in Lemma 1 implies
that this particular case of density ratio estimation is hard. This, in turn, yields a lower bound for the
general problem of density ratio estimation, assuming that any “reasonable” notion of successfully
estimating density ratios includes, in particular, solving the simple case of Corollary 1. That is, if
an algorithm is claimed to be able to estimate density ratios, then it should be able to successfully
distinguish between the cases of constant density ratio either 0 or 1.
7
C.1
The Left/Right problem
The Left/Right Problem was introduced in [13] and has been used in [7] to obtain lower bounds for
domain adaptation learning under covariate shift.
Definition. (Left/Right problem)
Input: Three finite samples, L, R and M of points from some domain set X .
Output: Assuming that L is an an i.i.d. sample from some distribution P over X , that R is an an
i.i.d. sample from some distribution Q over X , and that M is an i.i.d. sample generated by
one of these two distributions, was M generated by P or by Q ?
Intuitively, as long as the sample M does not have any intersections with either the sample L or M ,
no algorithm can successfully solve the Left/Right problem. This yields a lower bound on the the
size of the required sample sizes that is of the order of the sizes one needs to guarantee a “collision”
(hitting some point more than once). It is well known, e.g. from the “Birthday paradox”, that this
requires samples in the order of the square-root of the domain size.
Definition. We say that a (possibly randomized) algorithm, A, (δ, l, r, m)-solves the Left/Right
problem over a class W of pairs of probability distributions, if, for every (P1 , P2 ) ∈ W, given
samples L ∼ P1l , R ∼ P2r and M ∼ Pbm , where b ∈ {1, 2}, A(L, R, M ) = b with probability at
least 1−δ (over the choice of the samples L, R, M and possibly also over the internal randomization
of the algorithm A).
The following lower bound for the Left/Right problem was formally shown in [7]. We consider the
following classes: Wnhalves = {(UA , UB ) : A ∪ B = {1, . . . n}, A ∩ B = ∅, |A| = |B|}, (recall
that, for a finite set Y , UY denotes the uniform distribution over Y ).
Lemma 1 ([7]). For any given sample sizes l for L, r for R and m for M and any 0 < γ < 1/2, if
k = max{l, r} + m, then for
n > max{k 2 / ln(2), k 2 / ln(1/2γ)}
no algorithm has probability of success greater than 1 − γ over the class Wnhalves .
C.2
Reducing the Left/Right problem to weight ratio estimation
Definition (The extreme two sample problem (ETSP)). Given some finite domain set X let
W DRE = {(UA , UB ) : A, B, are subsets of X , |A| = |B| = |X |/2 and either A = B or A∩B = ∅}.
We say that a (possibly randomized) algorithm, A, (δ, l, r, m)-solves the Extreme Density Ratio
Estimation problem if, for every (UA , UB ) ∈ W DRE , given samples L ∼ P1l , R ∼ P2r , A(L, R) =
1 if A = B and A(L, R) = 0 if A ∩ B = ∅, with probability at least 1 − δ (over the choice of the
samples L, R and possibly also over the internal randomization of the algorithm A).
Lemma 2. The Extreme Density Ratio Estimation problem is as hard as the Right/Left problem.
Proof. Let A be an algorithm that solves the ETSP. Define an algorithm A0 that solves the Right/Left
problem for WXhalves , using A as a subroutine,with the same sample complexity. When given a
triplet of samples (L, R, M ) from a set X , the algorithm A0 just applies the algorithm A to the pair
of samples (L, M ).
This implies the lower bound on the simple case of density ratio estimation as stated in Theorem 1.
D
Proof of Theorem 2
Proof. Let < 1/4 be given. First, note that with the sample sizes stated, according to Corollay
1, we can assume that S is an -net with respect to P T for the set H∆H and for the set of of cells
(note that a partition always has VC-dimension 1). Similarly, we can assume that and T is an 2 approximation for those same collections of sets (see Appendix E and F respectively). Note that
2 < /4.
8
Let BST denote the union of all cells that contain weighted points of w(S) (that is the union of
all cells the algorithm “keeps”). Further let B denote the union of all -heavy cells (that is, cells
that have P T weight at least . Since S is an -net, S hits every heavy cell. Since T is an 2 3
approximation of the cells, and 2 < /4, for every -heavy cell, we have |T|T∩b|
| > 4 ). Thus we
“catch” every -heavy cell, that is B ⊆ BST . Now, the concentratedness assumption implies
PXT (BST ) ≥ PXT (B ) ≥ 1 − (1)
3
T
Now, since T is an 2 approximation, |T|T∩b|
| > 4 implies that PX (b) > /2 for all b ∈ BST . Thus,
PXT (b) ≥ /2 holds for all b ∈ BST . Since ρ = φ−1 (/2), the ρ-margin around the labeling function
l has total P T -weight at most /2 and can therefore not contain any cells from BST . This implies,
that every cell in BS T is label homogeneous.
b = {h ∈ H : h has a (ρ, 0) margin w.r.t. w(S)}.
Recall that H
b if err
Claim 1. For h, h0 ∈ H,
c w(S) (h) ≤ err
c w(S) (h0 ) then errP T (h) ≤ errP T (h0 ) + 2.
Before proving the claim itself, we now argue that the claim implies the statement of the theorem.
Note that since the 2ρ-margin around h∗ has weight at most /2, it can not fully contain any of the
cells in BST . Therefore, the ρ-margin around h∗ does not contain any reweighed sample points from
b
w(S). That is h∗ ∈ H.
Consider some h ∈ argminh∈Hb err
c w(S) . By the claim, we get
errP T (h) ≤ errP T (h∗ ) + 2 ≤ optP T + 3
Proof of the claim 1. Let err
c w(S) (h) ≤ err
c w(S) (h0 ). Note that since both h and h0 have a (ρ, 0)margin with respect to w(S), the symmetric difference h∆h0 can not properly intersect any of the
cells in BST .
Now, if the set h∆h0 does not fully contain any cell from BST , then errP T (h) ≤ errP T (h0 ) + ,
due to Equation 1. If the set h∆h0 does fully contain cells from BST , then all these cells are label
homogeneous, since, as shown above, the ρ-margin around the labeling function l can not contain
any cells from BST .
For the case that h∆h0 contains cells from BST , note that there are at most 1/ heavy cells (cells
that have weight at least ). Thus, since every cell is 2 -approximated by T , any union of heavy
cells is = 2 · 1 -approximated by T and thus also by w(S). Thus, again invoking Equation 1,
we get that the symmetric difference h∆h0 is 2-approximeted by w(S). This implies errP T (h) ≤
errP T (h0 ) + + ≤ errP T (h0 ) + 2
E -nets
E.1
-nets
The notion of an -net was introduced by [14] and has many applications in computational geometry
and machine learning.
Definition. Let X be some domain, W ⊆ 2X a collection of subsets of X and P a distribution over
X . An -net for W with respect to P is a subset N ⊆ X that intersects every member of W that has
P -weight at least .
The following key result concerning -nets states that they are easy to come by. A version of it for
uniform distributions appeared first in [14]. The general version, we employ here, can be found as
Theorem 28.3 in [15].
9
Lemma 3 ([14, 15]). Let X be some domain, W ⊆ 2X a collection of subsets of X of some finite
VC-dimension,d. Then, for everyprobability distribution P over X, for every > 0 and δ > 0,
a set of size O d log(d/)+log(1/δ)
sampled i.i.d. from P , is an -net for W with respect to P with
probability at least (1 − δ).
We now relate -nets for a source distribution to -nets for a target distribution:
Lemma 4 (Slight variation of a lemma in [7]). Let X be some domain, W ⊆ 2X a collection of
subsets of X , and P S and P T a source and a target distribution over X with C := CW, (P S , P T ) ≥
0. Then every (C)-net for W with respect to P S is an -net for W with respect to P T .
Proof. Let N ⊆ X be an (C)-net for W with respect to P S . Consider a U ∈ W that has targetweight at least , i.e. P T (U ) ≥ . Then we have P S (U ) ≥ CP T (U ) ≥ C. As N is an (C)-net
for W with respect to P S , we have N ∩ U 6= ∅.
Combining, the above lemmas, we get the following result for samples sizes sufficient for obtaining
-nets under a weight ratio assumption:
Corollary 1. Let X be some domain, W ⊆ 2X a collection of subsets of X of some finite
VC-dimension, d, and let P S and P T be source and a target distributions
over X with
C :=
d log(d/C)+log(1/δ)
S
T
CW (P , P ) ≥ 0. Then, for every > 0 and δ > 0, a set of size O
sampled
C
i.i.d. from P S , is an -net for W with respect to P T with probability at least (1 − δ).
F
-approximations
The notion of -approximations play the key role deriving finite sample bounds for VC-classes [16].
Definition (-approximation). Let X be some domain, B ⊆ 2X a collection of subsets of X and P
a distribution over X . An -approximation for B with respect to P is a finite subset S ⊆ X with
d − P (b)| ≤ |S(b)
for all sets b ∈ B.
It is shown in [14] (and already in [16]) that, for a collection B of subsets of some domain set X
with finite VC-dimension and any distribution P over X , an i.i.d. sample of size
16VC(B)
4
16
VC(B) ln
+ ln
2
2
δ
is an -approximation for B with respect to P with probability at least 1 − δ.
10