Learning Real-valued Functions

Learning Real-valued Functions
Nima Mousavi
CS 698 course project
1
Introduction
M. Balcan, N. Harvey [1] introduced a model for learning of real-valued functions. The
model is like PAC model but it ignores the small error on the predicted values. They coined
the name PMAC which stands for ”Probably Mostly Approximately Correct” for it. More
Formally,
Definition 1 (PMAC). Let F be a family of real-valued functions on {0, 1}n . An algorithm A
is a PMAC learner with α approximation factor if for any distribution D on {0, 1}n , any target real-valued function f ∗ ∈ F, and every , δ > 0, there exists m(, δ) = poly(n, log , log δ)
such that if m ≥ m(, δ), then
P rS1 ,...,Sm ∼Dm [P rS∼D [f (S) ≤ f ∗ (S) ≤ αf (S)] ≥ 1 − ] ≥ 1 − δ
where f ∈ F is the output of A on training set S = {(S1 , f ∗ (S1 )), . . . , (Sm , f ∗ (Sm ))}.
They showed that if f ∗ is a non-negative, monotone, 1-Lipschitz and submodular, and
D is a product distribution, then there exists a PMAC learner with approximation factor
of O(log (1/)). They also showed that if we don’t assume that D is a product distribution,
then no algorithm can learn f ∗ to within Õ(n1/3 ).
Here, we first prove a similar result for the slightly different learning model (called PMAC-2)
which allows the errors to be additive, then present the result of [1]. The main idea of the
results presented here is as follows. The function f ∗ is concentrated around its expectation
E(f ∗ ) (under some assumptions), and the empirical average of f ∗ is also concentrated around
the E(f ∗ ). Therefore, f ∗ can be well-approximated by the constant function that equals the
empirical average. Hence, we first review couple of concentration results in section 2 and
then present the main results in section 3.
2
Concentration of Measure for real-valued functions
Let f be a real-valued function with domain set X = {0, 1}n and D be a product distribution
on domain set X. We assume that f is bounded that is f (S) ≤ c for all S. The following
1
lemma proves that any bounded real-valued
√ function f is concentrated around its empirical
average with standard deviation of O(c/ m). Where m is the number of samples in the
average.
Lemma 1 (Hoeffding Inequality). Let Z1 , Z2 , . . . , Zm be independent random variables. Assume that Zi ∈ [0, c] for 1 ≤ i ≤ m. Then, for the empirical mean of these variables
Z̄ = (Z1 + Z2 + · · · + Zm )/m we have the inequalities
2t2 m
(1)
P r |Z̄ − E(Z̄)| > t ≤ 2 exp − 2
c
Note that every f : {0, 1}n → R+ can be viewed as a function of n variables xi ∈
{0, 1}, i ∈ [n]. Also, any set function f : 2[n] → R can be identified as a functions on
{0, 1}n in a natural way. Lemma 2 proves a concentration result for a more restricted class
of real-valued functions that is the class of bounded difference or Lipschitz functions. Let
first define Lipschitz functions formally.
Definition 2 ([3]). [Lipschitz functions] A function f (x1 , . . . , xn ) satisfies the Lipschitz property or the bounded differences condition with constants di , i ∈ [n], if
|f (a) − f (a0 )| ≤ di
whenever a and a0 differ in just the ith coordinate, i ∈ [n].
We say a function is 1-Lipschitz if di = 1 for all i ∈ [n].
Lemma 2 ([3]). If f (x1 , . . . , xn ) satisfies the Lipschitz property with constants di , i ∈ [n],
and x1 , . . . , xn are independent random variables, then
−2t2
P r[|f − E(f )| > t] ≤ 2 exp
(2)
d
P
where d = ni=1 d2i
Lemma 2 implies that any 1-Lipschitz
√ function f of n variables is concentrated around
the E(f ) with standard deviation of O( n). This concentration result can be improved if
we restrict our attention to monotone, submodular, Lipschitz functions.
Definition 3 (Submodular and Monotone function). Let [n] = {1, . . . , n} be a ground set
and let f : 2[n] → R be a set function. Then f is
1. submodular, if f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B) for all A, B ⊆ [n]
2. monotone, if f (A) ≤ f (B) for all A ⊆ B ⊆ [n]
Lemma 3 ([2]). If f (x1 , . . . , xn ) is a non-negative monotone, submodular function where
xi ∈ {0, 1} are independently random, then
2t2
P r [|f − E(f )| > t] ≤ 2 exp −
(3)
E(f )
The concentration bound of Lemma 3 is stronger than thep
one of Lemma 2 in the √
sense
that it is dimension free and implies standard deviation of O( E(f )) rather than O( n).
2
3
Learning real-valued functions
In this section, we first define PMAC-2 learning. Then we show that any positive, Lipschitz
n
function f on X = {0,
√1} can be learned (in either PMAC or PMAC-2 model) with approximation factor of O( n log 1/) if the distribution on X is a product distribution. Finally,
we show how the approximation factor can be improved to O(log 1/) in both PMAC and
PMAC-2 model if f is monotone and submodular as well.
Definition 4 (PMAC-2). Let F be a family of real-valued functions with domain set {0, 1}n .
An algorithm A is a PMAC-2 learner with α approximation factor if for any distribution D
on {0, 1}n , any target real-valued function f ∗ ∈ F on {0, 1}n , and every , δ > 0, there exists
m(, δ) = poly(n, log , log δ) such that if m ≥ m(, δ), then
P rS1 ,...,Sm ∼Dm [P rS∼D ( |f ∗ (S) − f (S)| ≤ αf ∗ (S)) ≥ 1 − ] ≥ 1 − δ
(4)
where f ∈ F is the output of A on training set S = {(S1 , f ∗ (S1 )), . . . , (Sm , f ∗ (Sm ))}.
Algorithm 1 A1 : An algorithm for PMAC-2 learning a positive, Lipschitz function f ∗
when the training examples x1 , x2 , . . . , xm come from a product distribution D.
Input: (S1 , f ∗ (S1 ), (S2 , f ∗ (S2 )), . . . , (Sm , f ∗ (Sm ))
Output: f : {0, 1}n → R+
P
∗
Let µ = m
i=1 f (Si )/m.
Return the constant function f (S) = µ.
If f ∗ is a Lipshitz function, then Lemma 2 implies that f ∗ is concentrated around its
expectation. On the other hand, the empirical average is also concentrated around the
expectation of f ∗ by Hoeffding Inequality (1). Therefore, f ∗ is well-approximated by the
constant function that equals the empirical average. This is formally proved in the following
theorem.
Theorem 1. Let F be a family of positive, 1-Lipshitz functions on {0, 1}n . Let η, c be such
that f (S) ∈ [η, c] for all S, f . Let D be a product distribution. For√any
p sufficiently small
, algorithm A1 PMAC-2 learns F with approximation factor of O( n log(1/)/η). The
number of required training samples is O(c2 log (1/δ)) for sufficiently small .
p
Proof. By Lemma 1, |µ − E(f ∗ )| ≤ log(1/δ) √cm with probability at least 1 − δ. Let define
p
√
P = {A : |f ∗ (A) − E(f ∗ )| ≤ log(1/) n}. For any S ∈ P,
|f ∗ (S) − µ| ≤ |f ∗ (S) − E(f ∗ )| + |E(f ∗ ) − µ|
p
p
√
c
≤
log(1/) n + log(1/δ) √
m
3
For m = c2 log (1/δ)/(log(1/)n).
p
√
|f ∗ (S) − µ| = O( log(1/) n)
2
for sufficiently small , the number of training
√ p samples is m = O(c log (1/δ)). Thus, the
approximation factor for all S ∈ P is O( n log(1/)/η). By Lemma 2, P has measure at
least 1 − with respect to D. Therefore, with probability at least 1 − δ on training samples,
µ approximate f ∗ on a set of measure at least 1 − .
Similarly√one
with approximation
p can prove that algorithm A1 is a PMAC-2 learner
∗
factor of O( d log(1/)/η) for any positive, Lipschitz function f . If we assume that the
target function f ∗ is also monotone, submodular, we can improve the approximation factor
to O(log(1/)/η) using Lemma 3 in a similar way. Algorithm A1 is not good for the case
that f ∗ can take zero value because allowing αf ∗ error is of no help in estimating the zero
of f ∗ . Using the fact that the zeros of a monotone, submodular function has structure: they
are both union-closed and downward-closed, we can estimate them well.
Algorithm 2 A2 : An algorithm for PMAC-2 learning a non-negative, Lipschitz, monotone,
submodular function f ∗ when the training examples S1 , S2 , . . . , Sm come from a product
distribution D.
Input: (S1 , f ∗ (S1 ), (S2 , f ∗ (S2 )), . . . , (Sm , f ∗ (Sm ))
Output: f : 2[n] → R+
P
∗
Let µ = m
i=1 f (Si )/m.
S
Compute the null set U = i:f ∗ (Si )=0 Si .
0 ifA ⊆ U
Return f (A) =
µ o.w.
Theorem 2. Let F be a family of non-negative, 1-Lipschitz, monotone, submodular functions
with ground set [n] and minimum non-zero value η. Let D be a product distribution on [n].
For any sufficiently small , algorithm A2 PMAC-2 learns F with approximation factor of
O(log(1/)/η). The number of required training samples is O(n2 log (1/δ) + n log(n/δ)/).
P
Proof. Submodularity of f ∗ implies that f ∗ (U ) ≤ i:f ∗ (Si )=0 f (Si ) = 0. let L = {S : S ⊆ U}.
Also, Monotonicity implies that f ∗ (T ) = 0 for all T ∈ L. Let Z = {S : f ∗ (S) = 0}. So,
Z/L is the set of zero of f ∗ that are estimated incorrectly by algorithm A2 .
Claim 1. If m ≥ 2n log(n/δ)/, then Z/L has measure at most /2 with respect to D with
probability at least 1 − δ over training samples.
Proof. Let UkSbe the null set computed by the algorithm after seeing first k samples. Formally, Uk = i≤kq:f ∗ (Si )=0 Si where q = 2 log(1/δ)/. Let Lk = {S : S ⊆ Uk }. For i ≤ n,
define
4
Ei = the event that Z \ Li−1 has measure at least /2 and no S ∈ Z \ Li−1 appears in
samples S(i−1)q , . . . , Siq−1
The probability that Z \ L has measure less than /2, P is
a
P ≥ P r(
≥ 1−
n
\
Ei )
i=1
n
X
P r(Ei )
i=1
≥ 1−
n
X
(1 − /2)q
i=1
≥ 1−δ
(a) if Z has measure greater than and at least an S ∈ Z \ Lk appears in samples
Skq , . . . , S(k+1)q−1 for all k, then Z \ L has measure less than /2 because |Z| can be greater
than n.
p
p
Let P = {S : |f ∗ (S) − E(f ∗ )| p
≤ log(2/) E(f ∗ )}. With a same argument as in
Theorem 1, |µ − f (S)| ≤ 2 log (1/) E(f ∗ ) for all S ∈ P with probability at least 1 − δ on
training sample if m = O(n2 log(1/δ)). Now, Consider two cases:
Case 1 : If E(f ) ≥ 100 log (2/), then for S ∈ P
p
p
log(2/) E(f ∗ )
|f ∗ (S) − E(f ∗ )|
p
p
≤
f ∗ (S)
E(f ) − log(2/) E(f ∗ )
≤ 1/9
Case 1 : If E(f ) ≤ 100 log (2/), then for S ∈ P \ Z
|f ∗ (S) − E(f ∗ )|
≤ 10 log 2//η
f ∗ (S)
Therefore, µ approximates S ∈ P \ Z to within O(log 1//η) with probability at least
1 − δ. By Lemma 3, P has measure at least 1 − /2 with respect to D. That’s completes
the proof because (P \ Z) ∪ L has measure at least 1 − .
We proved two results for learning real-valued functions in PMAC-2 model. We can
extend the result to PMAC model as well. Algorithm A3 learns a non-negative, Lipschitz,
monotone, submodular function f ∗ . It is proved in Theorem 3 that approximation factor of
A3 is O(log (1/)/η).
5
Algorithm 3 A3 : An algorithm for PMAC learning a non-negative, Lipschitz, monotone,
submodular function f ∗ when the training examples S1 , S2 , . . . , Sm come from a product
distribution D.
Input: (S1 , f ∗ (S1 ), (S2 , f ∗ (S2 )), . . . , (Sm , f ∗ (Sm ))
Output: f : 2[n] → R+
P
∗
Let µ = m
i=1 f (Si )/m.
S
Compute the null set U = i:f ∗ (Si )=0 Si .
case1: if µ > 200 log(1/), return f (A) = µ/3
0 ifA ⊆ U
case2: if µ ≤ 200 log(1/), return f (A) =
η o.w.
Theorem 3 ([1]). Let F be a family of non-negative, 1-Lipschitz, monotone, submodular
functions with ground set [n] and minimum non-zero value η. Let D be a product distribution
on [n]. For any sufficiently small , algorithm A3 PMAC learns F with approximation factor
of O(log(1/)/η). The number of required training samples is O(n2 log (1/δ) + n log(n/δ)/).
Proof. Case1: µ > 200 log(1/) :
By Hoeffding Inequality, with probability at least 1 − δ
µ > 200 log(1/) =⇒ 1/2µ ≤ E(f ∗ ) ≤ 3/2µ
if m ≥ (n2 log(1/δ))/1002 log2 (1/).1 So,
P r[µ/3 ≤ f ∗ (S) ≤ 2µ] ≥ P r[2/3E(f ∗ ) ≤ f ∗ (S) ≤ 4/3E(f ∗ )]
2E(f ∗ )
)
≥ 1 − 2 exp(
9
≥ 1−
For sufficiently small , the number of required samples equals (n2 log(1/δ))/1002 log2 (1/)
which is O(n2 log(1/δ)). Therefore, with confidence at least 1 − δ, the algorithm achieves
approximation factor of 6 on all but an fraction of the distribution.
Case 2: µ ≤ 200 log(1/)
Let N = {S : f ∗ (S) > 0} and Z = {S : f ∗ (S) = 0}. Let E be the event that a random
sample S violates
f (S) ≤ f ∗ (S) ≤ 400 log(1/)/ηf (S)
Clearly,
P r[E] = P r[E ∧ S ∈ N ] + P r[E ∧ S ∈ Z]
1
Note that f (s) ≤ n for all s, f because f is a 1-Lipschitz function
6
By Claim 1, P r[E ∧ S ∈ Z] is at most /2. Now, we prove that P r[E ∧ S ∈ N ] is also at
most /2.
P r[E ∧ S ∈ N ] ≤ P r[f ∗ (S) > 400 log(2/)]
≤ P r[f ∗ (S) > E(f ∗ ) + 200 log(2/)]
≤ /2
In summary, we presented the following results.
1. If F is a family of positive, Lipshitz functions on X = {0, 1}n with minimum and
maximum values η, c respectively, and the distribution on X is a √
product distribution,
then F can be PMAC-2 learned by approximation factor of O( n log(1/)/η) using
O(c2 log(1/δ)) training samples.
2. If F is a family of non-negative, 1-Lipshitz, monotone, submodular functions on X =
{0, 1}n with minimum value η and the distribution on X is a product distribution,
then F can be PMAC (PMAC-2) learned by approximation factor of O(log(1/)/η)
using O(n2 log(1/δ) + n log(n/δ)/) training samples [1].
References
[1] Maria-Florina Balcan, Nicholas J. A. Harvey, Learning Submodular Functions, 43rd Annual ACM Symposium on Theory of Computing (STOC), San Jose, CA, June 2011.
[2] Jan Vondrak, A note on concentration of submodular functions, arXiv:1005.2791, May
2010.
[3] Devdatt P. Dubhashi, Alessandro Panconesi, Concentration of Measure for the Analysis
of Randomized Algorithms, Cambridge University Press, 2009
7

Download Report

Learning Real-valued Functions

Paperzz.com

Your Paperzz