Slide

Mathematical Theories of
Interaction with Oracles
Liu Yang
Carnegie Mellon University
© Liu Yang 2013
1
Thesis Committee
Avrim Blum (co-chair)
Jaime Carbonell (co-chair)
Manuel Blum
Sanjoy Dasgupta (UC, San Diego)
Yishay Mansour (Tel Aviv University)
Joel Spencer (Courant Institute, NYU)
Outline
• Active Property Testing
- Do we need to imitate human to advance AI?
- I see air planes can fly without flapping
their wings.
© Liu Yang 2013
3
Property Testing
• Given access to massive dataset: want to
quickly determine if a given fn f has some
given property P or is far from having it
• Goal: test from very small num of queries.
• One motivation: preprocessing step before
learning
© Liu Yang 2013
4
Property Testing
• Instance space X = Rn (Distri D over X)
• Tested function f : X->{-1,1}
• A property P of Boolean fn is a subset of
all Boolean fns h : X -> {-1,1} (e.g ltf)
• distD(f, P):=ming  P Px~D[f(x) ≠g(x)]
• Standard Type of query: membership
query (ask for f(x) at arbitrary point x)
© Liu Yang 2013
5
Property Testing: An Example
If fP should accept w/ prob  2/3
If dist(f,P)>ε should reject w/ prob  2/3
• E.g. Union of d Intervals
0----++++----+++++++++-----++---+++--------1
- UINT4 ? Accept! UINT3 ? Depend on ε
- Model selection: testing can tell us how big
d need be to be close to target
(double and guess, d = 2, 4, 8, 16, ….)
© Liu Yang 2013
6
Property Testing and Learning :
Motivation
• What is Property Testing for ?
- Quickly tell if the right fn class to use
- Estimate complexity of fn without
actually learning
• Want to do it with fewer
queries than learning
© Liu Yang 2013
7
Standard Model uses
Membership Query
• Results of Testing basic Boolean fns using MQ:
• Constant QC for UINTd, dictator, ltf, …
However …
© Liu Yang 2013
8
Membership Query is Unrealistic
for ML Problems:
An Object Recognition example
Recognizing cat/dog ? MQ gives …
Is this a dog
or a cat?
© Liu Yang 2013
9
An example: movie reviews
Is this a positive or negative review ?
Typical representation
in ML (bag-of-words):
• {fell, holding, interest,
movie, my, of, short,
this}
The original review
(human labelers see):
• “This movie fell short of
holding my interest.”
- Object a human expert labels has more
structure than internal representation used by alg.
- MQs construct ex.s in internal representation.
- Can be very difficult to order constructed
example’s words so a human can label the example
(esp for long reviews)
© Liu Yang 2013
10
Passive : Waste Too Many Queries
• ML people move on
• Passive Model (sample from D)
query samples exist in NATURE
; but quite
wasteful (many examples uninformative)
• Can we SAVE #queries ?
© Liu Yang 2013
11
Pool of unlabeled
data (poly-size)
Alg can ask for labels but
only pts in the pool
Goal: small #queries
Active Testing
The NEW! Model of
Property Testing
© Liu Yang 2013
12
Property Tester
• Definition. An s-sample, q-query ε-tester
for P over the distribution D is a
cheap
randomized algorithm A that draws s
samples from D, sequentially queries for
the value of f on q of those samples, and
then
expensive
1. Accepts w.p. at least 2/3 when f  P
2. Rejects w.p. at least 2/3 when
distD(f,P)>ε
© Liu Yang 2013
13
Active tester: s = poly(n)
Passive tester: s = q
MQ tester: s = ∞ (D= Unif)
• Definition. An s-sample, q-query ε-tester
for P over the distribution D is a
cheap
randomized algorithm A that draws s
samples from D, sequentially queries for
the value of f on q of those samples, and
then
expensive
1. Accepts w.p. at least 2/3 when f  P
2. Rejects w.p. at least 2/3 when
distD(f,P)>ε
© Liu Yang 2013
14
Active Property Testing
• Testing as preprocessing step of learning
• Need an example? where Active testing
- get same QC saving as MQ
- better in QC than Passive
- need fewer queries than Learning
• Union of d Intervals, active testing help!
----++++----+++++++++-----++---+++--------1
0
- Testing tells how big d need to be close to target
- #Label: Active Testing need O(1), Passive Testing
need Θ(√d), Active Learning need Θ(d)
© Liu Yang 2013
15
NEW !!
Our Results
Active Testing
Union of d Intervals
Dictator
Passive Testing
Active Learning
O(1)
Θ(d ) Implications
Θ(d)
Has
Profound
Θ(log n)
Θ(log n)
!Θ(log n)
1/2
Linear Threshold Fn
O(n1/2)
~Θ(n1/2)
Θ(n)
Cluster Assumption
O(1)
Ω(N1/2)
Θ(N)
MQ-like on testing UINTd
Passive-like on testing Dictator
Passive-like
MQ-like
© Liu Yang 2013
16
Testing Unions of Intervals
0----++++----+++++++++-----++---+++-------- 1
• Theorem. Testing UINTd in the active testing
model can be done using O(1) queries.
Recall: Learning requires Ω(d) examples.
© Liu Yang 2013
17
Testing Unions of Intervals (cont.)
• Suppose uniform distribution
• Definition: Fixδ>0. The localδ-noise
sensitivity of fn f: [0, 1]->{0, 1} at x  [0; 1]
is
. The noise sensitivity
easy
of f is
har• Proposition: Fixδ>0. Let f: [0, 1] -> {0,1}
d be a union of d intervals. NSδ(f) ≤ dδ.
• Lemma: Fix δ= ε2/(32d). Let f : [0, 1] ->
{0, 1} be a fn with noise sensitivity
bounded by NSδ(f) ≤ dδ(1 + ε/4 ). Then f is
ε-close to a union of d intervals.
© Liu Yang 2013
18
Easy Lemma
• Lemma. If f is a union of ≤ d intervals,
NSδ(f) ≤ dδ.
Proof sketch:
- The probability that x lands within distance δ
of any of the boundaries is at most 2d*2δ.
- The probability that y crosses a boundary
given that x is within distance δ of it is 1/4.
- P(f(x)≠f(y)| |x-y|<δ) ≤ (2d*2δ)*(1/4) = dδ.
© Liu Yang 2013
19
Hard Lemma
• Lemma. Fix δ = ε2/(32d). If f is ε-far from
a union of d intervals, then NSδ(f) >
(1+ε/4)dδ.
Proof strategy:
If NSδ(f) is small, do “self-correction”.
g(x) = E[f(y) | yÃ[x-δ,x+δ]],
f’(x) = round g(x) to 0 if ≤¿ or to 1 if ≥ 1-¿
© Liu Yang 2013
20
Hard Lemma
• Lemma. Fix δ = ε2/(32d). If f is ε-far from
a union of d intervals, then NSδ(f) >
(1+ε/4)dδ.
Proof strategy:
- Argue dist(f,f’) ≤ε/2.
- Show f’ is union of ≤ d(1 + ε/2) intervals.
- Implies dist(f’,P) ≤ ε/2.
© Liu Yang 2013
21
z
Uniform
!!!
!!!
δ
δ
at δ Nr
δ
δ
!!!
δ
δ
----++++----+++++++++-----++---+++-------------++---+++-------© Liu Yang 2013
22
Testing Unions of Intervals
• Theorem. Testing UINTd in the active
testing model can be done using O(1)
queries.
• If non-uniform distribution, use data to
stretch/squash the axis, makes the distribution
near-uniform
• Total num unlabeled samples: O(d1/2).
© Liu Yang 2013
23
Testing Linear Threshold Fns
• Linear Threshold Functions (LTF):
f(x) = sign(<w,x>), for w,x 2 Rn
© Liu Yang 2013
24
Testing Linear Threshold Fns
• Theorem. We can efficiently test LTFs
under the Gaussian distribution with
Õ(n1/2) labeled examples in both active
and passive testing models.
• We have lower bounds of ~Ω (n1/3) for
active testing and ~Ω (n1/2) for passive
testing.
• Learning LTFs need Ω(n) under Gaussian.
So testing is better than learning in this case.
© Liu Yang 2013
25
Testing Linear Threshold Fns
• [MORS’10] => suffices to estimate
E[f(x) f(y) <x,y>] up to ± poly(ε).
• Intuition: LTF is characterized by a nice
linear relation between angle (<x,y>) and
probability of having same label
(f(x)f(y)=1).
© Liu Yang 2013
26
Testing Linear Threshold Fns
• [MORS’10] => suffices to estimate
E[f(x) f(y) <x,y>] up to ± poly(ε).
• Could take m random pairs and use
empirical average.
- But most pairs x,y would have <x,y> ≈ n 1/2 (CLT)
So would need m = Ω(n) to get within ± poly(ε).
• Solution: take O(n1/2) random points and
average f(x)f(y)<x,y> over all O(n) pairs
x,y.
- Concentration inequalities for U-statistics
[Arcones,95] imply this works.
© Liu Yang 2013
27
General Testing Dimension
• Testing dim characterize (up to constant
factors) the intrinsic #label requests
needed to test the given property w.r.t.
the given distribution
• All our lower bounds are proved via
testing dim
© Liu Yang 2013
28
Minimax Argument
• minAlgmaxf P(Alg mistaken) = maxπ0minAlg
P(Alg mistaken)
• wolg, π0=α π + (1-α) π’, π  Π0,π’  Πε
• Let πS, π’S be induced distributions on labels
of S.
• For a given π0,
minAlgP(Alg makes mistake|S)≤ 1-dS(π, π’)
© Liu Yang 2013
29
Passive Testing Dim
• Define dpassive largest q in N, s.t.
• Theorem: Sample Complexity of passive
testing is Θ(dpassive).
Compare with VC-dimension:
Want exists set S s.t. all labelings occur at
least once.
© Liu Yang 2013
30
Active Testing Dim
• Fair(π,π’,U): distri. of labeled (y; l): w.p.½ choose
y~πU, l= 1; w.p.½ choose y~π’U, l= 0.
• err*(H; P): err of optimal fn in H w.r.t data drawn
from distri. P over labeled egs.
• Given u=poly(n) unlabeled egs, dactive(u): largest
q in N s.t.
• Theorem: Active testing w/ failure prob 1/8
using u unlabeled egs needs Ω(dactive(u)) label
queries; can be done w/ O(u) unlabeled egs and
© Liu Yang 2013
31
O(dactive(u)) label queries
Application: Dictator fns
• Theorem: For dictator functions under the
uniform distribution, dactive(u)=Θ(log n) (for any
large-enough u=poly(n)).
• Corollary: Any class that contains dictator
functions requires log(n) queries to test in the
active model, including poly-size decision
trees, functions of low Fourier degree, juntas,
DNFs, etc.
© Liu Yang 2013
32
Application: Dictator fns
• Theorem: For dictator functions under the
uniform distribution, dactive(u)=Θ(log n) (for any
large-enough u=poly(n)).
• π = unif over dictator fns
• π’ = unif over all Boolean fns
© Liu Yang 2013
33
Application: LTFs
• Theorem. For LTFs under the standard n-dim
Gaussian distrib, dpassive = Ω((n/logn)1/2) and
dactive(u) = Ω((n/logn)1/3) (for any u=poly(n)).
- π: distrib over LTFs obtained by choosing
w~N(0, Inxn) and outputting f(x) = sgn(wx).
- π’: uniform distrib over all functions.
- Obtain dpassive :bound tvd(distrib of Xw/√n, N(0, Iqxq)).
- Obtain dactive: similar to dictator LB but rely on
strong concentration bounds on spectrum of random
matrices
© Liu Yang 2013
34
Open Problem
• Matching lb/ub for active testing LTF: √n?
• Tolerant Testing ε/2 vs. ε (UINTd, LTF)
• Testing LTF under general distrib.
© Liu Yang 2013
35
Outline
• Learnability of DNF
with Representation
Specific Queries
- Liu: We do statistical learning for …
- Marvin: but we haven't not done well at
the fundamentals, e.g. knowledge representation.
© Liu Yang 2013
36
Learning DNF formulas
• Poly-sized DNF: # terms = nO(1)
e.g. f=(x1∧x2)∨(x1∧x4)
- Natural form of knowledge representation
- PAC-learning DNF appears to be very hard.
Best known alg in standard model is exponential over
arbitrary distri; Over Unif, no known poly time alg
Your ticket :
n: number of var.s
Concept space C: collection of fn h: {0, 1}^n -> {0,1}
Unknown target fn f*: the true labeling fn
Err(h) = Px~D[h(x) ~= f*(x)] (Distri. D over X)
© Liu Yang 2013
37
New Models: Interaction with
Oracles
Imagine …
Hi, Tim, do x and y have
some term in common ?
- Boolean queries:
K(x, y) = 1 if share some term
- Numerical queries:
K(x, y) = #terms share
© Liu Yang 2013
Yes!
38
Query: Similarity about TYPE
What if have similarity info about TYPE ?
Fraud detection: fraudulent of same type ?
Term 1 of x
Term 2 of x
Term 3 of x
x Identity theft
Stolen cards
Skimming
y
Stolen cards
YES! x and y
share a term
BIN attack
Type of Query: pair of POSITIVE ex.s
from a random dataset, teacher says
YES if they share some term; or report
how many terms they share.
Question: can we efficiently learn DNF
with this type of query?
© Liu Yang 2013
39
Fraud Detection
Warm Up: Disjoint DNF
w/Boolean Queries
• Use similarity queries to partition positive ex.s
into t buckets, one per term.
• Separately learn a conjunction for each bucket
(intersect the pos ex.s in it)
• OR the results
© Liu Yang 2013
40
Pos Result 1: Weakly Disjoint
DNF w/Boolean Queries
- Distinguishing ex for T1: ex. sat. T1 & no other term
- Weakly disjoint: for each term, poly(n, 1/ε) fraction
rand. ex.s sat. it & no other term.
- Neighbor-method: get all its neighbors in the graph
and learn a conjunc.
- Neighbor-method w.p. 1-δ, produce an ε-accu. DNF
if weakly disjoint.
Graph:
- Nodes: pos examples
- Edge exists if K(.) = 1
© Liu Yang 2013
T1
41
Hardness Results
Boolean Queries
Thm. Learning DNF from random data under
arb. distri. w/ Boolean queries is as hard as
learning DNF from random data under arb.
distri. w/ only labels (no queries).
- Group-learn: tell data from D+ or D- Reduction from group-learn DNF in
std. model to our model
- How to use our alg A to group-learn m
?
- Simulate the oracle by always saying yes
whenever there is a query made to two
pos ex.s; Given the output of A, we give a
group-learn alg for the original problem
© Liu Yang 2013
n var.s
n var.s
n var.s
n var.s
n var.s
n var.s
n var.s
n var.s
n var.s
n var.s
K (giant 1, giant 2) = 1
42
Hardness Results
Approx Numerical Queries
Thm. Learning DNF from random data under
arbitrary distri. w/ approx-numerical-queries
is as hard as learning DNF from random data
under arb. distri. w/ only labels i.e. if C is
#terms xi and xj sat in common, oracle
returns a value in [(1 – τ)C, (1 + τ)C].
© Liu Yang 2013
43
Pos Result 3: learn DNF w/
Numerical Queries
Thm. Under unif distri., w/ numerical
queries, can learn any poly(n)-term DNF.
- Sample m = O((t/ε) log(t/(εδ))) landmark points
- Landmark Fi(x) is sum-of-monotone-terms fn (rm terms
not sat by pos xi). Fi(·) = K(xi, ·), K is numerical query
- Use subroutine to learn hypo. hi(x) ε/(2m)-accu w.r.t. Fi.
• Subroutine: learn a sum of monotone t terms over unif.,
using time & samples poly(t, n, 1/ε).
f(x) = T1(x)+T2(x)+ … +Tt(x)
- Combine all hypo.s hi to h: h(x) = 0 if hi(x) = 0 for all i,
© Liu Yang 2013
44
else h(x) = 1.
Learn Sum of Monotone Terms
x1 | x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9
S= {xS=
S=
} 1, x{x
S=
{x
S=
x{x3S=
}1,x4x{x
x1,54}x,3x{x
S=
x371,x} 4x,3xx8}4 ,x9}
1} {xS=
1, x2{x
3}1, x
31,x4
3,xS=
6x}1,4 x
,{x
Estimate Fourier coeffi. of S
Inclusion check: mag. ≥ ε/(16t)?
YES
Output
x1∧x3∧x4∧x9
- Fourier coeffi. of S:
- Greedy:
- Inclusion Check:
otw
© Liu Yang 2013
45
Learn Sum of Monotone Terms :
Greedy Alg
• Examine each parity fn of size 1 & est its Fourier coeffi.
(up to θ/4 accu.). Set θ =ε/(8t)
Inclusion check
1
• Place all coeffi. of mag. ≥ θ/2 into a list L .
• For j = 2, 3, ... repeat:
- For each parity fn ΦS in list Lj-1 and each xi not in S, est
Fourier coeffi. of
- If est. is ≥ θ/2, add it to list Lj (if not already in)
- maintain list Lj: size-j parity fns w/ coeffi. mag. ≥ θ.
• Construct fn g: weight sum of parities for identified coeff.
• Output fn h(x) = [g(x)]
© Liu Yang 2013
46
Other Positive Results
Binary
O(log(n)) terms DNF (any
distrib.)
Numeric
✔
2-term DNF (any distrib.)
✔
✔
DNF: each var in at most
O(log(n)) terms (Unif)
✔
✔
log(n)-Junta (Unif)
✔
✔
log(n)-Junta (any distrib)
✔
Open problems:
- learn arbitrary DNF (unif, Boolean queries)?
DNF having ≤ 2O(√log(n)) terms
✔numerical
✔ queries)?
- learn
arbitrary DNF (any distri.
(Unif.)
Outline
• Active Learning with a Drifting
Distribution
- If not every poem has a proof, can we at
least try to make every theorem proved
beautiful like a poem?
© Liu Yang 2013
48
Active Learning with
a Drifting Distrib: Model
• Scenario:
- Unobservable seq. of distrib.s
with each
- Unobservable time-indep. regular cond. distrib. represent by fn
: an infinite seq. of indep. r. v., s.t.,
and cond. distrib. Of Yt given Xt satisfies
• Active learning protocol
At each time t, alg is presented with Xt, and is required to
predict a label
, then it may optionally request to
see true label value Yt
• Interested in cumulative #mistakes up to time T and total
#labels requested up to time T
© Liu Yang 2013
49
Distrib.
Space
D2
D4
Data
Space
x4
x2
D1
Dt
D3
x3
xt
x1
© Liu Yang 2013
50
Definition and Notations
• Instance space X = Rn
• Distribution space
of distributions on X
• Concept space C of classifiers h: X -> {-1,1}
- Assume C has VC dimension vc < ∞
• Dt: Data distrib. on X at t
• Unknown target fn h*: true labeling fn
• Errt (h) = Px~Dt [h(x) ≠ h*(x)]
• In realizable case, h* in C and errt(h*) = 0.
• For
,
© Liu Yang 2013
51
Def: disagreement coefficient, tvd
• The disagreement coefficient of h* under a distri. P on X,
is define as, (r > 0)
• Total variation distance of probability measures P and Q
on a sigma-algebra
of subsets of the sample space is
defined via
© Liu Yang 2013
52
Assumptions
• Independence of the Xt variables
• Vc-dim < ∞
• Assumption 1 (totally bounded) :
is totally bounded
(i.e. satisfies
)
- For each ε > 0,
denote a minimal subset of
s.t.
s.t.
(i.e. a minimal ε-cover of
• Assumption 2 (poly-covers)
where c,m ≥ 0 are constants.
)
Realizable-case Active Learning
CAL
© Liu Yang 2013
54
Sublinear Result: Realizable Case
Theorem. If is totally bounded, then CAL,
achieves an expected mistake bound
And if
, then CAL makes an E[#queries]
[Proof Sketch]:
Partition D into buckets of diam < eps.
Pick a time T_eps past all indices from finite buckets
and all the infinite bucket has at least
© Liu Yang 2013
55
Number of Mistakes
• Alternative scenario:
- Let Pi be in bucket i
- Swap the L(ε) samples for bucket i with L(ε) samples from Pi
- L(ε) large enough so E[diam(V)]alternative < sqrt{eps}.
- Note: E[diam(V)] ≤ E[diam(V)]alternative + sumL(ε) t values||P_i – D_t||
< √ε + L(ε)*ε.
So E[diam] -> 0 as T -> ∞
- E[#mistake]
- Since
Number of Queries
•E[#queries]
•P(make query) = E[P(DIS(Vt-1))]
•Let
then
and
E[#queries]
•
=>
Explicit Bound: Realizable Case
Theorem. If poly-covers assumption is satisfied (
then CAL achieves an expected mistake bound
and
E[#queries]
such that
)
where
[Proof Sketch]
Fix any ε >0, and enumerate
For t in N, let K(t) be the index k of the closest
to Dt.
Alternative data sequence:
Let
be indep., with
This way all samples corresp. to distrib.s in a given bucket all came from same distri.
Let V’t be the corresponding version spaces.
© Liu Yang 2013
58
E[#mistakes]
Classic PAC bound =>
(#previous distrib.s in Dt's bucket)
So
(each bucket has at most T samples)
So E[#mistakes]
Take
to get the stated theorem.
To bound E[#queries], again it is
So
Again, taking
just showed this is
gives the stated result.
© Liu Yang 2013
59
Learning with Noise
Noise conditions
•Strictly benign noise condition:
and
•Special case: Tsybakov's noise conditions
•η satisfies strictly benign noise condition and for some c > 0 and
α≥0,
•Unif Tsybakov assumption: Tsybakov Assumption is satisfied for
all
with the same c and α values.
© Liu Yang 2013
60
Agnostic CAL [DHM]
Based on subroutine:
© Liu Yang 2013
61
Tsybakov Noise: Sublinear Results
& Explicit Bound
Theorem. If is totally bounded and η satisfies strictly benign
noise condition, then ACAL achieves an excess expected mistake
bound
and if additionally
, then ACAL makes an expected
number of queries
Theorem. If poly-covers Assumption and Unif Tsybakov assumption
are satisfied, then ACAL achieves an expected excess number of mistakes
ACAL achieves expected #mistakes and expected #queries
such that, for
© Liu Yang 2013
62
Outline
• Transfer Learning
- Do not ask what Bayesians can do for
Machine Learning, ask what Machine
Learning can do for Bayesians
Transfer Learning
• Principle: solving a new learning problem is easier given
that we’ve solved several already !
• How does it help?
- New task directly ``related’’ to previous task
[e.g., Ben-David & Schuller 03; Evgeniou, Micchelli, & Pontil 2005]
- Previous tasks give us useful sub-concepts [e.g., Thrun 96]
- Can gather statistical info on the variety of concepts
[e.g., Baxter 97; Ando & Zhang 04]
• Example: Speech Recognition
- After training a few times, figured out the dialects.
- Next time, just identify the dialect.
- Much easier than training a recognizer from scratch
Model of Transfer Learning
Motivation: Learners often Not Too Altruistic
Layer 1: draw
task i.i.d. from
unknown prior
Task 1
h 1*
prior
Task T
… !!
Better Estimate of Prior
hT*
Task 2
x11,y11
Layer 2: per task,
draw data i.i.d.
from target
…
x1k,y1k
h2*
x21,y21 …
xT1,yT1
x2k,y2k
…
xTk,yTk
- Marvin: so you assume learning French is similar
to learning English?
- Liu: It indeed seems many English words have a
French counterpart …
Identifiability of priors from joint
distribs
• Let prior π be any distribution on C
- example: (w, b) ~ multivariate normal
•
•
•
•
•
•
Target h*π ~ π
Data X = (X1, X2, …) i.i.d. D indep h*π
Z(π) = ((X1, h*π (X1), (X2, h*π (X2), …).
Let [m] = {1, …, m}.
Denote XI = {Xi}i € I (I : subset of natural numbers)
ZI (π) = {(Xi, h*π (Xi))}i € I
Theorem: Z[VC] (π1) =d Z[VC] (π2) iff π1 = π2.
Identifiability of priors by
VC-dim joint distri.
• Threshold:
0
---------------------
++++++++++++++++
1
- for two points x1, x2, if x1 < x2, then
Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(+,-)=0,
So Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.)
- for any k > 1 points, can directly to reduce number of labels in the
joint prob from k to 1
P(-----------(-+)+++++++++++++++++)
= P(
= P(
= P(
+ P(
= P(
(-+)
(.+)
(.+)
(+-)
(.+)
)
) - P(
(++)
) - P(
(+.)
) (unrealized labeling !!)
) - P(
(+.)
)
)
)
• Theorem: Z[VC] (π1) =d Z[VC] (π2) iff π1 = π2.
Proof Sketch
•
Let ρm(h,g) = 1/m Σi=1m II(h(Xm) ≠ g(Xm))
Then vc < ∞ implies w.p.1 forall h, g € C with h ≠ g
limm -> ∞ ρm(h,g) = ρ(h,g) > 0
• ρ is a metric on C by assumption,
so w.p.1 each h in C labels ∞-seq (X1, X2 …) distinctly (h(X1),
h(X2), …)
• => w.p.1 conditional distribution of the label seq Z(π)|X
identifies π
=> distrib of Z(π) identifies π
i.e. Z∞ (π1) =d Z∞ (π2) implies π1 = π2
Identifiability of Priors from Joint Distributions
lower–dim cond distrib
y’ closer to ỹ
Identifiability of Priors from Joint Distributions
Identifiability of Priors from Joint Distributions
Transfer Learning Setting
•
•
•
•
•
•
Collection Π of distribs on C. (known)
Target distrib π* € Π. (unknown)
Indep target fns h1*, …, hT* ~ π* (unknown)
Indep i.i.d. D data sets X(t) = (X1(t), X2(t), …), t €[T].
Define Z(t) = ((X1(t), ht*(X1(t))), (X2(t), ht*(X2(t))), …).
Learning alg. “gets” Z(1), then produces ĥ1, then “gets”
Z(2), then produces ĥ2, etc. in sequence.
• Interested in: values of ρ(ĥt, h*(t)), and the number of
h*t (Xj(t)) value alg. needs to access.
Estimating the prior
• Principle: learning would be easier if know π*
• Fact: π* is identifiable by distrib of Z[VC](t)
• Strategy: Take samples Z[VC](i) from past tasks 1, …,
t-1, use them to estimate distrib of Z[VC](i), convert
that into an estimate π’ of π*,
t-1
• Use π’ in a prior-dependent learning alg for new
task ht*t-1
• Assume Π is totally bounded in total variation
• Can estimate π* at a bounded rate:
|| π* - π’ ||< δt converges to 0 (holds whp)
t
Transfer Learning
• Given a prior-dependent learning A(ε, π), with
E[# labels accessed] =Λ(ε, π) and producing ĥ
with E[ρ(ĥ, h*)]≤ε
For t = 1,…, T
If δt-1 > ε/4,
run prior-indep learning on Z[VC/ε](t) to get ĥt
Else let π’’t = argminπ € B(π’t-1, δt-1) Λ(ε/2, π) and run
A(ε/2, π’’t) on Z(t) to get ĥt
Theorem: Forall t, E[ρ(ĥt, ht*)] ≤ ε, and
limsupT -> ∞E[#labels accessed]/T ≤Λ(ε/2, π*) + vc.
- Yonatan: I’ll send you an email to summarize
what we just discussed.
- Liu: Thank you but I now invented a model to
transfer knowledge with provable guarantees;
so I use that all the time.
- Yonatan: But that’s asymptotic guarantee. My
life span is finite. So I’m still gonna to send you
an email.
Outline
• Online Allocation and Pricing with
Economies of Scale
- Jamie Dimon: Economies of scale are a good thing.
If we didn't have them, we'd still be living in tents
and eating buffalo.
© Liu Yang 2013
77
Setting
• Christmas season
- Nov: customer survey
- Dec: purchasing and selling
• Buyers arrive online one at a time w/ val.s on
items sampled iid from some unknown distri.
Thrifty Santa Claus
• Each shopper wants only one item though it
might prefer some items than others
• Minimize total cost to seller
• Buyers: binary valuation
• Goal of seller: sat. everyone
Hardness: Set-Cover
• If costs  much more rapidly, then even if all
customers' val.s known up front, would be
(roughly) a set-cover problem and could not
hope to achieve cost o(log n) times optimal.
• Natural case: for each good, cost (to the
seller) for ordering T copies is sublinear in T.
Production
cost
α = 1
α in (0, 1)
α = 0
#copies
α = 0
α in (0, 1)
Marginal
α = 1
cost
#copies
Thrifty Santa Claus : Results
• Mar-cost non-increa, exists optimal strategy?
- order items by some perm.; give new buyer
earliest item it desires in the perm.
• What if n (#buyers) >> k (#items) AND marcost not  too rapidly? (rate 1/Tα for 0≤α<1)
- can efficiently perform allocation w/ cost ≤ a
const. factor greater than OPT
Algorithm
• Alg: use initial buyers to learn about distri.
determine how best to allocate to new buyers.
• If cost fn c(x) = Σi=1 x 1/iα, for α in [0,1)
- run greedy weighted set cover => total
cost ≤ 1/(1-α) {± OPT}.
• Essentially smooth variant of set-cover
• If ave-cost within some factor of mar-cost,
have a greedy alg w/ const. approx ratio
Sample Complexity Analysis
• How complicated the allocation rule needs to
be to achieve good perf.?
Theorem
Outline
• Factor Models for Correlated
Auctions
© Liu Yang 2013
84
The Problem
• Auctioneer sells good to a group of n buyers.
• Seller wants to maximize his revenue.
• Each buyer maximize his utility of getting
good: val. - price
• Seller doesn’t know exact val.s of players
• He knows distri D from which vec. of val.s
(v1, …, vn) is drawn.
Our Contribution
• When D is a product distri.,
- Myerson gives dominant strategy
truthful auction
• General correlated distr.s, not known
- how to create truthful auctions
- how to use player j’s bid to capture info
about player i.
• What if correlation between buyer val.s
driven by common factors?
Example
•
•
•
•
•
•
•
•
Two firms produce same type of good
Each firm’s “value”: production cost
need to hire workers (W) & rent capital (Z)
li: #workers firm i needs to produce one unit
Ki: amount of capital firm needs
εi:fixed costs unique to firm i.
firm’s costs: Ci = liW + kiZ + εi
firms’ costs correlated : hire workers & rent
capital from the same pool.
The Factor Model
• Factor model as V = F + U where
- V: vec. of observations
- λ: matrix of coefficients
- F : vec. of factors
- U: vec. of idiosyncratic components ind. of each
other & ind. of the factors
Discussions
• Possible that:
- Designer & bidders might not know common factors
- Bidders might only know their val.
- seller only knows joint distri. of bidders’ val.s,
• Seller RECOVER factor model by making
inferences over observed bids.
• Aggregate info.: common factors inferred
from collective knowledge of all players.
The Auction
The Auction (cont.)
• Thm: When correlation follows this
factor model, this auction is dominant
strategy truthful, ex-post individually
rational, and asymptotically optimal.
Dominant Strategy
Truthfulness
• Toss a coin & choose between:
- 2nd price auction: truthful
- mechanism M estimates factors from a
random set of bidders S:
bidders in S receive utility 0 regardless of
allocation & price output by M
• Players S incentivized truthful for small
incentive they get from participating in
2nd price auction.
Dominant Strategy Truthfulness
(Cont.)
• Remaining bidders set R = {1, …, n} - S
receive incentives from both 2nd price
auction and mechanism M.
• M offers them allocation and price vec.s
x(bR), p(bR) by running Myerson (bR,VR |^f)
on players' bids, and on cond. distri.s
estimated for these players.
• No player in R can influence the estimated
conditional distri. VR|^f, and Myerson's
optimal auction is truthful.
Thanks !
Hofstadter's Law: It always
takes longer than you expect,
even when you take into account
Hofstadter's Law.
© Liu Yang 2013
94