Poster - platanios.org

Estimating Accuracy from Unlabeled Data
Emmanouil Antonios Platanios
[email protected]
[email protected]
[email protected]
(32) = 3 equations
3 unknowns
Probability
that fˆj makes
an error
where
and:
Probability that
both make an
error
Too strong assumption
c=
{
We do not make it
1.
2.
2â{j,k}
2â{j,k}
1
2â{1,3}
1
with
Y 2 {0, 1}
c(e) =
X
i,j2{1,...,N }
y
lit
ua
Eq
e{i} e{j}
◆2
the
me ma
tho xim
d a um
s w lik
ell elih
oo
{
Inequality constraints:
Error Rate: The probability over
correct output label.
\
i2A
[fˆi (X) 6= Y ]
!
⇢
e A = PD
P (X ) = D of disagreeing with the
Error Event
EA
i2A
aA = eA + 1 +
Ei
◆
+ PD

X
|A|
k=1
✓\
( 1)
Ēi
i2A
k
X
I✓A
|I|=k
to
eA  min eA\i
◆
k
Either all function in clique 1 ( C1) make an error and
all functions in clique 2 ( C2) are correct:
eI
I✓C
|I|=k
k=1
1
PD (Ŷ s |e)
where C contains all of the functions.
That is:
2
PD
⇣
( 1)
X
k
I✓C2
|I|=k
k=1
⌘
Ŷ s |e = eC2 +

X
|C1 |
( 1)
k
X
I✓C1
|I|=k
k=1
True error rates (estimated from labeled data)
e{I[C1 }
bodypart
e{I[C2 }
PD (Ŷ s |e) =
log L (e) +
✓
X
eA
Y
ei
i2A
1
PD (Ŷ s |e)
+
8 classifiers using different text
representation:
•
•
2
PD (Ŷ s |e)
of letters in each word:
• Parts of speech:
…4 4 6 3 1 6 2 5 6…
!
… PRP VBD VBG IN
TO VB NNP …
…
DT
… Harry had heard Fred and George
Weasley complain …
… NNP VBD VBN
NNP NNP VB …
…
0.16
0.08
0.08
0.24
0.12
0.06
0.06
0.18
0.08
0.04
0.04
0.12
0.04
0.02
0.02
0.06
ADJ
CMC
CPL
eI
c(e) =
A:|A| 2
eA
i2A
0
VERB
ADJ
CMC
CPL
0
VERB
ADJ
CMC
CPL
0
VERB
ADJ
CMC
CPL
VERB
True error rates (estimated from labeled data)
Error rates estimates from unlabeled data
NNP
CC
Ind.
All Data Samples
Pair.
All
0.49
2.77
1.54
0.31
2.19
1.30
0.29
1.84
1.08
2
True error rates (estimated from labeled data)
Ind.
50 Data Samples
Pair.
All
0.82
20.06
13.11
0.39
19.96
15.17
0.40
15.42
11.14
Error rates estimated from unlabeled data
Region #3
Region #2
Region #1
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
1
2
3
4
5
6
AR
MLE
MAP
7
8
1
2
3
4
5
Ind.
4 Classifiers
Pair.
All
10.97
10.60
9.61
6.60
8.34
18.19
6.50
7.64
11.16
2
6
7
8
0
1
2
3
4
5
6
7
Ind.
8 Classifiers
Pair.
All
4.36
32.02
27.95
4.14
12.33
18.60
2.01
4.50
7.26
8
The more functions we have, the better the estimates we obtain!
High order sample agreement rates are often bad
estimates of the actual agreement rates
Objective function for a bigger set of functions:
Y
person
0.3
Which passage corresponds
to this fMRI recording?
✓
X
bird
0.1
⇥10
…5 3 5 4 3 6 7 8…
NN
beverage
AR
MLE
MAP
Passage #2
• Number
Error rates estimated from unlabeled data
3 functions under independence assumption: 2.82x10-2.
N functions without that assumption:
◆2
… They were hoping for a reason to fight
Malfoy …
20,733
18,932
19,263
21,840
21,778
21,827
20,452
19,162
19,566
18,911
21,606
21,700
21,811
21,723
18,826
0.1
⇥10
Passage #1
animal
beverage
bird
bodypart
city
disease
drug
fish
food
fruit
muscle
person
protein
river
vegetable
We report the mean absolute deviation (MAD) between:
Task: Find which of two 40 second long story passages corresponds to an
unlabeled 40 second time series of fMRI neural activity
Brain Data Set
# Examples
0.2
0
The two events are mutually exclusive and so:
A:|A| 2
2
= e C1 +

X
|C2 |
Or, the opposite:
C = {1, . . . , N }
•…
i2A
ADJ: Adjectives that occur with the NP
CMC: Orthographic features of the NP
CPL: Phrases that occur with the NP
VERB: Verbs that appear with the NP
Category
ei
◆2
1,000 labeled samples for
11 brain regions
Runs 8 times faster
for 8 classifiers and
performs better!
Pairwise agreement
rates equations
⇥10
MAD
2
Pairwise Agreement Rates
Ind.
Pair.
All
All Agreement Rates
Ind.
Pair.
All
4.40
4.36
4.06
1.90
4.14
2.01
Rest of agreement
rates equations
8 classifiers
10 classifiers
15 classifiers
⇢
a A = PD
Correctness definition:
all are right
all are wrong
✓\
Inequality constraints for a bigger set of functions:
{
s=1
I fˆi (Xs ) = fˆj (Xs ), 8i, j 2 A : i 6= j
Agreement rate for a bigger set of functions:
{
1
âA =
S
o
ply
Given unlabeled input data, X1 , . . . , XS, we observe the sample
agreement rates:
e{i,j}  min e{i} , e{j}
Ap
Consistency definition:
n
( 1)
X
d
ts
in
Orthographic features Context of the noun
of the noun phrase
phrase
S
X
|C| 
X
We add a prior: c(e) =
Constrained Optimization Problem
4 logistic regression classifiers using different
features:
Two Cliques Case
Regularization
ra
“Pittsburgh”
Two Cliques Case
The probabilities here are
computed in the same way
as for the agreement rates.
st
on
X ⇠D
The functions can be split into two non-empty groups: those that
output 1 and those that output 0.
Relaxes the independence assumption
C
X ⇠D
i
One Clique Case
1
Objective Function
Agreement Rates Equations
Task: Predict whether a noun phrase (NP)
belongs to a category (e.g. “city”)
All of the functions agree with each other.
One Clique Case
2â{2,3}
e{i,j}
PD Ŷ s |e
⌘
e contains all of the function error rates, and Ŷ s = fˆ1 (Xs ) , . . . , fˆN (Xs ) .
PD (Ŷ s |e) = eC + 1 +
✓
⇣
j<k
Minimize the dependence between the error
rates:
But we end up with more unknowns than equations
fˆN
c± 1
=
±2 1
2â{1,2}
s=1
h
For each sample, Ŷ s, we have two cases:
i 2 {1, 2, 3}, j, k 2 {1, 2, 3}\i
q
=
S
Y
{
Independent errors
e{j} + 2e{i,j}
e{i}
Is city?
···
where
pe
e{i}
Probability
that fˆi makes
an error
There exists a binary function f that we do not know. Instead, we have a
set of function approximations to that function and we want to know how
accurate they are.
fˆ2
3 functions that make independent errors:
de
correctness
Approximations: fˆ1
Likelihood definition: L (e) = PD Ŷ 1 , . . . , Ŷ S |e
In
a{i,j} = 1
Y 2 {0, 1}
e{i} e{j}
⌘
Error Rate
rs
{
{
both are right
⇣
NELL Data Set
1
Error Rate
senses
a{i,j} = PD E{i} \ E{j} + PD Ē{i} \ Ē{j}
If yes, under what
conditions?
Maximum likelihood
2
Agreement rate between fˆi and fˆj :
1
Induce
word
both are wrong
consistency
Does this
implication hold?
Agreement rates
rro
1
3. Experiments
te
Using only unlabeled data we can measure consistency but not
correctness. So:
Is “Pittsburgh”
f
a city?
Tom Mitchell
2. Approach
nd
en
1. Problem
Avrim Blum
All agreement rates equations