Estimating Accuracy from Unlabeled Data Emmanouil Antonios Platanios [email protected] [email protected] [email protected] (32) = 3 equations 3 unknowns Probability that fˆj makes an error where and: Probability that both make an error Too strong assumption c= { We do not make it 1. 2. 2â{j,k} 2â{j,k} 1 2â{1,3} 1 with Y 2 {0, 1} c(e) = X i,j2{1,...,N } y lit ua Eq e{i} e{j} ◆2 the me ma tho xim d a um s w lik ell elih oo { Inequality constraints: Error Rate: The probability over correct output label. \ i2A [fˆi (X) 6= Y ] ! ⇢ e A = PD P (X ) = D of disagreeing with the Error Event EA i2A aA = eA + 1 + Ei ◆ + PD X |A| k=1 ✓\ ( 1) Ēi i2A k X I✓A |I|=k to eA min eA\i ◆ k Either all function in clique 1 ( C1) make an error and all functions in clique 2 ( C2) are correct: eI I✓C |I|=k k=1 1 PD (Ŷ s |e) where C contains all of the functions. That is: 2 PD ⇣ ( 1) X k I✓C2 |I|=k k=1 ⌘ Ŷ s |e = eC2 + X |C1 | ( 1) k X I✓C1 |I|=k k=1 True error rates (estimated from labeled data) e{I[C1 } bodypart e{I[C2 } PD (Ŷ s |e) = log L (e) + ✓ X eA Y ei i2A 1 PD (Ŷ s |e) + 8 classifiers using different text representation: • • 2 PD (Ŷ s |e) of letters in each word: • Parts of speech: …4 4 6 3 1 6 2 5 6… ! … PRP VBD VBG IN TO VB NNP … … DT … Harry had heard Fred and George Weasley complain … … NNP VBD VBN NNP NNP VB … … 0.16 0.08 0.08 0.24 0.12 0.06 0.06 0.18 0.08 0.04 0.04 0.12 0.04 0.02 0.02 0.06 ADJ CMC CPL eI c(e) = A:|A| 2 eA i2A 0 VERB ADJ CMC CPL 0 VERB ADJ CMC CPL 0 VERB ADJ CMC CPL VERB True error rates (estimated from labeled data) Error rates estimates from unlabeled data NNP CC Ind. All Data Samples Pair. All 0.49 2.77 1.54 0.31 2.19 1.30 0.29 1.84 1.08 2 True error rates (estimated from labeled data) Ind. 50 Data Samples Pair. All 0.82 20.06 13.11 0.39 19.96 15.17 0.40 15.42 11.14 Error rates estimated from unlabeled data Region #3 Region #2 Region #1 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 1 2 3 4 5 6 AR MLE MAP 7 8 1 2 3 4 5 Ind. 4 Classifiers Pair. All 10.97 10.60 9.61 6.60 8.34 18.19 6.50 7.64 11.16 2 6 7 8 0 1 2 3 4 5 6 7 Ind. 8 Classifiers Pair. All 4.36 32.02 27.95 4.14 12.33 18.60 2.01 4.50 7.26 8 The more functions we have, the better the estimates we obtain! High order sample agreement rates are often bad estimates of the actual agreement rates Objective function for a bigger set of functions: Y person 0.3 Which passage corresponds to this fMRI recording? ✓ X bird 0.1 ⇥10 …5 3 5 4 3 6 7 8… NN beverage AR MLE MAP Passage #2 • Number Error rates estimated from unlabeled data 3 functions under independence assumption: 2.82x10-2. N functions without that assumption: ◆2 … They were hoping for a reason to fight Malfoy … 20,733 18,932 19,263 21,840 21,778 21,827 20,452 19,162 19,566 18,911 21,606 21,700 21,811 21,723 18,826 0.1 ⇥10 Passage #1 animal beverage bird bodypart city disease drug fish food fruit muscle person protein river vegetable We report the mean absolute deviation (MAD) between: Task: Find which of two 40 second long story passages corresponds to an unlabeled 40 second time series of fMRI neural activity Brain Data Set # Examples 0.2 0 The two events are mutually exclusive and so: A:|A| 2 2 = e C1 + X |C2 | Or, the opposite: C = {1, . . . , N } •… i2A ADJ: Adjectives that occur with the NP CMC: Orthographic features of the NP CPL: Phrases that occur with the NP VERB: Verbs that appear with the NP Category ei ◆2 1,000 labeled samples for 11 brain regions Runs 8 times faster for 8 classifiers and performs better! Pairwise agreement rates equations ⇥10 MAD 2 Pairwise Agreement Rates Ind. Pair. All All Agreement Rates Ind. Pair. All 4.40 4.36 4.06 1.90 4.14 2.01 Rest of agreement rates equations 8 classifiers 10 classifiers 15 classifiers ⇢ a A = PD Correctness definition: all are right all are wrong ✓\ Inequality constraints for a bigger set of functions: { s=1 I fˆi (Xs ) = fˆj (Xs ), 8i, j 2 A : i 6= j Agreement rate for a bigger set of functions: { 1 âA = S o ply Given unlabeled input data, X1 , . . . , XS, we observe the sample agreement rates: e{i,j} min e{i} , e{j} Ap Consistency definition: n ( 1) X d ts in Orthographic features Context of the noun of the noun phrase phrase S X |C| X We add a prior: c(e) = Constrained Optimization Problem 4 logistic regression classifiers using different features: Two Cliques Case Regularization ra “Pittsburgh” Two Cliques Case The probabilities here are computed in the same way as for the agreement rates. st on X ⇠D The functions can be split into two non-empty groups: those that output 1 and those that output 0. Relaxes the independence assumption C X ⇠D i One Clique Case 1 Objective Function Agreement Rates Equations Task: Predict whether a noun phrase (NP) belongs to a category (e.g. “city”) All of the functions agree with each other. One Clique Case 2â{2,3} e{i,j} PD Ŷ s |e ⌘ e contains all of the function error rates, and Ŷ s = fˆ1 (Xs ) , . . . , fˆN (Xs ) . PD (Ŷ s |e) = eC + 1 + ✓ ⇣ j<k Minimize the dependence between the error rates: But we end up with more unknowns than equations fˆN c± 1 = ±2 1 2â{1,2} s=1 h For each sample, Ŷ s, we have two cases: i 2 {1, 2, 3}, j, k 2 {1, 2, 3}\i q = S Y { Independent errors e{j} + 2e{i,j} e{i} Is city? ··· where pe e{i} Probability that fˆi makes an error There exists a binary function f that we do not know. Instead, we have a set of function approximations to that function and we want to know how accurate they are. fˆ2 3 functions that make independent errors: de correctness Approximations: fˆ1 Likelihood definition: L (e) = PD Ŷ 1 , . . . , Ŷ S |e In a{i,j} = 1 Y 2 {0, 1} e{i} e{j} ⌘ Error Rate rs { { both are right ⇣ NELL Data Set 1 Error Rate senses a{i,j} = PD E{i} \ E{j} + PD Ē{i} \ Ē{j} If yes, under what conditions? Maximum likelihood 2 Agreement rate between fˆi and fˆj : 1 Induce word both are wrong consistency Does this implication hold? Agreement rates rro 1 3. Experiments te Using only unlabeled data we can measure consistency but not correctness. So: Is “Pittsburgh” f a city? Tom Mitchell 2. Approach nd en 1. Problem Avrim Blum All agreement rates equations
© Copyright 2024 Paperzz