Sta[s[cal Hypothesis Tes[ng in Posi[ve Unlabelled Data

Sta$s$cal Hypothesis Tes$ng in Posi$ve Unlabelled Data Konstan$nos Sechidis∗, Borja Calvo∗∗, Gavin Brown∗ ∗School of Computer Science University of Manchester, UK ∗∗Department of Computer Science and Ar<ficial Intelligence University of Basque Country, Spain Scope of this work We use hypothesis tests a lot. G-­‐test Chi-­‐squared test strongly related Mutual Informa<on too… e.g. Bayesian network structure learning -­‐ should there be an arc between node X and Y? Feature Selec<on -­‐ should we select feature X? Decision trees -­‐ should we split on feature X? Contribu$ons of this work Explores the dynamics of hypothesis tes$ng in “posi$ve-­‐unlabelled” data. 1.  Can we perform a valid hypothesis test in this situa$on? 2.  Sample size determina$on: how many examples do I need? 3.  “Supervision determina$on”: how many labelled examples do I need? “Posi$ve-­‐Unlabelled” data Fully labelled Semi supervised Posi$ve Unlabelled X Y X Y X Y 2 1 2 1 2 1 3 1 3 1 3 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 3 0 3 0 3 0 1 0 1 0 1 0 3 0 3 0 3 0 3 0 3 0 3 0 2 0 2 0 2 0 PU data applica$ons Many many applica2ons… Func$onal Genomics “Gene 23 is associated with the disease.” “The other ones… we don’t know.” Text Mining Background: Hypothesis Tests Step 1. Calculate G-­‐sta$s$c… Step 2. Compare to cri$cal value, obtained from lookup table. X Y 2 1 3 1 1 1 2 1 1 1 3 0 1 0 3 0 3 0 2 0 if G > cri$cal value… else REJECT NULL HYPOTHESIS i.e. assume X and Y are related ACCEPT NULL HYPOTHESIS i.e. assume X and Y are INDEPENDENT References
Hypothesis Tes$ng in PU data? 1. Agresti, A.: Categorical Data Analysis. Wiley Series in Probability and Statistics,
Wiley-Interscience, 3rd edn. (2013)
2. Bacciu, D., Etchells, T., Lisboa, P., Whittaker, J.: Efficient identification of independence networks using mutual information. Computational Statistics 28(2),
621–646 (2013)
3. Blanchard, G., Lee, G., Scott, C.: Semi-Supervised Novelty Detection. Jour. of
Mach. Learn. Res. 11 (Mar 2010)
4. Brown, G., Pocock, A., Zhao, M., Lujan, M.: Conditional likelihood maximisation:
A unifying framework for information theoretic feature selection. Jour. of Mach.
Learn. Res. 13, 27–66 (2012)
S 5. Calvo, B., Larrañaga, P., Lozano, J.: Learning Bayesian classifiers from positive
and
unlabeled
examples.
Patt. Rec. Letters 28, 2375–2384 (2007)
Assume a
ll n
ega2ve? 1 6. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences (2nd Edition).
1 S = “surrogate” Routledge
Academic (1988)
1 7. Cohn, D., Atlas, L., Ladner, R.: Improving generalization with active learning.
Machine
201–221
(1994) of G(X;Y) ….. Then uLearning
se test f15(2),
or G(X;S) instead 0 8. Denis, F.: PAC learning from positive statistical queries. In: Proceedings of the
0 9th International Conference on Algorithmic Learning Theory (ALT). pp. 112–
0 126. Springer-Verlag, London, UK, UK (1998)
0 9. Denis, F., Laurent, A., Gilleron, R., Tommasi, M.: Text classification and cotraining from positive and unlabeled examples. In: International Conf. on Machine
0 Learning, Workshop: The Continuum from Labeled to Unlabeled Data (2003)
0 10. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In:
0 SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008)
11. Ellis, P.: The Essential Guide to E↵ect Sizes: Statistical Power, Meta-Analysis,
and the Interpretation of Research Results. Camb. Univ. Press (2010)
12. Gretton, A., Györfi, L.: Consistent nonparametric tests of independence. The Journal of Machine Learning Research 99, 1391–1423 (2010)
13. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature Extraction: Foundations
and Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006)
Step 1. Calculate G-­‐sta$s$c… X Y 2 1 3 1 1 1 2 1 1 1 3 0 1 0 3 0 3 0 2 0 Theorem 1 implies… Tes$ng with the surrogate G(X;S) will have an iden<cal FALSE POSITIVE rate to the original (unobservable) test. However…. Tes$ng for G(X;S) will have a higher FALSE NEGATIVE rate. X Y S 2 1 1 3 1 1 1 1 1 2 1 0 1 1 0 3 0 0 1 0 0 3 0 0 3 0 0 2 0 0 That is… If X;Y are truly related, …… we may miss that dependency by using the surrogate test G(X;S). Technical explana$on for this…. Contribu$ons of this work 1.  Can we perform a valid hypothesis test in this situa$on? 2.  Sample size determina<on: how many examples do I need? 3.  “Supervision determina$on”: how many labelled examples do I need? Sample Size Determina$on G(X;Y) G(X;S) I want my sta$s$cal test to detect an “effect” as small as I(X;Y) = 0.053 and, have …..a FALSE POSITIVE rate of 0.01 …..a FALSE NEGATIVE rate of 0.01 X Y S 2 1 1 3 1 1 1 1 1 2 1 0 1 1 0 3 0 0 1 0 0 3 0 0 3 0 0 2 0 0 N = 227 How many examples (N) do I need? Standard sample size determina$on does not apply…. So, we derive a correc$on factor, kappa.... The non-­‐centrality parameter of the test: G(X;S) test with sample size N/k …. ….obtains iden$cal FPR and FNR to the (unobservable) test G(X;Y) Probability of p(s=1) i.e. probability of being labelled. Prior probability of p(y=1) PU sample size determina$on… assuming we know p(y=1) N=1077 N=227 PU sample size determina$on… uncertainty over p(y=1) Place Beta distribu2on over p(y=1), use Monte Carlo simula2on to determine posterior over sample size N. Analy2cal solu2on seems to be intractable. Belief over p(y=1) Minimum N for specified test when p(s=1) = 0.05 Contribu$ons of this work 1.  Can we perform a valid hypothesis test in this situa$on? 2.  Sample size determina$on: how many examples do I need? 3.  “Supervision determina<on”: how many labelled examples do I need? PU “supervision determina$on” Use similar methodology to before, except we fix N, and solve for p(s+) From the total N=1000, …you need only 57 labels. …the rest can be unlabelled. If uncertain p(y=1) …. Example for Prac$$oner Supervision determina$on Design an experiment to observe … … a medium effect of I(X;Y) = 0.045 nats … with False Posi$ve Rate = 0.01 … with False Nega$ve Rate = 0.05 …. having N = 3000 examples Posi$ve Unlabelled knowing prior to be p(y=1) = 0.20 3000 examples •  49 posi<ves •  2951 unlabelled Conclusions 1.  We perform a valid hypothesis test in PU data by unlabelled as nega$ves. 2.  Sample size determina$on: How many examples we need to collect. 3.  “Supervision determina$on”: How many labelled examples we need to collect. Thank you for your aien$on! www.cs.man.ac.uk/~gbrown/posunlabelled/
Example for Prac$$oner Sample size determina$on Design an experiment to observe … … a medium effect of I(X;Y) = 0.045 nats … with False Posi$ve Rate = 0.01 … with False Nega$ve Rate = 0.20 Tradi$onal, fully supervised 130 labelled examples Posi$ve Unlabelled, with labelling 5% … and knowing prior to be p(y=1) = 0.20 617 examples •  31 posi<ves •  586 unlabelled