Learning Classifiers For Non-IID Data
Balaji Krishnapuram,
Computer-Aided Diagnosis and Therapy
Siemens Medical Solutions, Inc.
Collaborators:
Volkan Vural, Jennifer Dy [North Eastern], Ya Xue [Duke],
Murat Dundar, Glenn Fung, Bharat Rao [Siemens]
Jun 27, 2006
2
Outline
Implicit IID assumption in traditional classifier design
Often, not valid in real life. Motivating CAD problems
Convex algorithms for Multiple Instance Learning (MIL)
Bayesian algorithms for Batch-wise classification
Faster, approximate algorithms via mathematical programming
Summary / Conclusions
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
3
IID assumption in classifier design
Training data D={(xi,yi)i=1N: xi 2 Rd, yi 2 {+1,-1}},
Testing data T ={(xi,yi)i=1M: xi 2 Rd, yi 2 {+1,-1}},
Assume each training/testing sample drawn
independently from identical distribution:
(xi,yi) ~ PXY(x,y)
This is why we can classify one test sample at a
time, ignoring the features of the other test samples
Eg. Logistic Regression: P(yi=1|xi,w)=1/(1+exp(-wT xi))
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
4
Evaluating classifiers: learning-theory
Binomial test set bounds: With high probability over the
random draw of M samples in testing set T,
if M large and a classifier w is observed to be accurate on T,
with high probability its expected accuracy over a random draw
of a sample from PXY(x,y) will be high
If the IID assumption fails, all bets are off !
Thought experiment: repeat same test sample M times
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
5
Training classifiers: learning theory
With high probability over the random draw of N samples in
training set D, the expected accuracy on a random sample
from PXY(x,y) for the learnt classifier w will be high iff
accurate on the training set D; and N large
satisfies intuition before seeing data (“prior”, large margin etc)
PAC-Bayes, VC-theory etc rely on iid assumption
Relaxation to exchangeability being explored
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
6
CAD: Correlations among candidate ROI
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
7
Hierarchical Correlation Among Samples
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
8
Additive Random Effect Models
The classification is treated as iid, but only if given both
Fixed effects (unique to sample)
Random effects (shared among samples)
Simple additive model to explain the correlations
P(yi|xi,w,ri,v)=1/(1+exp(-wT xi –vT ri))
P(yi|xi,w,ri)=s P(yi|xi,w,ri,v) p(v|D) dv
Sharing vT ri among many samples correlated prediction
…But only small improvements in real-life applications
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
9
CAD detects early stage colon cancer
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
10
Sensitivity
Candidate Specific Random Effects Model: Polyps
Specificity
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
11
CAD algorithms: domain-specific issues
Multiple (correlated) views: one detection is
sufficient
Systemic treatment of diseases: e.g. detecting
one PE sufficient
Modeling the data acquisition mechanism
Errors in guessing class labels for training set.
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
12
The Multiple Instance Learning Problem
A bag is a collection of many instances (samples)
The class label is provided for bags, not instances
Positive bag has at least one +ve instance in it
Examples of “bag” definition for CAD applications:
Bag=samples from multiple views, for the same region
Bag=all candidates referring to same underlying structure
Bag=all candidates from a patient
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
13
CH-MIL Algorithm: 2-D illustration
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
14
CH-MIL Algorithm for Fisher’s Discriminant
Easy implementation via Alternating Optimization
Scales well to very large datasets
Convex problem with unique optima
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
15
Lung CAD
*Pending FDA Approval
Computed Tomography
Lung Nodules&
Pulmonary
Emboli
©2005 Siemens
Medical Solutions.
All rights reserved. Siemens confidential
AX
DR CAD
Computer-aided Diagnosis & Therapy
16
CH-MIL: Pulmonary Embolisms
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
17
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
18
CH-MIL: Polyps in Colon
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
19
Classifying a Correlated Batch of Samples
Let classification of individual samples xi be based on ui
Eg. Linear ui = wT xi ; or kernel-predictor ui= j=1N j k(xi,xj)
Instead of basing the classification on ui, we will base it on an
unobserved (latent) random variable zi
Prior: Even before observing any features xi (thus before ui),
zi are known to be correlated a-priori,
p(z)=N(z|0,)
Eg. due to spatial adjacency = exp(-D),
Matrix D=pair-wise dist. between samples
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
20
Classifying a Correlated Batch of Samples
Prior: Even before observing any features xi (thus before ui),
zi are known to be correlated a-priori,
p(z)=N(z|0,)
Likelihood: Let us claim that ui is really a noisy observation
of a random variable zi :
p(ui|zi)=N(ui|zi, 2)
Posterior: remains correlated, even after observing the
features xi
P(z|u)=N(z|(-12+I)-1u, (-1+2I)-1)
Intuition: E[zi]=j=1N Aij uj ; A=(-12+I)-1
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
21
SVM-like Approximate Algorithm
Intuition: classify using E[zi]=j=1N Aij uj ; A=(-12+I)-1
What if we used A=( + I) instead?
Reduces computation by avoiding inversion.
Not principled, but a heuristic for speed.
Yields an SVM-like mathematical programming algorithm:
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
22
Detecting Polyps in Colon
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
23
Detecting Pulmonary Embolisms
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
24
Detecting Nodules in the Lung
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
25
Conclusions
IID assumption is universal in ML
Often violated in real life, but ignored
Explicit modeling can substantially improve accuracy
Described 3 models in this talk, utilizing varying levels of
information
Additive Random Effects Models: weak correlation information
Multiple Instance Learning: stronger correlations enforced
Batch-wise classification models: explicit information
Statistically significant improvement in accuracy
Only starting to scratch the surface, lots to improve!
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
26
©2005 Siemens Medical Solutions. All rights reserved. Siemens confidential
Computer-aided Diagnosis & Therapy
© Copyright 2026 Paperzz