Receiver Operating Characteristic, Jackknife

Comprehensive Introduction to the Evaluation of Neural
Networks and other Computational Intelligence Decision
Functions: Receiver Operating Characteristic, Jackknife,
Bootstrap and other Statistical Methodologies
David G. Brown and Frank Samuelson
Center for Devices and Radiological Health, FDA
6 July 2014
Course Outline
I.
Performance measures for Computational Intelligence (CI) observers
1.
2.
3.
4.
II.
Receiver Operating Characteristic (ROC) analysis
1.
2.
3.
III.
Sensitivity and specificity
Construction of the ROC curve
Area under the ROC curve (AUC)
Error analysis for CI observers
1.
2.
3.
4.
IV.
Sources of error
Parametric methods
Nonparametric methods
Standard deviations and confidence intervals
Boot strap methods
1.
2.
V.
Accuracy
Prevalence dependent measures
Prevalence independent measures
Maximization of performance: Utility analysis/Cost functions
Theoretical foundation
Practical use
References
What’s the problem?
• Emphasis on algorithm innovation to exclusion
of performance assessment
• Use of subjective measures of performance –
“beauty contest”
• Use of “accuracy” as a measure of success
• Lack of error bars—My CIO is .01 better than
yours (+/- ?)
• Flawed methodology—training and testing on
same data
• Lack of appreciation for the many different
sources of error that can be taken into account
Original image
Lena. Courtesy of the Signal and Image Processing Institute at the University of Southern California.
CI improved image
Baboon. Courtesy of the Signal and Image Processing Institute at the University of Southern California.
Panel of experts
funnymonkeysite.com
I. Performance measures for computational
intelligence (CI) observers
• Task based: (binary) discrimination task
– Two populations involved: “normal” and “abnormal,”
• Accuracy – Intuitive but incomplete
– Different consequences for success or failure for each
population
• Some measures depend on the prevalence (Pr)
 Number of abnormals in population

some do not, Pr =  Total number of all subjects in population 
– Accuracy, positive predictive value, negative predictive value
– Sensitivity, specificity, ROC, AUC
• True optimization of performance requires knowledge
of cost functions or utilities for successes and failures
in both populations
How to make a CIO with >99% accuracy
• Medical problem: Screening mammography
(“screening” means testing in an asymptomatic
population)
• Prevalence of breast cancer in the screening
population Pr = 0.5 %
• My CIO always says “normal”
• Accuracy (Acc) is 99.5% (accuracy of accepted
present-day systems ~75%)
• Accuracy in a diagnostic setting (Pr~20%) is
80% -- Acc=1-Pr (for my CIO)
CIO operates on two different populations
Normal cases
p(t|0)
Abnormal cases
p(t|1)
Threshold t = T
t-axis
Must consider effects on normal and
abnormal populations separately
•
•
•
•
•
•
•
•
CIO output t
p(t|0) probability distribution of t for the population of normals
p(t|1) probability distribution of t for the population of abnormals
Threshold T. Everything to the right of T called abnormal, and
everything to the left of T called normal
Area of p(t|0) to left of T is the true negative fraction (TNF =
specificity) and to the right the false positive fraction (FPF = type 1
error).
 TNF + FPF = 1
Area of p(t|1) to left of T is the false negative fraction (FNF = type 2
error) and to the right is the true positive fraction (TPF = sensitivity)
 FNF + TPF = 1
TNF, FPF, FNF, TPF all are prevalence independent, since each is
some fraction of one of our two probability distributions
{Accuracy = Pr x TPF + (1-Pr) x TNF}
Normal
cases
FPF (.5)
TNF (.5)
t-axis
Threshold T
Abnormal
cases
TPF (.95)
FNF (.05)
t-axis
Prevalence dependent measures
• Accuracy (Acc)
 Acc = Pr x TPF + (1-Pr) x TNF
• Positive predictive value (PPV): fraction of positives that
are true positives
 PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr))
• Negative predictive value (NPV): fraction of negatives
that are true negatives
 NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr)
• Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values: Pr = .05, TPF = .95, TNF =
0.5, FNF=.05, FPF=0.5
 Acc = .05x.95+.95x.5 = .52
 PPV = .95x.05/(.95x.05+.5x.95) = .10
 NPV = .5x.95/(.5x.95+.05x.05) = .997
Prevalence dependent measures
• Accuracy (Acc)
 Acc = Pr x TPF + (1-Pr) x TNF
• Positive predictive value (PPV): fraction of positives that
are true positives
 PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr))
• Negative predictive value (NPV): fraction of negatives
that are true negatives
 NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr)
• Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values: Pr = .005, TPF = .95,
TNF = 0.5, FNF=.05, FPF=0.5
 Acc = .005x.95+.995x.5 = .50
 PPV = .95x.005/(.95x.005+.5x.995) = .01
 NPV = .5x.995/(.5x.995+.05x.005) = .995
Acc, PPV, NPV as functions of prevalence
(screening mammography)
•
•
•
•
TPF=.95
FNF=.05
TNF=0.5
FPF=0.5
Acc = NPV as function of prevalence
(forced “normal” response CIO)
Prevalence independent measures
• Sensitivity = TPF
• Specificity = TNF (1-FPF)
• Receiver Operating Characteristic (ROC)
= TPF as a function of FPF (Sensitivity as
a function of 1 – Specificity)
• Area under the ROC curve (AUC)
= Sensitivity averaged over all values of
Specificity
Normal / Class 0
subjects
Entire ROC curve
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
ROC slope
FPF, 1-specificity
17
Empirical ROC data for
mammography screening in the US
True Negative Fraction
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.9
0.1
0.8
0.2
0.7
0.3
0.6
0.4
0.5
0.5
0.4
0.6
0.3
0.7
0.2
0.8
0.1
0.9
0.0
False Negative Fraction
True Positive Fraction
1.0
1.0
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
False Positive Fraction
Craig Beam et al.
Maximization of performance
• Need to know utilities or costs of each type of decision outcome –
but these are very hard to estimate accurately. You don’t just
maximize accuracy.
• Need prevalence
• For mammography example
–
–
–
–
TPF:
FPF:
TNF:
FNF:
prolongation of life minus treatment cost
diagnostic work-up cost, anxiety
peace of mind
delay in treatment => shortened life
• Hypothetical assignment of utilities for some decision threshold T:
– UtilityT= U(TPF) x TPF x Pr + U(FPF) x FPF x (1-Pr)
+ U(TNF) x TNF x (1-Pr) + U(FNF) x FNF x Pr
– U(TPF) = 100, U(FPF) = -10, U(TNF) = 4, U(FNF) = -20
– UtilityT= 100 x .95 x .05 – 10 x .50 x .95
+ 4 x .50 x .95 – 20 x .05 x .05 = 1.85
• Now if we only knew how to trade off TPF versus FPF, we could
optimize (?) medical performance.
Utility maximization
(mammography example)
Choice of ROC operating point through
utility analysis—screening mammography
Utility maximization
(mammography example)
Utility maximization calculation
u = (UTPFTPF+UFNFFNF)PR+(UTNFTNF+UFPFFPF)(1-PR)
=(UTPFTPF+UFNF(1-TPF))PR+(UTNF(1-FPF)+UFPFFPF)(1-PR)
du/dFPF=(UFPF-UTNF)(1-PR)+(UTPF-UFNF)PRdTPF/dFPF
=0  dTPF/dFPF=(UTNF-UFPF)(1-PR)/(UTPF-UFNF)PR
PR=.005  dTPF/dFPF = 23.
PR=.05  dTPF/dFPF = 2.2
(UTPF=100, UFNF=-20, UTNF=4, UFPF=-20)
Normal
cases
Entire ROC curve
Threshold
Abnormal
cases
TPF, sensitivity
ROC slope
FPF, 1-specificity
Estimators
• TPF, FPF, TNF, FNF, Accuracy, the ROC
curve, and AUC are all fractions or
probabilities.
• Normally we have a finite sample of
subjects on which to test our CIO. From
this finite sample we try to estimate the
above fractions
– These estimates will vary depending upon the
sample selected (statistical variation).
– Estimates can be nonparametric or
parametric
Estimators
• TPF=
• TPF=
Number of abnormals that would be
selected by CIO in the population
Number of abnormals in the population
Number of abnormals that were
selected by CIO in the sample
Number of abnormals in the sample
• Number in sample << Number in population (at
least in theory)
II. Receiver Operating
Characteristic (ROC)
• Receiver Operating Characteristic
• Binary Classification
• Test result is compared to a threshold
Distribution of CIO Output for all
Subjects
Threshold
Computational intelligence observer output
Distribution of Output
for Normal / Class 0
Subjects, p(t|0)
Distribution of Output
for Abnormal / Class 1
Subjects, p(t|1)
Threshold
t-axis
Computational intelligence observer output
Distribution of Output
for Normal / Class 0
Subjects, p(t|0)
Threshold
Abnormal / Class 1
subjects
Distribution of Output
for Normal / Class 0
Subjects, p(t|0)
Specificity
= True Negative Fraction
= TNF
Threshold
Abnormal / Class 1
subjects
Sensitivity
= True Positive Fraction
= TPF
Normal / Class 0
subjects
Specificity
H0
Decision
D0
D1
Truth
Threshold
H1
Abnormal / Class 1
subjects
Sensitivity
TNF
0.50
TPF
0.95
Normal / Class 0
subjects
1 - Specificity
= False Positive Fraction
= FPF
Threshold
Abnormal / Class 1
subjects
1 - Sensitivity
= False Negative Fraction
= FNF
Normal / Class 0
subjects
1 - Specificity
H0
Decision
D0
D1
Truth
Threshold
H1
Abnormal / Class 1
subjects
1 - Sensitivity
TNF
0.50
FPF
0.50
FNF
0.05
TPF
0.95
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
Normal / Class 0
subjects
high
sensitivity
FPF, 1-specificity
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
Normal / Class 0
subjects
sensitivity = specificity
FPF, 1-specificity
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
Normal / Class 0
subjects
high
specificity
FPF, 1-specificity
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
Which CIO is best?
Normal / Class 0
subjects
CIO #3
CIO #2
CIO #1
FPF, 1-specificity
TPF
FPF
CIO #1
0.50
0.07
CIO #2
0.78
0.22
CIO #3
0.93
0.50
Normal / Class 0
subjects
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
Do not compare rates of one
class, e.g. TPF, at different
rates of the other class
(FPF).
CIO #3
CIO #2
CIO #1
FPF, 1-specificity
TPF
FPF
CIO #1
0.50
0.07
CIO #2
0.78
0.23
CIO #3
0.93
0.50
Normal / Class 0
subjects
Threshold
Abnormal / Class 1
subjects
TPF, sensitivity
Entire ROC curve
FPF, 1-specificity
AUC=0.98
AUC=0.85
TPF, sensitivity
Entire ROC curve
Discriminability
-orCIO performance
FPF, 1-specificity
AUC=0.5
AUC (Area under ROC Curve)
• AUC is a separation probability
• AUC = probability that
– CIO output for abnormal > CIO output for normal
– CIO correctly tells which of 2 subjects is normal
• Estimating AUC from finite sample
–
–
–
–
Select abnormal subject score = xi
Select normal subject score = yk
Is xi > yk ?
1
AUC =
Average over all x,y:
n1n0
n1
n0
  I( x
i
i
k
 yk 
ROC as a Q-Q plot
• ROC plots in
probability space
• ROC plots in quantile
space
Linear Likelihood Ratio Observer
for Gaussian Data
• When the input features of the data are
distributed as Gaussians with equal variance,
– The optimal discriminant, the log-likelihood ratio, is a
linear function,
– That linear discriminant is also distributed as a
Gaussian,
– The signal to noise ratio (SNR) is easily calculated
from the input data distributions and is a monotonic
function of AUC.
• Can serve as a benchmark against which to
measure CIO performance
Linear Ideal Observer
• p(x|0) probability distribution of data x for the population of normals
and p(x|1) probability distribution of x for the population of
abnormals with components xi independent Gaussian distributed
with means 0 and mi respectively and identical variances si2
D

p( x | 0) =  (2si2 )  D / 2 exp(  xi2 / 2s i2 )
i =1
D

p( x | 1) =  (2si2 )  D / 2 exp( ( xi  mi ) 2 / 2s i2 )
i =1
Maximum Likelihood CIO






L = p(1 | x ) / p(0 | x ) = p( x | 1) p(1) / p( x | 0) p(0) = kp( x | 1) / p( x | 0)
 D mi xi 
 D mi2 
L = k exp   2  exp    2 
 i =1 s i 
 i =1 2s i 
 D mi xi 
t1 = exp   2 
 i =1 s i 
mi xi
t = ln( t1 ) =  2
i =1 s i
D
Linear Ideal Observer ROC
 N mi2 
p(t | 0) = Gauss 0,  2  = Gauss(0, )
 i =1 s i 
p(t | 1) = Gauss(, 

( 
SNR = ( p(t | 1)  p(t | 0) ) / s ( p (t|1)  p (t|0))
1
2
=  /( 2) = ( / 2)
1
2
AUC = ( SNR);  = Cumulative Gaussian distributi on
1
2
Likelihood Ratio = Slope of ROC
• The likelihood ratio of the decision variable t is
the slope of the ROC curve:
• ROC= TPF(FPF); TPF= 1-P(t|0); FPF= 1-P(t|1)
d TPF d P( t | 1) p(t | 1)
slope =
=
=
= L(t )
d FPF d P(t | 0) p(t | 0)
III. Error analysis for CI observers
•
•
•
•
Sources of error
Parametric methods
Nonparametric methods
Standard deviations and confidence
intervals
• Hazards
Sources of error
• Test error—limited number of samples in
the test set
• Training error—limited number of samples
in the training set
– Incorrect parameters
– Incorrect feature selection, etc.
• Human observer error (when applicable)
– Intraobserver
– Interobserver
Parametric methods
• Use known underlying probability
distribution – may be exact for simulated
data
• Assume Gaussian distribution
• Other parameterization – e.g., Binomial or
ROC linearity in z-transformation
coordinates
– (-1(TPF) versus -1(FPF), where  is the cumulative Gaussian
distribution)
Binomial Estimates of Variance
• For single population measures,
f= TPF, FPF, FNF, TNF
• Var(f) = f (1-f) / N
• For AUC (back of envelope calculation)
AUC (1 - AUC)
Var(AUC) =
N
Data rich case
• Repeat experiment M times
• Estimate distribution parameters—e.g., for
a Gaussian distributed performance
measure f, G(m,s2):
fˆ = mˆ f =
M
1
M
f
i =1
i
1 M
sˆ =
( f i  mˆ ) 2

M  1 i =1
2
f
sˆ 2fˆ = sˆ 2f / M
• Find error bars or confidence limits
fˆ  sˆ fˆ
fˆ  ks fˆ
(k 95% = 1.96)
Example: AUC
•
•
•
•
Mean AUC
“Distribution” variance
Variance of mean
Error bars, confidence interval
Probability distribution for
calculation of AUC from 40 values
Probability distribution for
calculation of SNR from 40 values
But what’s a poor boy to do?
• Reuse the data you have:
Resubstitution, Resampling
• Two common approaches:
– Jackknife
– Bootstrap
Resampling
Resampling
Resampling
Resampling
Resampling
Resampling
Resampling
Resampling
Jackknife
• Have N observations
• Leave out m of these, then have M
subsets of the N observations to calculate
m, s2
N
M =  
m
• N=10, m=5: M=252; N=10, m=1: M=10
Round-robin jackknife bias
derivation and variance
• Given N datasets
AUCN = AUC  k / N
AUC N 1 = AUC  k /( N  1)
2 AUC = AUCN  AUC N 1  k
k
(2 N  1)
N ( N  1)
1
= AUCN  AUC N 1
N ( N  1)
AUˆC J = N  AUCN  ( N  1)  AUC N 1
 N 1 N
2
sˆ = 
 ( AUCN 1( i )  AUC N 1 )
 N  i =1
2
J
Fukunaga-Hayes bias derivation
• Divide both the normal and abnormal
classes in half, yielding 4 possible pairings
AUCN = AUC  k / N
AUC N / 2 = AUC  2k / N
AUˆC F  H = 2  AUCN  AUC N / 2
Jackknife bias correction example
Training error
•
AUC estimates
as a function of
number of
cases N. Solid
line is the
multilayer
perceptron
result. Open
circle jackknife,
closed circle
FukunagaHayes. The
horizontal
dotted line is
the asymptotic
ideal result
IV. Bootstrap methods
• Theoretical foundation
• Practical use
Bootstrap variance
• What you have is what you’ve got—the
data is your best estimate of the
probability distribution:
– Sampling with replacement, M times
– Adequate number of samples M>N
Simple bootstrap
1 M
sˆ =
( AUCB ( i )  AUC B ) 2

M  1 i =1
2
B
Bootstrap and jackknife error estimates
•
Standard deviation
of AUC: Solid line
simulation results,
open circles
jackknife estimate,
closed circles
bootstrap estimate.
Note how much
larger the jackknife
error bars are than
those provided by
the bootstrap
method.
Comparison of s.d. estimates:
2 Gaussian dist., 20 normal, 20 abnormal, pop. AUC=.936
•
•
•
•
Actual s.d.
Binomial approx.
Bootstrap
Jackknife
.0380
.0380
.0388
.0396
• Mean bootstrap AUC est.
• Mean jackknife bias est.
.936
2x10-17
.632 bootstrap for classifier
performance evaluation
• Have N cases, draw M samples of size N with replacement for
training (Have on average .632 x N unique cases in each sample of
size N)
• Test on the unused (~.368 x N) cases for each sample
pcasei = (1  N1 ) N  e 1
pcasei = 1  pcasei  1  e 1 = .632...
 casestraining = 1  casestesting = .632...  N
N
5
10
20
100
Infinity
(1  1 / N) N
.328
1-.672
.349
1-.651
.358
1-.642
.366
1-.634
.368
1-.632
.632 bootstrap for classifier
performance evaluation 2
• Have N cases, draw M samples of size N with
replacement for training (Have on average .632
x N unique cases in each sample of size N)
• Test on the unused (~.368 x N) cases for each
sample
• Get bootstrap average result AUCB
• Get resubstitution result (testing on training set)
AUCR
• AUC.632 = .632 x AUCB + .368 x AUCR
• As variance take the AUCB variance
Traps for the unwary:
Overparameterization
• Cover’s theorem:
 N  1
 .
C ( N , d ) = 2 
k =0  k

d
• For N<2(d+1) a hyperplane exists that will
perfectly separate almost all possible
dichotomies of N points in d space
fd(N) for d=1,5,25,125, and the limit of large d. The abscissa
x=N/2(d+1) is scaled so that the values of fd(N)=0.5 lie superposed at
x=1 for all d.
Poor data hygiene
• Reporting on training data results/ testing
on training data
• Carrying out any part of the training
process on data later used for testing
– e.g., using all of the data to select a
manageable feature set from among a large
number of features—and then dividing the
data into training and test sets.
Overestimate of AUC from
poor data hygiene
1
60
0.9
0.8
True Positive Fraction
Number of experiments (out of 900)
50
40
30
20
0.7
0.6
0.5
0.4
0.3
Method 1,
Method 2,
Method 3,
Method 4,
0.2
10
0.1
0
0.4
0.5
0.6
0.7
0.8
Area Under the ROC curve
0.9
1
0
0
0.2
0.4
0.6
False Positive Fraction
AUC=0.52
AUC=0.62
AUC=0.82
AUC=0.91
0.8
1
Distributions of AUC values in 900 simulation experiments (on the left) and the mean ROC curves
(on the right) for four validation methods: Method 1 – Feature selection and classifier training on one
dataset and classifier testing on another independent dataset; Method 2 – Given perfect feature
selection, classifier training on one dataset and classifier testing on another independent dataset;
Method 3 – Feature selection using the entire dataset and then the dataset is partitioned into two,
one for training and one for testing the classifier; Method 4 – Feature selection, classifier training,
and testing using the same dataset.
Correct feature selection is hard to do
180
250
Number of experiments (out of 900)
Number of experiments (out of 900)
160
140
120
100
80
60
40
200
150
100
50
20
0
0
10
1
2
10
10
Feature Index
3
10
0
0
2
4
6
8
Number of useful features (out of 30)
10
An insight of feature selection performance in Method 1. On the left plots the number of experiments
(out of 900) that a feature is selected. By design of the simulation population, the first 30 features are
useful for classification and the remaining are useless. On the right plots the distribution of the
number of useful features (out of 30) in the 900 experiments.
12
Conclusions
• Accuracy and other prevalence dependent
measures are inadequate
• ROC/AUC provide good measures of
performance
• Uncertainty must be quantified
• Bootstrap and jackknife techniques are
useful methods
V. References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[1] K. Fukunaga, Statistical Pattern Recognition, 2nd Edition. Boston: Harcourt Brace Jovanovich, 1990.
[2] K. Fukunaga and R. R. Hayes, “Effects of sample size in classifier design,” IEEE Trans. Pattern Anal. Machine Intell., vol.
PAMI-11, pp. 873–885, 1989.
[3] D. M. Green and J. A. Swets, Signal Detection Theory and Psychophysics. New York: John Wiley & Sons, 1966.
[4] J. P. Egan, Signal Detection Theory and ROC Analysis. New York: Academic Press, 1975.
[5] C. E. Metz, “Basic principles of roc analysis,” Seminars in Nuclear Medicine, vol. VIII, no. 4.
[6] H. H. Barrett and K. J. Myers, Foundations of Image Science. Hoboken: John Wiley & Sons, 2004, ch. 13 Statistical
Decision Theory.
[7] B. Efron and R. J. Tibshirani, Introduction to the Bootstrap. Boca Raton: Chapman & Hall/CRC, 1993.
[8] B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: Society for Industrial and Applied
Mathematics, 1982.
[9] A. C. Davison and D. V. Hinkley, Bootstrap Methods and their Applications. Cambridge: Cambridge University Press,
1997.
[10] B. Efron, “Estimating the error rate of a prediction rule: Some improvements on cross-validation,” Journal of the
American Statistical Association, vol. 78, pp. 316–331, 1983.
[11] B. Efron and R. J. Tibshirani, “Improvements on cross-validation: The .632+ bootstrap method,” Journal of the American
Statistical Association, vol. 92, no. 438, pp. 548–560, 1997.
[12] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 3rd Edition. New York: Springer, 2009.
[13] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006.
[14] ——, Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1995.
[15] R. F. Wagner, D. G. Brown, J.-P. Guedon, K. J. Myers, and K. A. Wear, “Multivariate Guassian pattern classification:
effects of finite sample size and the addition of correlated or noisy features on summary measures of goodness,” in
Information processing in Medical Imaging, Proceedings of IPMI ’93, 1993, pp. 507–524.
[16] ——, “On combining a few diagnostic tests or features,” in Proceedings of the SPIE, Image Processing, vol. 2167, 1994.
[17] D. G. Brown, A. C. Schneider, M. P. Anderson, and R. F. Wagner, “Effects of finite sample size and correlated noisy
input features on neural network pattern classification,” in Proceedings of the SPIE, Image Processing, vol. 2167, 1994.
[18] C. A. Beam, “Analysis of clustered data in receiver operating characteristic studies,” Statistical Methods in Medical
Research, vol. 7, pp. 324–336, 1998.
[19] W. A. Yousef, et al. “Assessing Classifiers from Two Independent Data Sets Using ROC Analysis: A Nonparametric
Approach,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1809-1817, 2006
[19] F. W. Samuelson and D. G. Brown, “Application of cover’s theorem to the evaluation of the performance of CI
observers,” in Proceedings of the IJCNN 2011, 2011.
[20] W. Chen and D. G. Brown, “Optimistic bias in the assessment of high dimensional classifiers with a limited dataset,” in
Proceedings of the IJCNN 2011, 2011.
Appendix I
Searching suitcases
Previous class results