Why the naïve Bayesian classifier can dominate the proper one (‘All models are wrong but some are useful'') Hans - J. Lenz Inst. f. Statistik und Ökonometrie Freie Universität Berlin hans-j.lenz @ fu-berlin.de 1 / 19 Overview The classification problem Proper and naïve Bayesian classifier The Gamerman, Thatcher (1991) study Explanation Curse of dimension Estimation error Missing values Short conclusion Hans-J. Lenz FU Berlin Aug 2013 2 Overview 1. 2. 3. 4. The classification problem Proper and naïve Bayesian classifier The Gamerman, Thatcher (1991) study Explanation 1. Curse of dimension 2. Estimation error 3. Missing values 5. Short conclusion Hans-J. Lenz FU Berlin Aug 2013 3 Entity-Relationship Model Patient Treatment case Doctor Symptom x ICD Disease Hans-J. Lenz FU Berlin Aug 2013 4 The classification problem in hospitals Given a new patient and his observed pN symptoms classify him according to the ICD set of known diseases (illnesses) and prior information from the hospital’s database system. finite parameter (disease) space finite sampling (symptom) space DEF.: Diagnostic function : p R[0,1] with = ( given x) and Hans-J. Lenz FU Berlin Aug 2013 5 Overview The classification problem Proper and naïve Bayesian classifier The Gamerman, Thatcher (1991) study Explanation Curse of dimension Estimation error Missing values Short conclusion Hans-J. Lenz FU Berlin Aug 2013 6 Bayesian classifier B(x)=*=P(* x) where * = arg max P(x) x Likelihood P(x) Prior information P() Bayes rule Posterior information P(x) P(x) P(x) P() Learning Clinical Database Hans-J. Lenz FU Berlin Aug 2013 Feedback 7 Proper Bayesian classifier PB(x) Px(x) P() given x=(x1,x2,…,xp) (Note: p symptoms obs. for disease ) symptom xi / xj disease for all cases symptom x1,x2,…,xp case disease B(x)=PB(*x) Hans-J. Lenz FU Berlin Aug 2013 8 Naïve (Idiot) Bayesian classifier PnB(x) Pi(xi) P() given x = (x1,x2,…,xp) symptom xi xj disease for all cases symptom x1,x2,…,xp case disease nB(x)=PnB(*x) Hans-J. Lenz FU Berlin Aug 2013 9 Overview The classification problem Proper and naïve Bayesian classifier The Gamerman & Thatcher (1991) study Explanation Curse of dimension Estimation error Missing values Short conclusion Hans-J. Lenz FU Berlin Aug 2013 10 The Gamerman & Thatcher (1991) study UK hospital data sampled in 1988 9 diseases 135 symptoms sample sizes nTraining = 2000 nTest = 4387 Hans-J. Lenz FU Berlin Aug 2013 11 I Physician Percentage+) of correct diagnoses 76 II Idiot Bayes 74 III Proper Bayes 65 (!) Method +) estimated from test set Source: Gammerman and Thatcher(1988) Hans-J. Lenz FU Berlin Aug 2013 12 Overview The classification problem Proper and naïve Bayesian classifier The Gamerman, Thatcher (1991) study Explanation Curse of dimension Missing values Estimation error Hans-J. Lenz FU Berlin Aug 2013 13 Curse of dimension ICD relates the finite disease space with the power set of the finite sampling (symptom) space , i.e. ICD: 2 Note that the no of observed symptoms (pl) varies from case to case (l) missing value problem = Prob( missing value) = 5% 1.200 1.000 0.800 0.600 0.400 0.200 dimension d 12 0 80 60 40 20 Hans-J. Lenz FU Berlin Aug 2013 10 0 0.000 1 P(at least one missing value) Curse Dimension P(at least oneof missing value) = 1-(1-)p 14 Estimation of disease probabilities N #(all cases) x Xd symptom vector for disease acc. to ICD n(, x) #(cases where symptom vector x and disease is recorded) n (, x) / N n (, x) P̂( x) n ( x) / N n ( x) for all O(N): one pass table scan for fixed d! Hans-J. Lenz FU Berlin Aug 2013 15 Estimation Error Missing value problem: pl p not constant over all cases l for each disease , and having known upper boundary p according to ICD Over-fitting effect : a too large value pmax for p (too many symptoms per disease considered) Sampling error in weakly occupied cells is increased var P̂(x, ) ~ 1 / m where m is the absolute frequency in cell (x,) Hans-J. Lenz FU Berlin Aug 2013 16 Overview The classification problem Proper and naïve Bayesian classifier The Gamerman, Thatcher (1991) study Explanation Curse of dimension Estimation error Missing values Short conclusion Hans-J. Lenz FU Berlin Aug 2013 17 Short conclusion The magic triangle modeling structural dependency missing values imputation estimation error Hans-J. Lenz FU Berlin Aug 2013 18 Thank you for your attention Hans-J. Lenz FU Berlin Aug 2013 19
© Copyright 2026 Paperzz