Model Measures: Are they fit for purpose? Tim Wright NZMUGS Conference, 2014 9th Sept 2014 Model Measures – Are They Fit for Purpose? 1 Introduction • ‘Fit for Purpose’ • Draft MUGS Data Comparison Guidelines (GEH) • Expanded review of: – GEH – R2 – RMSE • I am no statistician (but I love statistics !) • ‘Keeping it real’ “For every grain of sand on Earth, there are 10,000 stars in the observable universe” 9th Sept 2014 “1 Million seconds is around 12 Days. 1 Billion seconds is around 32 years” Model Measures – Are They Fit for Purpose? 2 GEH Definition • • • • (𝑀−𝑂)2 (𝑀+𝑂)/2 (𝑀−𝑂)2 (𝑀+𝑂)/2 (𝑀−𝑂) (𝑀+𝑂)/2 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 9th Sept 2014 Model Measures – Are They Fit for Purpose? 3 GEH Trend • Single measure of overall fit 9th Sept 2014 Model Measures – Are They Fit for Purpose? 4 GEH Trend 9th Sept 2014 Model Measures – Are They Fit for Purpose? 5 GEH Trend 9th Sept 2014 Model Measures – Are They Fit for Purpose? 6 GEH Thresholds 9th Sept 2014 Model Measures – Are They Fit for Purpose? 7 GEH Thresholds 9th Sept 2014 Model Measures – Are They Fit for Purpose? 8 GEH Examples Count Modelled Error % Error GEH GnT (Proposed) • GEH: • Model 1 Model 2 400 400 500 300 100 -100 25% -25% 4.7 5.0 5.3 5.0 Count Modelled Error % Error GEH GnT (Proposed) Model 1 Model 2 900 900 600 1200 -300 300 -33% 33% 11.0 10.0 9.3 10.0 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 Reflects uncertainty in the count • What is the Purpose ? – To identify possible issues for further investigation • GnT: 9th Sept 2014 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 Model Measures – Are They Fit for Purpose? 9 GnT Thresholds 9th Sept 2014 Model Measures – Are They Fit for Purpose? 10 GnT Thresholds 9th Sept 2014 Model Measures – Are They Fit for Purpose? 11 R2: Definition • 𝑅2 = 1 − • 𝑅2 =1− 𝑬𝒓𝒓𝒐𝒓𝒔𝟐 𝑺𝒖𝒎𝒎𝒆𝒅 𝑫𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒂 𝒄𝒐𝒖𝒏𝒕 𝒂𝒏𝒅 𝒕𝒉𝒆 𝒎𝒆𝒂𝒏 𝒄𝒐𝒖𝒏𝒕𝟐 𝑺𝒖𝒎𝒎𝒆𝒅 1000 y=x 900 900 800 800 700 700 600 600 Modelled Modelled 1000 𝑺𝑺𝒓𝒆𝒔 𝑺𝑺𝒕𝒐𝒕 500 400 500 400 300 300 200 200 100 100 0 Mean 0 0 100 200 300 400 500 600 700 800 900 1000 Observed 9th Sept 2014 0 100 200 300 400 500 600 700 800 900 1000 Observed Model Measures – Are They Fit for Purpose? 12 R2: A Comparison Observed Modelled GEH 400 450 500 505 510 515 520 525 530 600 380 473 475 530 485 541 494 551 504 630 1.0 1.0 1.1 1.1 1.1 1.1 1.2 1.1 1.2 1.2 % Error (M-O)/O -5% +5% -5% +5% -5% +5% -5% +5% -5% +5% Error2 (M-O)2 400 506 625 638 650 663 676 689 702 900 Diff2 (O-Av)2 11,130 3,080 30 0 20 90 210 380 600 8,930 Totals: 6,449 24,473 R2: 0.736 506 Ave Observed Modelled GEH 20 25 100 150 400 600 900 1300 2000 3000 16 0.9 30 1.0 80 2.1 180 2.3 320 4.2 720 4.7 720 6.3 1560 6.9 1600 9.4 3600 10.4 850 Ave Gridville % Error (M-O)/O -20% +20% -20% +20% -20% +20% -20% +20% -20% +20% Error2 (M-O)2 16 25 400 900 6,400 14,400 32,400 67,600 160,000 360,000 Diff2 (O-Av)2 688,070 679,800 561,750 489,300 202,050 62,250 2,550 202,950 1,323,650 4,624,650 Totals: 642,141 8,837,023 R2: 0.927 Hierarchy City 1200 4000 3500 1000 Modelled Modelled 3000 800 600 400 2500 2000 1500 1000 200 500 0 0 0 500 1000 Observed 9th Sept 2014 1500 0 1000 2000 Observed Model Measures – Are They Fit for Purpose? 3000 4000 13 R2: An Alternative ? • 𝑅2 = 1 − • 𝑅2 =1− 𝑺𝑺𝒓𝒆𝒔 𝑺𝑺𝒕𝒐𝒕 𝑬𝒓𝒓𝒐𝒓𝟐 ,𝑺𝒖𝒎𝒎𝒆𝒅 𝑫𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒂 𝑪𝒐𝒖𝒏𝒕 𝒂𝒏𝒅 𝒕𝒉𝒆 𝑴𝒆𝒂𝒏 𝑪𝒐𝒖𝒏𝒕𝟐, 𝑺𝒖𝒎𝒎𝒆𝒅 • Alternative: 𝑅2 =1− 𝑬𝒓𝒓𝒐𝒓𝟐 , 𝑺𝒖𝒎𝒎𝒆𝒅 𝑪𝒐𝒖𝒏𝒕𝟐 ,𝑺𝒖𝒎𝒎𝒆𝒅 Mean 9th Sept 2014 Model Measures – Are They Fit for Purpose? 14 R2: A Modification ? Observed Modelled GEH 400 450 500 505 510 515 520 525 530 600 380 473 475 530 485 541 494 551 504 630 1.0 1.0 1.1 1.1 1.1 1.1 1.2 1.1 1.2 1.2 % Error (M-O)/O -5% +5% -5% +5% -5% +5% -5% +5% -5% +5% 506 Ave Totals: R2: Error2 (M-O)2 400 506 625 638 650 663 676 689 702 900 Obs2 (O)2 160,000 202,500 250,000 255,025 260,100 265,225 270,400 275,625 280,900 360,000 Observed Modelled GEH 20 25 100 150 400 600 900 1300 2000 3000 6,449 2,579,775 0.736 16 0.9 30 1.0 80 2.1 180 2.3 320 4.2 720 4.7 720 6.3 1560 6.9 1600 9.4 3600 10.4 850 Ave 0.998 Gridville % Error (M-O)/O -20% +20% -20% +20% -20% +20% -20% +20% -20% +20% Error2 (M-O)2 16 25 400 900 6,400 14,400 32,400 67,600 160,000 360,000 Obs2 (O)2 400 625 10,000 22,500 160,000 360,000 810,000 1,690,000 4,000,000 9,000,000 Totals: 642,141 16,053,525 R2: 0.927 0.960 Hierarchy City 1200 4000 3500 1000 Modelled Modelled 3000 800 600 400 2500 2000 1500 1000 200 500 0 0 0 500 1000 Observed 9th Sept 2014 1500 0 1000 2000 Observed Model Measures – Are They Fit for Purpose? 3000 4000 15 RMSE Definition (or %RMSE) • % RMSE = • • • (𝑀−𝑂)2 𝑁−1 𝑂 𝑁 x 100 𝐸𝑟𝑟𝑜𝑟2 𝑁−1 𝐴𝑣𝑒 𝐶𝑜𝑢𝑛𝑡 𝐸𝑟𝑟𝑜𝑟2 𝑁 𝐴𝑣𝑒 𝐶𝑜𝑢𝑛𝑡 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 𝐴𝑣𝑒 𝐶𝑜𝑢𝑛𝑡 • What does an RMSE of 30% mean ? • How is this a better measure than the ‘weighted average % error’, i.e: – 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐶𝑜𝑢𝑛𝑡 9th Sept 2014 ? Model Measures – Are They Fit for Purpose? 16 %RMSE As a Measure of Confidence • RMSE closely related to Standard Deviation • Theory is that SD (or RMSE) tells us something about the confidence in the model: Proportion Anticipated range within 1SD or %RMSE Error • i.e. typically: – 68% of errors < RMSE – 95% of errors < 2 x RMSE 9th Sept 2014 • But this assumes a normal distribution of errors Model Measures – Are They Fit for Purpose? 17 %RMSE As a Measure of Confidence Frequency Distribution of Model Flow Errors vs Normal Distribution 0.04 • SD = 57 (RMSE = 24%) • Normal Distribution Implies 68% of errors < SD • Actual data shows 81% of data within SD • So what is SD or RMSE actually telling us ? 0.035 0.03 Proportion 0.025 0.02 0.015 0.01 0.005 Actual Frequency -500 -400 -300 -200 -100 -57 0 0 57 100 200 300 Error (Modelled - Count) 9th Sept 2014 Model Measures – Are They Fit for Purpose? 400 Normal Distribution Probability 18 %RMSE Thoughts • Scatter plots are a better indication of overall model fit • Is the %‘Mean Absolute Deviation’ i.e. 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐶𝑜𝑢𝑛𝑡 a more direct, simpler, intuitive single measure ? • E.g. %MAD = 14% vs RMSE = 24% 9th Sept 2014 Model Measures – Are They Fit for Purpose? 19 Conclusions • Is GnT a better indicator of potential issues with models than GEH ? • Is R2 appropriate to our purpose or should this be modified ? • %RMSE not intuitive and of dubious value. Suggest replacing with %MAD • Preference is to investigate & document reasons for all significant model vs. data discrepancies prior to and after any ME, rather than focussing on achieving a raft of arbitrary criteria. 9th Sept 2014 Model Measures – Are They Fit for Purpose? Mean 20
© Copyright 2026 Paperzz