Model Measures: Are they fit for purpose?

Model Measures:
Are they fit for purpose?
Tim Wright
NZMUGS Conference, 2014
9th Sept 2014
Model Measures – Are They Fit for Purpose?
1
Introduction
• ‘Fit for Purpose’
• Draft MUGS Data Comparison
Guidelines (GEH)
• Expanded review of:
– GEH
– R2
– RMSE
• I am no statistician (but I love
statistics !)
• ‘Keeping it real’
“For every grain of sand on
Earth, there are 10,000 stars
in the observable universe”
9th Sept 2014
“1 Million seconds is around 12 Days. 1
Billion seconds is around 32 years”
Model Measures – Are They Fit for Purpose?
2
GEH Definition
•
•
•
•
(𝑀−𝑂)2
(𝑀+𝑂)/2
(𝑀−𝑂)2
(𝑀+𝑂)/2
(𝑀−𝑂)
(𝑀+𝑂)/2
𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒
9th Sept 2014
Model Measures – Are They Fit for Purpose?
3
GEH Trend
• Single measure of overall fit
9th Sept 2014
Model Measures – Are They Fit for Purpose?
4
GEH Trend
9th Sept 2014
Model Measures – Are They Fit for Purpose?
5
GEH Trend
9th Sept 2014
Model Measures – Are They Fit for Purpose?
6
GEH Thresholds
9th Sept 2014
Model Measures – Are They Fit for Purpose?
7
GEH Thresholds
9th Sept 2014
Model Measures – Are They Fit for Purpose?
8
GEH Examples
Count
Modelled
Error
% Error
GEH
GnT (Proposed)
• GEH:
•
Model 1 Model 2
400
400
500
300
100
-100
25%
-25%
4.7
5.0
5.3
5.0
Count
Modelled
Error
% Error
GEH
GnT (Proposed)
Model 1 Model 2
900
900
600
1200
-300
300
-33%
33%
11.0
10.0
9.3
10.0
𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 Reflects uncertainty in the count
• What is the Purpose ?
– To identify possible issues for further investigation
• GnT:
9th Sept 2014
𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑
Model Measures – Are They Fit for Purpose?
9
GnT Thresholds
9th Sept 2014
Model Measures – Are They Fit for Purpose?
10
GnT Thresholds
9th Sept 2014
Model Measures – Are They Fit for Purpose?
11
R2: Definition
• 𝑅2 = 1 −
•
𝑅2
=1−
𝑬𝒓𝒓𝒐𝒓𝒔𝟐 𝑺𝒖𝒎𝒎𝒆𝒅
𝑫𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒂 𝒄𝒐𝒖𝒏𝒕 𝒂𝒏𝒅 𝒕𝒉𝒆 𝒎𝒆𝒂𝒏 𝒄𝒐𝒖𝒏𝒕𝟐 𝑺𝒖𝒎𝒎𝒆𝒅
1000
y=x
900
900
800
800
700
700
600
600
Modelled
Modelled
1000
𝑺𝑺𝒓𝒆𝒔
𝑺𝑺𝒕𝒐𝒕
500
400
500
400
300
300
200
200
100
100
0
Mean
0
0
100 200 300 400 500 600 700 800 900 1000
Observed
9th Sept 2014
0
100 200 300 400 500 600 700 800 900 1000
Observed
Model Measures – Are They Fit for Purpose?
12
R2: A Comparison
Observed Modelled GEH
400
450
500
505
510
515
520
525
530
600
380
473
475
530
485
541
494
551
504
630
1.0
1.0
1.1
1.1
1.1
1.1
1.2
1.1
1.2
1.2
% Error
(M-O)/O
-5%
+5%
-5%
+5%
-5%
+5%
-5%
+5%
-5%
+5%
Error2
(M-O)2
400
506
625
638
650
663
676
689
702
900
Diff2
(O-Av)2
11,130
3,080
30
0
20
90
210
380
600
8,930
Totals:
6,449
24,473
R2:
0.736
506 Ave
Observed Modelled GEH
20
25
100
150
400
600
900
1300
2000
3000
16 0.9
30 1.0
80 2.1
180 2.3
320 4.2
720 4.7
720 6.3
1560 6.9
1600 9.4
3600 10.4
850 Ave
Gridville
% Error
(M-O)/O
-20%
+20%
-20%
+20%
-20%
+20%
-20%
+20%
-20%
+20%
Error2
(M-O)2
16
25
400
900
6,400
14,400
32,400
67,600
160,000
360,000
Diff2
(O-Av)2
688,070
679,800
561,750
489,300
202,050
62,250
2,550
202,950
1,323,650
4,624,650
Totals:
642,141
8,837,023
R2:
0.927
Hierarchy City
1200
4000
3500
1000
Modelled
Modelled
3000
800
600
400
2500
2000
1500
1000
200
500
0
0
0
500
1000
Observed
9th Sept 2014
1500
0
1000
2000
Observed
Model Measures – Are They Fit for Purpose?
3000
4000
13
R2: An Alternative ?
• 𝑅2 = 1 −
•
𝑅2
=1−
𝑺𝑺𝒓𝒆𝒔
𝑺𝑺𝒕𝒐𝒕
𝑬𝒓𝒓𝒐𝒓𝟐 ,𝑺𝒖𝒎𝒎𝒆𝒅
𝑫𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒂 𝑪𝒐𝒖𝒏𝒕 𝒂𝒏𝒅 𝒕𝒉𝒆 𝑴𝒆𝒂𝒏 𝑪𝒐𝒖𝒏𝒕𝟐, 𝑺𝒖𝒎𝒎𝒆𝒅
• Alternative:
𝑅2
=1−
𝑬𝒓𝒓𝒐𝒓𝟐 , 𝑺𝒖𝒎𝒎𝒆𝒅
𝑪𝒐𝒖𝒏𝒕𝟐 ,𝑺𝒖𝒎𝒎𝒆𝒅
Mean
9th Sept 2014
Model Measures – Are They Fit for Purpose?
14
R2: A Modification ?
Observed Modelled GEH
400
450
500
505
510
515
520
525
530
600
380
473
475
530
485
541
494
551
504
630
1.0
1.0
1.1
1.1
1.1
1.1
1.2
1.1
1.2
1.2
% Error
(M-O)/O
-5%
+5%
-5%
+5%
-5%
+5%
-5%
+5%
-5%
+5%
506 Ave
Totals:
R2:
Error2
(M-O)2
400
506
625
638
650
663
676
689
702
900
Obs2
(O)2
160,000
202,500
250,000
255,025
260,100
265,225
270,400
275,625
280,900
360,000
Observed Modelled GEH
20
25
100
150
400
600
900
1300
2000
3000
6,449 2,579,775
0.736
16 0.9
30 1.0
80 2.1
180 2.3
320 4.2
720 4.7
720 6.3
1560 6.9
1600 9.4
3600 10.4
850 Ave
0.998
Gridville
% Error
(M-O)/O
-20%
+20%
-20%
+20%
-20%
+20%
-20%
+20%
-20%
+20%
Error2
(M-O)2
16
25
400
900
6,400
14,400
32,400
67,600
160,000
360,000
Obs2
(O)2
400
625
10,000
22,500
160,000
360,000
810,000
1,690,000
4,000,000
9,000,000
Totals:
642,141
16,053,525
R2:
0.927
0.960
Hierarchy City
1200
4000
3500
1000
Modelled
Modelled
3000
800
600
400
2500
2000
1500
1000
200
500
0
0
0
500
1000
Observed
9th Sept 2014
1500
0
1000
2000
Observed
Model Measures – Are They Fit for Purpose?
3000
4000
15
RMSE Definition (or %RMSE)
• % RMSE =
•
•
•
(𝑀−𝑂)2
𝑁−1
𝑂
𝑁
x 100
𝐸𝑟𝑟𝑜𝑟2
𝑁−1
𝐴𝑣𝑒 𝐶𝑜𝑢𝑛𝑡
𝐸𝑟𝑟𝑜𝑟2
𝑁
𝐴𝑣𝑒 𝐶𝑜𝑢𝑛𝑡
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟
𝐴𝑣𝑒 𝐶𝑜𝑢𝑛𝑡
• What does an RMSE of 30% mean ?
• How is this a better measure than the ‘weighted average % error’, i.e:
–
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐶𝑜𝑢𝑛𝑡
9th Sept 2014
?
Model Measures – Are They Fit for Purpose?
16
%RMSE As a Measure of Confidence
• RMSE closely related to Standard Deviation
• Theory is that SD (or RMSE) tells us something about the
confidence in the model:
Proportion
Anticipated range
within 1SD or
%RMSE
Error
• i.e. typically:
– 68% of errors < RMSE
– 95% of errors < 2 x RMSE
9th Sept 2014
• But this assumes a
normal distribution of
errors
Model Measures – Are They Fit for Purpose?
17
%RMSE As a Measure of Confidence
Frequency Distribution of Model Flow Errors vs Normal Distribution
0.04
• SD = 57
(RMSE = 24%)
• Normal
Distribution
Implies 68% of
errors < SD
• Actual data
shows 81% of
data within SD
• So what is SD
or RMSE
actually telling
us ?
0.035
0.03
Proportion
0.025
0.02
0.015
0.01
0.005
Actual Frequency
-500
-400
-300
-200
-100 -57
0
0
57
100
200
300
Error (Modelled - Count)
9th Sept 2014
Model Measures – Are They Fit for Purpose?
400
Normal Distribution
Probability
18
%RMSE Thoughts
• Scatter plots are a better indication of overall model fit
• Is the %‘Mean Absolute Deviation’ i.e.
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐶𝑜𝑢𝑛𝑡
a more direct, simpler, intuitive single measure ?
• E.g. %MAD = 14% vs RMSE = 24%
9th Sept 2014
Model Measures – Are They Fit for Purpose?
19
Conclusions
• Is GnT a better indicator of
potential issues with models than
GEH ?
• Is R2 appropriate to our purpose or
should this be modified ?
• %RMSE not intuitive and of
dubious value. Suggest replacing
with %MAD
• Preference is to investigate &
document reasons for all
significant model vs. data
discrepancies prior to and after
any ME, rather than focussing on
achieving a raft of arbitrary criteria.
9th Sept 2014
Model Measures – Are They Fit for Purpose?
Mean
20