SNHT and SNH2T

Critical values improvement for the Standard Normal
Homogeneity Test by combining Monte Carlo and regression
approaches
Michele Rienzner
Università degli Studi di Milano, DISAA
and Francesca Ieva
Università degli Studi di Milano, Department of Mathematics
Supplemental material (Appendices)
Appendix A
A.1 MATLAB® Algorithms description
lscov (MATLAB® general least squares algorithm, with given covariance matrix) GLS is
similar to the Ordinary Least Squares (OLS) but accounts for different uncertainty of the data
by specifying the covariance matrix of the error. In the application described here, a null
matrix with the estimated variances of the data ( sn2, ) was imposed as error covariance matrix.
This algorithm applies only in case the equation to be fit is linear in the parameters.
Lsqcurvefit (non-linear list squares curve fit - MATLAB Optimization Toolbox® 3.1.2
R2007b) is a subspace trust-region method based on the interior-reflective Newton method
described in Coleman and Li (1994, 1996). Each iteration involves the approximate solution
of a large linear system using the method of preconditioned conjugate gradients.
In the application described here, thousands of calls to the function were made to reduce the
probability to incur in local minima. Since Lsqcurvefit does not account for different
uncertainty in the data, a replication of the data-points was adopted in order to weigh the
residuals according to their uncertainty (Appendix A.2).
For a deeper insight the reader could refer to Draper and Smith (1966), Ledermann (1984),
and Motulsky and Christopoulos (2003) for the regression theory and to Kiviet (2011) for the
applications of the Monte Carlo technique.
A.2 Accounting for heteroscedasticity
To account for heteroscedasticity the calibration should consider the normalized residuals
(ri,α/si,α). On the other hand, some calibration algorithms, as lsqcurvefit (MATLAB
Optimization Toolbox®), do not consider heteroscedasticity. However, since the optimization
is performed reducing the sum of the squared residuals, it is possible to weigh the residuals
according to the uncertainty of the corresponding data points.
The desiderate weighted sum of squares, S2, can be defined as:
2
ri 2,
1
S  2  2
sM
i 1 si ,
P
2
sM2 , 2
1
ri ,  2

2
sM ,
i 1 si ,
P
P
k r 
i 1
i,
2
i,
(15)
where P is the number of the considered values of n, ri,α is the i-th residual, si2, is the
corresponding estimation variance, sM2 , is the highest estimation ( sM2 , =max( si2, )) and ki , is
the integer closer to sM2 , si2, . Notice that the calibration is not influenced by a scale factor in
the objective function and, therefore, the factor 1 sM2 , can be neglected. So, if the calibration
is done with a dataset where ki,α is the cardinality of the i-th data point (for any i from 1 to P),
the resulting sum of squared residuals accounts for heteroscedasticity.
Appendix B
Table B1 SNHT, base estimation, values of the parameters of equation (10)
Parameter
a
b
c
d
e
f
value
2.5396958930
0.3426638052
3.5215643691
0.0031724177
1.9822161561
-0.0106019771
Table B2 SNHT, parameters of equation (11) at varying α
alpha
0.1000
0.0800
0.0750
0.0600
0.0500
0.0250
0.0100
0.0080
0.0075
0.0060
0.0050
0.0025
0.0010
0.0008
0.00075
0.0006
0.0005
0.00025
0.0001
q1
0.0000000000
0.0002216562
0.0002849471
0.0005079990
0.0007402975
0.0019098465
0.0042829304
0.0049715711
0.0052037248
0.0059649280
0.0066474361
0.0095302300
0.0141330928
0.0153644702
0.0156620084
0.0170587871
0.0182700973
0.0221872392
0.0279371477
q2
0.0000000000
-0.0065724527
-0.0084140444
-0.0149694431
-0.0218729482
-0.0573689917
-0.1309253957
-0.1526579805
-0.1599447203
-0.1840764766
-0.2056962081
-0.2973588990
-0.4457925578
-0.4856708482
-0.4954310302
-0.5403611009
-0.5795352695
-0.7082158181
-0.8988595556
q3
0.0000000000
0.0607364114
0.0771037982
0.1361098420
0.2015205436
0.5572624072
1.3377863800
1.5752104846
1.6551016000
1.9215675826
2.1609934543
3.1858182901
4.8813302765
5.3404055837
5.4536495410
5.9694794753
6.4221254592
7.9270037754
10.1816436731
q4
1.0000000000
0.8772380010
0.8512722397
0.7514867608
0.5901702369
-0.5373265386
-3.5497122673
-4.5296731884
-4.8677951834
-5.9980578324
-7.0270805710
-11.5339617891
-19.3071122579
-21.4446426091
-21.9699200438
-24.3840641991
-26.5252000811
-33.6711706513
-44.5741388127
q5
0.0000000000
0.0150363668
-0.0084283899
-0.0884043919
-0.0325818775
0.9791022702
4.8349085691
6.2098313470
6.7039553896
8.3505999960
9.8755575961
16.7515926161
29.2396304872
32.7351956150
33.5838781948
37.5674596516
41.1481518693
53.0659097475
71.6407624533
3
Table B3 SNH2T, parameters of equation (14) at varying α
alpha
0.0100
0.0750
0.0500
0.0250
0.0100
0.0075
0.0050
0.0025
0.0010
0.00075
0.0005
0.00025
0.0001
a
0.7533752930
0.7838461044
0.8339804041
0.9430067785
1.0828170098
1.1408831094
1.1885071022
1.2826657627
1.3707505127
1.4003879462
1.4557947693
1.5195503517
1.5558508769
b
-0.9038251499
-1.0581999700
-1.2062237184
-1.3527573086
-1.4755575156
-1.4785656324
-1.5138325364
-1.5589969871
-1.6756028175
-1.6742195050
-1.7051382132
-1.7307883269
-1.7303631642
C
0.3845344216
0.3060080448
0.2301043479
0.1344601246
0.0643764983
0.0506819516
0.0441189096
0.0261919224
0.0092289133
0.0110481169
0.0046646477
0.0042280692
0.0305932717
d
0.7836932442
0.8411960917
0.9129267155
1.0553664267
1.2608554452
1.3295269682
1.3578127011
1.5113881641
1.8493668432
1.7676319785
2.0717405282
2.0759754177
1.3116896618
e
1.0881573219
1.3039295419
1.5272022994
1.7979224064
2.0369970173
2.0604788435
2.1129836704
2.1984014854
2.3817090845
2.3747120411
2.4279475470
2.4625086193
2.4028429328
Appendix C
Equation (16) is obtained by dividing terms in equation (4), at each n, by the standard
deviation of the corresponding Monte Carlo standard deviations (sn,α) that is assumed to be
known, and then applying the variance operator to both the.
r
var  n ,
 sn ,

e
  var  n ,

s

 n ,


  var  n ,

s

 n ,

e

  2 cov n , , n ,

s

 n , sn ,




(16)
We recall again that en,α is the distance between the unknown true critical value, C( n , ) , and
the MC estimation of data-point Cnest, , εn,α is the distance between the regression curve and
the unknown true value, and rn,α is the regression residual.
The left side term of equation (16) can be easily computed since rn,α and sn,α are known, while
the right side contains the unknown terms εn,α and en,α. Which are more difficult to treat.
However, some useful approximation can be considered.
At first, it is useful to define the subsequent quantity and discuss its properties.
e
OP  var  n ,
 sn ,

e

  1  2 cov n , , n ,

s

 n , sn ,




(17)
The values en,α/sn,α are distributed like a Student’s t getting closer to a standard normal as far
as sn,α is a good approximation of σn,α. Therefore, if the error variance is accurately identified
and the number of regression points (P) is large enough, the first term on the right side of
equation (17) is close to one, being the variance of P standard normal variables. The third
term is the covariance between the standardized Monte Carlo error and the residual error, εn,α,
divided by the Monte Carlo uncertainty. Let’s note that, if a data-point lays above its true
value we will have en,α >0; if we increase its value further, increasing the error, this will bias
upward the estimated regression curve increasing εn,α; so a positive correlation is expected
between en,α and εn,α. On the other hand, a reduction of this dependence between εn,α and en,α is
obtained by increasing the number of the data points and accounting for the data uncertainty
(heteroscedastic data calibration). Indeed, a large error in a data point implies a large
4
uncertainty sn,α, which reduces the influence of that data-point both in calibration and in
equation (17). Therefore, with P large and a calibration accounting for heteroscedasticity, the
covariance term can be expected to be, on average, positive and close to zero.
Resuming the issue, OP is a random variable with a small and positive expectation and its
asymptotic value (with infinite P) is 0.
Including OP in equation (16) we can write:
r
var  n ,
 sn ,

e
  var  n ,

s

 n ,

 1  var  n,
 sn,
r
var  n ,
 sn ,


  var  n,

s

 n ,
   en,
   var 


   sn,


  1  var  n ,

s

 n ,

e

  2 cov n , , n ,

s

 n , sn,

e

  1  2 cov n, , n,

s

 n, sn,

 11 




  1  var  n,

s

 n,
(18)

  OP



  OP


(19)
(20)
Defining
1 P 
m / s   n,
P n 1 sn,
(21)
the second term on the right side of equation (20) can then be explicated as:
2
2
P
P
  n, 


 n,
1 P   n,
1  P  n,
2



(22)
var 


m


m

2
m



  /s 
 /s 
 /s 

P  1  n 1 sn2, n 1
n 1 sn ,
 sn,  P  1 n 1  sn,


2
2
P
 P P  n, 
 n, 
1  P  n,
1  P  n,
2
2
 
  (23)


Pm

2
m


Pm

2
m




 /s
 /s 
 /s
 /s
2

P  1  n 1 sn2,
P

1
s
P
s
n 1 sn , 
n

1
n

1
n
,

n
,






2
2
 1 P  n, 

1  P  n,
1  P  n,
2
2







Pm

2
Pm
m
 2  Pm / s  2 Pm / s  


  (24)


/
s

/
s

/
s
2

P  1  n 1 sn,

 P n 1 sn,  P  1  n 1 sn,
2
2


1  P  n ,
1  P  n ,
2
2
2

(25)
 2  Pm / s  2 Pm / s  
 2  Pm / s 
P  1  n 1 sn,
 P  1  n 1 sn ,


var  n,
 sn,
2
P 
P



2
  1   n,  m / s   1  n2,  P m2/ s ;
 P  1 n 1  s

P  1 n 1 sn, P  1

 n,

(26)
Since, under the optimality conditions (reported in Section 3.2), rn,α and en,α are zero-mean
normal variables, also εn,α will be a zero-mean variable (equation 4) and, being sn,α strictly
positive, the expectation of mε/s is zero (i.e., it is the mean of P zero-mean variables). Since
the values of sn,α are independent from P, the variance of mε/s will tend to 0 as P increases.
Therefore, with P large we can consider the last term in equation (26) as a negligible strictly
positive quantity (oP) and equation (20) becomes:
5
r
var  n ,
 sn ,
P

 n2,
1
  1
  OP  oP

P  1 n 1 sn2,

(27)
and it can be rewritten as:
2
 rn,
1 P  n,


var

s
P  1 n 1 sn2,
 n,

  1  OP  oP 


(28)
The term on the left side of equation (28), is the average of the rates of a bare estimation of
the uncertainty introduced adopting the regression curve (εn,α2) and the variance of the
corresponding initial Monte Carlo estimations (sn,α2).
Indeed, the values εn,α2 are squared errors with zero mean, and can be seen as bare estimations
of the variance of the error of the regression function in matching the actual curve at the
inspected values of n, while the values sn,α2 are the variance estimations of the data points in
matching the actual curve. Therefore, this term can be used to build an indicator of the error
variance reduction. Define then PVRα as the average (on n) of the variance of the Monte
Carlo errors, minus the variance of the regression error, divided by the former one. Therefore,
PVR can be written as:
 1 P s2   2
PVR  100  n , 2 n ,
 P n 1 sn ,




(29)
According to the findings above, we can rewrite
 1 P   n2,  
 P 1 P  n2, 




 100  1  2   100   2  
 P n 1

 sn,  
 P P n 1 sn, 

 1  P  1 P  n2, 
 P  1  1 P  n2,


 1001  

100
1 
2 
2
 P 1 
 P 1 
P
P
s
n

1
n 1 s n ,
n , 





(30)

 


(31)
Including equation (28) in equation (30) we obtain:
 P  1   rn, 

  1  OP  oP   
 1001 
var 
P   sn, 

 


r 
  1 

 1001  1  1  var  n,   OP  oP   
P 

 sn, 
 
 

  1 
r
PVR  1001  1  1  var  n,
  P 
 sn,
 
   1001  1 OP  oP 

 P
 
(32)
(33)
(34)
Obviously, OP and oP have to be neglected in computing equation (34). Since the expectance
of both OP and oP are small and positive, their difference should reduce their effect on the
6
estimated value. However, since oP is a squared quantity, it will likely have a massive right
tail and its variance is inversely related with the values of sn,α . Therefore, large drawings of oP
can occur (especially in case of small sn,αs) significantly biasing upward some of the estimated
PVRs.
On the other hand, the PVR obtained in our applications (Table 1 and 2) are not rejected by
the Lilliefors test (p-values higher then 0.2) and show similar standard deviations (14 for
SNHT and 12 for SNH2T), suggesting a similar behaviour of the PVR estimator for the two
applications.