Critical values improvement for the Standard Normal Homogeneity Test by combining Monte Carlo and regression approaches Michele Rienzner Università degli Studi di Milano, DISAA and Francesca Ieva Università degli Studi di Milano, Department of Mathematics Supplemental material (Appendices) Appendix A A.1 MATLAB® Algorithms description lscov (MATLAB® general least squares algorithm, with given covariance matrix) GLS is similar to the Ordinary Least Squares (OLS) but accounts for different uncertainty of the data by specifying the covariance matrix of the error. In the application described here, a null matrix with the estimated variances of the data ( sn2, ) was imposed as error covariance matrix. This algorithm applies only in case the equation to be fit is linear in the parameters. Lsqcurvefit (non-linear list squares curve fit - MATLAB Optimization Toolbox® 3.1.2 R2007b) is a subspace trust-region method based on the interior-reflective Newton method described in Coleman and Li (1994, 1996). Each iteration involves the approximate solution of a large linear system using the method of preconditioned conjugate gradients. In the application described here, thousands of calls to the function were made to reduce the probability to incur in local minima. Since Lsqcurvefit does not account for different uncertainty in the data, a replication of the data-points was adopted in order to weigh the residuals according to their uncertainty (Appendix A.2). For a deeper insight the reader could refer to Draper and Smith (1966), Ledermann (1984), and Motulsky and Christopoulos (2003) for the regression theory and to Kiviet (2011) for the applications of the Monte Carlo technique. A.2 Accounting for heteroscedasticity To account for heteroscedasticity the calibration should consider the normalized residuals (ri,α/si,α). On the other hand, some calibration algorithms, as lsqcurvefit (MATLAB Optimization Toolbox®), do not consider heteroscedasticity. However, since the optimization is performed reducing the sum of the squared residuals, it is possible to weigh the residuals according to the uncertainty of the corresponding data points. The desiderate weighted sum of squares, S2, can be defined as: 2 ri 2, 1 S 2 2 sM i 1 si , P 2 sM2 , 2 1 ri , 2 2 sM , i 1 si , P P k r i 1 i, 2 i, (15) where P is the number of the considered values of n, ri,α is the i-th residual, si2, is the corresponding estimation variance, sM2 , is the highest estimation ( sM2 , =max( si2, )) and ki , is the integer closer to sM2 , si2, . Notice that the calibration is not influenced by a scale factor in the objective function and, therefore, the factor 1 sM2 , can be neglected. So, if the calibration is done with a dataset where ki,α is the cardinality of the i-th data point (for any i from 1 to P), the resulting sum of squared residuals accounts for heteroscedasticity. Appendix B Table B1 SNHT, base estimation, values of the parameters of equation (10) Parameter a b c d e f value 2.5396958930 0.3426638052 3.5215643691 0.0031724177 1.9822161561 -0.0106019771 Table B2 SNHT, parameters of equation (11) at varying α alpha 0.1000 0.0800 0.0750 0.0600 0.0500 0.0250 0.0100 0.0080 0.0075 0.0060 0.0050 0.0025 0.0010 0.0008 0.00075 0.0006 0.0005 0.00025 0.0001 q1 0.0000000000 0.0002216562 0.0002849471 0.0005079990 0.0007402975 0.0019098465 0.0042829304 0.0049715711 0.0052037248 0.0059649280 0.0066474361 0.0095302300 0.0141330928 0.0153644702 0.0156620084 0.0170587871 0.0182700973 0.0221872392 0.0279371477 q2 0.0000000000 -0.0065724527 -0.0084140444 -0.0149694431 -0.0218729482 -0.0573689917 -0.1309253957 -0.1526579805 -0.1599447203 -0.1840764766 -0.2056962081 -0.2973588990 -0.4457925578 -0.4856708482 -0.4954310302 -0.5403611009 -0.5795352695 -0.7082158181 -0.8988595556 q3 0.0000000000 0.0607364114 0.0771037982 0.1361098420 0.2015205436 0.5572624072 1.3377863800 1.5752104846 1.6551016000 1.9215675826 2.1609934543 3.1858182901 4.8813302765 5.3404055837 5.4536495410 5.9694794753 6.4221254592 7.9270037754 10.1816436731 q4 1.0000000000 0.8772380010 0.8512722397 0.7514867608 0.5901702369 -0.5373265386 -3.5497122673 -4.5296731884 -4.8677951834 -5.9980578324 -7.0270805710 -11.5339617891 -19.3071122579 -21.4446426091 -21.9699200438 -24.3840641991 -26.5252000811 -33.6711706513 -44.5741388127 q5 0.0000000000 0.0150363668 -0.0084283899 -0.0884043919 -0.0325818775 0.9791022702 4.8349085691 6.2098313470 6.7039553896 8.3505999960 9.8755575961 16.7515926161 29.2396304872 32.7351956150 33.5838781948 37.5674596516 41.1481518693 53.0659097475 71.6407624533 3 Table B3 SNH2T, parameters of equation (14) at varying α alpha 0.0100 0.0750 0.0500 0.0250 0.0100 0.0075 0.0050 0.0025 0.0010 0.00075 0.0005 0.00025 0.0001 a 0.7533752930 0.7838461044 0.8339804041 0.9430067785 1.0828170098 1.1408831094 1.1885071022 1.2826657627 1.3707505127 1.4003879462 1.4557947693 1.5195503517 1.5558508769 b -0.9038251499 -1.0581999700 -1.2062237184 -1.3527573086 -1.4755575156 -1.4785656324 -1.5138325364 -1.5589969871 -1.6756028175 -1.6742195050 -1.7051382132 -1.7307883269 -1.7303631642 C 0.3845344216 0.3060080448 0.2301043479 0.1344601246 0.0643764983 0.0506819516 0.0441189096 0.0261919224 0.0092289133 0.0110481169 0.0046646477 0.0042280692 0.0305932717 d 0.7836932442 0.8411960917 0.9129267155 1.0553664267 1.2608554452 1.3295269682 1.3578127011 1.5113881641 1.8493668432 1.7676319785 2.0717405282 2.0759754177 1.3116896618 e 1.0881573219 1.3039295419 1.5272022994 1.7979224064 2.0369970173 2.0604788435 2.1129836704 2.1984014854 2.3817090845 2.3747120411 2.4279475470 2.4625086193 2.4028429328 Appendix C Equation (16) is obtained by dividing terms in equation (4), at each n, by the standard deviation of the corresponding Monte Carlo standard deviations (sn,α) that is assumed to be known, and then applying the variance operator to both the. r var n , sn , e var n , s n , var n , s n , e 2 cov n , , n , s n , sn , (16) We recall again that en,α is the distance between the unknown true critical value, C( n , ) , and the MC estimation of data-point Cnest, , εn,α is the distance between the regression curve and the unknown true value, and rn,α is the regression residual. The left side term of equation (16) can be easily computed since rn,α and sn,α are known, while the right side contains the unknown terms εn,α and en,α. Which are more difficult to treat. However, some useful approximation can be considered. At first, it is useful to define the subsequent quantity and discuss its properties. e OP var n , sn , e 1 2 cov n , , n , s n , sn , (17) The values en,α/sn,α are distributed like a Student’s t getting closer to a standard normal as far as sn,α is a good approximation of σn,α. Therefore, if the error variance is accurately identified and the number of regression points (P) is large enough, the first term on the right side of equation (17) is close to one, being the variance of P standard normal variables. The third term is the covariance between the standardized Monte Carlo error and the residual error, εn,α, divided by the Monte Carlo uncertainty. Let’s note that, if a data-point lays above its true value we will have en,α >0; if we increase its value further, increasing the error, this will bias upward the estimated regression curve increasing εn,α; so a positive correlation is expected between en,α and εn,α. On the other hand, a reduction of this dependence between εn,α and en,α is obtained by increasing the number of the data points and accounting for the data uncertainty (heteroscedastic data calibration). Indeed, a large error in a data point implies a large 4 uncertainty sn,α, which reduces the influence of that data-point both in calibration and in equation (17). Therefore, with P large and a calibration accounting for heteroscedasticity, the covariance term can be expected to be, on average, positive and close to zero. Resuming the issue, OP is a random variable with a small and positive expectation and its asymptotic value (with infinite P) is 0. Including OP in equation (16) we can write: r var n , sn , e var n , s n , 1 var n, sn, r var n , sn , var n, s n , en, var sn, 1 var n , s n , e 2 cov n , , n , s n , sn, e 1 2 cov n, , n, s n, sn, 11 1 var n, s n, (18) OP OP (19) (20) Defining 1 P m / s n, P n 1 sn, (21) the second term on the right side of equation (20) can then be explicated as: 2 2 P P n, n, 1 P n, 1 P n, 2 (22) var m m 2 m /s /s /s P 1 n 1 sn2, n 1 n 1 sn , sn, P 1 n 1 sn, 2 2 P P P n, n, 1 P n, 1 P n, 2 2 (23) Pm 2 m Pm 2 m /s /s /s /s 2 P 1 n 1 sn2, P 1 s P s n 1 sn , n 1 n 1 n , n , 2 2 1 P n, 1 P n, 1 P n, 2 2 Pm 2 Pm m 2 Pm / s 2 Pm / s (24) / s / s / s 2 P 1 n 1 sn, P n 1 sn, P 1 n 1 sn, 2 2 1 P n , 1 P n , 2 2 2 (25) 2 Pm / s 2 Pm / s 2 Pm / s P 1 n 1 sn, P 1 n 1 sn , var n, sn, 2 P P 2 1 n, m / s 1 n2, P m2/ s ; P 1 n 1 s P 1 n 1 sn, P 1 n, (26) Since, under the optimality conditions (reported in Section 3.2), rn,α and en,α are zero-mean normal variables, also εn,α will be a zero-mean variable (equation 4) and, being sn,α strictly positive, the expectation of mε/s is zero (i.e., it is the mean of P zero-mean variables). Since the values of sn,α are independent from P, the variance of mε/s will tend to 0 as P increases. Therefore, with P large we can consider the last term in equation (26) as a negligible strictly positive quantity (oP) and equation (20) becomes: 5 r var n , sn , P n2, 1 1 OP oP P 1 n 1 sn2, (27) and it can be rewritten as: 2 rn, 1 P n, var s P 1 n 1 sn2, n, 1 OP oP (28) The term on the left side of equation (28), is the average of the rates of a bare estimation of the uncertainty introduced adopting the regression curve (εn,α2) and the variance of the corresponding initial Monte Carlo estimations (sn,α2). Indeed, the values εn,α2 are squared errors with zero mean, and can be seen as bare estimations of the variance of the error of the regression function in matching the actual curve at the inspected values of n, while the values sn,α2 are the variance estimations of the data points in matching the actual curve. Therefore, this term can be used to build an indicator of the error variance reduction. Define then PVRα as the average (on n) of the variance of the Monte Carlo errors, minus the variance of the regression error, divided by the former one. Therefore, PVR can be written as: 1 P s2 2 PVR 100 n , 2 n , P n 1 sn , (29) According to the findings above, we can rewrite 1 P n2, P 1 P n2, 100 1 2 100 2 P n 1 sn, P P n 1 sn, 1 P 1 P n2, P 1 1 P n2, 1001 100 1 2 2 P 1 P 1 P P s n 1 n 1 s n , n , (30) (31) Including equation (28) in equation (30) we obtain: P 1 rn, 1 OP oP 1001 var P sn, r 1 1001 1 1 var n, OP oP P sn, 1 r PVR 1001 1 1 var n, P sn, 1001 1 OP oP P (32) (33) (34) Obviously, OP and oP have to be neglected in computing equation (34). Since the expectance of both OP and oP are small and positive, their difference should reduce their effect on the 6 estimated value. However, since oP is a squared quantity, it will likely have a massive right tail and its variance is inversely related with the values of sn,α . Therefore, large drawings of oP can occur (especially in case of small sn,αs) significantly biasing upward some of the estimated PVRs. On the other hand, the PVR obtained in our applications (Table 1 and 2) are not rejected by the Lilliefors test (p-values higher then 0.2) and show similar standard deviations (14 for SNHT and 12 for SNH2T), suggesting a similar behaviour of the PVR estimator for the two applications.
© Copyright 2026 Paperzz