A Non Linear Regression Model for Time Series with Heteroskedastic Conditional Variance Un Modello Non Lineare per l’Analisi delle Serie Storiche con Varianza Condizionale Eteroschedastica Isabella Morlini Dipartimento di Economia – Sezione di Statistica, Via Kennedy 6, 43100 Parma, Italy, e-mail:[email protected]. Riassunto Il lavoro presenta una metodologia per l’applicazione di modelli non lineari all’analisi delle serie storiche con varianza condizionale eteroschedastica. Si considera il generico modello yt = f(x;w)+t dove yt è il valore della serie al tempo t, w è un vettore di parametri, x è un vettore di variabili esogene e/o endogene ritardate, f è una funzione non lineare ed t ~ N (0; 2 ) . Nella metodologia proposta tale modello viene adattato minimizzando una funzione di costo con fattore di penalizzazione weight decay, per descrivere il valore atteso condizionale yt|x. Successivamente, un secondo modello viene adattato per descrivere la varianza condizionale ht2 ( yt yt | x) 2 | x al tempo t, considerando {yt - yt|x}2 come valori di target. La metodologia è applicata a serie simulate da processi AR(2) con disturbi ARCH(1) e GARCH(1,1). Keywords: Flexible modelling, Prediction intervals, Weight-decay. 1. Times Series Prediction Using Non Linear Modelling Consider a non-linear regression model defined by: yt f (x; w) t (t=1, …, T) (1) where yt is the value of a series at time t, x is a vector of exogenous or lagged endogenous variables, w is a vector of adaptive parameters, t ~ N (0; 2 ) , t | x ~ N (0; ht2 ) , and f is a smooth, non-linear mapping function. If the estimated T ~ are chosen to minimise the sum-of-squares error E ( y yˆ (x; w parameters w t t ~ )) 2 , t 1 ~ ) approximate the conditional averages of the targets: then the predicted values yˆ t (x; w ~ ) y | x yˆ t (x; w t (t=1, …, T) (2) ~ ) is the network mapping with the weight vector at the minimum of the where yˆ t (x; w error function and yt | x yt p ( yt | x)dyt . This result (Bishop, 1995, pp. 201-202) is independent of the choice of the mapping function and only requires that the representation for the non-linear mapping be sufficiently general and the data set and the vector of adaptive parameters be sufficiently large. A non-parametric approach to determine the input dependent variance of yt may be based on result (2). Once model (1) is adapted, the predicted values yt|x are subtracted from the target values yt and the results are then squared and used as targets for a second non linear model, which is also fitted using a sum-of-squares error function. The output of this second model then represents the conditional average of yt yt | x2 and thus approximates the variance ht2 (x) ( y t y t | x ) 2 | x . The principal limitation of this technique is that it requires training until convergence and this may cause overfitting in practical applications, where the training data are limited sized and are affected by a high degree of noise. 2. The Suggested Approach The idea in this work is to present an alternative approach to estimate the input dependent variance using weight decay to fit the models. Instead of minimising the T K t 1 k 1 sum-of-squares error, we minimise the cost function C ( y t yˆ t (x; w )) 2 wk2 , where K is the total number of parameters and is a regularisation term. Although this method can be considered as an ad-hoc method to prevent overfitting, we will justify this approach on Bayesian ground, in order to demonstrate that result (2) is still valid ~ ) is the non-linear mapping with the parameter vector at the minimum of when yˆ t (x; w C. Given the set of target values Y=(y1, …, yT), the approximated posterior distribution 1 ~ ) 1 (w w ~ ) T H( w w ~ ) , where Zw of w is (MacKay, 1992) p(w | Y ) exp C (w Zw 2 ~ is the normalisation constant, C (w ) is the cost function at its minimum value and H is the Hessian matrix, whose klth entry is 2C / wk wl . Remembering equation (1), which leads to p( yt | x; w ) exp ( yt yˆ (x; w )) 2 / 2ht2 , the posterior distribution p( y t | x; Y ) p( y t | x; w ) p (w | Y ) dw can be written as follows ( y yˆ t (x; w)) 2 1 ~ ) T H( w w ~ ) dw exp (w w p( yt | x; Y ) exp t 2 2ht 2 (3) where any constant factor independent of yt has been dropped. Assume the width of this distribution be sufficiently narrow so that the function ŷt (x;w) can be approximated by ~ ) g T (w w ~ ) , where g is a ~ . Then yˆ (x; w ) yˆ (x; w its linear expansion around w t t vector whose kth entry is yˆ t / wk , and expression (3) can be written as follows: ~ ) gT (w w ~ )) 2 (w w ~ )T H(w w ~) ( yt yˆt (x; w dw (4) p( yt | x; Y ) exp 2 2 h 2 t The integral in (4) is evaluated to give the Gaussian distribution p( yt | x; Y ) ~ )) 2 ( y yˆ (x; w 1 exp t 2 t T 1 , Zy 2(ht g H g) (5) ~ ) is the mean yt|x;Y and ( h 2 g T H 1g ) is where Zy is the normalisation factor, yˆ t (x; w t the variance ( yt yt | x; Y ) 2 | x; Y . This variance is given by the sum of two terms: the first is the variance ht2 of yt conditioned on the input vector, the second represents the width of the posterior distribution of the parameters. The estimates hˆt2 (x; ~ v) of this variance are given at a next stage, by considering a second non linear model governed by a vector v of parameters, with the same input vector x and with target values given ~ )) 2 . This second model is also fitted by weight by the squared differences ( y t yˆ t (x; w ~ ) z hˆ (x; ~ decay. The procedure enables the approximated intervals [ yˆ (x; w v) ] to be t /2 t obtained, which are based on the distribution of yt conditioned on x and on the data set. 3. Simulation Study Since a practical framework for approximating arbitrary non-linear mappings is provided by neural networks, these models are used to validate the approach suggested in section 2. Two multi-layer feedforward networks are developed to obtain first time series predictions and then estimates of the heteroskedastic variances ht2(x). The four most recent observations from a time series are used as inputs and the networks have four nodes in the input layer, seven nodes in hidden layer and one node in the output layer. The transfer functions are hyperbolic tangent for the first layer and linear functions for the second layer. Five second order autoregressive (AR(2)) time series with Gaussian error modelled by an ARCH(1) process (Engle, 1982) and five AR(2) series with error modelled by a GARCH(1,1) process are simulated, considering equation t ht t , with t i.i.d. N(0, 1) being pseudo-random number generated from a gaussian distribution and ht2 0 1 t21 for the ARCH process and ht2 0 1 t21 1 ht21 for the GARCH process (0 and 1 being positive). All time series have starting values y1=0.264, y2=0.532, h02=0.031 and 0=0.3. The signal-tonoise ratio, the ratio of the unconditional variance y2 of yt and the unconditional noise variance 2, ranges from 1.85 to 116, while 2 ranges from 0.02 to 0.9. Each time series consists of 9,000 observations: 5,000 are used for training and 4,000 for testing. Table 1 shows the results from the experiment, training the networks with backpropagation in order to minimise the weight decay cost function and then ~ ) , hˆ 2 (x; ~ calculating yˆ t (x; w v) , and the 95% approximated prediction intervals for the t test points. Table 1 reports the percentage of observations falling within these intervals, ~ ) 1.96h ], where the estimated values yˆ (x; w ~ ) but the true values ht2 within [ yˆ t (x; w t t are used, and within [ y | x 1.96hˆ (x; ~ v) ], where the true values yt|x but the t t estimated values hˆt2 (x; ~ v) are used. The average sizes of the real intervals and of the approximated prediction intervals are also given. The accuracy of the fit between predicted and real values is measured by the root mean square error (RMSE) and by the mean absolute error (MAE). In order to evaluate the ability of the first network to predict the change in direction, rather than in magnitude, the confusion rate, is also ~ ) , respectively. given. The last two rows report the sample mean of ht2 and of hˆt 2 (x; w Table 1: Simulation Results Model ARCH (1) GARCH (1,1) Signal-to-Noise Ratio 1.85 116 116 1.85 3.86 1.85 116 116 1.85 3.86 0.10 0.10 0.50 0.90 0.90 0.02 0.02 0.50 0.90 0.90 Noise Variance 2 % of Points in ~ ) 1.96 hˆ (x; ~ ( yˆt (x; w v)) t ~ ( yˆt (x; w) 1.96ht ) ( y | x 1.96 hˆ (x; ~ v)) t t Average Size of Real Intervals Estimated Intervals RMSE for yt|x MAE for yt|x Confusion Rate for yt|x RMSE for ht2 MAE for ht2 Sample Mean of (ht2) 2 Sample Mean of ( hˆt (x; ~ v) ) 90.6 92.9 96.0 88.8 92.2 91.9 91.7 90.1 92.5 92.9 94.5 92.5 93.1 94.9 94.6 94.4 93.7 93.3 94.9 83.8 90.7 94.5 98.6 94.5 92.8 92.2 93.2 91.1 92.7 96.6 1.05 1.10 0.09 0.03 0.03 0.15 0.05 0.10 0.11 1.05 1.20 0.16 0.08 0.01 0.24 0.08 0.10 0.10 2.61 2.45 0.56 0.16 0.00 0.87 0.46 0.50 0.57 2.25 2.26 0.33 0.08 0.03 7.55 0.60 0.75 0.54 2.25 2.35 0.67 0.10 0.02 6.71 0.52 0.75 0.58 0.73 0.84 0.07 0.03 0.05 0.11 0.04 0.05 0.06 0.73 0.84 0.10 0.04 0.00 0.12 0.04 0.05 0.05 2.71 2.75 0.44 0.17 0.01 1.10 0.37 0.50 0.60 2.46 2.42 0.25 0.07 0.03 6.31 0.46 0.75 0.52 2.46 3.22 0.75 0.49 0.20 5.94 0.71 0.75 0.84 The prediction intervals, getting between 88.8% and 96% of the points, are consistent with the nominal coverage, but the percentage of points tends to be smaller than 95%. When the coverage falls short of 95%, the average size of the prediction intervals is not too narrow, with respect to the real prediction intervals. This means that the main source ~ ) . The better performances when the true values of error is due to the estimates yˆ t (x; w yt|x are used confirm this remark. A large value of 2 avoids overestimation of ht2 but leads to less accurate fittings both for ht2 and for yt|x. The sample mean of the estimated variances is close to the mean of real values, especially for small 2. In conclusion, the approximated prediction intervals and the errors in fit reveal quite satisfactory performances. The model appears to work adequately, especially considering that it does not require the correct process generating the conditional variance to be specified. More research is needed to validate these preliminary results. Main References Bishop C.M. (1995) Neural Networks for Pattern Recognition, Clarendon Press, Oxford. Engle R.F. (1982) Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation, Econometrica, 50, 987-1007. MacKay D.J.C. (1992) A practical Bayesian framework for back-propagation networks, Neural Computation 4 (3), pp.448-472.
© Copyright 2024 Paperzz