EVALUATION OF WEATHER FORECASTS SHORT COURSE ON PREDICTABILITY - lll Zoltan Toth Global Systems Division GSD/ESRL/OAR/NOAA Acknowledgements: Yuejian Zhu, Gopal Iyengar, Olivier Talagrand and a Large Group of Colleagues and Collaborators ELTE, Budapest, 26-27 May 2016 1 OUTLINE • Purposes and types of evaluation – Verification, diagnostics • Attributes of forecast systems – Statistical reliability & resolution • Measures of reliability & resolution • Metrics of ensemble performance – Reliability – More complex measures FORM OF WEATHER FORECASTS • Predictability of weather is limited due to – Initial condition and model related errors & – Chaotic nature of atmosphere • Predictab. varies from case to case due to – Complex nonlinear interactions in the atmosphere • Probabilistic forecasts capture case dependent variations in predictability – Ensemble forecasting • Only scientifically consistent form of forecasts FORECASTING IN A CHAOTIC ENVIRONMENT – PROBABILISTIC FORECASTING BASED ON A SINGLE FORECAST – One integration with an NWP model, combined with past verification statistics DETERMINISTIC APPROACH - PROBABILISTIC FORMAT •Does not contain all forecast information •Not best estimate for future evolution of system •UNCERTAINTY CAPTURED IN TIME AVERAGE SENSE •NO ESTIMATE OF CASE DEPENDENT VARIATIONS IN FCST UNCERTAINTY 4 PROBABILISTIC FORECAST APPROACHES • Theoretically appealing approach – Liouville Eq – Linear constraints & computational expense renders it impractical • Brute force Monte Carlo approach – Ensemble forecasting – Dependable results with multiple integrations – Case dependent probabilistic forecasts = predictability • Conceptual challenges – Estimate analysis error distribution – Sample analysis error distribution – Represent model related uncertainty • Practical implementations – Multiple analyses (EnKF, (L)ETKF, multiple 4dvar, etc) – Breeding or Ensemble Transform – Singular Vectors • Probabilistic forecast products derived from ensembles NEEDS FOR SYSTEMATIC EVALUATION OF FCSTS • Research exploration – Learn about forecast systems - Random vs systematic errors, etc • Research guidance – Guide NWP development efforts • Compare different systems – Eg, w different types of output - single unperturbed vs ensemble systems • Evaluate value of / added by subcomponents of system – E.g., improved DA, ensemble, product generation methods • Management decisions – Select best performing DA-Forecast system for operational use • Assess user value – Social, economic, environmental applications USER REQUIREMENTS • General characteristics of forecast users – Each user affected in specific way by • Various weather elements at • Different points in time & space • Requirements for optimal decisions related to operations affected by weather – Possible weather scenarios with covariances across • Variables, space, and time • Provision of weather information – Specific information needed by each user can’t be foreseen or provided • Only vanilla-type often used probabilistic info can be distributed – Ensemble data must be made accessible to users in statistically reliable form • All forecast info can be derived from this, including vanilla probabilistic products 7 STATISTICAL ASSESSMENT • Performance of forecast systems – Sample of forecasts - not a single forecast • Sample size affects – How fine details we can uncover – Statistical significance of results – Size of sample limited by available obs. Record • Verification – Quality & utility – Compare output from fcst system with proxy for truth – Main subject of lecture • Diagnostics – Other characteristics – Depends only on forecast properties • Unrelated to truth 8 VERIFICATION • Measures of quality – Environmental science issues - Focus of talk • Measures of utility – Multidisciplinary issues – Social & economic questions, beyond environmental sciences – Socio-economic value of forecasts is ultimate measure • Approximate measures can be constructed • Improved quality => Enhanced utility • How to improve utility if quality is fixed? – Communicate all available user relevant information • Offer probabilistic or other information on forecast uncertainty – Engage in education, training • Can we selectively improve certain aspects of forecasts? – E.g, improve precip fcsts without improving circulation forecasts? 9 VERIFICATION BASICS • Define predictand – Exhaustive set of events, eg • Continuous temperature • Precipitation types (Categorical) • Choose proxy for truth – Observations or – Observationally-based fine scale analysis – Results subject to error in proxy for truth • Collect sample of forecasts and verifying truth 10 VERIFICATION PREAMBLE • What do we want / need to assess? – Attributes of forecast systems • What can we measure that reflects attributes? – Quantitative metrics • Comprehensive approach – Cover all bases – Understand relationships between metrics EVALUATING QUALITY OF FORECAST SYSTEMS • Formats of forecasts – Categorical, single value / multiple value, probabilistic • Forecast system attributes – Performance specific characteristics – Traditionally, defined separately for each forecast format • General definition needed – Need to compare forecasts • From any system & • Of any type / format – Support systematic evaluation of • End-to-end (provider-user) forecast process – Statistical post-processing as integral part of system 12 FORECAST SYSTEM ATTRIBUTES • Abstract concept (like length) – Reliability and Resolution • Both can be measured through different statistics • Statistical properties – Interpreted for large set of forecasts • Describe behavior of forecast system, not a single forecast • Assumptions – Forecasts • Can be of any format – Single value, ensemble, categorical, probabilistic, etc • Take a finite number of different “classes” Fa – Observations • Can also be grouped into finite number of “classes” like Oa 13 STATISTICAL RELIABILITY – TEMPORAL AGGREGATE STATISTICAL CONSISTENCY OF FORECASTS WITH OBSERVATIONS BACKGROUND: • Consider particular forecast class – Fa • Consider frequency distribution of observations that follow forecasts Fa - fdoa DEFINITION: • If forecast Fa has the exact same form as fdoa, for all forecast classes, the forecast system is statistically consistent with observations => The forecast system is perfectly reliable MEASURES OF RELIABILITY: • Based on different ways of comparing Fa and fdoa EXAMPLES: CONTROL FCST ENSEMBLE 14 STATISTICAL RESOLUTION – TEMPORAL EVOLUTION ABILITY TO DISTINGUISH, AHEAD OF TIME, AMONG DIFFERENT OUTCOMES BACKGROUND: • Assume observed events are classified into finite number of classes, like Oa DEFINITION: • If all observed classes (Oa, Ob,…) are preceded by – Distinctly different forecasts (Fa, Fb,…) – The forecast system “resolves” the problem => The forecast system has perfect resolution MEASURES OF RESOLUTION: • Based on degree of separation of fdo’s that follow various forecast classes • Measured by difference between fdo’s & climate distribution • Measures differ by how differences between distributions are quantified EXAMPLES FORECASTS OBSERVATIONS 15 CHARACTERISTICS OF RELIABILITY & RESOLUTION • Reliability – Related to form of forecast, not forecast content • Fidelity of forecast – Reproduce nature when resolution is perfect, forecast looks like nature – Not related to time sequence of forecast/observed systems – How to improve? • • Make model more realistic – Also expected to improve resolution Statistical bias correction: Can be statistically imposed at one time level – If both natural & forecast systems are stationary in time & – If there is a large enough set of observed-forecast pairs – Link with verification: » Replace forecast with corresponding fdo • Resolution – Related to inherent predictive value of forecast system – Not related to form of forecasts • Statistical consistency at one time level (reliability) is irrelevant – How to improve? • Enhanced knowledge about time sequence of events – More realistic numerical model should help » May also improve reliability 16 CHARACTERISTICS OF FORECAST SYSTEM ATTRIBUTES RELIABILITY AND RESOLUTION ARE • General forecast attributes – Valid for any forecast format (single, categorical, probabilistic, etc) • Independent attributes – For example • • Climate pdf forecast is perfectly reliable, yet has no resolution Reversed rain / no-rain forecast can have perfect resolution and no reliability – To separate them, they must be measured according to general definition • • If measured according to traditional, narrower definition – Reliability & resolution can be mixed Function of forecast quality – There is no other relevant forecast attribute • • Perfect reliability and perfect resolution = perfect forecast system = – “Deterministic” forecast system that is always correct Both needed for utility of forecast systems – Need both reliability and resolution • Especially if no observed/forecast pairs available (eg, extreme forecasts, etc) 17 FORMAT OF FORECASTS – PROBABILSITIC FORMAT • Do we have a choice? – When forecasts are imperfect • Only probabilistic format can be reliable/consistent with nature • Abstract concept - Dimensionless – Related to forecast system attributes • Space of probability – dimensionless pdf or similar format – For environmental variables (not those variables themselves) • Steps 1. Define event • Function of concrete variables, features, etc – E.g., “temperature above freezing”; “thunderstorm” 2. Determine probability of event occurring in future – Based on knowledge of initial state and evolution of system 18 OPERATIONAL PROB/ENSEMBLE FORECAST VERIFICATION • Requirements – Use same general dimensionless probabilistic measures for verifying • Any event • Against either – Observations or – Numerical analysis • Measures used at NCEP – Probabilistic forecast measures – ensemble interpreted probabilistically • Reliability – Component of BSS, RPSS, CRPSS – Attributes & Talagrand diagrams • Resolution – Component of BSS, RPSS, CRPSS – ROC, attributes diagram, potential economic value – Special ensemble verification procedures • Designed to assess performance of finite set of forecasts – Most likely member statistics, PECA • Missing components include – General event definition - Spatial/temporal/cross variable considerations – Routine testing of statistical significance – Other “spatial” and/or “diagnostic” measures? 19 FORECAST PERFORMANCE MEASURES COMMON CHARACTERISTIC: Function of both forecast and observed values MEASURES OF RELIABILITY: DESCRIPTION: Statistically compares any sample of forecasts with sample of corresponding observations GOAL: To assess similarity of samples (e.g., whether 1st and 2nd moments match) EXAMPLES: Reliability component of Brier Score Ranked Probability Score Analysis Rank Histogram Spread vs. Ens. Mean error Etc. MEASURES OF RESOLUTION: DESCRIPTION: Compares the distribution of observations that follows different classes of forecasts with the climate distribution (as reference) GOAL: To assess how well the observations are separated when grouped by different classes of preceding fcsts EXAMPLES: Resolution component of Brier Score Ranked Probability Score Information content Relative Operational Characteristics Relative Economic Value Etc. COMBINED (REL+RES) MEASURES: Brier, Cont. Ranked Prob. Scores, rmse, PAC,… 20 EXAMPLE – PROBABILISTIC FORECASTS RELIABILITY: Forecast probabilities for given event match observed frequencies of that event (with given prob. fcst) RESOLUTION: Many forecasts fall into classes corresponding to high or low observed frequency of given event (Occurrence and non-occurrence of event is well resolved by fcst system) 21 22 PROBABILISTIC FORECAST PERFORMANCE MEASURES TO ASSESS TWO MAIN ATTRIBUTES OF PROBABILISTIC FORECASTS: RELIABILITY AND RESOLUTION Univariate measures: Statistics accumulated point by point in space Multivariate measures: Spatial covariance is considered EXAMPLE: BRIER SKILL SCORE (BSS) COMBINED MEASURE OF RELIABILITY AND RESOLUTION 23 BRIER SKILL SCORE (BSS) COMBINED MEASURE OF RELIABILITY AND RESOLUTION METHOD: Compares pdf against analysis • Resolution (random error) • Reliability (systematic error) EVALUATION BSS Higher better Resolution Higher better Reliability Lower better RESULTS Resolution dominates initially Reliability becomes important later • ECMWF best throughout – Good analysis/model? • NCEP good days 1-2 – Good initial perturbations? – No model perturb. hurts later? • CANADIAN good days 8-10 – Model diversity helps? May-June-July 2002 average Brier skill score for the EC-EPS (grey lines with full circles), the MSC-EPS (black lines with open circles) and the NCEP-EPS (black lines with crosses). Bottom: resolution (dotted) and reliability(solid) contributions to the Brier skill score. Values refer to the 500 hPa geopotential height over the northern hemisphere latitudinal band 20º-80ºN, and have been computed considering 24 10 equally-climatologically-likely intervals (from Buizza, Houtekamer, Toth et al, 2004) BRIER SKILL SCORE COMBINED MEASURE OF RELIABILITY AND RESOLUTION 25 RANKED PROBABILITY SCORE COMBINED MEASURE OF RELIABILITY AND RESOLUTION 26 Continuous Rank Probability Score CRPS 2 [ F ( x ) H ( x x )] dx 0 CRPSS CRP Skill Score is CRPSc CRPS f CRPSc Xo 100% Obs (truth) Heaviside Function H 50% H ( x x0 ) 0% p01 p02 p03 p04 p05 p06 p07 p08 p09 p10 0( x xo ) 1( x xo ) X Order of 10 ensemble members (p01, p02,…,p10) 27 INFORMATION CONTENT MEASURE OF RESOLUTION 28 RELATIVE OPERATING CHARACTERISTICS MEASURE OF RESOLUTION 29 ECONOMIC VALUE OF FORECASTS MEASURE OF RESOLUTION 30 ENSEMBLE MEAN ERROR VS. ENSEMBLE SPREAD MEASURE OF RELIABILITY Statistical consistency between the ensemble and the verifying analysis means that the verifying analysis should be statistically indistinguishable from the ensemble members => Ensemble mean error (distance between ens. mean and analysis) should be equal to ensemble spread (distance between ensemble mean and ensemble members) In case of a statistically consistent ensemble, ens. spread = ens. mean error, and they are both a MEASURE OF RESOLUTION. In the presence of bias, both rms error and PAC will be a combined measure of reliability and resolution31 ANALYSIS RANK HISTOGRAM (TALAGRAND DIAGRAM) MEASURE OF RELIABILITY 32 PERTURBATION VS. ERROR CORRELATION ANALYSIS (PECA) MULTIVATIATE COMBINED MEASURE OF RELIABILITY & RESOLUTION METHOD: Compute correlation between ens perturbtns and error in control fcst for – – – – Individual members Optimal combination of members Each ensemble Various areas, all lead time EVALUATION: Large correlation indicates ens captures error in control forecast – Caveat – errors defined by analysis RESULTS: – Canadian best on large scales • Benefit of model diversity? – ECMWF gains most from combinations • Benefit of orthogonalization? – NCEP best on small scale, short term • Benefit of breeding (best estimate initial error)? – PECA increases with lead time • Lyapunov convergence • Nonlilnear saturation – Higher values on small scales 33 SAMPLING ANALYSIS ERRORS • Multiple analysis cycles – Random & realistic sample of both growing & non-growing patterns • Bed Vectors (BV) – Random sample of growing subspace only • Ensemble Transform (ET) – Random & orthogonalized sample of growing subspace – Minor increase in space spanned • Singular Vectors – Directed sampling of fastest finite time apparent growth • From unknown distribution (when general norms used) – Hessian norm samples errors ASSUMED by DA Wei et al 2006 DA-FORECAST CYCLES & ANALYSIS ERRORS • Analysis = weighted mean of First Guess (FG) forecast & obs • FG error dominated by growing errors – Initial errors rotate into fast growing nonlinear patterns – Very few fast growing directions supported by instabilities of the flow • Observations - random errors – Very large subspace • Analysis error - Combination of FG & obs errors Patil et al 2002 10-20% unstable area characterized by 2-3 growing perturbations in 1,100 x 1,100 km boxes Toth & Kalnay 1993 ENSEMBLE METRICS • Mean / spread • Talagrand – Regular – Time-lag for jump • • • • Temporal consistency of perts Craig’s / Mozheng’s stat PECA Higher moment statistics – Joint probabilities, correlations, covariances • Diagnostics – Dof FORECAST EVALUATION - SUMMARY • Forecast system attributes – Statistical reliability (calibration) – Statistical resolution (skill or information content) • Array of metrics for – Either attributes – Joint evaluation of both attributes • Different metrics for assessing – (Continuous) probabilistic forecasts – Scenario-based (ensemble) forecasts • Complex forecast systems need multitude of metrics – Different metrics – Different parameters including • Joint probabilities, covariance, etc BACKGROUND 38 http://wwwt.emc.ncep.noaa.gov/gmb/ens/ens_info.html Toth, Z., O. Talagrand, and Y. Zhu, 2005: The Attributes of Forecast Systems: A Framework for the Evaluation and Calibration of Weather Forecasts. In: Predictability Seminars, 9-13 September 2002, Ed.: T. Palmer, ECMWF, pp. 584-595. Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. In: Environmental Forecast Verification: A 39 practitioner's guide in atmospheric science. Ed.: I. T. Jolliffe and D. B. Stephenson. Wiley, p. 137-164. REFERENCES • Value of forecasts; decomposition of scores - A. 40 OUTLINE / SUMMARY • SCIENCE OF FORECASTING – GOAL OF SCIENCE – VERIFICATION Forecasting Model development, user feedback • GENERATION OF PROBABILISTIC FORECASTS – SINGLE FORECASTS – ENSEMBLE FORECASTS Statistical rendition of pdf NWP-based, case-dependent pdf • ATTRIBUTES OF FORECAST SYSTEMS – RELIABILITY – RESOLUTION Forecasts look like nature statistically Forecasts indicate actual future developments • VERIFICATION OF PROBABILSTIC & ENSEMBLE FORECASTS – UNIFIED PROBABILISTIC MEASURES – ENSEMBLE MEASURES Dimensionless Evaluate finite sample • STATISTICAL POSTPROCESSING OF FORECASTS – STATISTICAL RELIABILITY – STATISTICAL RESOLUTION Make it perfect Keep it unchanged 41 WHAT WE NEED FOR POSTPROCESSING TO WORK? • LARGE SET OF FCST – OBS PAIRS • Consistency defined over large sample – need same for post-processing • Larger the sample, more detailed corrections can be made • BOTH FCST AND REAL SYSTEMS MUST BE STATIONARY IN TIME • Otherwise can make things worse • Subjective forecasts difficult to calibrate HOW WE MEASURE STATISTICAL INCONSISTENCY? • MEASURES OF STATIST. RELIABILITY • • • • Time mean error Analysis rank histogram (Talagrand diagram) Reliability component of Brier etc scores Reliability diagram 42 SOURCES OF STATISTICAL INCONSISTENCY • TOO FEW FORECAST MEMBERS • Single forecast – inconsistent by definition, unless perfect • MOS fcst hedged toward climatology as fcst skill is lost • Small ensemble – sampling error due to limited ensemble size (Houtekamer 1994?) • MODEL ERROR (BIAS) • Deficiencies due to various problems in NWP models • Effect is exacerbated with increasing lead time • SYSTEMATIC ERRORS (BIAS) IN ANALYSIS • Induced by observations • Effect dies out with increasing lead time • Model related • Bias manifests itself even in initial conditions • ENSEMBLE FORMATION (INPROPER SPREAD) • Not appropriate initial spread • Lack of representation of model related uncertainty in ensemble • I. E., use of simplified model that is not able to account for model related uncertainty 43 HOW TO IMPROVE STATISTICAL CONSISTENCY? • MITIGATE SOURCES OF INCONSISTENCY • TOO FEW MEMBERS • Run large ensemble • MODEL ERRORS • Make models more realistic • INSUFFICIENT ENSEMBLE SPREAD • Enhance models so they can represent model related forecast uncertainty • OTHERWISE => • STATISTICALLY ADJUST FCST TO REDUCE INCONSISTENCY • • • • Unpreferred way of doing it What we learn can feed back into development to mitigate problem at sources Can have LARGE impact on (inexperienced) users Two separate issues • Bias correct against NWP analysis • Reduce lead time dependent model behavior • Downscale NWP analysis • Connect with observed variables that are unresolved by NWP models 44 45 SCIENCE OF FORECASTING • Ultimate goal of science – Forecasting – Prominent role of Meteorology • Weather forecasting constantly in public’s eye • Approach – Observe what is relevant and available • Analyze data – Create general knowledge about nature • Generalization & abstraction – Laws, relationships – Build model of reality • Conceptual, Analogs, Quantitative/numerical, incl. physical processes – Predict what’s not observable in • Space – eg, data assimilation • Time - eg, future weather • Variables / processes – Verify (ie, compare with observations) • Determine to what extent model represents reality • Assess if predictions have any utility – Improve general knowledge and model 46 USER REQUIREMENTS • General characteristics of forecast users – Each user affected in specific way by • Various weather elements at • Different points in time & • Space • Requirements for optimal decisions related to operations affected by weather – Possible weather scenarios with covariances across • Variables, space, and time • Provision of weather information – Specific information needed by each user can’t be foreseen or provided • Only vanilla-type often used probabilistic info can be distributed – Ensemble data must be made accessible to users in statistically reliable form • All forecast info can be derived from this, including vanilla probabilistic products • IT infrastructure requirements – Staging ground for ensemble data (disc) – Sophisticated data access / interrogation tools • Subset ensemble data, derive required parameters – Technical achievements - NOMADS, AWIPS2 – Telecommunication (bandwidth) 47 USER REQUIREMENTS: PROBABILISTIC FORECAST INFORMATION IS CRITICAL 48 PREDICTIONS IN TIME • Method – Use model of nature for projection in time – Start model with estimate of state of nature at “initial” time • Sources of errors – Discrepancy between model and nature • Added at every time step – Discrepancy between estimated and actual state of nature • Initial error • Chaotic systems – Common type of dynamical systems • Characterized with at least one perturbation pattern that amplifies – All errors project onto amplifying directions • Any initial and/or model error – Predictability limited • Ed Lorenz’ legacy • Verification quantifies situation 49
© Copyright 2026 Paperzz