Part VI:   How to evaluate the quality of an EPS
and its forecasts?

    How and what to evaluate is important because it not only gives one a sense of correctness or wrongness but more importantly it could shape how an EPS or a model being developed in a long run. In general, four aspects need to be verified to measure the quality of an ensemble system: equal-likelihood of each ensemble members, superiority of ensemble mean to single control forecast, high spread-skill relation and reliable probability. Those four aspects are related to each other in certain ways. Because all perturbed ICs are supposed to be equal-likely true and all perturbed physics or varying physics schemes or alternative models are also equally plausible, performance of all ensemble members should be, in principle, similar to each other on average. Otherwise, it indicates problems of ensembling technique employed, e.g., either IC perturbation size is too large or alternative models, physics schemes or perturbations added are not really equally plausible. Due to this equal-likelihood property, one can image that it’s difficult if not impossible to determine in prior which member is likely to perform the best in a particular forecast.

    Due to nonlinear filtering, discrepancies among members (i.e., less predictable elements) are damped or cancelled and only those common features among members (i.e., more predictable parts) are remained during the process of ensemble averaging. This will result in a superior ensemble mean forecast to a single or even higher-resolution control forecast on average. Figure 5 is an example in hurricane track forecasting showing the ensemble mean is close to the observed track. In grid-point verification, smoothing effect of the averaging partially contributes to this superiority but should be in a much less degree merely as a side effect comparing to the nonlinear filtering. It needs to point out that the ensemble averaging can only remove random error but not systematic bias error if an ensemble consists of only one model with one version of physics package. For such an ensemble system, effectiveness of error reduction by ensemble averaging should be measured with respect to the random portion of forecast error but not to the systematic error such as mean-error or the total error such as root-mean-squared error (RMSE) which is a mixture of random and systematic errors. Otherwise, one might be comparing the relative performance of two modeling systems but not two ensembling techniques or strategies, therefore the conclusion drawn might be misleading. However, for a multi-model and/or multi-physics ensemble, bias error could also be reduced by ensemble averaging due to possibly different biases possessed in different schemes or models. Similar to ensemble mean, ensemble median forecast should verify the best too on average. To measure ensemble mean/median forecast accuracy, all methods normally used in evaluating deterministic single forecasts can also be applied such as Threat score, Equitable Threat score, RMSE and correlation etc..

    For a good ensemble system, ensemble spread should be a good indicator of possible forecast error distribution since spread should reflect the true predictability of a flow. Large (small) spread indicates less (more) predictable event, while less (more) predictable event is more (less) difficult to forecast and should have wider (narrower) error range. Therefore, spread and absolute error of ensemble mean forecast should be positively correlated on average such as shown in Fig. 6, which is called spread-skill relation (Whitaker and Loughe, 1998; Roulston, 2005; Grimit and Mass, 2007). Figure 7 is a snapshot of the spatial distributions of the forecast error of ensemble mean and the ensemble spread from the NCEP SREF, which does show good relation between spread and error in the spatial pattern of large-scale. Figure 8 shows the spatial correlation of various variables averaged over a period of about a month. We can see that such correlation exceeds 50% at day 3.5 for sea-level pressure, 500hpa height and 2m temperature, which is comparable to or even better than the quality of quantitative precipitation forecast by the current state-of-the-art NWP models and, therefore, skillful in providing useful guidance to forecasters. It’s also interesting to notice that the spread-skill correlation of sea-level pressure and 500hpa height increases steadily with forecast time and remains below 30% at day 1. This could imply that the current ensemble technique used by NCEP SREF might not be suitable for very short range forecast (0-24hr), which is a subject worth being investigated. Many examine the spread-skill relation by simply comparing two domain-averaged curves of spread vs. RMSE of ensemble mean. By doing so, one needs to keep in mind that the closeness of the spread curve to the error curve is only a necessary condition but not a sufficient condition in measuring true spread-skill relation since their actual spatial patterns might not match to each other although their domain-wide summary statistics does. To avoid this, the comparison between spread and error must be carried out point by point in space. Spatial correlation, as discussed above, is one way to do. Another way is the rank histogram or rank distribution known as Talagrand Diagram/Distribution initially proposed by Talagrand et. al. (1997) and Anderson (1996) and late also documented by Hamill (2001). For a good ensemble system, truth acts just like one of the ensemble members and is not distinguishable from them. In other words, truth has equal chance to be anywhere between any two members. In Talagrand Distribution, all n ensemble members are first sorted out in order from smallest to biggest in value and form n+1 bins including two endings at a given point and time. Then, the chance of truth entering each bin is counted point by point and summed over a region and a period of time. If the resulting distribution of the chance over all bins is flat (--), the truth has the same statistical properties as all ensemble members, then ensemble spread is reliable and reflects true error distribution (perfect spread); if the distribution is in “upside down” U shape, spread over estimates forecast uncertainty (over-dispersive); if the distribution shows L shape, forecasts have high-bias; if shows “reversed” L shape, forecasts have low-bias; and if the distribution is in U shape, it indicates either spread under estimates uncertainty (under-dispersive) or some forecasts have low-bias and some high-bias. Figure 9 is an example of this, showing quite satisfactory spread of 500hpa height from NCEP SREF. Even for a perfect but finite ensemble system with n members, one should always expect [2/(n+1)x100] % (the sum of two end bins) of the time that truth would fall outside the ensemble cloud (outlier) and [(n-1)/(n+1)x100] % chance that the ensemble cloud would encompass the truth. To obtain more reliable assessment out of Talagrand Distribution, Minimum Spanning Tree approach is sometimes used to aid the calculation (Smith and Hansen, 2004; Wilks, 2004; Gombos et. al., 2007).

    There are two attributes to measure the usefulness of a probabilistic forecast: reliability and resolution (Jolliffe and Stephenson, 2003; Roulston and Smith, 2002; Atger, 1999b). In a reliable probabilistic forecast, a probability really means what its surface value says. For example, a perfectly reliable 60% forecast of an event means that in 60 times out of 100 such “60%” forecasts (either spanning in space or in time) the event will actually occur and 40 times not occur. A median forecast (50%) should verify half of the time. This property can be measured by the Reliability Score (Wilks, 2006). A perfect Reliability Score curve is a diagonal line with x-axis in forecast probability and y-axis in observed frequency of the predicted event.  Figure 10 shows an example of it based on NCEP SREF. A reliable probabilistic forecast has, however, no ability to tell which particular probabilistic forecast would verify and which one wouldn’t. The ability to distinguish the ones that would occur from those that would not occur is called Resolution. The resolution is related to the sharpness of probability density function (PDF). The sharper a PDF is, the higher resolution or the more skill or the more information a probabilistic forecast has. A perfect deterministic single forecast has perfect resolution (perfect reliability too): full capability of distinguishing “yes” event from “no” event and yes means yes and no means no. Climatology forecast is perfectly reliable but has no resolution.

 

 

 

Figure 5

 

 

 

 

 

 

 

 

Figure 6

 

Figure 7

 

Figure 8

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 9

 

 

 

 

 

 

Figure 10

When a PDF distribution becomes as flat as climatology PDF, a probabilistic forecast becomes no skill although it’s perfectly reliable. So, one can see that as long as a probabilistic forecast is reliable, the higher the resolution is, the more valuable a probabilistic forecast will be. Reliability reflects how well IC and model physics etc. are perturbed in an EPS. Good spread-skill relation is a basis to have a reliable probabilistic forecast. However, resolution cannot be improved through ensemble technique but only through improvement of model and IC quality themselves. Note that reliability score can be severely contaminated by model systematic bias, while resolution is mainly related to and affected by random error. Therefore, removing model systematic bias can improve reliability but not resolution in a probabilistic forecast. Since ensemble averaging can reduce random error, it improves resolution for a single deterministic forecast. Spatial correlation (to truth) is a way to measure resolution since it reflects random error (not systematic error) for a single forecast. To assess probabilistic forecast accuracy, many other scores are also used such as Brier score (BS, Brier, 1950) for one-category (e.g., rain or no rain) event and Ranked Probabilistic score (RPS, Epstein, 1969; Murphy, 1969 and 1971; Wilks, 2006 and Du et. al., 1997) for multiple Mutually Exclusive and Collectively Exhaustive categories or Continuous Ranked Probability score (CRPS) for continuous variables (Brown, 1974; Unger, 1985; Grimit et. al., 2006). Analogous to RMSE in single forecast verification, BS and RPS are the average deviation between predicted probabilities for a set of events and their outcomes, so a lower score represents higher accuracy. 0 is perfect and 1 (J-1) the worst in BS (RPS), where J is the number of event categories. Similar to RMSE, BS and RPS measure total error and are contributed by both reliability and resolution. Traditionally, BS, RPS and CRPS etc. can be decomposed into reliability, resolution and uncertainty components (Hersback, 2000). Relative Operating Characteristic (ROC) diagram is another tool often used to assess probabilistic forecast using False Alarm Rate (FAR) as x-axis and Hit Rate (HR) as y-axis (Harvey et. al., 1992). An ideal forecast has 0% FAR and 100% HR, while a worst forecast 100% FAR and 0% HR. When HR and FAR are evenly divided (50% each), it denotes a no-skill forecast shown by a diagonal line in the ROC diagram such as climatological forecast. Area under the ROC curve (AUC) is used to quantify this score: 1 for perfect, 0 for worst and 0.5 for no-skill forecast. Motivated by search for a metric that relates ensemble forecast performance to things that customers will actually care about, Economic Value (EV) score is developed (Richardson, 2000; Zhu et. al., 2002). Calculation of EV is based on the two components of ROC (FAR and HR) as well as a cost-lost ratio which is closely related to customer’s dependency on weather. Most of the above scores are sometimes converted into skill-score format with respect to a same score of a reference forecast such as climatology to have a better idea about the relative performance of the forecast. Cautious is needed when using climatology to calculate forecast skill to avoid possible overestimation of skill scores due to possibly different climatologies used (Hamill and Juras, 2007).

    Finally, one should keep in mind that no matter how an EPS is verified, a good EPS forecast should demonstrate the following three general properties: the consistency from one cycle to another (probability is found much more consistent than a single-value forecast is), the quality (for single-value forecast) or reliability (for probabilistic forecast) regarding distance between forecast and observation, and the value or benefit realized from action taken by considering the forecast information.

Contact  Jun Du