NOAA/CTB/CST Discussions on Overconfidence of GCMs Prediction

NOAA Climate Testbed Science Team (NOAA/CTB/CST) had a teleconference on March 24, 2005. One of the hot issues on multimodel ensemble, overconfidence of model prediction, attracted lively discussions. The issues raised are 1) the meaning of overconfidence - in terms of standard deviation and/or signal to noise ratio, 2) physical reasons.  Followings are communications collected after the meeting.

Siegfried Schubert (NASA/GSFC):    The fact that AGCMs tend to be overconfident seems to be a common problem (perhaps someone knows of a model that is less sure of itself?), and likely reflects an improper (too strong) response of the current generation of models to tropical SST anomalies. People may recall that the DSP (Dynamical seasonal Prediction) project highlighted a very large range in the signal to noise ratios in some of the earlier models (we now seem to have converged).   In any event, progress on this would presumably reduce the need for calibration.   This is a difficult problem because I don't think anybody really knows the reason for the stronger response in current models (or if it is known, it is for very unphysical reasons), and it is something that is difficult to validate.    For those reasons, I would like to suggest that this problem is best tackled as a multi-model development exercise.  E.g., coordinated experimentation that addresses the sensitivity of the atmospheric response to the details of the convection and boundary schemes, and that takes full advantage of the extensive knowledge base (e.g., history of tuning) of the various model development groups, would do much to accelerate progress on this problem.  

Lisa Goddard (IRI):   It used to be thought that having high signal-to-noise in a model was a desirable thing, and now we know that is only the case, if it is realistic. Previous research brought up the issue of modest predictability over the US. That may just be the way is it. If so, what we should strive for is reliable forecasts. The tough thing about looking at that is sampling. Even if you aggregate over the entire US, a reliability diagram with 10 bins will look fairly noisy over only 20 years of data. From my presentation, looking at the reliability diagrams, you may notice one of the individual model lines that are less flat than the others. That is ECHAM4.5. What I had forgotten when asked the question about all models being over confident was a set of similar reliability figures in the Hagedorn et al. paper.  She shows that all the DEMETER CGCMs are over-confident.  Her analysis is just looking at 'above-average' (i.e. 2-category), which is typically less impacted by the over-confidence problem than you see with 3 categories (which is why we will be checking those diagrams for the above-normal and below-normal tercile classes).  I still think that careful recalibration of the single model ensembles is a very worthwhile thing to do that will improve reliability.

(ref: Hagedorn, R., F.J. Doblas-Reyes and T.N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting.  Part I: Basic concept. Tellus, in press.)

Huug M. van den Dool (NOAA/NWS/NCEP):    Just to make sure what is meant by 'overconfidence'. There are three levels of calibration:  a) bias, b) the standard deviation, c) reliability. If no calibration is done at all, the forecast (relative to observed tercile boundaries) may be all in one class just because of, say, a warm bias. If you correct for mean error (which takes a few hind cast runs) the standard deviation may be off, leading to overpopulation of certain classes. Is that the meaning of overconfident? The standard deviation can also be corrected (this takes more hindcast runs), and it is only at this point that I feel the issue of overconfidence needs any deeper interpretation. Recently, NCEP scientists studied the probabilistic skill of CFS Nino34 forecasts over 1981-2003 and found surprisingly reliable forecasts after correcting the mean bias (The standard deviation was not corrected). As most of us will agree, we need to exclude simple errors and have a long hindcast history for discussing deeper reasons for overconfidence. The bottom line may be that not all models are overconfident.

Lisa Goddard (IRI):    It is correct to point out the different sources. Bias (mean bias) is not at issue, since any sane analysis would get rid of this right up front. The standard deviation is another matter. I do feel that this is where the majority of the problem can be found. In fact the reliability is not entirely separate from the standard deviation at least the way that the use of my 3x3 matrix would come into play. It is rare that the contingency table reverses the sign of the forecast -- and I think in that case it would be better to not issue a forecast. There are methods for correcting for the standard deviation of the models, and we are looking into that also.  The contingency table example is not desirable because you lose the strong probabilities -- the approach doesn't care if the ensemble mean is barely above-normal or extremely above normal, the same set of probabilities will be issued. Also, without a very long dataset, there will be severe undersampling. There is an interesting approach outlined in the Doblas-Reyes et al. paper that involves variance inflation. I think it would be useful to get some more intercomparison on these methods for adjusting the year-to-year seasonal pdf.  In this respect (standard deviation issue) all AGCMs that I have looked at (and the CGCMs in DEMETER) are overconfident -- some more than others, at least for terrestrial climate (even in the tropics).

(ref: Doblas-Reyes, F.J., R. Hagedorn and T.N. Palmer, 2005: The rationale behind the success of multi-model ensembles in seasonal forecasting.  Part II: Calibration and combination. Tellus, in press.)

Huug M. van den Dool (NOAA/NWS/NCEP):    Yes, what I listed in b) and c) can be folded into one procedure. In fact we have used an approach for a long time in the 6-10day forecast (see ref in the end). And CDC used to have this method (for wk2) before they did (Hamill&Whitaker) the grand Reforecast and used logistic regression. Going back to the question of whether there are any underconfident models... If the issue is standard deviation only, then I remember an early NASA model (some 2 layer (Held) Suarez rendition) with a standard deviation 1.5 times larger than observed. But most models have smaller than observed, and since so many processes are not included I wonder whether we should not prefer it to be that way (you can always postprocess).  At NCEP we have the additional factor that the model is exposed, day in day out, to a rmse minimizing verification attitude, so model changes that lower rmse are more easily accepted than the more wild model changes, even if they make physical sense. This leads to some damping.

(ref: Pan, J. and H. M. Van den Dool, 1998: Extended-Range probability forecast based on dynamical model output. Weather and Forecasting. 13, 983-966.)

Siegfried Schubert (NASA/GSFC):    Regarding the relationship between b) and c) (van den Dool), I guess that even if a model had a perfect standard deviation, it could still do poorly with (c) - there are lots of ways to partition the standard deviation. It would be worth while to revisit the signal to noise ratios in current models.

(Contact Kingtse Mo)