US National Oceanic and Atmospheric Administration
Methods of MultiModel Consolidation, with Emphasis on Huug van den Dool Climate Prediction Center, NOAA/NWS/NCEP Camp Springs, MD, 20746  
1. Introduction In many contexts with limited data and no patience to wait for new and independent data, one needs to design schemes that mimic the real time forecast situation on a fixed old data set. This is done often nowadays by crossvalidation (CV). The purpose of CV is to establish properties of a forecast scheme that would apply on independent future data, for instance to estimate apriori skill. However, while CV is often a necessity, it may also itself be the source of a problem in evaluating skill. CV is not an exactly defined procedure in general, so let’s focus on a situation when systematic error correction is thought to be required. Given N pairs of forecast and verification, say seasonal Nino34 forecasts for 19812001, we can set M pairs (M much less than N=21) aside, calculate the systematic error over the NM cases, then apply the correction to all or some of the M cases left out. This is done exhaustively, so all data is used as (assumed independent) verification at least once. Naturally, researchers want to get away with M=0 or M=1, since it is simpler than M>1, and skill may appear higher that way. Don’t we want high skill??? Yes, but not if the assessment misleads us as to the performance in real time. Dependent data generally overstate the skill level. In this writeup we want to make a strong case for M=3, i.e. keep (at least) three forecast/observed pairs out. This appears to be the right approach in the context of multimodel ensembles, where not only systematic error correction is required but also the determination of weights to be assigned to the participating models. The procedure we recommend, used in Pena and Van den Dool (2008), is more completely named CV3RE, where CV is crossvalidation, 3 means three years left out, R refers to the random choice of two of the three years left out, and E refers to an external climatology (ideally from a data set for a constant climate outside the period of experimentation.) The reason that 3 years should be taken out for the systematic error correction (SEC) is that one can show analytically that the correlation does not change upon taking out just 1 year, i.e. CV1 does not do anything. The number of elements withheld being odd (as a convenience), three would thus be the minimum. Typically that would be three successive years as a block, but here we argue that the three removed should be a) the test year, and b) two additional randomly chosen years. 

2. An example of systematic error correction Table 1 provides details of an example. Shown in column 1 are June temperatures for 19812001 (top to bottom) in the Nino34 area as predicted at a lead of 5 months by one of the Demeter models (“Model #4”) which has its initial states in January. The observed SST is shown in column 3. The anomalies in columns 2 and 4 are wrt to the 21 year mean of model and observed data respectively. The bottom line shows 21 year averages. Column 6 shows SEC that would be applied to the year in column 5. Columns 7&8 are two randomly selected years also withheld in calculating the recommended SEC. Clearly model 4 needs systematic error correction badly, since it is about 2.5°C too cold. This is a large error in the mean given that anomalies are rarely larger than 1.5. However, it would be wrong to assume that we know SEC = 2.45 with such certainty so as to apply it to all cases in the sample of 21  this would be the full sample dependent data approach. If one withholds each year in turn (in the hopes of creating an independent year), plus two more years chosen at random, and calculates the difference in the mean of forecast and observation over 18 cases, one finds SEC to vary somewhat but not greatly, from 2.32 to 2.63 to be specific. Fortunately, the forecast still improves greatly as a result of applying a variable SEC, but not as much as, seemingly, when applying a constant SEC = 2.45. It is more correct to say that the dependent data case (N=21) overestimates skill, and we have a professional duty to calculate an estimate that will hold up in true realtime. As shown in Fig.1 the skill, as measured by correlation, is around three points lower than in the dependent data result for each of the 9 models on the left in Fig.1 considered by Pena and Van den Dool (2008). That the year for which forecast accuracy is tested should not be included in the SEC determination is easily seen in the extreme for N=1 – that would make the forecast perfect in a misguided way. But even for N=21 the test case has a noticeable impact, because of “compensation” effects that are known to affect CV. For instance, in 1987, see Table 1, the forecast and observation are ‘only’ 2.2°C apart and including this case keeps the SEC at 2.45, whereas excluding it makes it 2.63. The opposite happens in 1985 and 1993, two years that feature forecast errors larger than average. Using three elements dilutes the compensation effect. In section 3 we will see a more complicated compensation effect. In the next section we argue again that three should be taken out, but for a very different reason. 3. Degeneracy in regression In earlier work we found highly negative correlation in CV applied to forecasts based on regression schemes, where a zero correlation would have been more reasonable. This feature was ultimately explained in Barnston and van den Dool (1993). Fig. 2, reproduced from that paper shows a synthetic data case. We generated pairs of correlated forecast,observation data, varying the correlation along the xaxis from 0 to 1. We then did a CV1 approach to calculate the correlation from limited data (32 pairs). When the correlation is large CV1 functions OK and only shows some normal ‘shrinkage’. But when the intended correlation is small, between 0 and 0.2 in this case, the outcome of CV1 is a disaster. One can get a perfect 1 correlation. This happens because of compensation effects at the covariance level (in section 2 we had compensation in the mean). Suppose we have zero correlation on the full sample between forecast and observation, and thus also zero covariance. When we leave out 1 pair, which happens to covary by chance positively, the remaining N1 pairs have, by necessity, a negative covariance in the mean. Thus a regression forecast based on the N1 will be opposite to what is observed in the one case left out, thus leading to high negative correlation. 
Table 1
Figure 1
Figure 2 
This can happen in real life regression forecasts. For instance Nino34 correlates with seasonal temperature over the US, but with opposite sign in the NW and the SE US. Along the broad band of zero and small correlation, presumably the nodal line of a teleconnection pattern, the CV1 score of a regression forecast is highly negative. Here we get punished for our good intentions. The solution, aside from waiting forever for more years, is to take out more than 1. For instance when taking out the test year as well as two more years, the compensation effect is obviously diluted. Choosing two more years at random (as opposed to a block of three, with the test year in the middle) is better because the serial correlation (caused by climate change among other things) violates the assumption of independence. This above discussion applies to the multimodel ensemble approach because the MME is a linear combination of several forecasts, with weights derived from a limited data set as per regression. We should apply CV3RE, and we can fold the CV for SEC into the CV required for the weights (the regression aspect) into one single procedure. The seven entries on the right hand of Fig.1 are MME by different schemes subjected to CV3RE. The various ridge regression approaches fare much better under CV than an unconstrained regression (UR). 4. Conclusion We recommend as crossvalidation procedure something called CV3RE, where CV is crossvalidation, 3 means three years left out, R refers to the random choice of two of the three years left out, and E refers to an external climatology (ideally from a data set for a constant climate outside the period of experimentation.). We have not laid out the case for the external climatology in this short writeup, but this aspect also helps stabilize the answers one gets. While we believe CV3RE is appropriate for the multimodel ensemble it may also be a good strategy in many other situations. However, each problem requires some deliberations of its own, and a general theory/algorithm for CV appears elusive (to me). References Barnston, A.G., and H.M. van den Dool, 1993: A Degeneracy in CrossValidated Skill in Regressionbased Forecasts. J. Climate, 6, 963–977. Peña, M., and H. van den Dool, 2008: Consolidation of Multimodel Forecasts by Ridge Regression: Application to Pacific Sea Surface Temperature. J. Climate, 21, 6521–6538. 

Contact Huug van den Dool 