E-mail 1198790779 : More significance testing

From	Ben Santer	Date	Thu, 27 Dec 2007 16:26:19 -0800
To	John Lanzante, Thomas R Karl, carl mears, David C. Bader, Dian J. Seidel, francis, Frank Wentz, Karl E.Taylor, Leopold Haimberger, Melissa Free, Mike MacCracken, Phil Jones, Steven Sherwood, Steve Klein, Susan Solomon, Peter Thorne, Tim Osborn, Tom Wigley, Gavin Schmidt
Subject	More significance testing
Dear folks, This email briefly summarizes the trend significance test results. As I mentioned in yesterday's email, I've added a new case (referred to as "TYPE3" below). I've also added results for tests with a stipulated 10% significance level. Here is the explanation of the four different types of trend test: 1. "OBS-vs-MODEL": Observed MSU trends in RSS and UAH are tested against trends in synthetic MSU data in 49 realizations of the 20c3m experiment. Results from RSS and UAH are pooled, yielding a total of 98 tests for T2 trends and 98 tests for T2LT trends. 2. "MODEL-vs-MODEL (TYPE1)": Involves model data only. Trend in synthetic MSU data in each of 49 20c3m realizations is tested against each trend in the remaining 48 realizations (i.e., no trend tests involving identical data). Yields a total of 49 x 48 = 2352 tests. The significance of trend differences is a function of BOTH inter-model differences (in climate sensitivity, applied 20c3m forcings, and the amplitude of variability) AND "within-model" effects (i.e., is related to the different manifestations of natural internal variability superimposed on the underlying forced response). 3. "MODEL-vs-MODEL (TYPE2)": Involves model data only. Limited to the M models with multiple realizations of the 20c3m experiment. For each of these M models, the number of unique combinations C of N 20c3m realizations into R trend pairs is determined. For example, in the case of N = 5, C = N! / [ R!(N-R)! ] = 10. The significance of trend differences is solely a function of "within-model" effects (i.e., is related to the different manifestations of natural internal variability superimposed on the underlying forced response). There are a total of 62 tests (not 124, as I erroneously reported yesterday!) 4. "MODEL-vs-MODEL (TYPE3)": Involves model data only. For each of the 19 models, only the first 20c3m realization is used. The trend in each model's first 20c3m realization is tested against each trend in the first 20c3m realization of the remaining 18 models. Yields a total of 19 x 18 = 342 tests. The significance of trend differences is solely a function of inter-model differences (in climate sensitivity, applied 20c3m forcings, and the amplitude of variability). REJECTION RATES FOR STIPULATED 5% SIGNIFICANCE LEVEL Test type No. of tests T2 "Hits" T2LT "Hits" 1. OBS-vs-MODEL 49 x 2 (98) 2 (2.04%) 1 (1.02%) 2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 58 (2.47%) 32 (1.36%) 3. MODEL-vs-MODEL (TYPE2) --- (62) 0 (0.00%) 0 (0.00%) 4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 22 (6.43%) 14 (4.09%) REJECTION RATES FOR STIPULATED 10% SIGNIFICANCE LEVEL Test type No. of tests T2 "Hits" T2LT "Hits" 1. OBS-vs-MODEL 49 x 2 (98) 4 (4.08%) 2 (2.04%) 2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 80 (3.40%) 46 (1.96%) 3. MODEL-vs-MODEL (TYPE2) --- (62) 1 (1.61%) 0 (0.00%) 4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 28 (8.19%) 20 (5.85%) REJECTION RATES FOR STIPULATED 20% SIGNIFICANCE LEVEL Test type No. of tests T2 "Hits" T2LT "Hits" 1. OBS-vs-MODEL 49 x 2 (98) 7 (7.14%) 5 (5.10%) 2. MODEL-vs-MODEL (TYPE1) 49 x 48 (2352) 176 (7.48%) 100 (4.25%) 3. MODEL-vs-MODEL (TYPE2) --- (62) 4 (6.45%) 3 (4.84%) 4. MODEL-vs-MODEL (TYPE3) 19 x 18 (342) 42 (12.28%) 28 (8.19%) Features of interest: A) As you might expect, for each of the three significance levels, TYPE3 tests yield the highest rejection rates of the null hypothesis of "No significant difference in trend". TYPE2 tests yield the lowest rejection rates. This is simply telling us that the inter-model differences in trends tend to be larger than the "between-realization" differences in trends in any individual model. B) Rejection rates for the model-versus-observed trend tests are consistently LOWER than for the model-versus-model (TYPE3) tests. On average, therefore, the tropospheric trend differences between the observational datasets used here (RSS and UAH) and the synthetic MSU temperatures calculated from 19 CMIP-3 models are actually LESS SIGNIFICANT than the inter-model trend differences arising from differences in sensitivity, 20c3m forcings, and levels of variability. I also thought that it would be fun to use the model data to explore the implications of Douglass et al.'s flawed statistical procedure. Recall that Douglass et al. compare (in their Table III) the observed T2 and T2LT trends in RSS and UAH with the overall means of the multi-model distributions of T2 and T2LT trends. Their standard error, sigma{SE}, is meant to represent an "estimate of the uncertainty of the mean" (i.e., the mean trend). sigma{SE} is given as: sigma{SE} = sigma / sqrt{N - 1} where sigma is the standard deviation of the model trends, and N is "the number of independent models" (22 in their case). Douglass et al. apparently estimate sigma using ensemble-mean trends for each model (if 20c3m ensembles are available). So what happens if we apply this procedure using model data only? This is rather easy to do. As above (in the TYPE1, TYPE2, and TYPE3 tests), I simply used the synthetic MSU trends from the 19 CMIP-3 models employed in our CCSP Report and in Santer et al. 2005 (so N = 19). For each model, I calculated the ensemble-mean 20c3m trend over 1979 to 1999 (where multiple 20c3m realizations were available). Let's call these mean trends b{j}, where j (the index over models) = 1, 2, .. 19. Further, let's regard b{1} as the surrogate observations, and then use Douglass et al.'s approach to test whether b{1} is significantly different from the overall mean of the remaining 18 members of b{j}. Then repeat with b{2} as surrogate observations, etc. For each layer-averaged temperature series, this yields 19 tests of the significance of differences in mean trends. To give you a feel for this stuff, I've reproduced below the results for tests involving T2LT trends. The "OBS" column is the ensemble-mean T2LT trend in the surrogate observations. "MODAVE" is the overall mean trend in the 18 remaining members of the distribution, and "SIGMA" is the 1-sigma standard deviation of these trends. "SIGMA{SE}" is 1 x SIGMA{SE} (note that Douglass et al. give 2 x SIGMA{SE} in their Table III; multiplying our SIGMA{SE} results by two gives values similar to theirs). "NORMD" is simply the normalized difference (OBS-MODAVE) / SIGMA{SE}, and "P-VALUE" is the p-value for the normalized difference, assuming that this difference is approximately normally distributed. MODEL "OBS" MODAVE SIGMA SIGMA{SE} NORMD P-VALUE CCSM3.0 0.1580 0.2179 0.0910 0.0215 2.7918 0.0052 GFDL2.0 0.2576 0.2124 0.0915 0.0216 2.0977 0.0359 GFDL2.1 0.3567 0.2069 0.0854 0.0201 7.4404 0.0000 GISS_EH 0.1477 0.2185 0.0906 0.0214 3.3153 0.0009 GISS_ER 0.1938 0.2159 0.0919 0.0217 1.0205 0.3075 MIROC3.2_T42 0.1285 0.2196 0.0897 0.0211 4.3094 0.0000 MIROC3.2_T106 0.2298 0.2139 0.0920 0.0217 0.7305 0.4651 MRI2.3.2a 0.2800 0.2111 0.0907 0.0214 3.2196 0.0013 PCM 0.1496 0.2184 0.0907 0.0214 3.2170 0.0013 HADCM3 0.1936 0.2159 0.0919 0.0217 1.0327 0.3018 HADGEM1 0.3099 0.2095 0.0891 0.0210 4.7784 0.0000 CCCMA3.1 0.4236 0.2032 0.0769 0.0181 12.1591 0.0000 CNRM3.0 0.2409 0.2133 0.0918 0.0216 1.2762 0.2019 CSIRO3.0 0.2780 0.2113 0.0908 0.0214 3.1195 0.0018 ECHAM5 0.1252 0.2197 0.0895 0.0211 4.4815 0.0000 IAP_FGOALS1.0 0.1834 0.2165 0.0917 0.0216 1.5314 0.1257 GISS_AOM 0.1788 0.2168 0.0916 0.0216 1.7579 0.0788 INMCM3.0 0.0197 0.2256 0.0790 0.0186 11.0541 0.0000 IPSL_CM4 0.2258 0.2142 0.0920 0.0217 0.5359 0.5920 T2LT: No. of p-values .le. 0.05: 12. Rejection rate: 63.16% T2LT: No. of p-values .le. 0.10: 13. Rejection rate: 68.42% T2LT: No. of p-values .le. 0.20: 14. Rejection rate: 73.68% The corresponding rejection rates for the tests involving T2 data are: T2: No. of p-values .le. 0.05: 12. Rejection rate: 63.16% T2: No. of p-values .le. 0.10: 13. Rejection rate: 68.42% T2: No. of p-values .le. 0.20: 15. Rejection rate: 78.95% Bottom line: If we applied Douglass et al.'s ridiculous test of difference in mean trends to model data only - in fact, to virtually the same model data they used in their paper - one would conclude that nearly two-thirds of the individual models had trends that were significantly different from the multi-model mean trend! To follow Douglass et al.'s flawed logic, this would mean that two-thirds of the models really aren't models after all... Happy New Year to all of you! With best regards, Ben ---------------------------------------------------------------------------- Benjamin D. Santer Program for Climate Model Diagnosis and Intercomparison Lawrence Livermore National Laboratory P.O. Box 808, Mail Stop L-103 Livermore, CA 94550, U.S.A. Tel: (925) 422-2486 FAX: (925) 422-7675 email: santer1@llnl.gov ----------------------------------------------------------------------------