A friend and I are looking at ways of selecting forecasters based on past performance. My friend's preferred method is to select the forecasters who are most correlated with the truth, based on a Pearson correlation.
I've been trying to express to my friend some reasons why I think this method is a bit flawed, using the following example:
| True value | Forecaster 1 | Forecaster 2 | |:-------------|---------------:|:---------------:| | 5 | 6 | 51 | | 10 | 9 | 99 | | 20 | 21 | 201 |
Forecaster 1’s estimates are ±1 from the true value, while Forecaster 2’s estimates are ±1 from (true value × 100). The Pearson correlation between Forecaster 1 and the truth is .9897, while the Pearson correlation between Forecaster 2 and the truth is > .9999. Recall that according to my friend's method we're selecting the Forecaster who has the highest Pearson correlation with the truth. Thus in this example Forecaster 2 would be selected, even though Forecaster 2’s forecasts are wildly variable and inaccurate relative to Forecaster 1’s.
Yes it does disclose a problem although your friend may have a different scientific question in mind (although it is hard to guess what that might be). Forecaster 3 who guesses 105, 110, 120 does even better I think you will find.
There are a host of options. In the medical field an influential paper for the case where you are comparing two measurements is by Bland and Altman available here entitled "Statistical methods for measuring agreement between two methods of clinical measurement". In the thirty years since publication it has been cited more than 38000 times according to Google.
You might also want to look at mean squared error https://en.wikipedia.org/wiki/Mean_squared_error
As far as I can see, the issue here is similar to the issue of consistency vs. agreement in the context of interrater reliability. Consider your true value as ratings offered by one rater (who happens to be very good...) and the forecasters as providing their own ratings.
Raters are consistent if, e.g., those people who are rated as higher than average by one rater are also rated similarly higher than average by another rater. Consistency can be assessed with correlations.
Raters are in agreement if the mean of one rater's set of ratings are similar to the mean of another rater's set of ratings. Agreement can be assessed with tests of differences between means.
Consistency and agreement are orthogonal. Two raters can be consistent, but not in agreement,
|Rater 1|Rater 2| Subject 1| 1 | 10 | Subject 2| 2 | 20 | Subject 3| 3 | 30 | Subject 4| 4 | 40 | Subject 5| 5 | 50 | Mean| 3 | 30 | r = 1
and in agreement but not (really) consistent,
|Rater 1|Rater 2| Subject 1| 1 | 1 | Subject 2| 2 | 5 | Subject 3| 3 | 4 | Subject 4| 4 | 2 | Subject 5| 5 | 3 | Mean| 3 | 3 | r = .1
or both, or neither (you can imagine those numbers for yourself).
Ideally, you want both consistency and agreement to be high. Both of these can be assessed using the intraclass correlation coefficient
Unless I'm missing something obvious, you could do something similar here between your true values and your forecasts.