For people whose profession revolves around making order out of seemingly-random observations, scientists sure are inconsistent at judging the work of other scientists. Why? It certainly doesn't seem to be like this at all levels. For example according to the GRE's website,
For the Analytical Writing section, each essay receives a score from two trained raters, using a six-point holistic scale. In holistic scoring, raters are trained to assign scores on the basis of the overall quality of an essay in response to the assigned task. If the two assigned scores differ by more than one point on the scale, the discrepancy is adjudicated by a third GRE reader. Otherwise, the two scores on each essay are averaged.
This implies that it's uncommon for two assigned scores to differ by more than one point on the scale, i.e. GRE essay raters usually agree. Similarly, as far as I know, undergraduate thesis readers, MS thesis readers and even PhD thesis readers don't usually come to diametrically opposed judgments on the piece of work. Yet once it gets to research-level material, peer reviewers no longer seem to agree. Why?
Good question. Hard to answer. Some thoughts:
Considering these observations, it is unlikely to expect two review reports to be aligned. Then the difficult decision transfers to the associate editor who is also a volunteer and not specialized in the author’s field.
Leaves the question why it is accepted while outside science this wouldn’t be. Honestly, I don’t know. Just some guesses:
Added based on comment: - reviewers are busy scientists - reviewers are career-wise not rewarded for conducting reviews
The biggest difference is that up to PhD thesis level, the person doing the assessing is more of an expert than the person being assessed. In almost all these cases there is an agreed set of standard skills, techniques and knowledge that any assessor can be expected to posses and any assessee is being measured against.
This isn't so true of a PhD thesis, but in the end once a supervisor/thesis committee has green lit a student, almost all PhD theses are passed.
It's definitely not true higher up. In almost all cases the person being reviewed will be more of an expert in their work than anyone doing the reviewing. The only exceptions will be direct competitors, and they will be excluded. We are talking right at the edge of human knowledge, different people have different knowledge and skill sets.
I'm quite surprised that the GRE scores are so consistent. Its long been known that essay marking is pretty arbitrary (see for example Diederich 1974). Mind you 1 mark on a 6 mark scale is 15% - a pretty big difference. In our degree a 70 and above is a 1st class degree - the best mark there is, whereas 55 is a 2:2, a degree that won't get you an interview for most graduate jobs. Losing 15% on a grant assessment will almost certainly loose you the grant.
But even to obtain this level of consistency, the graders must have been given a pretty prescriptive grading rubric. In research, no such rubric exists, there are not pre-defined criteria against which a piece of research is measured, and any attempt to lay one down would more or less break the whole point of research.
With respect to the good papers being rejected problem, a factor that doesn't seem to have been mentioned yet is that the consequences of accepting a bogus paper are much worse than those of rejecting a good paper. If a good paper is rejected, it can always be resubmitted to a different journal. And if the authors first revise according to the reviewer comments, the version that ends up getting published may well be better written than the one that was rejected. All that's lost is time.
But if a bogus paper is accepted, other scientists may see it in the literature, assume its results to be valid, and build their own work upon it. This could result in significant lost time on their part, as experiments that depend on the bogus result don't work out as they should (which at least may lead to the bogus paper being retracted if the errors are bad enough). Or maybe they'll avoid researching along a line that would have worked, because the bogus paper implies it wouldn't, or worse, they'll end up with inaccurate results themselves and end up putting another paper with bad data into the literature. All of these are far worse outcomes than just needing to resubmit a paper, so false negatives are preferred to false positives when reviewing.
Different tasks, different results.
All the GRE graders have to do is assign scores but they are doing so to dozens or hundreds of essays. They receive clear guidance and examples about what score given essays should probably receive. So it’s basically checking boxes to justify a small set of results.
A peer review analysis is fundamentally different since you’re asking for a much more technically difficult task. They have to evaluate if the analysis is accurate, not if it’s responsive to a prompt. There’s no set of examples to draw on either. So the focus of peer review can be very different for different reviewers who may have different sets of expertise and certainly will have their own points of view.
To compare academic peer review to GRE grading -- that makes apples and oranges look all but identical. Let's step a little closer:
Similarly, as far as I know, undergraduate thesis readers, MS thesis readers and even PhD thesis readers don't usually come to diametrically opposed judgments on the piece of work.
That is certainly not always true and highly field dependent. In certain parts of academia it is a standard grad student horror story that Committee Member A insists that the thesis be cast in terms of Theoretical Perspective X, while Committee Member B insists that the thesis be cast in terms of Theoretical Perspective Y, where X and Y may be intellectually incompatible or sociologically incompatible: i.e., each theory has rejection of the other as a central tenet. This is more common in humanities where the nature of "theory" to the rest of the work is rather different, but it is not unheard of in the sciences either.
As a frequent committee member, I also happen to know that coming to a consensus judgment is a sociological phenomenon as well as an intellectual one -- i.e., some differences in judgment are limited only to the private discussion following the defense and other differences in judgment are never verbalized at all.
This is helpful in understanding the disparity in peer review: in peer review, the different referees are (in my experience, at least) never in direct communication with each other, and in fact may not be seeing each other's verdicts at all: as a referee, I believe that I have never been shown another referee report. In fact,
There is no aspect of the academic process that makes me feel like a lone masked vigilante more than being a referee. Surely people who do GRE grading go through some lengthy training process of repeated practice evaluating, feedback on those evaluations, discussion of the larger goals, and so forth. There is nothing like this for academic referees. We get no practice, and there is very little evaluation of our work. If I turn in what is (I guess!) an unusually comprehensive report unusually quickly, I will often get a "Hey, thanks!" email from the editor. In the (thankfully rather small) number of instances where my referee reports were months overdue, I either heard nothing from the editors (I am ashamed to say that once I figured out on my own that a paper I thought I had had for a few months had actually been an entire year) or got carefully polite pleas for me to turn in the report. I have never gotten any negative feedback after the fact. Unlike GRE graders, referees are volunteers.
I find (again, in my experience and in my academic field of mathematics) that referees are almost never given instructions that amount to any more than "1) Use your best judgment. 2) We are a really good journal and want you to impose high standards." I also notice that 2) is said for journals of wildly differing quality. What does it mean to "impose high standards"? I take that directive seriously and fire my shots into the dark as carefully as I can, but....of course that is ridiculously, maximally subjective.
Contributing a point beyond other answers:
Different levels of effort going into the review leads to different outcomes.
Papers are often written such that on a first pass read, it's supposed to read "pretty good" even if a more critical deep read and/or check of references would expose gaping holes, serious methodological issues, and alternative explanations for the results observed. Sometimes, an even more-effortful review can find that these issues don't actually matter in the particular case applicable to that specific paper (though the author should generally add this to the paper text itself).
While reviewers are incentivized to do a good job by the general knowledge that the system depends on that, specific instances are generally not incentivized and reviews sometimes get left to the last minute with a reviewer who's short on sleep and long on other tasks, who doesn't put in the effort for a good review. Thus, the result could be very different than even the same paper getting reviewed by the same reviewer at a different time. With no visibility into the factors affecting that outcome, it seems random.
We write more and more, and the typical submission quality seems to be going down. This has various reasons, including bad incentives in particular in China. If your salary directly on the papers accepted, quantity beats quality...
IMHO we are close to a tipping point now. Many of the expert reviewers refuse almost any reviewing request - because so many submissions are so sloppy, that it's quite annoying to review them. It should be different: most submissions should be so high quality that you enjoy reading that and can focus on the details. So more and more experts are just annoyed. They delegate more of the reviewing to students, or simply refuse. But that now means the remaining reviewers get more requests, and more bad papers. This can tip quickly, just like most ecosystems.
So the editors need to find other reviewers, and we get less and less expert reviewers. This also opens doors to scams and schemes. Multimedia Tools and Applications for example seems to have fallen prey to editor and reviewer manipulation scheme.
So what's the solution? I don't know.
This won't really answer your question, I realize, but I'd like to address your first example - rejected papers that later led to Nobel prizes.
Sometimes a piece of work is Frame Breaking and it leads to a Paradigm Shift within a field. This has happened many times in history, since at least Copernicus and Galileo. Einstein's early work on relativity was rejected among the physics/astronomy hoi oligoi as it was too different from the belief in the Aether at the time. The most prominent members of the field reject a radically new idea and their students, who are pervasively represented usually go along.
It has been said that revolutions in physics require the death or retirement of the most respected researchers so that the ideas of the young can get a fair hearing and come to the fore.
That is in fact an explanation of at least some of the eight papers referenced in your first link.
I don't think that many of us write paradigm changing papers, but it occasionally happens. The truly brilliant (not guilty) among us often must labor in near silence and obscurity for most of a generation. The next generation may celebrate them, or it may take even longer.
When a reviewer is faced with a truly frame breaking paper they, by definition, have no frame of reference in which to evaluate it. It is orthogonal to their entire way of thinking. "This must be nonsense", is the too-natural response.
To address the aspect of:
The same paper resubmitted to the same journal after several years often ends up rejected due to 'serious methodological errors'
In about one third of the papers I reviewed, I identified fundamental flaws that could not be addressed by revising the paper (you would have to write a new paper instead). Some examples just to give you a taste:
While I may have been wrong about these things, the authors never addressed my concerns, be it in a rebuttal or version of the paper published in another journal (which never happened in most of these cases).
Now these issues may seem like they should have been easy to spot, but evidently they weren’t: I only spotted some of these flaws during writing up the actual review, and I witnessed (and performed) quite a few jaw droppings when discussing papers with co-reviewing colleagues¹ whom I knew to be thorough. Also, in some cases I saw reports of other referees who were otherwise exhaustive but did not spot the issues.
So, to summarise: Even spotting fundamental flaws in a paper can be very difficult. A given reviewer only has a comparably small chance to spot a flaw in a given paper. Therefore there is a considerable chance that all of the reviewers fail.
¹ Yes, that’s a thing in my field and fully accepted by the journals.