Gelman and Hill (2006) write on p46 that:
The regression assumption that is generally least important is that the errors are normally distributed. In fact, for the purpose of estimating the regression line (as compared to predicting individual data points), the assumption of normality is barely important at all. Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of regression residuals.
Gelman and Hill don't seem to explain this point any further.
Are Gelman and Hill correct? If so, then:
Why "barely important at all"? Why is it neither important nor completely irrelevant?
Why is the normality of residuals important when predicting individual data points?
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press
For estimation normality isn't exactly an assumption, but a major consideration would be efficiency; in many cases a good linear estimator will do fine and in that case (by Gauss-Markov) the LS estimate would be the best of those things-that-would-be-okay. (If your tails are quite heavy, or very light, it may make sense to consider something else)
In the case of tests and CIs, while normality is assumed, it's usually not all that critical (again, as long as tails are not really heavy or light, or perhaps one of each), in that, at least in not-very-small samples the tests and typical CIs tend to have close to their nominal properties (not-too-far from claimed significance level or coverage) and perform well (reasonable power for typical situations or CIs not too much wider than alternatives) - as you move further from the normal case power can be more of an issue, and in that case large samples won't generally improve relative efficiency, so where effect sizes are such that power is middling in a test with relatively good power, it may be very poor for the tests which assume normality.
This tendency to have close to the nominal properties for CIs and significance levels in tests is because of several factors operating together (one of which is the tendency of linear combinations of variables to have close to normal distribution as long as there's lots of values involved and none of them contribute a large fraction of the total variance).
However, in the case of a prediction interval based on the normal assumption, normality is relatively more critical, since the width of the interval is strongly dependent on the distribution of a single value. However, even there, for the most common interval size (95% interval), the fact that many unimodal distributions have very close to 95% of their distribution within about 2sds of the mean tends to result in reasonable performance of a normal prediction interval even when the distribution isn't normal. [This doesn't carry over quite so well to much narrower or wider intervals -- say a 50% interval or a 99.9% interval -- though.]
2: When predicting individual data points, the confidence interval around that prediction assumes that the residuals are normally distributed.
This isn't much different than the general assumption about confidence intervals -- to be valid, we need to understand the distribution, and the most common assumption is normality. For example, a standard confidence interval around a mean works because the distribution of sample means approaches normality, so we can use a z or t distribution