Why does it says data should be normally distributed for analysis, when different test follow its own distribution (i.e. t, Z, F)?

Kynda 05/16/2018. 2 answers, 294 views

Why does it say data should be normally distributed for statistical analysis when different test follow its own distribution (i.e. t, Z, F)?

What does normality have to do with this?

Glen_b 05/16/2018.

Let's look at a specific example, a one sample t-test.

The t-statistic consists of a numerator and a denominator:

$$t = \frac{\bar{X}-\mu_0}{s_X/\sqrt{n}}$$

Both $\bar{X}$ and $s_X$ -- the sample mean and standard deviation -- are random quantities that depend on the (random) sample.

Because the random values in the sample we will be taking ($X_1,X_2,...,X_n$) are assumed to be independent and identically distributed as $N(\mu_0,\sigma^2)$ for some unknown $\sigma^2$, their mean $\bar{X}$ is in turn normally distributed with mean $\mu_0$ and variance $\sigma^2/n$ (these statements can be proved under the assumptions; for the mean and variance see https://en.wikipedia.org/wiki/Expected_value#Basic_properties and https://en.wikipedia.org/wiki/Variance#Basic_properties).

So the numerator of our t-statistic is distributed as $N(0,\sigma^2/n)$. If we knew $\sigma$, we could divide that by $\sigma/\sqrt{n}$ and get a test statistic that was distributed as a standard normal (a Z test). ($Z = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}$)

However, we don't have that information, and because we don't know how variable the population is, we don't have a way to exactly work out how "unusual" a particular sample mean would be (if it were to have come from a normal distribution with the hypothesized mean).

We can, however, get an estimate of that variability (we can estimate $\sigma^2$ by $s_X^2$, the sample variance). If we do that, and use the estimate in place of the population value, the statistic becomes the t-statistic. But because the numerator and denominator are now both random (different samples will give different values for both), the statistic no longer has a normal distribution. The tendency of $1/s_X$ to be occasionally much larger than $1/\sigma_X$ makes the tail heavier, and it turns out that under the assumptions we made, the statistic now has a Student's t distribution.

[It's named for Student, the pseudonym of the person that correctly guessed the form of the distribution of the t-statistic and checked it using a kind of simulation (of sorts); Fisher later proved the guess correct.]

The t-distribution is one of a number of distributions connected to the normal distribution (in that they're what you end up with when you do certain calculations on samples from normal populations). Others include the chi-squared distribution and the F-distribution.

Similar - but slightly more complicated explanations than the above one - apply to F which arises in problems related to ANOVA, regression, and tests of variance when used on normal or conditionally normal populations (as appropriate).

What does normality have to do with this?

The calculations that get to a t-distribution or an F-distribution respectively rely on the original values being one (or more) random sample(s) from normally distributed populations. (In the case of regression, its the error term that's normally distributed rather than the unconditional population of responses.)

If you didn't have normal populations, you wouldn't get things that are t- or F- distributed respectively, but something else.

For example, if we do a one-sample t-test on data from an exponential distribution with n=8, then if the hypothesized population mean is correct the distribution of the t-statistic looks like this:

The bars are a histogram and the green curve is a kernel density estimate for a one-sample t-statistic based on drawing many samples from an exponential population at n=8.

You might have expected that (since the numerator would be right-skew) that the t-statistic would be skewed right, but it's clearly left-skew in this instance. [This should reinforce to us the idea that you cannot ignore the behavior of the denominator when dealing with the t-test.]

If we used the ordinary t-tables with this statistic our p-values would be wrong -- quite wrong for either one-tailed test (too high or too low depending on which side we're testing against) but still somewhat wrong for a two-tailed test.

The two-sample test is considerably less impacted than this with exponential data, but it's still not t-distributed. The F-test for equality of variances is more substiantially affected, however.

Tests are typically done on statistics, which are functions of data. The mean is one example where you sum N random variables and then divide by N. These functions often have their own distributions that are distinct from the underlying data.