Why does it say data should be normally distributed for statistical analysis when different test follow its own distribution (i.e. t, Z, F)?

What does normality have to do with this?

Glen_b 05/16/2018.

Let's look at a specific example, a one sample t-test.

The t-statistic consists of a numerator and a denominator:

$$t = \frac{\bar{X}-\mu_0}{s_X/\sqrt{n}}$$

Both $\bar{X}$ and $s_X$ -- the sample mean and standard deviation -- are random quantities that depend on the (random) sample.

Because the random values in the sample we will be taking ($X_1,X_2,...,X_n$) are assumed to be independent and identically distributed as $N(\mu_0,\sigma^2)$ for some unknown $\sigma^2$, their mean $\bar{X}$ is in turn normally distributed with mean $\mu_0$ and variance $\sigma^2/n$ (these statements can be proved under the assumptions; for the mean and variance see https://en.wikipedia.org/wiki/Expected_value#Basic_properties and https://en.wikipedia.org/wiki/Variance#Basic_properties).

So the numerator of our t-statistic is distributed as $N(0,\sigma^2/n)$. If we knew $\sigma$, we could divide that by $\sigma/\sqrt{n}$ and get a test statistic that was distributed as a standard normal (a Z test). ($Z = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}$)

However, we don't have that information, and because we don't know how variable the population is, we don't have a way to exactly work out how "unusual" a particular sample mean would be (if it were to have come from a normal distribution with the hypothesized mean).

We can, however, get an estimate of that variability (we can estimate $\sigma^2$ by $s_X^2$, the sample variance). If we do that, and use the estimate in place of the population value, the statistic becomes the t-statistic. But because the numerator and denominator are now *both* random (different samples will give different values for both), the statistic no longer has a normal distribution. The tendency of $1/s_X$ to be occasionally much larger than $1/\sigma_X$ makes the tail heavier, and it turns out that under the assumptions we made, the statistic now has a Student's t distribution.

[It's named for Student, the pseudonym of the person that correctly guessed the form of the distribution of the t-statistic and checked it using a kind of simulation (of sorts); Fisher later proved the guess correct.]

The t-distribution is one of a number of distributions connected to the normal distribution (in that they're what you end up with when you do certain calculations on samples from normal populations). Others include the chi-squared distribution and the F-distribution.

Similar - but slightly more complicated explanations than the above one - apply to F which arises in problems related to ANOVA, regression, and tests of variance when used on normal or conditionally normal populations (as appropriate).

What does normality have to do with this?

The calculations that get to a t-distribution or an F-distribution respectively rely on the original values being one (or more) random sample(s) from normally distributed populations. (In the case of regression, its the error term that's normally distributed rather than the unconditional population of responses.)

If you didn't have normal populations, you wouldn't get things that are t- or F- distributed respectively, but something else.

For example, if we do a one-sample t-test on data from an exponential distribution with n=8, then if the hypothesized population mean is correct the distribution of the t-statistic looks like this:

The bars are a histogram and the green curve is a kernel density estimate for a one-sample t-statistic based on drawing many samples from an exponential population at n=8.

You might have expected that (since the numerator would be right-skew) that the t-statistic would be skewed right, but it's clearly left-skew in this instance. [This should reinforce to us the idea that you *cannot* ignore the behavior of the denominator when dealing with the t-test.]

If we used the ordinary t-tables with this statistic our p-values would be wrong -- quite wrong for either one-tailed test (too high or too low depending on which side we're testing against) but still somewhat wrong for a two-tailed test.

The *two*-sample test is considerably less impacted than this with exponential data, but it's still not t-distributed. The F-test for equality of variances is more substiantially affected, however.

Dimitriy V. Masterov 05/16/2018.

Tests are typically done on statistics, which are functions of data. The mean is one example where you sum N random variables and then divide by N. These functions often have their own distributions that are distinct from the underlying data.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

- How to test for differences between two group means when the data is not normally distributed?
- How to test for normality of growth disturbances in chemo treatment?
- Normality of dependent variable = normality of residuals?
- Which test to use when comparing variances for data that is not necessarily normally distributed
- Total scores are normally distributed, but subtest scores are not; what to do?
- Independent samples t-test: Do data really need to be normally distributed for large sample sizes?
- what is the meaning of 'normality' as a difference between parametric and nonparametric method?
- When I log transform data, the Lilliefors test still says my data is not normal…why?
- When does frequency distribution matter when using different analysis tools?
- Does GLM analysis require normally distributed data and homogeneity of variance?

- Seeking a story about a subway train journey, through a universe made entirely of city
- Satellite navigation in Mongolia?
- Did Gandalf Visit Bree between the fall of Mordor and the hobbits' journey home?
- Running a loop precisely once per second
- Is it safe to let a user type a regex as a search input?
- What international relations is Australia missing out on by being a constitutional monarchy?
- Does a wizard need to copy spells into spellbook to remain relevant?
- How are copyright infringements verified when photo is initially made in JPEG, not raw?
- Idiom for blaming a bad act on the devil
- What's the phone number equivalent of example.org?
- WHAT?! AM I BOTHERING YOU?
- Confirmed evidence of cyber-warfare using GPS history data
- Starting a sentence with And?
- A linear algebra problem in positive characteristic
- How long can a fantasy novel stay in metaphorical Kansas?
- Is it inappropriate to help an employee outside of work?
- Should I know the chord name and scale degree when I play?
- How to build a Blood Gun?
- Multiples arrow branching out from a node to different nodes
- How do I dress for a hike on a hot summer day on Titan?
- drawing double headed latex arrow
- Applied for a C1 visa but got a B1/B2 visa
- How can I plausibly let my Runners escape an Ares Corporation facility?
- What's the meaning "be from money"?