You can’t assume a normal distribution for your data with N>30

The central limit theorem (CLT) is one of the most foundational concepts in all probability (Daly, 2013). It is commonly understood as: when the means of a variable with a suitable number of observations is plotted on a graph, it can form a normal distribution. When the data comes from many independent and random events, the sum of the means of these observations will produce this distribution (even if the population the sample is drawn from isn’t normally distributed) e.g. coin tosses. This is important as it implies that methods which assume a normal distribution can be used for data that isn’t originally normally distributed. All that’s needed is samples with 30 observations or more, and your data will form (roughly) a normal distribution.

I understood this to mean it applied without reservation[note]Or at least it wasn’t clearly stated there are many instances where it doesn’t apply. I also forgot it applies to the mean or the sum of data sets, not the data itself. [/note]. To see why this isn’t the case, we need to examine some key distributions.

Poisson distribution

A Poisson distribution is often used to model data which arises from counting the number of occurrences of an outcome within a specified time period or area. A Poisson process is something that generates a Poisson distribution. In it, independent[note]This means the probability of an event occurring doesn’t impact the probability of subsequent events e.g. winning the lottery.[/note] and discrete[note]A discrete distribution means the data can only take on certain values e.g. ratings on a Likert scale, rather than any value within a specified range (continuous).[/note] events occur over time or space at a continuous rate. This rate determines the number of outcomes observed in a given sample and is represented by lambda (λ).

When λ is small and the events are rare, the distribution is asymmetric. The distribution becomes more like a normal distribution the greater λ is. So if the observations occur at a frequent enough rate, the sampling distribution of the statistic[note]A statistic is a sample estimate of a population parameter (Baguley, 2012).[/note] will be (relatively) similar to a normal distribution. Thus, using tests that rely on a normal distribution may be appropriate. Poisson distributions are often called upon to justify the use of such tests on non-normally distributed data, but this may not be valid[note]This is explained later in the post, using Joliffe (1995) as evidence.[/note].

Source: https://en.wikipedia.org/wiki/Poisson_distribution

If two or more independent Poisson distributions are summed, the result is a Poisson distribution. The λ of this distribution is the sum of the rates from the constituent distributions. For example, if a person gets in 1 fight a week and another gets into 2 separate fights, the λ is 3. This is important as it implies that the same measurement can be combined even if their rates are different (if the observations are independent).

If events are rare, the Poisson distribution can approximate a binomial distribution[note]This is the distribution from a series of events (which has 2 outcomes) of trials with P probability e.g. 10 coin flips (n=10, P=0.5, with 2 possible outcomes: ‘heads’ or ‘tails’).[/note] if certain conditions are met: λ=nP. When there are lots of opportunities for the event to occur (n is very high) then the distribution of results can resemble a normal distribution, even if the probability is small (P is low).

Low probability, normal distribution

The graphs below are examples of non-normally distributed variables whose sampling distributions of the statistic approximate a normal distribution. The first graph shows a simulated Poisson distribution with 1,000,000 random values, λ=6. A normal distribution curve is fitted on top and it clearly resembles a normal distribution.

Below are two binomial distributions for simulated data[note]The data has been simulated 100,000 times.[/note] with different probabilities Pn =30. The first shows the distribution for an event e.g. a fair coin flip: P=.5. The distribution clearly approximates a normal distribution.

The same roughly holds true when P=.15 (the independent event happens 1.5 times in every 10 trials); the sampling distribution of the statistic generally approximates a normal distribution.

Therefore, many argue data which isn’t necessarily normally distributed can be treated as such when it forms Poisson and binomial distributions. This is the CLT in action.

The CLT only holds true for distributions with finite mean and finite variance[note]This is sometimes taken to mean that it applies to all everyday examples, because these variables will always have finite means. But this isn’t necessarily the case: if you have two variables a and b and b=0, the combined ratio of a/b becomes non-finite (Baguley, personal communication, 2018).[/note]. Both Poisson and binomial distributions have these characteristics. It therefore seems reasonable to presume that the CLT applies to them given a large enough n. Even with smaller n‘s, Poisson distributions are sometimes invoked to show that highly skewed data can be treated as normally distributed data (Joliffe, 1995). This is because highly skewed data can, with a relatively modest n, approximate a normal distribution when looking at the sampling distribution of the mean. This has led to the propagation of the rule described in the first paragraph: when n>30, a normal distribution can be assumed (this is especially true when n>100).

But this not the case.

No matter how many observations you have (what the n is) you can’t guarantee the sampling distribution will be normal (Joliffe, 1995). You would need to know the rate of recording these observations to know what the distribution is. This is true for both distributions and summed Poisson distributions. Even with a very large n, it might not be enough to bring the sum total lambda (from the combined Poisson distributions) high enough to approximate a normal distribution.

The CLT can’t save everything

The graph below shows a simulated Poisson distribution with 1,000,000 random values, λ=1. Even with a n of 1,000,000, the data is heavily right-skewed. If the rate (λ) is low enough, no sample size can guarantee a normal distribution. This holds true for summed Poisson distributions.

The same is true for a binomial distribution. When P=.01 (the event happens 1 in every 100 trials), the distribution is heavily right skewed. So, the distribution of the statistic of a relatively unlikely event with two potential answers doesn’t approximate a normal distribution. This is also true when n=100. Therefore, it shouldn’t be analysed using statistics that assume a normal distribution. This clearly shows the CLT doesn’t apply without reservations.

What these graphs seek to show is that it’s not just the number of trials/participants you have in your sample that determines whether you can assume a normal distribution. The rate of convergence (how quickly it approximates a normal distribution) is strongly influenced by the probability or the rate of these independent events happening. There are certainly a lot of instances where the CLT applies, but what I was taught was wrong: you can’t always assume a normal distribution when n>30.

Note

Almost all of the information in this post comes from the excellent statistics book by Baguley (2012).

Author feedback

Baguley clarified a few points and gave a bit more detail to be added.

Code


#Binomial distribution of 30 attempts with a probability of .5, simulated 100,000 times
g <- rbinom(100000, 30, .5)
graph1 <- hist(g,xlab = expression(paste(italic(n),'=30, ', italic(P),'=.5')),main = "Simulated data for 30 trials with a probabiltiy of .5", col = "skyblue2")
#Binomial distribution of 30 attempts with a probability of .15, simulated 100,000 times
h <- rbinom(100000, 30, .15)
graph2 <- hist(h,xlab = expression(paste(italic(n),'=30, ', italic(P),'=.15')),main = "Simulated data for 30 trials with a probabiltiy of .15", col = "skyblue2")
#Binomial distribution of 30 attempts with a probability of .001, simulated 100,000 times
j <- rbinom(100000, 30, .01)
graph3 <- hist(j,xlab = expression(paste(italic(n),'=30, ', italic(P),'=.01')),main = "Simulated data for 30 trials with a probabiltiy of .01", col = "skyblue2")
#Binomial distribution of 100 attempts with a probability of .001, simulated 100,000 times
k <- rbinom(100000, 100, .01)
graph4 <- hist(k,xlab = expression(paste(italic(n),'=100, ', italic(P),'=.01')),main = "Simulated data for 100 trials with a probabiltiy of .01", col = "skyblue2")
#Simulated Poisson distribution where the x axis is between 0 and 20 (as these are the vectors of quantiles) and λ=6. n=number of random values to return. lwd=line width. Normal curve line superimposed on the graph with a mean of the simulated data, SD of the square root of the simulated data's mean
x = 0:20; pdf = dpois(x, 6)
y = rpois(10^6, 6); up=max(y)
hist(y, prob=T, br=(-1:up)+.5, col="skyblue2", xlab="x", main="Simulated Sample from a Poisson distribution λ=6 with Normal Approximation")
curve(dnorm(x, mean(y), sd(y)), col="red", lwd=2, add=T)
#Simulated Poisson distribution where the x axis is between 0 and 20 (as these are the vectors of quantiles) and λ=1. n=number of random values to return. lwd=line width. Normal curve line superimposed on the graph with a mean of the simulated data, SD of the square root of the simulated data's mean
m = 0:20; pdf = dpois(m, 1)
n = rpois(10^6, 1); up=max(n)
hist(n, prob=T, br=(-1:up)+.5, col="skyblue2", xlab="x", main="Simulated Sample from a Poisson distribution λ=1")

References

Baguley, T. S. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences. New York, NY, : Palgrave Macmillan.

BruceET. (2017). Answer to: Plotting in R: Probability mass function for a Poisson distribution [closed]. Available at: https://math.stackexchange.com/questions/2412983/plotting-in-r-probability-mass-function-for-a-poisson-distribution

Daly, F. (2013). The central limit theorem and Poisson approximation An introduction to Stein’s method. Available at: http://www.macs.hw.ac.uk/~fd78/Daly_SMSTC_notes.pdf

Ian T. Jolliffe (1995) Sample Sizes and the Central Limit Theorem: The Poisson Distribution as an Illustration. The American Statistician, 49 (3), 269, DOI: 10.1080/00031305.1995.10476161

 

2 responses to “You can’t assume a normal distribution for your data with N>30”

  1. So I thought that the justification for doing that is because the sampling distribution would form a normal distribution, despite the original sample/process not being normally distributed?

    Like

    1. That’s what I was taught as well: the “sampling distribution would form a normal distribution, despite the original sample/process not being normally distributed”. But as Baguley (2012) and Joliffe (1995) explain, the number of participants isn’t the only factor that determines this. It’s also the rate at which the observations of the variable occur. If the observations are really rare, there’s almost no n great enough to ensure the sampling distribution of the statistic is anywhere near normal.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: