Type I and Type II errors, β, α, p-values, power and effect sizes – the ritual of null hypothesis significance testing contains many strange concepts.

Much has been said about significance testing – most of it negative. Methodologists constantly point out that researchers misinterpret p-values. Some say that it is at best a meaningless exercise and at worst an impediment to scientific discoveries. Consequently, I believe it is extremely important that students and researchers correctly interpret statistical tests. This visualization is meant as an aid for students when they are learning about statistical hypothesis testing. The visualization is based on a one-sample Z-test. You can vary the sample size, power, signifance level and effect size using the sliders to see how the sampling distributions change.


  • Solve for?

Sample size

Effect size

Clarification on power ("-") when the effect is 0

The visualization will show that "power" and "Type II error" is "-" when d is set to zero. However, the Type I error rate implies that a certain amount of tests will reject H0. It is tempting to also say that this ratio is the test's "power", and frequently textbooks and software do just that. Some sources also say that power is zero when H0 is equal to Ha. Both claims are incorrect, power is not defined when the estimated effect is an element of H0's parameter space. When this is the case, the power function returns α, and therefore "power" is undefined. So even though the power function says 5 % of the tests will reject the null, it does not make sense to talk about "power" here. This also implies that as Ha approaches H0 power will approach α for small values of d. As a result the slider for "power" isn't allowed to be equal to or less than α.

Book recommendation

Here are some recommended books that discuss the issues of NHST.

A Very Short Primer on Null Hypothesis Significance Testing and Statistical Power

The distributions in the visualization show the theoretical sampling distribution for the null distribution (H0) and the sampling distribution under the alternative hypothesis (Ha). Although this site is not meant as a first introduction to NHST, here is a quick summary of the core concepts.

Term Explanation
α The conditional probability of incorrectly rejecting H0 when it actually is true.
β The conditional probability of failing to reject H0 when it is false.
Power The complement of β (i.e. 1 - β), this is the probability of correctly rejecting H0 when it is false.
H0 The null hypothesis, usually stated as the population mean being zero, or that there is no difference. However, it does not have to be stated as a zero or no difference hypothesis.
Ha The alternative hypothesis, usually stated as the population mean being non-zero or greater then or less than zero.
Simply put what we are doing when we perform traditional (frequentist) statistical tests, is that we collect some data and then calculate the probability of observing data at least as extreme as our data given that no effect exists in the population. This conditional probability is the p-value, and if it is smaller then α (usually 0.05 or 0.01) we claim that our findings are “statistically significant”. Moreover, α is the long-run probability of making a Type I error when H0 is true. The acceptable Type I error rate is set before running the study, and α should not be confused with the p-value from a single study. Before we collect our data we should perform a power analysis. Usually we specify the minimum effect (say Cohen’s d = 0.5) we are interested in finding, set α to 0.05 and β to 0.2 (i.e. 80 % power). The power analysis will tell us how large our sample needs to be to achieve this power. Given this sample size, if we rerun our study many times with new random samples 80 % of the time we will correctly reject the null hypothesis, i.e. we will find that p < α.

What Null Hypothesis Significance Testing Does Not Tell Us

  • It does not give us the probability that our results are due to chance.
  • If we reject H0 with α = 0.05 this does not mean that we are 95 % sure that the alternative hypothesis is true.
  • Rejecting H0 with α = 0.05 does not mean that the probability that we have made a type I error is 5 %.
  • A p-value does not tell us that our findings are relevant, clinical significant or of any scientific value whatsoever.
  • A small p-value does not tell us our results will replicate.
  • A small p-value does not indicate a large treatment effect.
  • Failing to reject the null hypothesis is not evidence of it being true.
  • If our test has 80 % power and we fail to reject the null hypothesis, then this does not mean that the probability is 20 % that the null is true.
  • If our test has 80 % power and we do reject the null hypothesis, then this does not mean that the probability is 80 % that the alternative hypothesis is true.

Some NHST Testimonials

I am deeply skeptical about the current use of significance tests. The following quotes might spark your interest in the controversies surrounding NHST.

"What's wrong with [null hypothesis significance testing]? Well, among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!"

– Cohen (1994)

“… surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students"

– Rozeboom (1997)

“… despite the awesome pre-eminence this method has attained in our journals and textbooks of applied statistics, it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research”

– Rozeboom (1960)

“… an instance of a kind of essential mindlessness in the conduct of research" – Bakan (1966)

– Bakan (1966)

“Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution”

– Schmidt and Hunter (1997)

“The textbooks are wrong. The teaching is wrong. The seminar you just attended is wrong. The most prestigious journal in your scientific field is wrong."

– Ziliak and McCloskey (2008)
These quotes were mostly taken from Nickerson’s (2000) excellent review “Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy”.


If you have any suggestions send me a message on Twitter or use the contact form on my site.