Introduction

Belia, Fidler, Williams, and Cumming (2005) found that researchers in psychology, behavior neuroscience and medicine are really bad at interpreting when error bars signify that two means are significantly different (p = 0.05). What they did was to email a bunch of researchers and invite them to take a web-based test, and they got 473 usable responses. The test consisted of an interactive plot with error bars for two independent groups, the participants were asked to move the error bars to a position they believed would represent a significant t-test at p=0.05. They did this for error bars based on the 95 % CI and the group’s standard errors. The participants did on average set the 95 % CI too far apart with their mean placement corresponding to a p value of .009. They did the opposite with the SE error bars, which they put too close together yielding placements corresponding to p = 0.109. And if you’re wondering they found no difference between the three disciplines.

Plots

I wanted to pull my weight, and I have therefore created some various plots in R that show error bars that are significant at various p-values.

Figure 3. Error bars corresponding to a significant difference at p = .001 (equal group sizes and equal variances)

Based on the first plot we see that an overlap of about one third of the 95 % CIs corresponds to p = 0.05. For the SE error bars we see that they are about 1 SE apart when p = 0.05.

R Code

Here’s the complete R code used to produce these plots

Belia S, Fidler F, Williams J, & Cumming G (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological methods, 10 (4), 389-96 PMID: 16392994

Kristoffer Magnusson

I'm a psychologist from Sweden with a passion for research and statistics. I hold a Master of Science degree in psychology from Umeå University. This is my personal blog about psychological research and statistical programming with R.

1. great post.

Is there a wee typo in this sentence: ‘Based on the first plot we see that an overlap of about one third of the 95 % CIs corresponds to p = 0.5.’ ?

Otherwise, I have wondered about this question a lot. I was always told that CI’s should not include the other group’s mean and SE’s should not overlap. Turns out that this advice was actually correct even though it was also conservative.

2. One can alternatively calculate a Least Significant Difference (LSD) value (being the minimum difference between means that achives a specified statistical significance level) and display this on the plot.

The LSD info could either be displayed as a ‘floating’ error bar or perhaps as error bars on the means as above (1/2 the LSD on each side of the mean). In the latter case, overlap between two error bars means lack of statistical significance, which is a nice simple interpretation. This is not a plot that I have seen used before though – people are more used to seeing CI or SE bars, and you would need to explain carefully what the ‘LSD’ bars are and how to interpret them (& their limitations – e.g. LSDs don’t adjust for multiple comparisons…).

• That’s a great point. I believe difference-adjusted CIs using the t distribution are similar to the LSD approach. Though, if I remember correctly, the LSD error bars will all be of equal width for the different groups regardless of their SE? Difference-adjusted CIs allow groups to have different widths of their error bars. But just as you say, I’ve never seen neither of these error bars used in a publication.

• Yeah, the LSD approach only works if you have one (possibly approximate) LSD for all comparisons, so equal sample sizes, essentially. But if you have unequal SE’s, hence different SEDs/LSDs for different comparisons, you are going to have problems coming up with any reliable method of graphically representing significances via error bar overlap, it seems to me? Do the guidelines suggested by your graphs above work when your confidence interval bars are of different widths for different means?

I must look closely at that difference adjusted CI idea – thanks for that link!

I have also encountered the suggestion of choosing a confidence level for a CI that produces error bars approximately equivalent to my aforementioned LSD bars. This is usually based on the assumption of equal variances and samples sizes for the two groups compared, and you end up with a confidence of about 53% (I think – need to check this, so don’t quote me!). This trick is useful when you can readily calculate a CI, but not an LSD, for example in using Fieller’s theorem to obtain the CI of a ratio of means…

• Cumming & Finch (2005) reported that an overlap of about one quarter will approximate p = 0.05 for 95 % CIs that differ by a factor up to 2, however they did not assume homogeneity of variances in their calculations. So the overlap is based on the average margin of error, which is not that practical to visually estimate.

Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psycholo- gist, 60, 170–180.

• I’ve also used this ‘LSD-bar’ trick when I’ve transfomed data prior to analysing, but the clients wants a graph they can interpret on the original scale. given that it is feasible to produce an LSD-bar plot on the transformed scale, one can then back-transform to the original scale, to obtain a plot in which significances of differences is still interpretable in terms of error bar overlaps. But again, not a trick I’ve seen elsewhere…

3. Good post. I think there should be a convention that 95% CI bars are displayed differently from SE bars. Because at the moment they look the same, yet mean entirely different things. Maybe make one of them out of dots and the other solid lines. Or display both at the same time with a double-crossed line.

• That’s a good point, it really irks me when it’s not clearly stated what kind of error bars are used in a plot. Personally, I think two-tiered error bars are an interesting idea that might be more informative than regular CIs or SE error bars.

• Indeed. A former boss of mine used to get really irked when people (a) failed to understand the distinction between a SE and and a SD, and (b) when people failed to report the base used when they calculated a logarithm…

4. Hi Kristoffer
Good point and cool visualisation,
I would like to show this examples to my studens. I tryed to reproduce these plots, but there is only one. Could you share also code for other two plots.
Many thanks

• Hi! Sorry for a late reply. To get other p-values than 0.05, just change the numerical part of conditional statement on line 10. Except that, the code is identical for all the plots