Earlier this week I read this article about “Why Publishing Everything Is More Effective than Selective Publishing of Statistically Significant Results” by Mercal et al (2014). The authors simulated different meta-analytic scenarios and came to the conclusion that publishing everything is more effective for the scientific collective. This got me thinking about publication bias and its expected effect on effect sizes, particularly Cohen’s d. It is evident that the effect will be overestimated under selective publishing, and that this bias will be larger in small studies, simply because the sampling distribution from small studies will have a larger standard errors. Daniel Simons wrote a similar post about “What effect size would you expect?”. In his post Simons looked at the expectation of the absolute value of d when there is no effect in the population. However, I am not interested in the absolute effect and not so much in the null distribution. So in this post I will both use d‘s theoretical and empirical sampling distribution to show the expected overestimation due to selective publishing. I will look at the overestimation for various sample sizes when the population effect \(\delta\) is 0, 0.2, 0.5 and 0.8. The conclusion is that you should be weary of effect sizes from small samples, and that the issue is rather with type M (magnitude) errors than type I errors. At least is clinical psychology the pervasive problem is overestimation of effects and not falsely rejecting null hypothesis.

## About the R-code

To make this post easier to follow I have chosen to put all of the R code at the end of this post.

## When the population effect is zero

### Cohen’s d sample distribution

It is known that the sampling distribution for Cohen’s *d* follows a noncentral t distribution. However, it is also knows that it is asymptotically normal with a standard deviation of \(\sqrt{\frac{n_1 + n_2}{n_1 n_2}}\). What we will do is look at the 25th, 50th and 75th percentile of the upper-half of the sampling distribution, which corresponds to the 62.5th, 75th and 87.5th percentile of the full distribution. This means that we are looking at the expected value of Cohen’s d if researchers selectively publish studies that yield an effect above these percentiles. Stated as fractions the 62.5th percentile corresponds to the top 3 out of 8 studies, 75th percentile is 1 out of 4 studies and 87.5th 1 out of 8 studies.

*Figure 1*

In clinical psychology we often have a hypothesis about the direction of an effect, so we are not interested in the absolute effect. Consequently I will refer to the percentiles of the full distribution, but if you are interested in the absolute value of d you can think of the percentiles as 25th, 50th and 75th percent of the sampling distribution. *Figure 2* bellow shows the results from simulating this and from using the quantile function of the standard normal distribution. We see that the analytical methods give slightly different results for small samples.

*Figure 2*

### Using the noncentral *t* distribution

To get more exact results we will use the noncentral t distribution to find the three percentiles from above. We accomplish this by finding the noncentrality parameters that correspond to the respective percentile and then multiply this by the standard error to get the expected Cohen’s d. This is the same methodology as when the noncentral t distribution is used to construct 95 % confidence intervals for d. The interested reader can look at Cumming (2001) for a short primer on this method. *Figure 3* shows that this methods closely matches the simulation results. We also see that when we have 20 participants in each group 1 in 4 studies will estimate d to be larger then 0.2, even though no real effect exists.

*Figure 3*

## When there is a population effect

### A note about the Cohen’s d and Hedges’ g

The terminology regarding Cohen’s *d* is inconsistent, but in this post I refer to the biased estimate of \(\delta\) as Cohen’s *d* and the bias-adjusted estimate as Hedges’ *g*. The bias is trivial for moderate and large samples. Cohen’s *d* can be converted to Hedges’ *g* by multiplying by an adjustment factor. However, since I want to analytically estimate the sample distribution of Cohen’s *d*, I need to adjust \(\delta\) to take this bias into account. I simply do this by dividing \(\delta\) by the adjustment factor. Like this

where the adjustment factor \(A = \left(1-\frac{3}{4df-1}\right)\)

### When d is non-zero

I think it is fair to say that null hypotheses are seldom true. Therefore I examined a scenario were researchers only publish values above some expected cutoff. For instance, say that the true effect is 0.2 but researchers only publish values larger then this, then what is the expected 25th, 50th and 75th percentile of this distribution?

*Figure 4*

*Figure 5*

*Figure 6*

The figure below shows the percentage of bias for studies at the 75th percentile. The bias percentage is calculated like this

*Figure 7*

Figures 4 show that the overestimation is huge for small samples when the population effect is 0.2. For instance, with 20 subjects in each group 1 in 4 studies is expected to find an effect that is at least 0.4. And 1 in 8 studies will find an effect close to 0.6 or over.

When the population effect is 0.5, Figure 5 shows that with 20 subjects per group 1 in 4 studies is expected to find an effect close to 0.75 or higher. And 1 in 8 studies will find and effect close to 0.9 or higher. Using Cohen’s rule of thumb for interpreting *d* we would wrongly conclude that there is an “large” effect when in fact there is a “medium” effect in the population. When the population effect is 0.8 the overestimation is slightly smaller, but the overall pattern is the same.

## But are these effect sizes “statistically significant”?

I have not focused on null hypothesis significance testing (NHST) in this post. But in the figure below I show the expected p-values for the effect sizes at the 75th percentile. The *p*-values are both from independent *t*-tests and mixed-design ANOVAs.

*Figure 8*

It should be no surprise that, independent of the size of *n*, *p* > 0.05 when the null is true. A traditional null hypothesis test at \(\alpha = 0.05\) will – by definition – never reject values at the 75th percentile. Moreover, we see that *p*-values are no safeguard against type M errors when the population effect is 0.5 or 0.8. On the contrary, NHST most probably contributes to even larger overestimations of effect sizes. If the population effect is 0.2 and we repeatedly rerun our study until *p* < 0.05, then the effect will be grossly overestimated. If \(n_1 = n_2 = 20\) then d will be about 0.64 when *p* = 0.05 (two-tailed *t*-test). If we once again use the noncentral *t* distribution we can show that if \(\delta=0.2\) then about 10 % of the studies will estimate d to be equal to or greater then 0.64. This means that the estimate is 220 % biased, but still “statistically significant”. So *p*-values from underpowered studies clearly perpetuates the problem of overestimation — unscrupulous researchers just have work a little harder.

## Summary and conclusions

- The overestimation due to publication bias is not trivial.
- Selective publishing of the top 25 % of small studies will yield effects that are at least 50–100 % biased, if the true effect is between 0.2 and 0.5.
- Selective reporting of outcomes will give the same effect. However the overestimation depends on the correlation between outcomes.
- Larger studies will give less biased estimates even under publication bias. Larger studies are also more expensive and resource consuming making willful selective publishing more tedious.
- Don’t trust effect sizes from small studies.
- Interpret confidence intervals, not point estimates.
- Focus on replication.
- Think “meta-analytically”, interpret single studies with caution.
- Null hypothesis significance testing will only make the problem worse.

## R-code

library(MBESS) library(ggplot2) library(MASS) # # FIGURE 1 ---------------- # # normal plot - percentiles x <- seq(-1,1, by=0.001) y <- dnorm(x, 0, sd=sqrt(sigma^2/25 + sigma^2/25)) sample_dist <- data.frame("x" = x, "y"= y) d_25 <- qnorm(0.625, 0, sqrt(sigma^2/25 + sigma^2/25)) d_50 <- qnorm(0.75, 0, sqrt(sigma^2/25 + sigma^2/25)) d_75 <- qnorm(0.875, 0, sqrt(sigma^2/25 + sigma^2/25)) polygon <- sample_dist[sample_dist$x > 0,] polygon$y1 <- 0 ggplot(sample_dist, aes(x,y, color=NULL)) + geom_line(size=1) + geom_polygon(data=polygon, fill="#c0392b") + geom_vline(aes(xintercept=c(d_25, d_50, d_75)), linetype="dashed") + labs(title="Cohen's d sampling distribution (normal aprox.; n = 25)\n Dashed lines showing 62.5th, 75th and 87.5th percentile", y="Density", x="Cohen'd") ggsave("figure1.png", width=10, height=8, dpi=72) # # FIGURE 2 ---------------- # # normal approx, d = 0 wrap_sim_d <- function(n.1, n.2) { sim <- function(n.1, n.2) { group1 <- rnorm(n.1) group2 <- rnorm(n.2) return(smd(group1, group2, Unbiased=F)) } sim_data <- replicate(10000, sim(n.1, n.2)) # get percentiles sigma <- 1 SE <- sqrt((n.1+n.2)/(n.1*n.2)) analytical <- c("25%"=qnorm(0.625, 0, SE), "50%"=qnorm(0.75, 0, SE), "75%"=qnorm(0.875, 0, SE)) mc <- quantile(abs(sim_data), c(0.25, 0.5, 0.75)) output <- data.frame("y"=c(analytical, mc)) output$n <- n.1 output$group <- gl(2,3, labels=c("analytical","MC")) output$percentile <- gl(3,1, labels=c("62.5%","75%","87.5%")) return(output) } result_d <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d(n.1=i, n.2=i))) ggplot(result_d, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = "Expected Cohen's d if only positive effects are published and the population parameter is zero", y="Cohen's d") ggsave("figure2.png", dpi=72, width=10, height=8) # # FIGURE 3 ------------- # # code for d using non-central t wrap_sim_d_nct <- function(n.1, n.2) { sim <- function(n.1, n.2) { group1 <- rnorm(n.1) group2 <- rnorm(n.2) return(smd(group1, group2, Unbiased=F)) } sim_data <- replicate(50000, sim(n.1, n.2)) # get percentiles SE <- sqrt((n.1+n.2)/(n.1*n.2)) analytical <- c("25%" = conf.limits.nct(0, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.625)$Upper.Limit * SE, "50%" = conf.limits.nct(0, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.750)$Upper.Limit * SE, "75%" = conf.limits.nct(0, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.875)$Upper.Limit * SE) # alpha.lower is not used for anything here mc <- quantile(abs(sim_data), c(0.25, 0.5, 0.75)) output <- data.frame("y" = c(analytical, mc)) output$n <- n.1 output$group <- gl(2, 3, labels=c("analytical", "MC")) output$percentile <- gl(3, 1, labels=c("62.5%", "75%", "87.5%")) return(output) } result_d_nct <- do.call(rbind, lapply(seq(5, 100, by = 5), function(i) wrap_sim_d_nct(n.1 = i, n.2 = i))) ggplot(result_d_nct, aes(x = n, y = y, group = interaction(group, percentile), color = percentile, linetype = group)) + geom_line() + geom_point(shape = 1, size = 3) + scale_color_manual(values=c("#3498db", "black", "#3498db")) + scale_x_continuous(breaks = c(5, 10, 15, 20, 25, 50, 100)) + labs(title = "Expected Cohen's d under publication bias when the population parameter is zero (noncentral t)", y="Cohen's d") ggsave("figure3.png", dpi = 72, width = 10, height = 8) # # FIGURE 4, 5 & 6 ----------- # code for Cohen's d nct > 0 # # wrap_sim_d_not_zero <- function(n.1, n.2, d) { sim <- function(n.1, n.2, d) { group1 <- rnorm(n.1, d) group2 <- rnorm(n.2) return(smd(group1, group2, Unbiased = F)) } sim_data <- replicate(10000, sim(n.1, n.2, d)) mean(sim_data[2,] < 0) * 2 getp(mean(sim_data[1,]), n.1) # get percentiles SE <- sqrt((n.1+n.2)/(n.1*n.2)) df <- (n.1+n.2)-2 biased_d <- d/(1-3/(4*df-1)) # give biased d ncp <- biased_d * sqrt((n.1 * n.2)/(n.1 + n.2)) analytical <- c("25%" = conf.limits.nct(ncp, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.625)$Upper.Limit * SE, "50%" = conf.limits.nct(ncp, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.750)$Upper.Limit * SE, "75%" = conf.limits.nct(ncp, (n.1+n.2)-2, alpha.lower=0.05, alpha.upper=1-0.875)$Upper.Limit * SE) mc <- quantile(sim_data[sim_data > d], c(0.25, 0.5, 0.75)) output <- data.frame("y"=c(analytical, mc)) output$n <- n.1 output$group <- gl(2,3, labels=c("analytical","MC")) output$percentile <- gl(3,1, labels=c("62.5%", "75%", "87.5%")) return(output) } d.val <- 0.2 result_d_02 <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d_not_zero(n.1=i, n.2=i, d=d.val))) plot_d_02 <- ggplot(result_d_02, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's" ~ d ~ under ~ publication ~ bias ~ when ~ delta == ~ .(d.val)~"(noncentral t)"), y="Cohen's d") d.val <- 0.5 result_d_05 <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d_not_zero(n.1=i, n.2=i, d=d.val))) plot_d_05 <- ggplot(result_d_05, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's" ~ d ~ under ~ publication ~ bias ~ when ~ delta == ~ .(d.val)~"(noncentral t)"), y="Cohen's d") d.val <- 0.8 result_d_08 <- do.call(rbind, lapply(seq(5,100, by=5), function(i) wrap_sim_d_not_zero(n.1=i, n.2=i, d=d.val))) plot_d_08 <- ggplot(result_d_08, aes(x=n, y=y, group=interaction(group, percentile), color=percentile, linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_color_manual(values=c("#3498db","black","#3498db")) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's" ~ d ~ under ~ publication ~ bias ~ when ~ delta == ~ .(d.val)~"(noncentral t)"), y="Cohen's d") plot_d_02; plot_d_05; plot_d_08 ggsave("plot_d_02.png", plot_d_02, width=10, height=8, dpi=72) ggsave("plot_d_05.png", plot_d_05, width=10, height=8, dpi=72) ggsave("plot_d_08.png", plot_d_08, width=10, height=8, dpi=72) # # FIGURE 7 ----------- # # bias for d=0.2, 0.5, 0.8 result_all <- data.frame("y" = c(result_d_02$y[result_d_02$percentile=="75%" & result_d_02$group=="MC"], result_d_05$y[result_d_05$percentile=="75%" & result_d_05$group=="MC"], result_d_08$y[result_d_08$percentile=="75%" & result_d_08$group=="MC"])) result_all$n <- rep(seq(5,100, by=5), 3) result_all$d <- rep(c(0.2, 0.5, 0.8), each=20) result_all$bias <- ((result_all$y - result_all$d)/result_all$d) * 100 # percentage bias plot_all <- ggplot(result_all, aes(x=n, y=bias, group=factor(d), color=factor(d))) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100)) + labs(title = bquote(Expected ~ "Cohen's d under publication bias (>75th perc.) when" ~ delta == "(0.2, 0.5, 0.8)"~"(Monte Carlo estimates)"), y="% Bias") ggsave("plot_d_nonzero_all.png", plot_all, width=10, height=8, dpi=72) # # FIGURE 8 ------------- # # p.vals from t getp <- function(d, n) { df <- (2*n)-2 t <- (d / sqrt(2/n)) 2*pt(-abs(t), df) # two tailed } result_d_0 <- result_d_nct result_d_0$p <- getp(result_d_nct$y, result_d_nct$n) result_d_0$ES <- 0 result_d_02$p <- getp(result_d_02$y, result_d_02$n) result_d_02$ES <- 0.2 result_d_05$p <- getp(result_d_05$y, result_d_05$n) result_d_05$ES <- 0.5 result_d_08$p <- getp(result_d_08$y, result_d_08$n) result_d_08$ES <- 0.8 p_vals <- rbind(result_d_0[result_d_0$group == "MC" & result_d_0$percentile == "75%", ], result_d_02[result_d_02$group == "MC" & result_d_02$percentile == "75%", ], result_d_05[result_d_05$group == "MC" & result_d_05$percentile == "75%", ], result_d_08[result_d_08$group == "MC" & result_d_08$percentile == "75%", ]) p_vals_plot <- ggplot(p_vals, aes(x=n, y=p, group=factor(ES), color=factor(ES))) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100,150)) + geom_hline(yintercept=0.05) + labs(title = bquote("P-values from independent t-tests for" ~ delta== ~ "{0, 0.2, 0.5, 0.8}"), y="p-value (two-tailed)", parse=T) ggsave("p_vals_plot.png", p_vals_plot, width=10, height=8, dpi=72) ## pvals anova getANOVAp <- function(d, n, rho) { if(!require(MASS)) stop("MASS is not installed") rho <- rho cov_matrix <- function(sigmas) { B <- matrix(sigmas, ncol=length(sigmas), nrow=length(sigmas)) sigma <- t(B) * B * rho diag(sigma) <- sigmas^2 sigma } innerFunc <- function(d, n, rho=rho) { delta <- d/2 CBT <- c(3, 3-delta, 3-(2*delta)) WL <- c(3, 3, 3) CBT_i <- mvrnorm(n=n, mu=CBT, Sigma=cov_matrix(c(1,1,1)), empirical=T) WL_i <- mvrnorm(n=n, mu=WL, Sigma=cov_matrix(c(1,1,1)), empirical=T) data <- data.frame("DV" = c(CBT_i, WL_i), "time" = gl(3, n, 2*3*n), "group" = gl(2,3*n, 2*3*n), "subjects" = factor(c(rep(1:n, 3), rep((n+1):(2*n), 3))) ) result<-summary(aov(DV ~ time*group + Error(subjects), data=data)) return(result[[2]][[1]][[5]][2]) } mapply(function(i,j) innerFunc(i,j), i=d, j=n) } result_d_0_anova <- result_d_nct result_d_0_anova$p <- getANOVAp(result_d_nct$y, result_d_nct$n, 0.5) result_d_0_anova$ES <- 0 result_d_02_anova <- result_d_02 result_d_02_anova$p <- getANOVAp(result_d_02$y, result_d_02$n, 0.5) result_d_02_anova$ES <- 0.2 result_d_05_anova <- result_d_05 result_d_05_anova$p <- getANOVAp(result_d_05$y, result_d_05$n, 0.5) result_d_05_anova$ES <- 0.5 result_d_08_anova <- result_d_08 result_d_08_anova$p <- getANOVAp(result_d_08$y, result_d_08$n, 0.5) result_d_08_anova$ES <- 0.8 p_vals_anova <- rbind(result_d_0_anova[result_d_0_anova$group == "MC" & result_d_0_anova$percentile == "75%", ], result_d_02_anova[result_d_02_anova$group == "MC" & result_d_02_anova$percentile == "75%", ], result_d_05_anova[result_d_05_anova$group == "MC" & result_d_05_anova$percentile == "75%", ], result_d_08_anova[result_d_08_anova$group == "MC" & result_d_08_anova$percentile == "75%", ]) p_vals_plot_anova <- ggplot(p_vals_anova, aes(x=n, y=p, group=factor(ES), color=factor(ES))) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100,150)) + geom_hline(yintercept=0.05) + labs(title = expression(atop("P-values from mixed-design ANOVAS for" ~ delta== ~ "{0, 0.2, 0.5, 0.8} at endpoint","3 time points, two groups, "~rho==0.5)), y="p-value (time x group)", parse=T) ggsave("p_vals_plot_anova.png", p_vals_plot, width=10, height=8, dpi=72) p_vals$group <- "t-test" p_vals_anova$group <- "ANOVA" p_vals_df <- rbind(p_vals, p_vals_anova) p_vals_plot_comb <- ggplot(p_vals_df, aes(x=n, y=p, group=interaction(factor(ES), group), color=factor(ES), linetype=group)) + geom_line() + geom_point(shape=1, size=3) + scale_x_continuous(breaks = c(5,10,15,20,25,50,100,150)) + scale_y_continuous(breaks = c(0, 0.05, 0.2, 0.4, 0.6, 0.8)) + geom_hline(yintercept=0.05) + labs(title = expression(atop("P-values from mixed-design ANOVAs (time x group) and t-tests (two-tailed)", ~ delta== ~ "{0, 0.2, 0.5, 0.8} at endpoint, 3 time points, two groups, "~rho==0.5)), y="p-value", parse=T) ggsave("p_vals_plot_comb.png", p_vals_plot_comb, width=10, height=8, dpi=72)