# An R Script to Automatically download PubMed Citation Counts By Year of publication

## Background

I believe there’s some information to be gained from looking at publication trends over time. But it’s really troublesome to do it by hand; fortunately it’s not so troublesome to do it in R statistical software. Though, data like these should be interpreted with extreme caution.

## How the script works

I tried to use the `RISmed`

-package to query PubMed, but found it to be
really unreliable. Instead my script is querying PubMed’s
E-Utilities using `RCurl`

and `XML`

. The E-utilities work like this:

We can see after the URL that I’m telling E-Utilities that I want to search PubMed’s database, and retrieve it as ‘count’. This will give a minimal XML-output containing only the number of hits for my search term, which is exactly what I want.

That’s really the basic gist of what my script is doing. If you look at
the code (at the bottom of this post), you can see that I construct the
query in the beginning of the script using `paste()`

and `gsub()`

. Then
the main part of the script is the `for`

-loop. What it’s doing is it’s
looping through all the years from the specified year range, retrieving
the number of hits, and then pasting it together in one data frame. To
get counts for a specific year `i`

, I add `AND i[PPDAT]`

(Print Dates
Only Tag) at the end of each query.

Since I have all the necessary code in `getCount()`

, I can run the same
script for any number of queries using `ldply(query.index, getCount)`

.
By doing that I end up with a data frame, which contains data for all
the queries arranged in long format. The end of the script will
calculate relative counts by dividing the matches each year by the
total amount of publications that year. I’ve also added a function
that will type out the total number of hits for each query, called
`PubTotalHits()`

.

## Why use Print Dates Only Tag [PPDAT] and not Publication Date [DP]?

The problem with using DP is in how PubMed handles articles. If an article is published electronically in the end of say 2011 but printed in 2012, that article will be counted both in 2011 and 2012 if I search those 2 years individually (which my script is doing).

By using PPDAT I will miss some articles that doesn’t have a published print date. If you’d rather get some duplicates in your data, but not miss any citations, you can easily change PPDAT to DP, the script will run the same either way.

## A quick example [PDAT] vs [DP]:

To illustrate the differences I did a quick search using a Cognitive Behavioral Therapy-query. When searching with PubMed’s website I specified the year range as 1940:2012[DP]/[PPDAT], and used the same interval in R.

The correct amount of hits is 5501 which is retrieved using PubMed’s website with the [DP]-tag. It’s also the same amount as what would be reported when not specifying any time interval. Consequently, if you use my script with the [PPDAT]-tag you would, in this scenario, be about -2.5% off from the correct amount, and about +3% off from the correct amount if you use [DP]. It’s possible that other queries will generate different results. However, the error seems to be so small that it doesn’t warrant any changes to the code. Duplicates could be avoided by downloading PMIDS with every search, then checking for duplicates for each adjacent year. Though, that change would require an unnecessary amount of data transfer for an error that appear to be only 3%.

## How to use my script

It’s really simple to use this script. Assuming you have R installed, all you need to do is download the files. Then point R to that directory and then tell it to run “PubMedTrend.R”. Like this:

Once that is done you specify your query like this:

Now all you have to is to execute my `PubMedTrend()`

-function for those
quires and save the results in a data frame:

The content of df will be structured like this:

### Additional arguments

The default year range is set to 1950–2009, but can easily be changed, like this:

### Some notes about using relative values

PubMed’s total counts (as posted in a table at their website) hasn’t
been updated since April 8 2011, but the *de facto* total values have
changed since then, because PubMed is always adding new citations (new
and old). This can be remedied easily by looping through 1950[PPDAT],
1951[PPDAT] … 2012[PPDAT] (or you can use [DP]). I did that for you and
made a graph of the two data sets, and as you can see there’s some
differences, but they’re not that big. Nonetheless, I’ve included both
files with my script.

## Some example runs

When searching a progress bar will show the progress, and the search will look like this once completed.

Using the function to get total hits, will give this output

## A few words on usage guidelines

In PubMed’s E-utilities usage guidelines it’s specified that:

In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI

To comply with this my script will wait 0.5 sec after each iteration resulting in (theoretically) 2 URL GETs per second. This means that searching for 100 yearly counts will take a minimum if 50 seconds for each query. You can change the wait time if you feel that 0.5 sec is too low or too high.

## And here’s the R code to look at PubMed trends

*Update (2013 August 2)*: I am currently updating this script and moving
it to a GitHub repo, so it will be easier to maintain. You can find the
repo here.

## And some example ggplot2 codes

Written by **Kristoffer Magnusson**, a researcher in clinical psychology. You should follow him on Twitter and come hang out on the open science discord Git Gud Science.

Published April 19, 2012 (View on GitHub)

### Buy Me A Coffee

A huge thanks to the **100** supporters who've bought me a **225** coffees!

Jason Rinaldo bought ☕☕☕☕☕☕☕☕☕☕ (10) coffees

I've been looking for applets that show this for YEARS, for demonstrations for classes. Thank you so much! Students do not need to tolarate my whiteboard scrawl now. I'm sure they'd appreciate you, too.l

@LinneaGandhi bought ☕☕☕☕☕ (5) coffees

This is awesome! Thank you for creating these. Definitely using for my students, and me! :-)

@ICH8412 bought ☕☕☕☕☕ (5) coffees

very useful for my students I guess

@KelvinEJones bought ☕☕☕☕☕ (5) coffees

Preparing my Master's student for final oral exam and stumbled on your site. We are discussing in lab meeting today. Coffee for everyone.

Someone bought ☕☕☕☕☕ (5) coffees

What a great site

@Daniel_Brad4d bought ☕☕☕☕☕ (5) coffees

Wonderful work!

David Loschelder bought ☕☕☕☕☕ (5) coffees

Terrific work. So very helpful. Thank you very much.

@neilmeigh bought ☕☕☕☕☕ (5) coffees

I am so grateful for your page and can't thank you enough!

@giladfeldman bought ☕☕☕☕☕ (5) coffees

Wonderful work, I use it every semester and it really helps the students (and me) understand things better. Keep going strong.

Dean Norris bought ☕☕☕☕☕ (5) coffees

Sal bought ☕☕☕☕☕ (5) coffees

Really super useful, especially for teaching. Thanks for this!

dde@paxis.org bought ☕☕☕☕☕ (5) coffees

Very helpful to helping teach teachers about the effects of the Good Behavior Game

@akreutzer82 bought ☕☕☕☕☕ (5) coffees

Amazing visualizations! Thank you!

@rdh_CLE bought ☕☕☕☕☕ (5) coffees

So good!

Someone bought ☕☕☕ (3) coffees

@PhysioSven bought ☕☕☕ (3) coffees

Amazing illustrations, there is not enough coffee in the world for enthusiasts like you! Thanks!

Cheryl@CurtinUniAus bought ☕☕☕ (3) coffees

🌟What a great contribution - thanks Kristoffer!

vanessa moran bought ☕☕☕ (3) coffees

Wow - your website is fantastic, thank you for making it.

Someone bought ☕☕☕ (3) coffees

mikhail.saltychev@gmail.com bought ☕☕☕ (3) coffees

Thank you Kristoffer This is a nice site, which I have been used for a while. Best Prof. Mikhail Saltychev (Turku University, Finland)

Someone bought ☕☕☕ (3) coffees

Ruslan Klymentiev bought ☕☕☕ (3) coffees

@lkizbok bought ☕☕☕ (3) coffees

Keep up the nice work, thank you!

@TELLlab bought ☕☕☕ (3) coffees

Thanks - this will help me to teach tomorrow!

SCCT/Psychology bought ☕☕☕ (3) coffees

Keep the visualizations coming!

@elena_bolt bought ☕☕☕ (3) coffees

Thank you so much for your work, Kristoffer. I use your visualizations to explain concepts to my tutoring students and they are a huge help.

A random user bought ☕☕☕ (3) coffees

Thank you for making such useful and pretty tools. It not only helped me understand more about power, effect size, etc, but also made my quanti-method class more engaging and interesting. Thank you and wish you a great 2021!

@hertzpodcast bought ☕☕☕ (3) coffees

We've mentioned your work a few times on our podcast and we recently sent a poster to a listener as prize so we wanted to buy you a few coffees. Thanks for the great work that you do!Dan Quintana and James Heathers - Co-hosts of Everything Hertz

Cameron Proctor bought ☕☕☕ (3) coffees

Used your vizualization in class today. Thanks!

eshulman@brocku.ca bought ☕☕☕ (3) coffees

My students love these visualizations and so do I! Thanks for helping me make stats more intuitive.

Someone bought ☕☕☕ (3) coffees

Adrian Helgå Vestøl bought ☕☕☕ (3) coffees

@misteryosupjoo bought ☕☕☕ (3) coffees

For a high school teacher of psychology, I would be lost without your visualizations. The ability to interact and manipulate allows students to get it in a very sticky manner. Thank you!!!

Chi bought ☕☕☕ (3) coffees

You Cohen's d post really helped me explaining the interpretation to people who don't know stats! Thank you!

Someone bought ☕☕☕ (3) coffees

You doing useful work !! thanks !!

@ArtisanalANN bought ☕☕☕ (3) coffees

Enjoy.

@jsholtes bought ☕☕☕ (3) coffees

Teaching stats to civil engineer undergrads (first time teaching for me, first time for most of them too) and grasping for some good explanations of hypothesis testing, power, and CI's. Love these interactive graphics!

@notawful bought ☕☕☕ (3) coffees

Thank you for using your stats and programming gifts in such a useful, generous manner. -Jess

Mateu Servera bought ☕☕☕ (3) coffees

A job that must have cost far more coffees than we can afford you ;-). Thank you.

@cdrawn bought ☕☕☕ (3) coffees

Thank you! Such a great resource for teaching these concepts, especially CI, Power, correlation.

Julia bought ☕☕☕ (3) coffees

Fantastic work with the visualizations!

@felixthoemmes bought ☕☕☕ (3) coffees

@dalejbarr bought ☕☕☕ (3) coffees

Your work is amazing! I use your visualizations often in my teaching. Thank you.

@PsychoMouse bought ☕☕☕ (3) coffees

Excellent! Well done! SOOOO Useful!😊 🐭

Dan Sanes bought ☕☕ (2) coffees

this is a superb, intuitive teaching tool!

@whlevine bought ☕☕ (2) coffees

Thank you so much for these amazing visualizations. They're a great teaching tool and the allow me to show students things that it would take me weeks or months to program myself.

Someone bought ☕☕ (2) coffees

@notawful bought ☕☕ (2) coffees

Thank you for sharing your visualization skills with the rest of us! I use them frequently when teaching intro stats.

Someone bought ☕ (1) coffee

Michael Hansen bought ☕ (1) coffee

ALEXANDER VIETHEER bought ☕ (1) coffee

mather bought ☕ (1) coffee

Someone bought ☕ (1) coffee

Bastian Jaeger bought ☕ (1) coffee

Thanks for making the poster designs OA, I just hung two in my office and they look great!

@ValerioVillani bought ☕ (1) coffee

Thanks for your work.

Someone bought ☕ (1) coffee

Great work!

@YashvinSeetahul bought ☕ (1) coffee

Someone bought ☕ (1) coffee

Angela bought ☕ (1) coffee

Thank you for building such excellent ways to convey difficult topics to students!

@inthelabagain bought ☕ (1) coffee

Really wonderful visuals, and such a fantastic and effective teaching tool. So many thanks!

Someone bought ☕ (1) coffee

Someone bought ☕ (1) coffee

Yashashree Panda bought ☕ (1) coffee

I really like your work.

Ben bought ☕ (1) coffee

You're awesome. I have students in my intro stats class say, "I get it now," after using your tool. Thanks for making my job easier.

Gabriel Recchia bought ☕ (1) coffee

Incredibly useful tool!

Shiseida Sade Kelly Aponte bought ☕ (1) coffee

Thanks for the assistance for RSCH 8210.

@Benedikt_Hell bought ☕ (1) coffee

Great tools! Thank you very much!

Amalia Alvarez bought ☕ (1) coffee

@noelnguyen16 bought ☕ (1) coffee

Hi Kristoffer, many thanks for making all this great stuff available to the community!

Eran Barzilai bought ☕ (1) coffee

These visualizations are awesome! thank you for creating it

Someone bought ☕ (1) coffee

Chris SG bought ☕ (1) coffee

Very nice.

Gray Church bought ☕ (1) coffee

Thank you for the visualizations. They are fun and informative.

Qamar bought ☕ (1) coffee

Tanya McGhee bought ☕ (1) coffee

@schultemi bought ☕ (1) coffee

Neilo bought ☕ (1) coffee

Really helpful visualisations, thanks!

Someone bought ☕ (1) coffee

This is amazing stuff. Very slick.

Someone bought ☕ (1) coffee

Sarko bought ☕ (1) coffee

Thanks so much for creating this! Really helpful for being able to explain effect size to a clinician I'm doing an analysis for.

@DominikaSlus bought ☕ (1) coffee

Thank you! This page is super useful. I'll spread the word.

Someone bought ☕ (1) coffee

Melinda Rice bought ☕ (1) coffee

Thank you so much for creating these tools! As we face the challenge of teaching statistical concepts online, this is an invaluable resource.

@tmoldwin bought ☕ (1) coffee

Fantastic resource. I think you would be well served to have one page indexing all your visualizations, that would make it more accessible for sharing as a common resource.

Someone bought ☕ (1) coffee

Fantastic Visualizations! Amazing way to to demonstrate how n/power/beta/alpha/effect size are all interrelated - especially for visual learners! Thank you for creating this?

@jackferd bought ☕ (1) coffee

Incredible visualizations and the best power analysis software on R.

Cameron Proctor bought ☕ (1) coffee

Great website!

Someone bought ☕ (1) coffee

Hanah Chapman bought ☕ (1) coffee

Thank you for this work!!

Someone bought ☕ (1) coffee

Jayme bought ☕ (1) coffee

Nice explanation and visual guide of Cohen's d

Bart Comly Boyce bought ☕ (1) coffee

thank you

Dr. Mitchell Earleywine bought ☕ (1) coffee

This site is superb!

Florent bought ☕ (1) coffee

Zampeta bought ☕ (1) coffee

thank you for sharing your work.

Mila bought ☕ (1) coffee

Thank you for the website, made me smile AND smarter :O enjoy your coffee! :)

Deb bought ☕ (1) coffee

Struggling with statistics and your interactive diagram made me smile to see that someone cares enough about us strugglers to make a visual to help us out!😍

Someone bought ☕ (1) coffee

@exerpsysing bought ☕ (1) coffee

Much thanks! Visualizations are key to my learning style!

Someone bought ☕ (1) coffee

## Sponsors

You can sponsor my open source work using GitHub Sponsors and have your name shown here.

Backers ✨❤️

#### Questions & Comments

Please use GitHub Discussions for any questions related to this post, or open an issue on GitHub if you've found a bug or wan't to make a feature request.

## Archived Comments (38)

Hi,

Great script but I can't run it

It appears this error. Can you help me? Thkx a lot

df <- PubMedTrend(query)

Searching for: medicine[tw]

| | 0%Space required after the Public Identifier

SystemLiteral " or ' expected

SYSTEM or PUBLIC, the URI is missing

Error: 1: Space required after the Public Identifier

2: SystemLiteral " or ' expected

3: SYSTEM or PUBLIC, the URI is missing

Having the same problem...was solved by changing the URL in the retrieval loop to include https instead of http.

Hello Kristoffer i am trying to use your code to extract the # of citation from a lot of articles in PubMed but i get this error

""Error in function (type, msg, asError = TRUE) :

SSL certificate problem: unable to get local issuer certificate""

Maybe you could help?i dont have any idea what this error is,although i search it on the WEB

thnx in advance

Thank you, this script is great!

I had one problem with it however: Line 62, where you load total_table.csv

You wrote

tmp <- getURL("https://raw.github.com/rpsy...")

but for me this only returned an emtpy character ""

What worked is changing the URL to

tmp <- getURL("https://raw.githubuserconte...")

If I missed something: sorry! If not, it might reflect an internal change at github and you might want to update the file.

All the best,

Florian

Hi Kristoffer,

very nice script, however it does not seem to work for me. I tried several queries but they never work. Always the same error, see below. The debug does not point to a specific line so I have no clue where to start. Maybe you have suggestion? All packages are up to date.

> source("PubMedTrend.R")

Loading required package: RCurl

Loading required package: bitops

Loading required package: XML

Loading required package: plyr

> query <- c("biology" = "biology[tw]")

> df <- PubMedTrend(query)

Searching for: biology[tw]

|=======================================================================================================| 100%

Error in read.table(file = file, header = header, sep = sep, quote = quote, :

no lines available in input

Called from: read.csv(text = tmp)

Browse[1]>

Morning Kristoffer, I need a Pubmed count done on publications on QEEG and ERP in the past 10 years with a graph. I have one, but do not know where the author got it or how to reference as it is from the user group Facebook page and not an academic resource. Could you pls assist. Perhaps if you draw up a new similar statistic for me you could assist in how to reference it correctly. Hope I am understanding correctly what you do and that you can assist in this for me. Pls let me know if I needed to compensate you. Your work is really interesting and of huge use.

You may want to contribute your code to one of these two R packages:

https://github.com/ropensci...

https://github.com/ropensci...

Hi,

thanks a lot for this function.

I tried to run it, but encountered this error

df <- PubMedTrend(query, 1990,2005)

Searching for: medicine[tw]

|======================================================================| 100%

Searching for: psychology[tw]

|======================================================================| 100%

Searching for: biology[tw]

|======================================================================| 100%

Error in match(df$year, total.table$year) :

object 'total.table' not found

Do you have an idea why that is?

Thanks a lot,

Best,

JR

The script didn't find the total.table file. I've updated the script to avoid this problem. Now it'll download the total.table data-file from my GitHub repo.

Hi, great code. I am a bit confused though. You are calculating the number of publications per million by calculating the relative counts * 10,000. Is this not the number of publication per 10,000 or am I not getting the point?

You are absolutely right. Thanks for notifying me. Must have been tired when wrote that :). I've changed the function so it'll return the relative counts as percentages. I figured it's better to do any computations outside the main procedure, if it's needed.

Great script, thanks for writing it! One caveat I noticed when I was running/modifying it. I was searching for a count of "Dangeardiella macrospora". The API call (no years) shows that there are 109 hits:

http://eutils.ncbi.nlm.nih....

*but*, searching at PubMed shows that the search result should actually be zero (since "Dangeardiella" is not found in PubMed, they conveniently just discard that search term):

http://www.ncbi.nlm.nih.gov...

To ask that PubMed not discard terms, you need to append "[All fields]". For example:

http://www.ncbi.nlm.nih.gov...

http://eutils.ncbi.nlm.nih....[All%20fields]

Hope that's helpful!

The last link in my comment is slightly broken (square brackets got lost). The correct link is here: http://tinyurl.com/km238fq

Thanks for posting this Andrew! I've opened an issue regarding this on my github repo. I will look into it when have some time over.

Nice one Kris,

I am puzzled as to why the total counts for 2011 onwards to 2013 are NAs I cannot find any fault in the code and my system date is on the correct year! so for example if I query for 2009 to 2012 I get:

.idyearcounttotal_countrelative

1App20091507786831.9263295

2App20101696800982.4849360

3App2011 166NA NA

4App2012194NA NA

5Rev2009147786830.1797907

6Rev2010216800980.3087790

7Rev2011 31NA NA

8Rev201223NA NA

Hi Andrews, the script didn't include total values for 2011 onwards. The totals were fetched from this table http://www.nlm.nih.gov/bsd/.... It's been updated now, and so has my script.

Great piece of code. I am trying to extend the publication year range from1970 to 2012 but the total_table only goes from 1947 to 2009 (as commented in your code), correct?

Is there a way I can download or update the table to include 2012?

Hi Fulton. Here's two different ways to update the total_table https://github.com/rpsychol...

Amazing. I'm wondering how it might be applied to ProMED-mail archives now.....

In the meantime, however...a much easier problem. How would one plot the data with a log2 y-axis? I'm finding the learning curve for gglot2 is much steeper than that of R itself....

Hi Robert. Did you figure out how to plot the data with a log2 y-axis? If not, you can use something like this: *scale_y_continuous(trans=log2_trans())*, this is using the log_trans from the "scales"-packages, so you need that package loaded.

Very nice, any plans to post it as a package on CRAN?

Hi Ricardo, maybe In the future, but it's not really something I've thought about!

Whilst I actually like this publish, I believe there was an spelling error near towards the finish of the third section.

Thanks for notifying me, it's been sorted!

I get this error after it has finished searching when trying to run your script (using rstudio):

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection

In addition: Warning message:

In readChar(con, 5L, useBytes = TRUE) :

cannot open compressed file 'total_table', probable reason 'No such file or directory'

Any idea what is going on?

I get a similar area in that it can't find 'total.table', have redownloaded, tried to rename total.table to total_table...

Using Mac OS X 10.8.4

Kris,

thanks for sharing this script. It's exactly what I was trying to do, with a lot of pain.

However I don't find any file other than PubMedTrend.r and I get the same error as Wouter. Also, I don't find RunPubMed.R mentioned by Michael MacAskill.

I assume both come from the same zip file, but where is the link ?

Cheers

D

Hi, Dani. I have moved everything to moved the script over to GitHub so you should find everything here.

Works brilliantly, in a different league performance and reliability-wise compared to RISmed.

One possible error in line 28 of RunPubMed.R

Should:

df.hits <- PubMedTrend()

actually be:

df.hits <- PubTotalHits()

Cheers.

Glad you liked it! You are correct about the error on line 28, thanks for spotting it!

Hi Kris,

Thank you for this fine and incredibly useful script.

Unfortunately it does not work with the recent version R (2.15) since some interfaces seem to have changed. When running the example script I receive an error message which seems to distresst already some R users:

> source("PubMedTrend.R")

Error in source("PubMedTrend.R") :

7 arguments passed to .Internal(identical) which requires 6

http://r.789695.n4.nabble.c...

Since I am not too experienced yet, I do not have an idea which of the function calls .Internal() . If you have a guess and give a hint I would try bugfixing.

Best regards,

Robert

Hi Robert,

Unfortunately I couldn't replicate you error on R 2.15.0. Are you using the latest versions of "plyr", "XML" and "Rcurl"?.

I'm not explicitly calling .Internal in my script, but you could try running traceback() to get some more information on the error.

Hi Kris,

Strangely, it seems that neither your script nor R itself is the troublemaker but RStudio - sorry for not mentioning the use of an alternative editor first. Due to the above given link I thought that the error ist related to the updated version of R.

When directly inputting your script into RStudio (by loading -> Strg A -> Run) everything works well and the given examples are executeable.

When loading your script with source followed by traceback():

> source("PubMedTrend.R")

Error in source("PubMedTrend.R") :

7 arguments passed to .Internal(identical) which requires 6

> traceback()

2: source("PubMedTrend.R")

1: source("PubMedTrend.R")

Nevertheless, using the native R console no error is thrown even when loading the script with source().

Sorry for the confusion and thank you again for providing the code!

Thanks for posting your solution, I'm glad you got it to work. It's strange though, it runs just fine in RStudio for me.

Thanks for sharing Kris, particularly since the code even includes thoughtful validation checking!

Looking forward to trying this out, but will be a few days I think. An advantage of the RISmed approach was that (when it worked...) entire records were returned. This meant that further analyses could be done, above just the simple raw counts (e.g. examining trends within a particular journal). I'll be happy to at least get reliable count data, but does this method lend itself to getting individual record-level information as well?

Cheers,

Michael

It's possible to extend this method to download complete records in XML. This script actually started out with that functionality, but it seemed a bit unnecessary to download the complete records when I mostly wanted to look at yearly counts.

I will release a version of my script that will download complete records, I only need to add a function to batch download articles, since PubMed has got a retrieval cap of 10k articles.

Thanks for commenting,

Kris