An R Script to Automatically download PubMed Citation Counts By Year of publication

Pubmed automatic search

Background

I believe there’s some information to be gained from looking at publication trends over time. But it’s really troublesome to do it by hand; fortunately it’s not so troublesome to do it in R statistical software. Though, data like these should be interpreted with extreme caution.

How the script works

I tried to use the RISmed-package to query PubMed, but found it to be really unreliable. Instead my script is querying PubMed’s E-Utilities using RCurl and XML. The E-utilities work like this:

We can see after the URL that I’m telling E-Utilities that I want to search PubMed’s database, and retrieve it as ‘count’. This will give a minimal XML-output containing only the number of hits for my search term, which is exactly what I want.

That’s really the basic gist of what my script is doing. If you look at the code (at the bottom of this post), you can see that I construct the query in the beginning of the script using paste() and gsub(). Then the main part of the script is the for-loop. What it’s doing is it’s looping through all the years from the specified year range, retrieving the number of hits, and then pasting it together in one data frame. To get counts for a specific year i, I add AND i[PPDAT] (Print Dates Only Tag) at the end of each query.

Since I have all the necessary code in getCount(), I can run the same script for any number of queries using ldply(query.index, getCount). By doing that I end up with a data frame, which contains data for all the queries arranged in long format. The end of the script will calculate relative counts by dividing the matches each year by the total amount of publications that year. I’ve also added a function that will type out the total number of hits for each query, called PubTotalHits().

Why use Print Dates Only Tag [PPDAT] and not Publication Date [DP]?

The problem with using DP is in how PubMed handles articles. If an article is published electronically in the end of say 2011 but printed in 2012, that article will be counted both in 2011 and 2012 if I search those 2 years individually (which my script is doing).

By using PPDAT I will miss some articles that doesn’t have a published print date. If you’d rather get some duplicates in your data, but not miss any citations, you can easily change PPDAT to DP, the script will run the same either way.

A quick example [PDAT] vs [DP]:

To illustrate the differences I did a quick search using a Cognitive Behavioral Therapy-query. When searching with PubMed’s website I specified the year range as 1940:2012[DP]/[PPDAT], and used the same interval in R.

The correct amount of hits is 5501 which is retrieved using PubMed’s website with the [DP]-tag. It’s also the same amount as what would be reported when not specifying any time interval. Consequently, if you use my script with the [PPDAT]-tag you would, in this scenario, be about -2.5% off from the correct amount, and about +3% off from the correct amount if you use [DP]. It’s possible that other queries will generate different results. However, the error seems to be so small that it doesn’t warrant any changes to the code. Duplicates could be avoided by downloading PMIDS with every search, then checking for duplicates for each adjacent year. Though, that change would require an unnecessary amount of data transfer for an error that appear to be only 3%.

How to use my script

It’s really simple to use this script. Assuming you have R installed, all you need to do is download the files. Then point R to that directory and then tell it to run “PubMedTrend.R”. Like this:

Once that is done you specify your query like this:

Now all you have to is to execute my PubMedTrend()-function for those quires and save the results in a data frame:

The content of df will be structured like this:

Additional arguments

The default year range is set to 1950–2009, but can easily be changed, like this:

Some notes about using relative values

PubMed’s total counts (as posted in a table at their website) hasn’t been updated since April 8 2011, but the de facto total values have changed since then, because PubMed is always adding new citations (new and old). This can be remedied easily by looping through 1950[PPDAT], 1951[PPDAT] … 2012[PPDAT] (or you can use [DP]). I did that for you and made a graph of the two data sets, and as you can see there’s some differences, but they’re not that big. Nonetheless, I’ve included both files with my script.

PubMed hits by year -- looking at publications trends for biology,psychology and medicine. By Kristoffer Magnusson

Some example runs

PubMed hits by year -- looking at publications trends for biology,psychology and medicine. By Kristoffer Magnusson

PubMed hits by year -- looking at publications trends for biology,psychology and medicine. By Kristoffer Magnusson

When searching a progress bar will show the progress, and the search will look like this once completed.

Using the function to get total hits, will give this output

A few words on usage guidelines

In PubMed’s E-utilities usage guidelines it’s specified that:

In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI

To comply with this my script will wait 0.5 sec after each iteration resulting in (theoretically) 2 URL GETs per second. This means that searching for 100 yearly counts will take a minimum if 50 seconds for each query. You can change the wait time if you feel that 0.5 sec is too low or too high.

And here’s the R code to look at PubMed trends

Update (2013 August 2): I am currently updating this script and moving it to a GitHub repo, so it will be easier to maintain. You can find the repo here.

And some example ggplot2 codes


Written by Kristoffer Magnusson, a researcher in clinical psychology. You should follow him on Twitter and come hang out on the open science discord Git Gud Science.


Share:

Published April 19, 2012 (View on GitHub)

Buy Me A Coffee

A huge thanks to the 64 supporters who've bought me a 143 coffees!

SCCT/Psychology bought ☕☕☕ (3) coffees

Keep the visualizations coming!

@noelnguyen16 bought ☕ (1) coffee

Hi Kristoffer, many thanks for making all this great stuff available to the community!

Eran Barzilai bought ☕ (1) coffee

These visualizations are awesome! thank you for creating it

Someone bought ☕ (1) coffee

@elena_bolt bought ☕☕☕ (3) coffees

Thank you so much for your work, Kristoffer. I use your visualizations to explain concepts to my tutoring students and they are a huge help.

A random user bought ☕☕☕ (3) coffees

Thank you for making such useful and pretty tools. It not only helped me understand more about power, effect size, etc, but also made my quanti-method class more engaging and interesting. Thank you and wish you a great 2021!

@hertzpodcast bought ☕☕☕ (3) coffees

We've mentioned your work a few times on our podcast and we recently sent a poster to a listener as prize so we wanted to buy you a few coffees. Thanks for the great work that you do!Dan Quintana and James Heathers - Co-hosts of Everything Hertz 

Chris SG bought ☕ (1) coffee

Very nice.

@whlevine bought ☕☕ (2) coffees

Thank you so much for these amazing visualizations. They're a great teaching tool and the allow me to show students things that it would take me weeks or months to program myself.

Someone bought ☕☕ (2) coffees

Cameron Proctor bought ☕☕☕ (3) coffees

Used your vizualization in class today. Thanks!

@Daniel_Brad4d bought ☕☕☕☕☕ (5) coffees

Wonderful work!

Gray Church bought ☕ (1) coffee

Thank you for the visualizations. They are fun and informative.

eshulman@brocku.ca bought ☕☕☕ (3) coffees

My students love these visualizations and so do I! Thanks for helping me make stats more intuitive.

Qamar bought ☕ (1) coffee

Someone bought ☕☕☕ (3) coffees

Adrian Helgå Vestøl bought ☕☕☕ (3) coffees

David Loschelder bought ☕☕☕☕☕ (5) coffees

Terrific work. So very helpful. Thank you very much.

Tanya McGhee bought ☕ (1) coffee

@neilmeigh bought ☕☕☕☕☕ (5) coffees

I am so grateful for your page and can't thank you enough!  

@schultemi bought ☕ (1) coffee

@giladfeldman bought ☕☕☕☕☕ (5) coffees

Wonderful work, I use it every semester and it really helps the students (and me) understand things better. Keep going strong.

Dean Norris bought ☕☕☕☕☕ (5) coffees

@misteryosupjoo bought ☕☕☕ (3) coffees

For a high school teacher of psychology, I would be lost without your visualizations. The ability to interact and manipulate allows students to get it in a very sticky manner. Thank you!!!

Sal bought ☕☕☕☕☕ (5) coffees

Really super useful, especially for teaching. Thanks for this!

Neilo bought ☕ (1) coffee

Really helpful visualisations, thanks!

Chi bought ☕☕☕ (3) coffees

You Cohen's d post really helped me explaining the interpretation to people who don't know stats! Thank you!

Someone bought ☕☕☕ (3) coffees

You doing useful work !! thanks !!

@ArtisanalANN bought ☕☕☕ (3) coffees

Enjoy.

Someone bought ☕ (1) coffee

This is amazing stuff. Very slick. 

Someone bought ☕ (1) coffee

Sarko bought ☕ (1) coffee

Thanks so much for creating this! Really helpful for being able to explain effect size to a clinician I'm doing an analysis for. 

@jsholtes bought ☕☕☕ (3) coffees

Teaching stats to civil engineer undergrads (first time teaching for me, first time for most of them too) and grasping for some good explanations of hypothesis testing, power, and CI's. Love these interactive graphics!

@DominikaSlus bought ☕ (1) coffee

Thank you! This page is super useful. I'll spread the word. 

Someone bought ☕ (1) coffee

dde@paxis.org bought ☕☕☕☕☕ (5) coffees

Very helpful to helping teach teachers about the effects of the Good Behavior Game

@notawful bought ☕☕☕ (3) coffees

Thank you for using your stats and programming gifts in such a useful, generous manner. -Jess

Mateu Servera bought ☕☕☕ (3) coffees

A job that must have cost far more coffees than we can afford you ;-). Thank you.

Melinda Rice bought ☕ (1) coffee

Thank you so much for creating these tools! As we face the challenge of teaching statistical concepts online, this is an invaluable resource.

@tmoldwin bought ☕ (1) coffee

Fantastic resource. I think you would be well served to have one page indexing all your visualizations, that would make it more accessible for sharing as a common resource.

@cdrawn bought ☕☕☕ (3) coffees

Thank you! Such a great resource for teaching these concepts, especially CI, Power, correlation.

Someone bought ☕ (1) coffee

Fantastic Visualizations! Amazing way to to demonstrate how n/power/beta/alpha/effect size are all interrelated - especially for visual learners! Thank you for creating this?

@akreutzer82 bought ☕☕☕☕☕ (5) coffees

Amazing visualizations! Thank you!

@rdh_CLE bought ☕☕☕☕☕ (5) coffees

So good!

@jackferd bought ☕ (1) coffee

Incredible visualizations and the best power analysis software on R.

Julia bought ☕☕☕ (3) coffees

Fantastic work with the visualizations!

@felixthoemmes bought ☕☕☕ (3) coffees

@dalejbarr bought ☕☕☕ (3) coffees

Your work is amazing! I use your visualizations often in my teaching. Thank you. 

@notawful bought ☕☕ (2) coffees

Thank you for sharing your visualization skills with the rest of us! I use them frequently when teaching intro stats. 

Cameron Proctor bought ☕ (1) coffee

Great website!

Someone bought ☕ (1) coffee

Hanah Chapman bought ☕ (1) coffee

Thank you for this work!!

Someone bought ☕ (1) coffee

Jayme bought ☕ (1) coffee

Nice explanation and visual guide of Cohen's d

Bart Comly Boyce bought ☕ (1) coffee

thank you

Dr. Mitchell Earleywine bought ☕ (1) coffee

This site is superb!

Florent bought ☕ (1) coffee

Zampeta bought ☕ (1) coffee

thank you for sharing your work. 

Mila bought ☕ (1) coffee

Thank you for the website, made me smile AND smarter :O enjoy your coffee! :)

Deb bought ☕ (1) coffee

Struggling with statistics and your interactive diagram made me smile to see that someone cares enough about us strugglers to make a visual to help us out!😍 

Someone bought ☕ (1) coffee

@exerpsysing bought ☕ (1) coffee

Much thanks! Visualizations are key to my learning style! 

@PsychoMouse bought ☕☕☕ (3) coffees

Excellent!  Well done!  SOOOO Useful!😊 🐭 

Someone bought ☕ (1) coffee

Sponsors

You can sponsor my open source work using GitHub Sponsors and have your name shown here.

Backers ✨❤️

Questions & Comments

Please use GitHub Discussions for any questions related to this post, or open an issue on GitHub if you've found a bug or wan't to make a feature request.

Webmentions

There are no webmentions for this page

(Webmentions sent before 2021 will unfortunately not show up here.)

Archived Comments (38)

G
Gey 2018-10-14

Hi,
Great script but I can't run it
It appears this error. Can you help me? Thkx a lot
df <- PubMedTrend(query)
Searching for: medicine[tw]
| | 0%Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing

hkbob 2019-04-20

Having the same problem...was solved by changing the URL in the retrieval loop to include https instead of http.

A
Aplamhden 2014-12-12

Hello Kristoffer i am trying to use your code to extract the # of citation from a lot of articles in PubMed but i get this error
""Error in function (type, msg, asError = TRUE) :

SSL certificate problem: unable to get local issuer certificate""

Maybe you could help?i dont have any idea what this error is,although i search it on the WEB
thnx in advance

F
Florian Markowetz 2014-11-14

Thank you, this script is great!

I had one problem with it however: Line 62, where you load total_table.csv

You wrote
tmp <- getURL("https://raw.github.com/rpsy...")
but for me this only returned an emtpy character ""

What worked is changing the URL to
tmp <- getURL("https://raw.githubuserconte...")

If I missed something: sorry! If not, it might reflect an internal change at github and you might want to update the file.

All the best,
Florian

W
Wouter van den Bos 2014-10-13

Hi Kristoffer,

very nice script, however it does not seem to work for me. I tried several queries but they never work. Always the same error, see below. The debug does not point to a specific line so I have no clue where to start. Maybe you have suggestion? All packages are up to date.

> source("PubMedTrend.R")

Loading required package: RCurl

Loading required package: bitops

Loading required package: XML

Loading required package: plyr

> query <- c("biology" = "biology[tw]")

> df <- PubMedTrend(query)

Searching for: biology[tw]

|=======================================================================================================| 100%

Error in read.table(file = file, header = header, sep = sep, quote = quote, :

no lines available in input

Called from: read.csv(text = tmp)

Browse[1]>

L
Lorraine Paquet 2014-02-14

Morning Kristoffer, I need a Pubmed count done on publications on QEEG and ERP in the past 10 years with a graph. I have one, but do not know where the author got it or how to reference as it is from the user group Facebook page and not an academic resource. Could you pls assist. Perhaps if you draw up a new similar statistic for me you could assist in how to reference it correctly. Hope I am understanding correctly what you do and that you can assist in this for me. Pls let me know if I needed to compensate you. Your work is really interesting and of huge use.

S
Scott Chamberlain 2013-09-21

You may want to contribute your code to one of these two R packages:

https://github.com/ropensci...
https://github.com/ropensci...

J
JR 2013-08-12

Hi,

thanks a lot for this function.

I tried to run it, but encountered this error

df <- PubMedTrend(query, 1990,2005)

Searching for: medicine[tw]
|======================================================================| 100%
Searching for: psychology[tw]
|======================================================================| 100%
Searching for: biology[tw]
|======================================================================| 100%

Error in match(df$year, total.table$year) :
object &#039total.table&#039 not found

Do you have an idea why that is?

Thanks a lot,

Best,

JR

Kristoffer Magnusson 2013-08-17

The script didn't find the total.table file. I've updated the script to avoid this problem. Now it'll download the total.table data-file from my GitHub repo.

M
Michael 2013-07-16

Hi, great code. I am a bit confused though. You are calculating the number of publications per million by calculating the relative counts * 10,000. Is this not the number of publication per 10,000 or am I not getting the point?

Kristoffer Magnusson 2013-08-17

You are absolutely right. Thanks for notifying me. Must have been tired when wrote that :). I've changed the function so it'll return the relative counts as percentages. I figured it's better to do any computations outside the main procedure, if it's needed.

A
Andrew Su 2013-06-14

Great script, thanks for writing it! One caveat I noticed when I was running/modifying it. I was searching for a count of "Dangeardiella macrospora". The API call (no years) shows that there are 109 hits:

http://eutils.ncbi.nlm.nih....

*but*, searching at PubMed shows that the search result should actually be zero (since "Dangeardiella" is not found in PubMed, they conveniently just discard that search term):

http://www.ncbi.nlm.nih.gov...

To ask that PubMed not discard terms, you need to append "[All fields]". For example:

http://www.ncbi.nlm.nih.gov...
http://eutils.ncbi.nlm.nih....[All%20fields]

Hope that's helpful!

Andrew Su 2013-06-14

The last link in my comment is slightly broken (square brackets got lost). The correct link is here: http://tinyurl.com/km238fq

Kristoffer Magnusson 2013-08-17

Thanks for posting this Andrew! I've opened an issue regarding this on my github repo. I will look into it when have some time over.

A
Andrew Roberts 2013-04-24

Nice one Kris,

I am puzzled as to why the total counts for 2011 onwards to 2013 are NAs I cannot find any fault in the code and my system date is on the correct year! so for example if I query for 2009 to 2012 I get:

.idyearcounttotal_countrelative
1App20091507786831.9263295
2App20101696800982.4849360
3App2011 166NA NA
4App2012194NA NA
5Rev2009147786830.1797907
6Rev2010216800980.3087790
7Rev2011 31NA NA
8Rev201223NA NA

Kristoffer Magnusson 2013-08-17

Hi Andrews, the script didn't include total values for 2011 onwards. The totals were fetched from this table http://www.nlm.nih.gov/bsd/.... It's been updated now, and so has my script.

F
Fulton Shannon 2012-10-26

Great piece of code. I am trying to extend the publication year range from1970 to 2012 but the total_table only goes from 1947 to 2009 (as commented in your code), correct?

Is there a way I can download or update the table to include 2012?

Kristoffer Magnusson 2013-08-17

Hi Fulton. Here's two different ways to update the total_table https://github.com/rpsychol...

R
Robert Fisher 2012-09-13

Amazing. I'm wondering how it might be applied to ProMED-mail archives now.....

In the meantime, however...a much easier problem. How would one plot the data with a log2 y-axis? I'm finding the learning curve for gglot2 is much steeper than that of R itself....

Kristoffer Magnusson 2012-10-12

Hi Robert. Did you figure out how to plot the data with a log2 y-axis? If not, you can use something like this: scale_y_continuous(trans=log2_trans()), this is using the log_trans from the "scales"-packages, so you need that package loaded.

R
Ricardo Pietrobon 2012-08-20

Very nice, any plans to post it as a package on CRAN?

Kristoffer Magnusson 2012-10-12

Hi Ricardo, maybe In the future, but it's not really something I've thought about!

N
news 2012-06-06

Whilst I actually like this publish, I believe there was an spelling error near towards the finish of the third section.

Kristoffer Magnusson 2012-06-27

Thanks for notifying me, it's been sorted!

W
Wouter 2012-05-18

I get this error after it has finished searching when trying to run your script (using rstudio):

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file 'total_table', probable reason 'No such file or directory'

Any idea what is going on?

Hugo 2013-08-14

I get a similar area in that it can't find 'total.table', have redownloaded, tried to rename total.table to total_table...

Using Mac OS X 10.8.4

Dani 2013-07-26

Kris,
thanks for sharing this script. It's exactly what I was trying to do, with a lot of pain.
However I don't find any file other than PubMedTrend.r and I get the same error as Wouter. Also, I don't find RunPubMed.R mentioned by Michael MacAskill.
I assume both come from the same zip file, but where is the link ?
Cheers
D

Kristoffer Magnusson 2013-08-17

Hi, Dani. I have moved everything to moved the script over to GitHub so you should find everything here.

M
Michael MacAskill 2012-04-23

Works brilliantly, in a different league performance and reliability-wise compared to RISmed.

One possible error in line 28 of RunPubMed.R

Should:
df.hits <- PubMedTrend()

actually be:
df.hits <- PubTotalHits()

Cheers.

Kristoffer Magnusson 2012-04-23

Glad you liked it! You are correct about the error on line 28, thanks for spotting it!

R
Robert Adams 2012-04-23

Hi Kris,

Thank you for this fine and incredibly useful script.

Unfortunately it does not work with the recent version R (2.15) since some interfaces seem to have changed. When running the example script I receive an error message which seems to distresst already some R users:

> source("PubMedTrend.R")
Error in source("PubMedTrend.R") :
7 arguments passed to .Internal(identical) which requires 6

http://r.789695.n4.nabble.c...

Since I am not too experienced yet, I do not have an idea which of the function calls .Internal() . If you have a guess and give a hint I would try bugfixing.

Best regards,

Robert

Kristoffer Magnusson 2012-04-23

Hi Robert,

Unfortunately I couldn't replicate you error on R 2.15.0. Are you using the latest versions of "plyr", "XML" and "Rcurl"?.

I'm not explicitly calling .Internal in my script, but you could try running traceback() to get some more information on the error.

Robert Adams 2012-04-23

Hi Kris,

Strangely, it seems that neither your script nor R itself is the troublemaker but RStudio - sorry for not mentioning the use of an alternative editor first. Due to the above given link I thought that the error ist related to the updated version of R.

When directly inputting your script into RStudio (by loading -> Strg A -> Run) everything works well and the given examples are executeable.

When loading your script with source followed by traceback():

> source("PubMedTrend.R")
Error in source("PubMedTrend.R") :
7 arguments passed to .Internal(identical) which requires 6
> traceback()
2: source("PubMedTrend.R")
1: source("PubMedTrend.R")

Nevertheless, using the native R console no error is thrown even when loading the script with source().

Sorry for the confusion and thank you again for providing the code!

Kristoffer Magnusson 2012-04-23

Thanks for posting your solution, I'm glad you got it to work. It's strange though, it runs just fine in RStudio for me.

M
Michael MacAskill 2012-04-23

Thanks for sharing Kris, particularly since the code even includes thoughtful validation checking!

Looking forward to trying this out, but will be a few days I think. An advantage of the RISmed approach was that (when it worked...) entire records were returned. This meant that further analyses could be done, above just the simple raw counts (e.g. examining trends within a particular journal). I'll be happy to at least get reliable count data, but does this method lend itself to getting individual record-level information as well?

Cheers,

Michael

Kristoffer Magnusson 2012-04-23

It's possible to extend this method to download complete records in XML. This script actually started out with that functionality, but it seemed a bit unnecessary to download the complete records when I mostly wanted to look at yearly counts.

I will release a version of my script that will download complete records, I only need to add a function to batch download articles, since PubMed has got a retrieval cap of 10k articles.

Thanks for commenting,
Kris