An R Script to Automatically download PubMed Citation Counts By Year of publication
I believe there’s some information to be gained from looking at publication trends over time. But it’s really troublesome to do it by hand; fortunately it’s not so troublesome to do it in R statistical software. Though, data like these should be interpreted with extreme caution.
I tried to use the
RISmed-package to query PubMed, but found it to be
really unreliable. Instead my script is querying PubMed’s
XML. The E-utilities work like this:
We can see after the URL that I’m telling E-Utilities that I want to search PubMed’s database, and retrieve it as ‘count’. This will give a minimal XML-output containing only the number of hits for my search term, which is exactly what I want.
That’s really the basic gist of what my script is doing. If you look at
the code (at the bottom of this post), you can see that I construct the
query in the beginning of the script using
the main part of the script is the
for-loop. What it’s doing is it’s
looping through all the years from the specified year range, retrieving
the number of hits, and then pasting it together in one data frame. To
get counts for a specific year
i, I add
AND i[PPDAT] (Print Dates
Only Tag) at the end of each query.
Since I have all the necessary code in
getCount(), I can run the same
script for any number of queries using
By doing that I end up with a data frame, which contains data for all
the queries arranged in long format. The end of the script will
calculate relative counts by dividing the matches each year by the
total amount of publications that year. I’ve also added a function
that will type out the total number of hits for each query, called
The problem with using DP is in how PubMed handles articles. If an article is published electronically in the end of say 2011 but printed in 2012, that article will be counted both in 2011 and 2012 if I search those 2 years individually (which my script is doing).
By using PPDAT I will miss some articles that doesn’t have a published print date. If you’d rather get some duplicates in your data, but not miss any citations, you can easily change PPDAT to DP, the script will run the same either way.
To illustrate the differences I did a quick search using a Cognitive Behavioral Therapy-query. When searching with PubMed’s website I specified the year range as 1940:2012[DP]/[PPDAT], and used the same interval in R.
The correct amount of hits is 5501 which is retrieved using PubMed’s website with the [DP]-tag. It’s also the same amount as what would be reported when not specifying any time interval. Consequently, if you use my script with the [PPDAT]-tag you would, in this scenario, be about -2.5% off from the correct amount, and about +3% off from the correct amount if you use [DP]. It’s possible that other queries will generate different results. However, the error seems to be so small that it doesn’t warrant any changes to the code. Duplicates could be avoided by downloading PMIDS with every search, then checking for duplicates for each adjacent year. Though, that change would require an unnecessary amount of data transfer for an error that appear to be only 3%.
It’s really simple to use this script. Assuming you have R installed, all you need to do is download the files. Then point R to that directory and then tell it to run “PubMedTrend.R”. Like this:
Once that is done you specify your query like this:
Now all you have to is to execute my
PubMedTrend()-function for those
quires and save the results in a data frame:
The content of df will be structured like this:
The default year range is set to 1950–2009, but can easily be changed, like this:
PubMed’s total counts (as posted in a table at their website) hasn’t been updated since April 8 2011, but the de facto total values have changed since then, because PubMed is always adding new citations (new and old). This can be remedied easily by looping through 1950[PPDAT], 1951[PPDAT] … 2012[PPDAT] (or you can use [DP]). I did that for you and made a graph of the two data sets, and as you can see there’s some differences, but they’re not that big. Nonetheless, I’ve included both files with my script.
When searching a progress bar will show the progress, and the search will look like this once completed.
Using the function to get total hits, will give this output
In PubMed’s E-utilities usage guidelines it’s specified that:
In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI
To comply with this my script will wait 0.5 sec after each iteration resulting in (theoretically) 2 URL GETs per second. This means that searching for 100 yearly counts will take a minimum if 50 seconds for each query. You can change the wait time if you feel that 0.5 sec is too low or too high.
Update (2013 August 2): I am currently updating this script and moving it to a GitHub repo, so it will be easier to maintain. You can find the repo here.
Published April 19, 2012 (View on GitHub)