Using Ggplot2 to plot top 100 albums

March 22, 2012

I found out that had made data files available for their Best of 2011 artist list, and I thought it’d be a great opportunity to learn some more about data management in R and Ggplot2. I began by downloading and importing the tab separated data file from (TSV).

Then I did some data cleanup, because one row just contained junk and some columns were unnecessary. I also removed all entries after row 100.

I did a search for missing values, but none were found.

The XML-file contained information about artists location. So I loaded it and cleaned it up a bit. The location column was a bit messy so I edited manually in statas data editor, I figured it was the easiest way. I then read the edited data file back into R and combined that data.frame with the rest of the data from the TSV-file.

I tried plotting this data.frame with ggplot but the location variable contained 17 countries, which made a messy plot. Therefore I choose to group some countries under the label “other”.

I still wasn’t satisfied with the plot, because it wasn’t sorted after album plays. I tried quite a lot of different methods of sorting the data.frame before figuring out how to do it successfully with reorder().

I wanted my plot to have readable decimal notation so I created my own x-breaks.

I also used my own custom colors for the plots legend, which I saved in a list before initiating ggplot2.

Then, at last, I drew the plot with ggplot2. plot 2011. By Kristoffer Magnusson

We can see that the plot is dominated by USA and UK and that Adele and Lady Gaga got exponentially more album plays than the rest. To give a summary of $location I used summary().

Which gave the following:

Written by Kristoffer Magnusson a researcher in clinical psychology. You should follow him on Twitter and come hang out on the friendliest open science discord Git Gud Science.


Published March 22, 2012 (View on GitHub)