Using Ggplot2 to plot last.fm top 100 albums
I found out that last.fm had made data files available for their Best of 2011 artist list, and I thought it’d be a great opportunity to learn some more about data management in R and Ggplot2.
I began by downloading and importing the
tab separated data file from last.fm (TSV).
Then I did some data cleanup, because one row just contained junk and some columns were unnecessary. I also removed all entries after row 100.
I did a search for missing values, but none were found.
The XML-file contained information about artists location. So I loaded
it and cleaned it up a bit. The location column was a bit messy so I
edited manually in statas data editor, I figured it was the easiest
way. I then read the edited data file back into R and combined that
data.frame with the rest of the data from the TSV-file.
I tried plotting this
data.frame with ggplot but the location
variable contained 17 countries, which made a messy plot. Therefore I
choose to group some countries under the label “other”.
I still wasn’t satisfied with the plot, because it wasn’t sorted after
album plays. I tried quite a lot of different methods of sorting the
data.frame before figuring out how to do it successfully with
I wanted my plot to have readable decimal notation so I created my own x-breaks.
I also used my own custom colors for the plots legend, which I saved in
a list before initiating
Then, at last, I drew the plot with
We can see that the plot is dominated by USA and UK and that Adele and
Lady Gaga got exponentially more album plays than the rest. To give a
$location I used
Which gave the following:
Published March 22, 2012 (View on GitHub)