Colored plotmatrix in ggplot2

I feel like plotmatrix is a much-neglected ggplot2 function (it’s not even on the ggplot2 webpage). It’s the equivalent of plot(dataframe) from the core graphics package, with the added bonus of a kernel density plot along the diagonal.

The one thing that seems to be missing is the ability to color the points by some factor. I modified the plotmatrix function to enable this. Here’s an example.

My data looks something like this:

> head(data)
   a  b  c  d level
1  1  2  5  2 FALSE
2  1 21  2  1 FALSE
3 26  1 NA  1 FALSE
4  8 45 13 12 FALSE
5  3 25  4  4 FALSE
6  4 NA  1  2 FALSE

And when I plot it using plotmatrix I get the following:

plotmatrix(data, mapping=aes(colour=data$level))

None of the aesthetic mappings actually make it to the plot.  I changed it so that the aesthetics get passed to the points only:

plotmatrix2(data, mapping=aes(colour=data$level, shape=data$level))

And here’s the code:

Today Tumblr informed users about the Protect-IP Act and Stop Online Privacy Act by “censoring” users’ dashboards.
The above plot displays the increase in posts on Tumblr mentioning ‘SOPA’ or ‘censorship’ from the beginning of today up to just a few minutes ago. We launched the announcement just after 11:00 EST and were quickly producing 3.6 calls per second to representatives around the country.

Today Tumblr informed users about the Protect-IP Act and Stop Online Privacy Act by “censoring” users’ dashboards.

The above plot displays the increase in posts on Tumblr mentioning ‘SOPA’ or ‘censorship’ from the beginning of today up to just a few minutes ago. We launched the announcement just after 11:00 EST and were quickly producing 3.6 calls per second to representatives around the country.

Re-organizing factors in ggplot2 plots

When you have a factor with many levels and want to see a bar chart (kind of like a histogram for discrete data, as Hadley pointed out in the comments), it’s always more useful to see them sorted in some kind of order.  An example of this is shown below:

labels <- c(
  rep("a", 100*rexp(1)), 
  rep("b", 100*rexp(1)),
  rep("c", 100*rexp(1)),
  rep("d", 100*rexp(1)))
x <- data.frame(labels = factor(labels), some.value = runif(length(labels)))
qplot(labels, data=x)

This produces the following chart:

Not bad if you only have 4 levels in your factor, but what if you have 400?  I’d like to see the sequence of bars go a, d, b, c instead of a, b, c, d.  Here’s the trick to re-order your factor:

qplot(reorder(x$labels, as.numeric(x$labels), length))

The reorder function takes three parameters:

  • The factor that you want to reorder
  • A numeric vector of equal length (the values don’t matter for for this specific task, as long as each number corresponds to one factor level)
  • A function to apply to the numeric vector

In this case, we want to re-order by the number of each level, so we use the length function. The plot now looks like this:

To get them in descending order, I figured it was easiest to just multiply the length by -1.

qplot(reorder(x$labels, as.numeric(x$labels), function(y){-1*length(y)}))

Which will produce the following:

Using idata.frames in R

I recently discovered the idata.frame (immutable data frame) datatype in the plyr package in R, which MASSIVELY speeds up *ply functions.  

Usually, when using any of the *ply functions, the pieces of the data.frame (or whatever other object you give it) are copied into new objects when being split apart.  With the idata.frame, they are simply passed by reference, saving lots of time.

Here’s an example. I’ll load 50,000 lines of server log data and count the number of individual user agents found per second.

> x <- read.csv("../data/haproxy.csv", header=F, nrows=50000)
> library(plyr)
> system.time(x1 <- ddply(x, .(V1, V19), nrow))
   user  system elapsed 
 43.790   8.104  51.893 
> system.time(x2 <- ddply(idata.frame(x), .(V1, V19), nrow))
   user  system elapsed 
  2.968   0.010   2.978 

Even with the conversion to idata.frame, it only took 3 seconds instead of nearly a minute. If anyone knows why the first line of any plyr function isn’t x<-idata.frame(x), I’d be curious to know.