I feel like plotmatrix is a much-neglected ggplot2 function (it’s not even on the ggplot2 webpage). It’s the equivalent of plot(dataframe) from the core graphics package, with the added bonus of a kernel density plot along the diagonal.
The one thing that seems to be missing is the ability to color the points by some factor. I modified the plotmatrix function to enable this. Here’s an example.
My data looks something like this:
> head(data)
a b c d level
1 1 2 5 2 FALSE
2 1 21 2 1 FALSE
3 26 1 NA 1 FALSE
4 8 45 13 12 FALSE
5 3 25 4 4 FALSE
6 4 NA 1 2 FALSE
And when I plot it using plotmatrix I get the following:
plotmatrix(data, mapping=aes(colour=data$level))

None of the aesthetic mappings actually make it to the plot. I changed it so that the aesthetics get passed to the points only:
plotmatrix2(data, mapping=aes(colour=data$level, shape=data$level))

And here’s the code:
Today Tumblr informed users about the Protect-IP Act and Stop Online Privacy Act by “censoring” users’ dashboards.
The above plot displays the increase in posts on Tumblr mentioning ‘SOPA’ or ‘censorship’ from the beginning of today up to just a few minutes ago. We launched the announcement just after 11:00 EST and were quickly producing 3.6 calls per second to representatives around the country.
Drew Conway and John Myles White posted the code used in their O’Reilly book Machine Learning for Email on github. Check it out to see implementations (all in R) of priority inbox, spam classification, and other algorithms.
I’m really looking forward to these “big data” visualization features for ggplot2.
via the Revolution Analytics blog
There’s some great stuff on this list. See anything that was missed?
Some that I’d add:
When you have a factor with many levels and want to see a bar chart (kind of like a histogram for discrete data, as Hadley pointed out in the comments), it’s always more useful to see them sorted in some kind of order. An example of this is shown below:
labels <- c(
rep("a", 100*rexp(1)),
rep("b", 100*rexp(1)),
rep("c", 100*rexp(1)),
rep("d", 100*rexp(1)))
x <- data.frame(labels = factor(labels), some.value = runif(length(labels)))
qplot(labels, data=x)
This produces the following chart:

Not bad if you only have 4 levels in your factor, but what if you have 400? I’d like to see the sequence of bars go a, d, b, c instead of a, b, c, d. Here’s the trick to re-order your factor:
qplot(reorder(x$labels, as.numeric(x$labels), length))
The reorder function takes three parameters:
In this case, we want to re-order by the number of each level, so we use the length function. The plot now looks like this:

To get them in descending order, I figured it was easiest to just multiply the length by -1.
qplot(reorder(x$labels, as.numeric(x$labels), function(y){-1*length(y)}))
Which will produce the following:

I recently discovered the idata.frame (immutable data frame) datatype in the plyr package in R, which MASSIVELY speeds up *ply functions.
Usually, when using any of the *ply functions, the pieces of the data.frame (or whatever other object you give it) are copied into new objects when being split apart. With the idata.frame, they are simply passed by reference, saving lots of time.
Here’s an example. I’ll load 50,000 lines of server log data and count the number of individual user agents found per second.
> x <- read.csv("../data/haproxy.csv", header=F, nrows=50000)
> library(plyr)
> system.time(x1 <- ddply(x, .(V1, V19), nrow))
user system elapsed
43.790 8.104 51.893
> system.time(x2 <- ddply(idata.frame(x), .(V1, V19), nrow))
user system elapsed
2.968 0.010 2.978
Even with the conversion to idata.frame, it only took 3 seconds instead of nearly a minute. If anyone knows why the first line of any plyr function isn’t x<-idata.frame(x), I’d be curious to know.
R and Python: uniting against common enemies.