A couple of summers ago I had a 3 hour delay flying from Key West to New York. I was frustrated by everything except the fact that there was free wifi at the airport. So I decided to download the data on every takeoff and landing for all major airline carriers for 2010 and most of 2011 and see which airlines/routes are the worst so that I could avoid this issue in the future. I was only looking at takeoff delays, because that’s what was on my mind while I was drinking terrible coffee in the airport while my friends and family were out golfing and riding jet skis.

The graphs show the average delay time on each route (red is longer). Some other interesting facts:

  • Southwest has the highest delayed flight percentage (60%) when you include very short delays (5-10 minutes), which will probably have no effect on arrival time.
  • Southwest also has the lowest percentage of flights that are delayed for 40+ minutes (8% for >= 40min, 4% for >= 60min)
  • JetBlue and Pinnacle (some regional airline that I had never heard of?) have the highest percentage of 40+ minute delays (both around 15% for >=  40min and 10% for >= 60min).

This is just something I did on the airplane on the way home, but it would be fun to build this out as an actual web service (any volunteers for help?).

Oh and they also lost my luggage on that flight as well.

This is something I’ve always been curious about. I love the empty windows right around Christmas, Thanksgiving, and July 4. The original article says that this data is about births in the US from 1973-1999. I would imagine that there are very few people born on September 11 in the last 12 years.
I’d also bet that this varies by occupation. For example, I’ve had friends who are teachers tell me that the best time to have a baby is 6 weeks before the end of the school year, so you can coast from maternity leave into summer vacation. Any teacher who gets pregnant in February or March is probably having a surprise.

This is something I’ve always been curious about. I love the empty windows right around Christmas, Thanksgiving, and July 4. The original article says that this data is about births in the US from 1973-1999. I would imagine that there are very few people born on September 11 in the last 12 years.

I’d also bet that this varies by occupation. For example, I’ve had friends who are teachers tell me that the best time to have a baby is 6 weeks before the end of the school year, so you can coast from maternity leave into summer vacation. Any teacher who gets pregnant in February or March is probably having a surprise.

The folks over at RStudio have been killing it lately. A few months ago they released their integration with knitr and RMarkdown. We use it here at tumblr for report generation and I absolutely love it.
Most recently, they released Shiny, which lets you build interactive visualizations in the browser with minimal effort. I haven’t played around with it yet but I’m really looking forward to doing so.
If you’ve built any interesting Shiny apps yet, leave a link in the comments. I’d love to see them.

The folks over at RStudio have been killing it lately. A few months ago they released their integration with knitr and RMarkdown. We use it here at tumblr for report generation and I absolutely love it.

Most recently, they released Shiny, which lets you build interactive visualizations in the browser with minimal effort. I haven’t played around with it yet but I’m really looking forward to doing so.

If you’ve built any interesting Shiny apps yet, leave a link in the comments. I’d love to see them.

Canopy is a project that came out of the DataKind event a few weeks ago. The NYC Parks Department brought full dumps of their databases and a handful of questions. Volunteers brought their modeling, data munging, visualizing, and overall hacking skills.
I was a “data ambassador” for one of the groups, which means I got to look at the data in advance to make sure that we can 1) easily open and start working with the data that was provided and 2) actually accomplish what the Parks Department was asking with the information they provided.
Our project was provide a good understanding of what the tree diversity is like across the city, and how it is changing over time. The results are above. An interactive map where you can find all of the tree types in the city, the diversity of each census block (“diversity” being the number of unique species seen), some information about each tree type, and more. It was in a near-complete state in just one full day of work from Christopher Reed, Andrew Hill, Brian Abelson, Bennett Andrews, and myself. Chris did all of the front end work and has been updating the project relentlessly, making it better pretty much every day. Andrew set up the cartography database (CartoDB) which exposes an amazing API for querying the data. Bennett pulled in all of the tree information from Encyclopedia of Live. And Brian and I took the raw data provided by the parks department and transformed it into a workable shape.
This is something that the parks department probably couldn’t have thrown together on its own (especially this quickly), and now they have a tool that they can use and share. Huge thanks to Jake Porway and DataKind for putting events like these together. For more information on DataKind, check out Jake’s talk from DataGotham.

Canopy is a project that came out of the DataKind event a few weeks ago. The NYC Parks Department brought full dumps of their databases and a handful of questions. Volunteers brought their modeling, data munging, visualizing, and overall hacking skills.

I was a “data ambassador” for one of the groups, which means I got to look at the data in advance to make sure that we can 1) easily open and start working with the data that was provided and 2) actually accomplish what the Parks Department was asking with the information they provided.

Our project was provide a good understanding of what the tree diversity is like across the city, and how it is changing over time. The results are above. An interactive map where you can find all of the tree types in the city, the diversity of each census block (“diversity” being the number of unique species seen), some information about each tree type, and more. It was in a near-complete state in just one full day of work from Christopher Reed, Andrew Hill, Brian Abelson, Bennett Andrews, and myself. Chris did all of the front end work and has been updating the project relentlessly, making it better pretty much every day. Andrew set up the cartography database (CartoDB) which exposes an amazing API for querying the data. Bennett pulled in all of the tree information from Encyclopedia of Live. And Brian and I took the raw data provided by the parks department and transformed it into a workable shape.

This is something that the parks department probably couldn’t have thrown together on its own (especially this quickly), and now they have a tool that they can use and share. Huge thanks to Jake Porway and DataKind for putting events like these together. For more information on DataKind, check out Jake’s talk from DataGotham.

Colored plotmatrix in ggplot2

I feel like plotmatrix is a much-neglected ggplot2 function (it’s not even on the ggplot2 webpage). It’s the equivalent of plot(dataframe) from the core graphics package, with the added bonus of a kernel density plot along the diagonal.

The one thing that seems to be missing is the ability to color the points by some factor. I modified the plotmatrix function to enable this. Here’s an example.

My data looks something like this:

> head(data)
   a  b  c  d level
1  1  2  5  2 FALSE
2  1 21  2  1 FALSE
3 26  1 NA  1 FALSE
4  8 45 13 12 FALSE
5  3 25  4  4 FALSE
6  4 NA  1  2 FALSE

And when I plot it using plotmatrix I get the following:

plotmatrix(data, mapping=aes(colour=data$level))

None of the aesthetic mappings actually make it to the plot.  I changed it so that the aesthetics get passed to the points only:

plotmatrix2(data, mapping=aes(colour=data$level, shape=data$level))

And here’s the code: