I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I’ve heard it described as “Pandas for Scala,” which is great because nothing beats having data frames ported to a new language.
The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.
These are from Drew Conway’s slides from his Monktoberfest talk, in which he explains how to perform the seemingly trivial actions of asking a good question and then answering it by analyzing some data.
The key takeaway is summarized by JD “Mister McButter” Long:

I find this interaction funnier than I probably should.
R Flavored Markdown is a plain-text formatting syntax for creating documents that can be rendered to HTML. In fact it’s like HTML, but simpler. R Flavored Markdown is a variant of original Markdown with a few additional features
This is a really exciting step towards reproducible research. The markdown code below creates this output:
# Normal Distributions Functions in R
Density, distribution function, quantile function and random
generation for the normal distribution with mean equal to ‘mean’
and standard deviation equal to ‘sd’.
Use them this way:
```{r}
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
```
The math behind the code:
$$latex
f(x) = \frac{1}{(\sigma\sqrt{2 \pi})} e^{-(\frac{(x - \mu)^2}{2 \sigma^2})}
$$
There is still room at the NY Open Statistical Meetup tomorrow to hear Jeff and others talk about creating dynamic reports in R.
pull request merged!
(Source: git-animals, via dallas)
This is a pretty interesting site with 90 2-minute R tutorial videos. I’ve only watched a couple but the narration is pretty high-octane as far as programming tutorial videos go.
Some examples:
1980: C
printf("%10.2f", x);1988: C++
cout << setw(10) << setprecision(2) << showpoint << x;1996: Java
java.text.NumberFormat formatter = java.text.NumberFormat.getNumberInstance(); formatter.setMinimumFractionDigits(2); formatter.setMaximumFractionDigits(2); String s = formatter.format(x); for (int i = s.length(); i < 10; i++) System.out.print(' '); System.out.print(s);2004: Java
System.out.printf("%10.2f", x);2008: Scala and Groovy
printf("%10.2f", x)via horstmann.com
This talk sums up several of my favorite things about programming in python. The day after I watched this, I had to use the Google API client, which is summarized in the video still above.
It’s massive, way too complicated, and poorly documented (there are the pydocs, but that doesn’t help you with workflow). Every code example that I found for working with Google Docs spreadsheets was different, and full of comments from people claiming that it did or did not work for them.
If you’re looking to work on a project that will be used widely, maybe try rewriting this client.
I recently discovered the idata.frame (immutable data frame) datatype in the plyr package in R, which MASSIVELY speeds up *ply functions.
Usually, when using any of the *ply functions, the pieces of the data.frame (or whatever other object you give it) are copied into new objects when being split apart. With the idata.frame, they are simply passed by reference, saving lots of time.
Here’s an example. I’ll load 50,000 lines of server log data and count the number of individual user agents found per second.
> x <- read.csv("../data/haproxy.csv", header=F, nrows=50000)
> library(plyr)
> system.time(x1 <- ddply(x, .(V1, V19), nrow))
user system elapsed
43.790 8.104 51.893
> system.time(x2 <- ddply(idata.frame(x), .(V1, V19), nrow))
user system elapsed
2.968 0.010 2.978
Even with the conversion to idata.frame, it only took 3 seconds instead of nearly a minute. If anyone knows why the first line of any plyr function isn’t x<-idata.frame(x), I’d be curious to know.