Adam Laiacano

I'm a data engineer at tumblr and this is my blog. I write mostly about personal projects, data science, R/python, and various curiosities.

  1. SADDLE: Scala Data Library

    I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I’ve heard it described as “Pandas for Scala,” which is great because nothing beats having data frames ported to a new language.

  2. 2013-04-02
    #tech #scala #pandas #programming
  3. The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.

  4. 2012-11-13
    #data science #pydata #python #machine learning #programming #hadoop #big data
  5. These are from Drew Conway’s slides from his Monktoberfest talk, in which he explains how to perform the seemingly trivial actions of asking a good question and then answering it by analyzing some data.

    The key takeaway is summarized by JD “Mister McButter” Long:

  6. 2012-10-04
    #data science #stack overflow #github #programming
  7. I find this interaction funnier than I probably should.

    I find this interaction funnier than I probably should.

  8. 2012-07-25
    #lol #programming #regex #twitter
  9. Jeffrey Horner: Announcing The R markdown Package

    jeffreyhorner:

    R Flavored Markdown is a plain-text formatting syntax for creating documents that can be rendered to HTML. In fact it’s like HTML, but simpler. R Flavored Markdown is a variant of original Markdown with a few additional features

    This is a really exciting step towards reproducible research. The markdown code below creates this output:

    # Normal Distributions Functions in R
    
    Density, distribution function, quantile function and random
    generation for the normal distribution with mean equal to ‘mean’
    and standard deviation equal to ‘sd’.
    
    Use them this way:
    
    ```{r}
         dnorm(x, mean = 0, sd = 1, log = FALSE)
         pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
         qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
         rnorm(n, mean = 0, sd = 1)
    ```
    
    The math behind the code:
    
    $$latex  
    f(x) = \frac{1}{(\sigma\sqrt{2 \pi})} e^{-(\frac{(x - \mu)^2}{2 \sigma^2})}
    $$
    
    

    There is still room at the NY Open Statistical Meetup tomorrow to hear Jeff and others talk about creating dynamic reports in R.

  10. 2012-06-04
    #rstats #programming
  11. 
pull request merged!

    pull request merged!

    (Source: git-animals, via dallas)

  12. 2012-06-04
    #git #github #programming #gif
  13. 2-minute R tutorial videos

    This is a pretty interesting site with 90 2-minute R tutorial videos. I’ve only watched a couple but the narration is pretty high-octane as far as programming tutorial videos go.

    Some examples:

    • 013 how to read spss, stata, and sas files into r
    • 029 how to run analyses across multiple categories of a data table with the tapply and aggregate functions in r
    • 083 how to plot residuals from a regression in r (assuming you know some fancy statistics)
    • 085 how to export or save a plot in r
    • 089 how to run a block of commands at start-up to do stuff like setting your CRAN mirror permanently with r
  14. 2012-05-31
    #rstats #r #data science #programming
  15. The March of Progress

    alandipert:

    • 1980: C

      printf("%10.2f", x);
    • 1988: C++

      cout << setw(10) << setprecision(2) << showpoint << x;
    • 1996: Java

      java.text.NumberFormat formatter = java.text.NumberFormat.getNumberInstance(); formatter.setMinimumFractionDigits(2); formatter.setMaximumFractionDigits(2); String s = formatter.format(x); for (int i = s.length(); i < 10; i++) System.out.print(' '); System.out.print(s);
    • 2004: Java

      System.out.printf("%10.2f", x);
    • 2008: Scala and Groovy

      printf("%10.2f", x)

    via horstmann.com

  16. 2012-05-12
    #programming #java #c #scala #groovy
  17. This talk sums up several of my favorite things about programming in python. The day after I watched this, I had to use the Google API client, which is summarized in the video still above.

    It’s massive, way too complicated, and poorly documented (there are the pydocs, but that doesn’t help you with workflow). Every code example that I found for working with Google Docs spreadsheets was different, and full of comments from people claiming that it did or did not work for them.

    If you’re looking to work on a project that will be used widely, maybe try rewriting this client.

  18. 2012-03-22
    #python #programming #google #pycon #pydata
  19. Using idata.frames in R

    I recently discovered the idata.frame (immutable data frame) datatype in the plyr package in R, which MASSIVELY speeds up *ply functions.  

    Usually, when using any of the *ply functions, the pieces of the data.frame (or whatever other object you give it) are copied into new objects when being split apart.  With the idata.frame, they are simply passed by reference, saving lots of time.

    Here’s an example. I’ll load 50,000 lines of server log data and count the number of individual user agents found per second.

    > x <- read.csv("../data/haproxy.csv", header=F, nrows=50000)
    > library(plyr)
    > system.time(x1 <- ddply(x, .(V1, V19), nrow))
       user  system elapsed 
     43.790   8.104  51.893 
    > system.time(x2 <- ddply(idata.frame(x), .(V1, V19), nrow))
       user  system elapsed 
      2.968   0.010   2.978 

    Even with the conversion to idata.frame, it only took 3 seconds instead of nearly a minute. If anyone knows why the first line of any plyr function isn’t x<-idata.frame(x), I’d be curious to know.

  20. 2011-09-30
    #rstats #programming #plyr #idata.frame