Adam Laiacano

I'm a data engineer at tumblr and this is my blog. I write mostly about personal projects, data science, R/python, and various curiosities. You can read more about me here if you'd like.

  1. John Myles White was kind enough to come by Tumblr HQ this week and give a talk about the advantages that MAB (Multi-Armed Bandit) testing provide over traditional A/B testing.

    Most of the content is drawn from his ebook “Bandit Algorithms for Website Optimization”

  2. 2012-12-06
    #statistics #data mining #machine learning #ab test #multi-armed bandit #computer science
  3. The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.

  4. 2012-11-13
    #data science #pydata #python #machine learning #programming #hadoop #big data
  5. Upcoming Events

    I’ve got a few cool things coming up:

    DataKind - Sept. 7-9. I’ll be leading a team of volunteer data hackers to work with NYC governmental agencies and make sense of the data that they collect. DK will be announcing all of the details soon, but it sounds like there are a lot of interesting projects to work on over the course of the weekend. My group’s work from the last DataDive was presented at the United Nations General Assembly, and others work on NYC’s Stop and Frisk policy has been getting a lot of media attention lately.

    DataGotham Conference - Sept. 13-14. DG is a conference celebrating the Data community in NYC. There is a great list of speakers and tutorials so far, primarily by engineers, researchers, and data scientists at New York-based institutions. Follow @DataGotham to see when the schedule is announced, but it sounds like I’ll be on a really interesting panel.

    Bad Data Handbook - Nov. 2012 (est). I contributed a chapter to this collection of tips, tricks, and “war stories” on working with disorganized, inconsistent, and overall messy data.

  6. 2012-08-13
    #data science #machine learning #new york city #tech
  7. DataGotham is a 1.5 day conference focusing on the data analysis community in New York City. The focus will be on tutorials, stories, and ideas rather than tools and “enterprise solutions.”
The call for proposals was just posted, so get over there and share your work.

    DataGotham is a 1.5 day conference focusing on the data analysis community in New York City. The focus will be on tutorials, stories, and ideas rather than tools and “enterprise solutions.”

    The call for proposals was just posted, so get over there and share your work.

  8. 2012-07-17
    #data science #new york city #nyc #hadoop #statistics #rstats #machine learning
  9. CONE Welder: Collaborative Observatory for Natural Environments

    This could be the basis of a really fun image recognition project.

    With over 38,849 bird photos this collection can serve as a training set for visual bird detection and image processing research. 

    Via @siah

  10. 2012-02-14
    #machine learning #image recognition #birds
  11. Python function for sampling from an arbitrary discrete distribution

    Sampling is fun, right?  Here’s a simple implementation of a slice sampler for discrete probability distributions.

    And here’s how to call it:

    >>> px=[.2, .4, .1, .3] >>> slice_sampler(px, N=5) array([2, 3, 3, 3, 3]) 
    >>> slice_sampler(px, N=5, x=[100, 200, 300, 400]) array([200, 200, 400, 200, 200])

    Set N to something high and take a histogram and you’ll see that you have the right distribution.

    >>> from pylab import * >>> samples = slice_sampler(px, N=10000, x=[100, 200, 300, 400]) 
    >>> hist(samples) >>> grid()

    And here’s what you get:

    image

  12. 2011-12-29
    #machine learning #math #sampling #statistics #python
  13. A video of Hadley Wickham (author of ggplot2, plyr, reshape, stringr, lubridate, etc) talking about “tidy data” at the NYC Open Statistical Programming Meetup.

    Highly recommended for anyone who works with data from multiple sources that comes in various structures.

  14. 2011-12-20
    #rstats #machine learning #data analysis #data science
  15. Machine Learning for Email

    Drew Conway and John Myles White posted the code used in their O’Reilly book Machine Learning for Email on github. Check it out to see implementations (all in R) of priority inbox, spam classification, and other algorithms.

  16. 2011-11-15
    #machine learning #rstats
  17. Comparison of high level languages for mapreduce: k means

  18. 2011-10-14
    #hadoop #rstats #machine learning
  19. engineering:

As the new data guy at Tumblr, my first project is to take a look at algorithms we use to find and suggest blogs that a given user might be interested in.  This graph is a simple visual sample of my initial research.
Another engineer graciously volunteered to let me peek at the list of blogs he follows, from which I gathered a list of all the blogs they follow. From those two lists, I was able to create a large matrix with a row for each blog and a column for each person that he or she follows. Using a fairly simple SVD recommender, we are able to see a few distinct blog clusters (the axes here are the first three principal components).
The red dots are the blogs our guinea pig engineer follows (first degree), and the blue are the blogs his followers follow (second degree).  We performed a few spot tests to make sure that the groups made sense, and sure enough they do.  Up in the top left are some Tumblr staff blogs (including the official Staff Blog and David’s Log). The cluster on the far right, meanwhile, are a lot of “funny things I found on the internet”-style blogs. This engineer only follows one blog in the heart of that cloud, but you can see that the other followers of that blog are very cliquey (that is, they all follow each other).

My first post on the tumblr engineering blog.  This is coming along really well. I can’t wait to see it go live.

    engineering:

    As the new data guy at Tumblr, my first project is to take a look at algorithms we use to find and suggest blogs that a given user might be interested in.  This graph is a simple visual sample of my initial research.

    Another engineer graciously volunteered to let me peek at the list of blogs he follows, from which I gathered a list of all the blogs they follow. From those two lists, I was able to create a large matrix with a row for each blog and a column for each person that he or she follows. Using a fairly simple SVD recommender, we are able to see a few distinct blog clusters (the axes here are the first three principal components).

    The red dots are the blogs our guinea pig engineer follows (first degree), and the blue are the blogs his followers follow (second degree).  We performed a few spot tests to make sure that the groups made sense, and sure enough they do.  Up in the top left are some Tumblr staff blogs (including the official Staff Blog and David’s Log). The cluster on the far right, meanwhile, are a lot of “funny things I found on the internet”-style blogs. This engineer only follows one blog in the heart of that cloud, but you can see that the other followers of that blog are very cliquey (that is, they all follow each other).

    My first post on the tumblr engineering blog.  This is coming along really well. I can’t wait to see it go live.

  20. 2011-08-19
    #machine learning #recommendation systems #prediction #data #data visualization #infographics #tumblr #math #statistics