The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.
I’ve got a few cool things coming up:
DataKind - Sept. 7-9. I’ll be leading a team of volunteer data hackers to work with NYC governmental agencies and make sense of the data that they collect. DK will be announcing all of the details soon, but it sounds like there are a lot of interesting projects to work on over the course of the weekend. My group’s work from the last DataDive was presented at the United Nations General Assembly, and others work on NYC’s Stop and Frisk policy has been getting a lot of media attention lately.
DataGotham Conference - Sept. 13-14. DG is a conference celebrating the Data community in NYC. There is a great list of speakers and tutorials so far, primarily by engineers, researchers, and data scientists at New York-based institutions. Follow @DataGotham to see when the schedule is announced, but it sounds like I’ll be on a really interesting panel.
Bad Data Handbook - Nov. 2012 (est). I contributed a chapter to this collection of tips, tricks, and “war stories” on working with disorganized, inconsistent, and overall messy data.
DataGotham is a 1.5 day conference focusing on the data analysis community in New York City. The focus will be on tutorials, stories, and ideas rather than tools and “enterprise solutions.”
The call for proposals was just posted, so get over there and share your work.
This could be the basis of a really fun image recognition project.
With over 38,849 bird photos this collection can serve as a training set for visual bird detection and image processing research.
Sampling is fun, right? Here’s a simple implementation of a slice sampler for discrete probability distributions.
And here’s how to call it:
>>> px=[.2, .4, .1, .3] >>> slice_sampler(px, N=5) array([2, 3, 3, 3, 3])
>>> slice_sampler(px, N=5, x=[100, 200, 300, 400]) array([200, 200, 400, 200, 200])
Set N to something high and take a histogram and you’ll see that you have the right distribution.
>>> from pylab import * >>> samples = slice_sampler(px, N=10000, x=[100, 200, 300, 400])
>>> hist(samples) >>> grid()
And here’s what you get:
Drew Conway and John Myles White posted the code used in their O’Reilly book Machine Learning for Email on github. Check it out to see implementations (all in R) of priority inbox, spam classification, and other algorithms.
As the new data guy at Tumblr, my first project is to take a look at algorithms we use to find and suggest blogs that a given user might be interested in. This graph is a simple visual sample of my initial research.
Another engineer graciously volunteered to let me peek at the list of blogs he follows, from which I gathered a list of all the blogs they follow. From those two lists, I was able to create a large matrix with a row for each blog and a column for each person that he or she follows. Using a fairly simple SVD recommender, we are able to see a few distinct blog clusters (the axes here are the first three principal components).
The red dots are the blogs our guinea pig engineer follows (first degree), and the blue are the blogs his followers follow (second degree). We performed a few spot tests to make sure that the groups made sense, and sure enough they do. Up in the top left are some Tumblr staff blogs (including the official Staff Blog and David’s Log). The cluster on the far right, meanwhile, are a lot of “funny things I found on the internet”-style blogs. This engineer only follows one blog in the heart of that cloud, but you can see that the other followers of that blog are very cliquey (that is, they all follow each other).
My first post on the tumblr engineering blog. This is coming along really well. I can’t wait to see it go live.