April 2013
3 posts
2 tags
Baseball by the (jersey) numbers
The other day I saw this video of a 14 year old high school basketball sensation. He’s billed as “The Next LeBron” who was in turn billed as “The Next Michael Jordan.” What caught my eye is that they all wear number 23 on their jersey (until LeBron moved to The Heat, at least).
When I was a kid I played lots of sports, and the best basketball player on my team was...
2 tags
The blogroll « Statistical Modeling, Causal... →
A great list of blogs on statistics, visualization, politics and more. Curated by Andrew Gelman. I’m not sure how Simply Statistics didn’t make the list.
4 tags
SADDLE: Scala Data Library →
I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I’ve heard it described as “Pandas for Scala,” which is great because nothing beats having data frames ported to a new language.
March 2013
8 posts
3 tags
All models are wrong, but some are useful.
– George E P Box (1919-2013)
2 tags
How to approach a problem: self-indulgent music...
I’ve been thinking a lot about music recommendations lately, and I realized that I’m usually a little bearish about listening to recommended bands that I’ve never heard of before. Maybe it’s just because I listen to a pretty broad variety of music, but I love re-discovering a band that I know but haven’t thought of in a while. So with that, let’s build a 100% self-centered music recommender. The...
3 tags
So, yes, lasso was a great idea, and I didn’t get the point. I’m still amazed in...
– It’s always impressive to see a leader in any field admit when they were “way way wrong” about a certain technique or idea. This post by Andrew Gelman is a great summary of what Lasso (or regularization in general) is, and how his opinion on it has changed over time.
Tibshirani...
Keeping tabs on your data analysis workflow
I’ve been thinking a lot lately about data analysis workflows, from raw materials to a finished product. Here at Tumblr and at my previous job at Columbia Business School, I’ve had to create a ton of one-off reports that would “probably never be used again.” The problem is that an awful lot of those “one time” requests have come back to show their head a month, or even a...
2 tags
4 tags
Any data scientist worth their salary will tell you that you should start with a...
– Jake Porway on how You Can’t Just Hack Your Way to Social Change
1 tag
Some more shameless self promotion
I’m going to be giving two talks in the next few weeks:
An overview of our data stack at the NYC Data Science Meetup at Foursquare HQ on March 28.
A keynote talk at the Big Data Techchon in Boston April 8-10
I hope to see a bunch of you there. Feel free to say hello if we haven’t met.
4 tags
Using entropy to route web traffic
Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the...
February 2013
7 posts
6 tags
Sifter: an API for website testing and...
I’m happy to finally give Sifter a name and a home. It’s a multi-armed bandit (MAB) API for testing web pages, optimizing which version (“arm”) of the test gets displayed. These arms could be article titles, ad positions, logos, colors, etc. Here’s the gist of how how a MAB test works:
A user visits a page that is under test.
The page makes a request to sifter...
5 tags
1 tag
2 tags
1 tag
Are you going to be optimizing the solution to one problem, or solving a wide...
– Hilary Mason is on point, as usual. Read Finding a Great Data Science Job for more excellent considerations.
1 tag
An API for website testing/optimization
This is something that I’ve been working on very heavily for the last 6 weeks or so, and I’m excited to start showing it off. It’s currently being hosted at http://banditapi.herokuapp.com.
In a normal A/B test on a website, you randomly present users with one version of a feature or another. This could be the placement of your logo, for example. You let the test run for some...
January 2013
9 posts
4 tags
BigData TechCon 2013 →
I’m honored to be giving a keynote talk at the 2013 BigData TechCon conference in Boston (April 8-10). There are a lot of excellent speakers at the event that I’m looking forward to seeing, including Claudia Perlich (Media6Degrees), Jonathan Seidman (Cloudera, formerly Orbitz), and Oscar Boykin (twitter).
Registration is open here. Here’s a short abstract for my talk:
Social...
Significance testing in general has been a greatly overworked procedure, and in...
– George Box “Statistics For Experimenters,” 1978.
8 tags
1 tag
When I was a kid, I thought a lot about what made me different from the other...
– Aaron Swartz
2 tags
Upcoming data meetups in NYC
Here are a handful of great events coming up in New York in the next few weeks:
Jan 17 Music recommendations at Spotify (700 people have RSVP’d for this. Yikes!)
Jan 24 Algorithms, Art and Authorship (waiting list only at this point)
Jan 31 Ranking Algorithms for Sports, Social Networks and More (some background reading here)
Feb 6 Data Science at the United Nations
1 tag
4 tags
New year, new projects.
Happy new year! 2012 was a pretty crazy year for me. The biggest thing was probably moving into my great new apartment with Lana, but I also worked on a bunch of great projects, including:
Led a team at the DataKind hackathon, which had great results.
Speaking on a panel at the Data Gotham conference.
Contributing a chapter to Bad Data Handbook.
I also made (and almost completed) a little app...
3 tags
I think what makes a good data scientist is the same thing that makes you a good...
– Roger Peng on What makes a good data scientist.
December 2012
4 posts
1 tag
1 tag
4 tags
The Fast Fourier Transform →
This is a great post about the FFT algorithm with a simple implementation example. It’s definitely worth reading as either an introduction to or refresher on one of the most important algorithms in modern history.
6 tags
November 2012
7 posts
3 tags
3 tags
Optimal Descriptive NFL Rankings
seanjtaylor:
Most NFL fans, like myself, obsess over who’s going to win games or which players to start in our fantasy football leagues. One of the fundamental tools we use to look at this are rankings. Rankings are the simplest possible model that can represent a total order, which you can think of as a function that allows you to compare all possible pairs in your set.
Describing the NFL...
4 tags
7 tags
The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.
2 tags
NYC awards Columbia $15 million for new...
columbiascience:
Engineering professors Kathleen McKeown, at right, and Patricia Culligan have been named director and associate director, respectively, of Columbia’s new data-science institute. / Photograph by Jenica Miller
Researchers call it “big data”: the troves of digital information constantly being generated by the mobile devices in our hands, the GPS units on our dashboards, and the...
October 2012
12 posts
3 tags
Simply Statistics: On weather forecasts, Nate... →
simplystatistics:
The interesting thing is that even though we only estimate that Obama leads by about 0.5%, he wins 68% of the simulated elections…
He never gets more than 54% or so and never less than 47% or so. So it is always a reasonably close election. Silver’s calculations are obviously more complicated, but the basic idea of simulating elections is the same.
An excellent, simple...
5 tags
6 tags
wssrstrm asked: So, in a theme, how do I hide the content source when it's the same as the reblog parent?
4 tags
3 tags
2 tags
Adam: Should I spell "referer" with...
Everyone: ONE R
NSQ: realtime distributed message processing at... →
wordbitly:
NSQ is a realtime message processing system designed to operate at bitly’s scale, handling billions of messages per day.
It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message…
1 tag
Unsung heroes in startups
bijan:
In startups, the heroes often cited are the founders. Or the product leads. Or the designers. They are the ones making the headlines or get invited to speak at conferences.
And for good reason. They are driving the vision of the product experience and often breaking new ground. So the praise is often justified.
But they aren’t the only heroes.
More and more, I’m paying more...
4 tags
Pandas - to - R cheat sheet →
A great pandas-to-R reference page from gappy.
(I’d love to see this as a github gist rather than read-only google doc)
4 tags