April 2013
3 posts
2 tags
Baseball by the (jersey) numbers
The other day I saw this video of a 14 year old high school basketball sensation. He’s billed as “The Next LeBron” who was in turn billed as “The Next Michael Jordan.” What caught my eye is that they all wear number 23 on their jersey (until LeBron moved to The Heat, at least). When I was a kid I played lots of sports, and the best basketball player on my team was...
Apr 30th
12 notes
2 tags
The blogroll « Statistical Modeling, Causal... →
A great list of blogs on statistics, visualization, politics and more. Curated by Andrew Gelman. I’m not sure how Simply Statistics didn’t make the list.
Apr 29th
2 notes
4 tags
SADDLE: Scala Data Library →
I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I’ve heard it described as “Pandas for Scala,” which is great because nothing beats having data frames ported to a new language.
Apr 2nd
6 notes
March 2013
8 posts
3 tags
“All mod­els are wrong, but some are use­ful.”
– George E P Box (1919-2013)
Mar 31st
5 notes
2 tags
How to approach a problem: self-indulgent music...
I’ve been thinking a lot about music recommendations lately, and I realized that I’m usually a little bearish about listening to recommended bands that I’ve never heard of before. Maybe it’s just because I listen to a pretty broad variety of music, but I love re-discovering a band that I know but haven’t thought of in a while. So with that, let’s build a 100% self-centered music recommender. The...
Mar 24th
25 notes
3 tags
“So, yes, lasso was a great idea, and I didn’t get the point. I’m still amazed in...”
– It’s always impressive to see a leader in any field admit when they were “way way wrong” about a certain technique or idea. This post by Andrew Gelman is a great summary of what Lasso (or regularization in general) is, and how his opinion on it has changed over time. Tibshirani...
Mar 20th
2 notes
Keeping tabs on your data analysis workflow
I’ve been thinking a lot lately about data analysis workflows, from raw materials to a finished product. Here at Tumblr and at my previous job at Columbia Business School, I’ve had to create a ton of one-off reports that would “probably never be used again.” The problem is that an awful lot of those “one time” requests have come back to show their head a month, or even a...
Mar 14th
13 notes
2 tags
Mar 14th
6 notes
4 tags
“Any data scientist worth their salary will tell you that you should start with a...”
– Jake Porway on how You Can’t Just Hack Your Way to Social Change
Mar 7th
67 notes
1 tag
Some more shameless self promotion
I’m going to be giving two talks in the next few weeks: An overview of our data stack at the NYC Data Science Meetup at Foursquare HQ on March 28. A keynote talk at the Big Data Techchon in Boston April 8-10 I hope to see a bunch of you there. Feel free to say hello if we haven’t met.
Mar 6th
5 notes
4 tags
Using entropy to route web traffic
Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the...
Mar 1st
97 notes
February 2013
7 posts
6 tags
Sifter: an API for website testing and...
I’m happy to finally give Sifter a name and a home. It’s a multi-armed bandit (MAB) API for testing web pages, optimizing which version (“arm”) of the test gets displayed. These arms could be article titles, ad positions, logos, colors, etc. Here’s the gist of how how a MAB test works: A user visits a page that is under test. The page makes a request to sifter...
Feb 21st
12 notes
5 tags
Feb 21st
6 notes
1 tag
Feb 18th
10 notes
Feb 14th
12 notes
2 tags
Feb 12th
19 notes
1 tag
“Are you going to be optimizing the solution to one problem, or solving a wide...”
– Hilary Mason is on point, as usual. Read Finding a Great Data Science Job for more excellent considerations.
Feb 4th
5 notes
1 tag
An API for website testing/optimization
This is something that I’ve been working on very heavily for the last 6 weeks or so, and I’m excited to start showing it off. It’s currently being hosted at http://banditapi.herokuapp.com. In a normal A/B test on a website, you randomly present users with one version of a feature or another. This could be the placement of your logo, for example. You let the test run for some...
Feb 2nd
5 notes
January 2013
9 posts
4 tags
BigData TechCon 2013 →
I’m honored to be giving a keynote talk at the 2013 BigData TechCon conference in Boston (April 8-10). There are a lot of excellent speakers at the event that I’m looking forward to seeing, including Claudia Perlich (Media6Degrees), Jonathan Seidman (Cloudera, formerly Orbitz), and Oscar Boykin (twitter). Registration is open here. Here’s a short abstract for my talk: Social...
Jan 23rd
9 notes
“Significance testing in general has been a greatly overworked procedure, and in...”
– George Box “Statistics For Experimenters,” 1978.
Jan 23rd
2 notes
8 tags
Jan 18th
9 notes
1 tag
Jan 17th
825 notes
“When I was a kid, I thought a lot about what made me different from the other...”
–  Aaron Swartz
Jan 16th
502 notes
2 tags
Upcoming data meetups in NYC
Here are a handful of great events coming up in New York in the next few weeks: Jan 17 Music recommendations at Spotify (700 people have RSVP’d for this. Yikes!) Jan 24 Algorithms, Art and Authorship (waiting list only at this point) Jan 31 Ranking Algorithms for Sports, Social Networks and More (some background reading here) Feb 6 Data Science at the United Nations
Jan 15th
5 notes
1 tag
ListenI made this a few months ago and just came across...
Jan 15th
26 notes
4 tags
New year, new projects.
Happy new year! 2012 was a pretty crazy year for me. The biggest thing was probably moving into my great new apartment with Lana, but I also worked on a bunch of great projects, including: Led a team at the DataKind hackathon, which had great results. Speaking on a panel at the Data Gotham conference. Contributing a chapter to Bad Data Handbook. I also made (and almost completed) a little app...
Jan 5th
11 notes
3 tags
“I think what makes a good data scientist is the same thing that makes you a good...”
– Roger Peng on What makes a good data scientist.
Jan 4th
2 notes
December 2012
4 posts
1 tag
Dec 19th
338 notes
1 tag
Dec 17th
10 notes
4 tags
The Fast Fourier Transform →
This is a great post about the FFT algorithm with a simple implementation example. It’s definitely worth reading as either an introduction to or refresher on one of the most important algorithms in modern history.
Dec 17th
8 notes
6 tags
Dec 6th
35 notes
November 2012
7 posts
3 tags
Nov 26th
207 notes
3 tags
Optimal Descriptive NFL Rankings
seanjtaylor: Most NFL fans, like myself, obsess over who’s going to win games or which players to start in our fantasy football leagues. One of the fundamental tools we use to look at this are rankings. Rankings are the simplest possible model that can represent a total order, which you can think of as a function that allows you to compare all possible pairs in your set. Describing the NFL...
Nov 20th
20 notes
4 tags
Nov 15th
2 notes
7 tags
WatchWatch
The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.
Nov 13th
3 notes
2 tags
Nov 13th
124 notes
Nov 12th
359 notes
NYC awards Columbia $15 million for new...
columbiascience: Engineering professors Kathleen McKeown, at right, and Patricia Culligan have been named director and associate director, respectively, of Columbia’s new data-science institute. / Photograph by Jenica Miller Researchers call it “big data”: the troves of digital information constantly being generated by the mobile devices in our hands, the GPS units on our dashboards, and the...
Nov 8th
8 notes
October 2012
12 posts
Oct 31st
981 notes
3 tags
Simply Statistics: On weather forecasts, Nate... →
simplystatistics: The interesting thing is that even though we only estimate that Obama leads by about 0.5%, he wins 68% of the simulated elections… He never gets more than 54% or so and never less than 47% or so. So it is always a reasonably close election. Silver’s calculations are obviously more complicated, but the basic idea of simulating elections is the same.  An excellent, simple...
Oct 30th
14 notes
5 tags
Oct 28th
4 notes
6 tags
Oct 28th
7 notes
wssrstrm asked: So, in a theme, how do I hide the content source when it's the same as the reblog parent?
Oct 22nd
1 note
4 tags
Oct 16th
11 notes
3 tags
Oct 12th
5 notes
2 tags
Adam: Should I spell "referer" with...
Everyone: ONE R
Oct 10th
6 notes
NSQ: realtime distributed message processing at... →
wordbitly: NSQ is a realtime message processing system designed to operate at bitly’s scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message…
Oct 9th
5 notes
1 tag
Unsung heroes in startups
bijan: In startups, the heroes often cited are the founders. Or the product leads. Or the designers. They are the ones making the headlines or get invited to speak at conferences.  And for good reason. They are driving the vision of the product experience and often breaking new ground. So the praise is often justified.  But they aren’t the only heroes.  More and more, I’m paying more...
Oct 7th
84 notes
4 tags
Pandas - to - R cheat sheet →
A great pandas-to-R reference page from gappy. (I’d love to see this as a github gist rather than read-only google doc)
Oct 4th
4 tags
Oct 4th
23 notes