The other day I saw this video of a 14 year old high school basketball sensation. He’s billed as “The Next LeBron” who was in turn billed as “The Next Michael Jordan.” What caught my eye is that they all wear number 23 on their jersey (until LeBron moved to The Heat, at least).
When I was a kid I played lots of sports, and the best basketball player on my team was always #23 and the best football player was always #34 (Bo Jackson, Walter Payton, etc). So that got me to wondering what the “desirable” numbers are in other sports. I asked about other sports on twitter and #10 is clearly the best number to wear in soccer, and maybe #9 for hockey (Richard, Hull, Howe).
I wanted to figure out which jersey numbers have been the best historically, and baseball is the obvious sport to turn to for this, since data is so reliable and readily available. So I downloaded the career stats for a little over 17,000 players (which I think is every baseball player ever) and decided to see which are the best jersey numbers over all.
For each player, I got their career batting average and their jersey number, which proved to be harder than I expected. For example, Johnny Damon wore #18 from 2002-2009 (during his best seasons with the Red Sox and Yankees), but has also worn 51, 8, 22, and 33. So to make it a little easy on myself, I just got the number that they wore on the most teams. Johnny Damon has had 7 numbers on 6 teams, but wore #18 on 4 teams, so I’m associating him with #18.
For scoring the jersey number, I’m taking the mean of career batting averages for all players who wore that number, weighted by the number of plate appearances per batter. I perform this weighting so that someone like Roberto Clemente who hit .317 over 10,211 plate appearances would have more influence than someone like Buster Posey, who hit .311 with only 1,324 plate appearances. That’s a bit of a mouthful so here’s some math that might make more sense:
\[ S_j = \sum_{i=1}^{N_j} \frac{p_{j,i} b_{j,i}}{\sum_{k=1}^{N_j}p_{j,k}} \]
Where \( S_j \) is the score for jersey number \( j \), \( b_{j,i} \) and \( p_{j,i} \) are respectively the batting average and number of plate appearances for the \( i^{th} \) player who wears jersey number \( j \), and \( N_j \) is the total number of players who wore jersey number \( j \). I was also sure to limit the data set to \( p_{j,i} \geq 500 \) and \( N_j \geq 5 \). There were also a lot of older (~100 years ago) players whose number I couldn’t gather, so I dropped those as well. This narrowed the data set down to about 3,500 players.
The next thing to consider is that since I’m using lifetime batting average and number of plate appearances to rank players, what I’m really doing is ranking hitters. Many pitchers in the National League end up with over 500 at bats, but their batting average is just going to hurt the overall ranking of their jersey number. The following graph makes this crystal clear. The y-axis shows the jersey number, and the x-axis is the batting average for each player.

So once we remove the pitchers, here’s where each number ranks in terms of weighted batting average:

There are some clear winners here. The number 51 is a bit of an outlier because there are only 10 batters with that number who meet the minumum plate appearance requirement, but the list includes Ichiro Suzuki (.321) and Bernie Williams (.297) among others with high batting averages. Here are the top players who wore #4, a more “classic” number:
name plate_appearances batting_average
Rogers Hornsby 9480 0.358
Lou Gehrig 9663 0.340
Riggs Stephenson 5134 0.336
Dale Alexander 2736 0.331
Babe Herman 6228 0.324
Luke Appling 10254 0.310
Hack Wilson 5556 0.307
Paul Molitor 12167 0.306
Smead Jolley 1815 0.305
Mel Ott 11348 0.304
Not a bad group to be in. There are also some clear loser numbers, but they’re mostly higher numbers which I think are often worn by pitchers.
Here are the the 10 best and worst lifetime batting averages:
name number plate_appearances batting_average
Rogers Hornsby 4 9480 0.358 0
Ted Williams 9 9788 0.344 0
Bill Terry 3 7108 0.341 0
Lou Gehrig 4 9663 0.340 0
Tony Gwynn 19 10232 0.338 0
Riggs Stephenson 4 5134 0.336 0
Al Simmons 7 9518 0.334 0
Paul Waner 24 10766 0.333 0
Dale Alexander 4 2736 0.331 0
Stan Musial 6 12717 0.331 0
...
Corky Miller 37 575 0.188 0
J.R. Phillips 17 545 0.188 0
Bill Plummer 8 1007 0.188 0
Gus Gil 18 538 0.186 0
Brandon Wood 32 751 0.186 0
Drew Butera 41 531 0.183 0
Kevin Cash 17 714 0.183 0
Tommy Dean 3 594 0.180 0
Ray Oyler 1 1445 0.175 0
John Vukovich 16 607 0.161 0
If you’re a baseball fan, you can see that the bottom is all populated by pitchers, because no .890 batter would ever get 950 at bats in the majors. In fact, baseball’s obvious bias towards keeping good hitters and cutting poor hitters is obvious when you plot lifetime batting average against total plate appearances:
So what number would I wear if I were a pro baseball player? I can’t NOT choose #9, even though it’s retired by the Red Sox.
Also, all of the code and results for this blog post are available on my github page.
A great list of blogs on statistics, visualization, politics and more. Curated by Andrew Gelman. I’m not sure how Simply Statistics didn’t make the list.
I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I’ve heard it described as “Pandas for Scala,” which is great because nothing beats having data frames ported to a new language.
I’ve been thinking a lot about music recommendations lately, and I realized that I’m usually a little bearish about listening to recommended bands that I’ve never heard of before. Maybe it’s just because I listen to a pretty broad variety of music, but I love re-discovering a band that I know but haven’t thought of in a while. So with that, let’s build a 100% self-centered music recommender. The goal is to remind myself of some bands that I might like to play next based on what I’m listening to right now.
Fortunately for me, I’ve used last.fm to record the last 135,000+ tracks that I’ve listened to over the course of 7 years (my “now playing” is listed at the top of this page). And even more fortunately, they let you grab your entire history via their API. I was actually able to get 127,873 of them, which is more than plenty to work with. So let’s check out which artists I’ve listened to the most and see how well it matches up with my last.fm profile:
Artist Plays
----------------------------
Tom Waits 3155
Justin Townes Earle 2613
Iron & Wine 2053
M. Ward 2005
Lucero 1832
Old 97's 1761
The Black Keys 1755
Beach House 1624
Death Cab for Cutie 1592
Ryan Adams 1527
The first thing that should become clear is that I listen to a lot of Sad Bastard music. At least both Dillinger Four and Samiam are in the top 20.
When designing this recommender, I’m going to try to answer the following question: Given the artist I’m currently listening to, what have I generally listened to next?
Now since this is specific to me, I’m going add a few constraints. The first of which is this: I prefer listening to full albums rather than individual tracks. I’m not going to recommend songs, I’m going to recommend artists. Because of that, the only important attributes I need for each track are the time that I listened to it and the artist.
Let’s look at the song-to-song transitions. That is, given that I’m listening to a song by The Antlers, which band am I likely to listen to next? This table shows the number of times I transition from listening to an Antlers song to each artist on the list, as well as the probability of the transition.
artist transitions transition_prob
The Antlers 854 0.870540
Beach House 22 0.022426
Arcade Fire 5 0.005097
Patrick Watson 4 0.004077
The Tallest Man on Earth 3 0.003058
Okkervil River 3 0.003058
The Avett Brothers 3 0.003058
Carla Bruni 3 0.003058
Pinback 2 0.002039
North Highlands 2 0.002039
This simply confirms what I stated earlier: when I listen to music, I listen to full albums. 87% of the time that I listen to an Antlers song, I listen to another one of their songs next. That’s not helpful for recommendations, so I’ll add another constraint: I’m only interested in transitions where the artists are not the same. Now the above list looks like this:
artist transitions transition_prob
Beach House 22 0.173228
Arcade Fire 5 0.039370
Patrick Watson 4 0.031496
The Tallest Man on Earth 3 0.023622
The Avett Brothers 3 0.023622
Okkervil River 3 0.023622
Carla Bruni 3 0.023622
Pinback 2 0.002039
North Highlands 2 0.002039
Fleetwood Mac 2 0.015748
The order is the same as before, but the transition probabilities are much higher. This is a reasonable list of artists to recommend to someone who listens to The Antlers. Even last.fm has Beach House and Okkervil River in the top related artists.
We’re doing well so far, but let’s see if we can make it a little better with just a bit more work. Beach House is the top recommendation, but I listen to them a lot . Of all of the tracks I’ve recorded, 1.27% of them are Beach House tracks. Considering there are a total of 1,342 unique artists in my data set, that means I’m \(\frac{0.0127}{1 / 1342} = 17 \) times more likely to listen to Beach House than the “average” band.
So let’s use this information by dividing each transition probability by the unconditional probability of listening to a given artist. The unconditional probability is simply the total plays for each artist divided by the total number of plays (1624/127873 = 0.0127 for Beach House). The equation for the ranking has now become the following, where \( Pr(artist | Antlers) \) means the probability of listening to a given artist immediately after listening to The Antlers:
\[ \frac{Pr(artist | Antlers)}{Pr(artist)} \]
When I divide by the unconditional probability, I will give weight to artists that I listen to less often overall, making the results a little more exciting. If I multiply by this probability, however, I’ll give extra weight to artists I listen to more often, making the results more familiar. It’s probably important to point out that this is a bit of a hack that I came up with while typing up this blog post, and shouldn’t be confused with Bayes’ Theorem even though it looks sort of related. Anyway, let’s see what these rankings look like:
original less familiar more familiar
------------------------------------------------------------------------------
Beach House April March Beach House
Arcade Fire Broken Bells Tom Waits
Patrick Watson Beach House Arcade Fire
The Tallest Man on Earth Army of Ponch Patrick Watson
The Avett Brothers Mineral The Tallest Man on Earth
Okkervil River Carla Bruni Okkervil River
Carla Bruni Arcade Fire Bon Iver
Pinback North Highlands The Avett Brothers
North Highlands The Murder City Devils Dillinger Four
Fleetwood Mac Pinback Band of Horses
You can see that Tom Waits moved up on the chart on the right because I’ve listen to him more than anybody else. For the middle list, however, there’s lots of stuff that didn’t even make the original cut. A band like Army of Ponch might not seem like the best recommendation to someone currently listening to The Antlers, but I’ve made that transition twice and might want to again.
While we’re at it, here’s the list of recommendations for Bruce Springsteen:
original less familiar more familiar
----------------------------------------------------------------------
Chuck Ragan Buddy Holly Tom Waits
Built to Spill Buckingham Nicks Chuck Ragan
Camera Obscura Sam Cooke Built to Spill
Tom Waits The Jayhawks Wilco
Wilco Chuck Ragan Camera Obscura
Okkervil River Bridge and Tunnel Okkervil River
Mean Creek Built to Spill Death Cab for Cutie
Death Cab for Cutie Mastodon Old 97's
Dan Auerbach Camera Obscura Ryan Adams
Bridge and Tunnel Iron & Wine and Calexico Spoon
Just reading that list reminds me that Mean Creek has a new record out that I’m going to listen to right now.
So which list is best? How does it compare to standard information retrieval techniques? Well, that’s probably different for each person and the only way to find out is to test it. I could (and might eventually) put together a little app that recommends me some artists from my listening history based on what I’m listening to right now. With a simple A/B test, I could see which of the three recommendation algorithms I follow most often and stick with that one in the future. To do that, I would have to record
The recommendation algorithm that provides the highest play / display ratio is the one I’d like to go with in the future. This seems like an obvious place to plug sifter for performing A/B and other types of testing in scenarios like this.
The point of this blog post is more about the thought process than the technical parts of the recommender. There are lots of things that I could have done “right,” like using properties of Markov Chains (which is essentially what I built) to improve the system, or account for the fact that Buddy Holly follows Bruce Springsteen in my music library, so maybe that isn’t a true transition.
I think the main takeaway is really in the constraints that I put on the system. The idea for this one-day project followed the following course:
Identifying the problem correctly let me build something in just a few hours. Is it the best recommender system the world has ever seen? Actually, it might be, because I’ve never seen any recommender that only suggests content that you are already familiar with and that’s what I wanted. But we’d have to test it against the likes of Last.fm, Spotify, and Pandora to find out.
As I said in the beginning, I’ve been thinking about this stuff a lot lately, but that’s not to say I’ve put this method into production anywhere. The code for all of this was done using python/pandas, and breaks pretty much every rule that I laid out in my previous blog post, so I’ll clean that up and get it posted soon.
It’s always impressive to see a leader in any field admit when they were “way way wrong” about a certain technique or idea. This post by Andrew Gelman is a great summary of what Lasso (or regularization in general) is, and how his opinion on it has changed over time.
I’ve been thinking a lot lately about data analysis workflows, from raw materials to a finished product. Here at Tumblr and at my previous job at Columbia Business School, I’ve had to create a ton of one-off reports that would “probably never be used again.” The problem is that an awful lot of those “one time” requests have come back to show their head a month, or even a year later, and suddenly you have to read through old code (that was written in a hurry) in order to figure out what exactly was in “counts2.csv” and how it was populated.
To avoid this problem, you should always make sure that you can go from raw/messy data to a finished result with one command. I generally have, at the bare minimum, a bash script to execute a series of commands to re-generate my results. The script will do any or all of the following things, in order:
If you did it right, you should be able to clone your repository (you are using version control, right?) onto a new computer that’s set up with the right tools and libraries, execute your build.sh script, and get your final results back with no errors along the way. Trust me, your future self will appreciate the extra effort.
Here are some tips for making this process easier:
I love ipython and the R console as much as the next nerd, but you have to be careful in there. If you have a line like
In[342]: results.to_csv('final_output.csv')
then you’re going to have a hell of a time when you need to run those 342 commands again. If you’re working on the command line, make sure you’re copy/pasting the right code into a script while you’re working. Scrolling through a terminal history to see what worked and what didn’t is never fun. RStudio makes it really easy to do this. Just type your code in the editor and hit control+enter to execute any highlighted lines, rather than typing them into the console window.
ipython notebook and knitr (for R) make it easy to describe what you’re doing while you’re doing it. You can easily combine descriptions with code, results, and graphs. They’re often used to present final results (we use knitr to generate some daily email reports), but this is as good for your own code documentation / commentary as it is for producing the report that you send off to someone else.
The actual report generation is usually the tip of the iceberg. There’s a lot more that happens before that, and if you’re like me it’s usually spread out among multiple scripts, unix tools, and programming languages. For example, we at Tumblr use scribe to log events, and then store the archived data in HDFS. So if I want to run an analysis that includes some of today’s actions, I need to get data from two different places (the scribe server and HDFS), and probably treat it slightly differently. Here’s an example of gathering data from two places, combining it together, and generating a report.
There are some more tools out there to help with this, such as ProjectTemplate for R projects, which executes a series of R scripts in an organized manner to gather, munge, model, and export data. There’s also age old GNU make, which Mike Bostock recently endorsed for data projects. This provides the ability to list dependencies for specific scripts and data files. In the above example, generate_report.R relies on combined_data.csv, which relies on both clean_data_${TODAY_DATE}.csv and archive_data.csv. With make, you can specify all of these dependencies and if something throws an error along the way, the whole project will hault.
Have any other tips or tools for this? I’d love to hear them in the comments.
In celebration of what I have come to call “Hey, math! Amirite? Day”, go read about how to estimate the value of \( \pi \) using Monte Carlo simulation.
(via Happy Pi Day, Now Go Estimate It! « Zero Intelligence Agents)
Jake Porway on how You Can’t Just Hack Your Way to Social Change
I’m going to be giving two talks in the next few weeks:
I hope to see a bunch of you there. Feel free to say hello if we haven’t met.