## Baseball by the (jersey) numbers

The other day I saw this video of a 14 year old high school basketball sensation. He’s billed as “The Next LeBron” who was in turn billed as “The Next Michael Jordan.” What caught my eye is that they all wear number 23 on their jersey (until LeBron moved to The Heat, at least).

When I was a kid I played lots of sports, and the best basketball player on my team was always #23 and the best football player was always #34 (Bo Jackson, Walter Payton, etc). So that got me to wondering what the “desirable” numbers are in other sports. I asked about other sports on twitter and #10 is clearly the best number to wear in soccer, and maybe #9 for hockey (Richard, Hull, Howe).

I wanted to figure out which jersey numbers have been the best historically, and baseball is the obvious sport to turn to for this, since data is so reliable and readily available. So I downloaded the career stats for a little over 17,000 players (which I think is every baseball player ever) and decided to see which are the best jersey numbers over all.

For each player, I got their career batting average and their jersey number, which proved to be harder than I expected. For example, Johnny Damon wore #18 from 2002-2009 (during his best seasons with the Red Sox and Yankees), but has also worn 51, 8, 22, and 33. So to make it a little easy on myself, I just got the number that they wore on the most teams. Johnny Damon has had 7 numbers on 6 teams, but wore #18 on 4 teams, so I’m associating him with #18.

For scoring the jersey number, I’m taking the mean of career batting averages for all players who wore that number, weighted by the number of plate appearances per batter. I perform this weighting so that someone like Roberto Clemente who hit .317 over 10,211 plate appearances would have more influence than someone like Buster Posey, who hit .311 with only 1,324 plate appearances. That’s a bit of a mouthful so here’s some math that might make more sense:

\[ S_j = \sum_{i=1}^{N_j} \frac{p_{j,i} b_{j,i}}{\sum_{k=1}^{N_j}p_{j,k}} \]

Where \( S_j \) is the score for jersey number \( j \), \( b_{j,i} \) and \( p_{j,i} \) are respectively the batting average and number of plate appearances for the \( i^{th} \) player who wears jersey number \( j \), and \( N_j \) is the total number of players who wore jersey number \( j \). I was also sure to limit the data set to \( p_{j,i} \geq 500 \) and \( N_j \geq 5 \). There were also a lot of older (~100 years ago) players whose number I couldn’t gather, so I dropped those as well. This narrowed the data set down to about 3,500 players.

The next thing to consider is that since I’m using lifetime batting average and number of plate appearances to rank players, what I’m really doing is ranking *hitters*. Many pitchers in the National League end up with over 500 at bats, but their batting average is just going to hurt the overall ranking of their jersey number. The following graph makes this crystal clear. The y-axis shows the jersey number, and the x-axis is the batting average for each player.

So once we remove the pitchers, here’s where each number ranks in terms of weighted batting average:

There are some clear winners here. The number 51 is a bit of an outlier because there are only 10 batters with that number who meet the minumum plate appearance requirement, but the list includes Ichiro Suzuki (.321) and Bernie Williams (.297) among others with high batting averages. Here are the top players who wore #4, a more “classic” number:

```
name plate_appearances batting_average
Rogers Hornsby 9480 0.358
Lou Gehrig 9663 0.340
Riggs Stephenson 5134 0.336
Dale Alexander 2736 0.331
Babe Herman 6228 0.324
Luke Appling 10254 0.310
Hack Wilson 5556 0.307
Paul Molitor 12167 0.306
Smead Jolley 1815 0.305
Mel Ott 11348 0.304
```

Not a bad group to be in. There are also some clear loser numbers, but they’re mostly higher numbers which I think are often worn by pitchers.

Here are the the 10 best and worst lifetime batting averages:

```
name number plate_appearances batting_average
Rogers Hornsby 4 9480 0.358 0
Ted Williams 9 9788 0.344 0
Bill Terry 3 7108 0.341 0
Lou Gehrig 4 9663 0.340 0
Tony Gwynn 19 10232 0.338 0
Riggs Stephenson 4 5134 0.336 0
Al Simmons 7 9518 0.334 0
Paul Waner 24 10766 0.333 0
Dale Alexander 4 2736 0.331 0
Stan Musial 6 12717 0.331 0
...
Corky Miller 37 575 0.188 0
J.R. Phillips 17 545 0.188 0
Bill Plummer 8 1007 0.188 0
Gus Gil 18 538 0.186 0
Brandon Wood 32 751 0.186 0
Drew Butera 41 531 0.183 0
Kevin Cash 17 714 0.183 0
Tommy Dean 3 594 0.180 0
Ray Oyler 1 1445 0.175 0
John Vukovich 16 607 0.161 0
```

If you’re a baseball fan, you can see that the bottom is all populated by pitchers, because no .890 batter would ever get 950 at bats in the majors. In fact, baseball’s obvious bias towards keeping good hitters and cutting poor hitters is obvious when you plot lifetime batting average against total plate appearances:

So what number would I wear if I were a pro baseball player? I can’t NOT choose #9, even though it’s retired by the Red Sox.

Also, all of the code and results for this blog post are available on my github page.