I'm a data engineer at tumblr and this is my blog. I write mostly about personal projects, data science, R/python, and various curiosities.
It’s always impressive to see a leader in any field admit when they were “way way wrong” about a certain technique or idea. This post by Andrew Gelman is a great summary of what Lasso (or regularization in general) is, and how his opinion on it has changed over time.
Most NFL fans, like myself, obsess over who’s going to win games or which players to start in our fantasy football leagues. One of the fundamental tools we use to look at this are rankings. Rankings are the simplest possible model that can represent a total order, which you can think of as a function that allows you to compare all possible pairs in your set.
Describing the NFL season
In the NFL regular season, each of the 32 teams plays 16 games. The result is 256 binary outcomes, which if we assume are the same as coin flips (independent trials of a fair coin), then the entirety of the season contains at most 256 bits of information. One way to think about this is that if you could send yourself a string of 256 1s or 0s from January back in time to Septmeber, with a simplistic coding scheme, your past self could correctly predict all 256 games.
I began to think: we all love rankings, but how well can you describe the season with a ranking of teams? There are 32! possible rankings, so a ranking of all 32 teams contains log2(32!) bits of information, or about 118 bits. This is the smallest amount of information you could use to describe what happens in all possible pairings of teams. How accurate is it to describe 256 bits with 118 bits? How many games would you get wrong?
Looks like Sean J. Taylor has moved his blog to tumblr! Click through and scroll to the bottom to skip the math and see the cool interactive NFL ranking graphs.
Sampling is fun, right? Here’s a simple implementation of a slice sampler for discrete probability distributions.
And here’s how to call it:
>>> px=[.2, .4, .1, .3] >>> slice_sampler(px, N=5) array([2, 3, 3, 3, 3])
>>> slice_sampler(px, N=5, x=[100, 200, 300, 400]) array([200, 200, 400, 200, 200])
Set N to something high and take a histogram and you’ll see that you have the right distribution.
>>> from pylab import * >>> samples = slice_sampler(px, N=10000, x=[100, 200, 300, 400])
>>> hist(samples) >>> grid()
And here’s what you get: