Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.
The data looked something like this (with fake blog names, of course):
## blog.name H1 H2 H3 H4
## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08
One important thing to point out before we get started is that this has to be a representative sample of the request data. Despite the wild popularity of my personal blog, it doesn’t get a sliver of the traffic that Beyonce gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.
The first thing I ever do is plot data to get a sense of what I’m working with and what I’m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you’re not familiar with kernel density plots, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it’s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).

By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.

In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we’re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1’s and 0’s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?
Huffman Coding is one encoding algorithm. There’s an example implementation here, which assigns the code 100 to “e” and 1111110010 to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you’ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.
The actual entropy formula is:
\[ H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \]
Where \( H(x) \) is the entropy, and \( p_i \) is the probability of that symbol \( i \) will appear. In the example I linked to above, \( p_e = 0.124 \) and \( p_q=0.0009 \), so it makes sense that e’s symbol is so much shorter. In the example, the average number of bits per symbol is \( \sum S_i p_i = 4.173 \frac{bits}{symbol} \), where \( S_i \) is the number of bits in the symbol. The entropy, from the above equation, is \( 4.142 \frac{bits}{symbol} \).
The example problem of web traffic distribution is a little different. We’re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as high as possible.
We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we’re trying to choose. If one of our servers is the equivalent of the letter “e”, it’s going to be totally overloaded while the “q” isn’t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.
So to calculate the entropy, we’ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server’s probability of handing the next incoming request. These are the \( p_i \) values that we need in order to calculate the entropy. In code, it looks like this:
calc.entropy <- function(hash.value) {
h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
probs = h$counts/sum(h$counts)
print(probs)
entropy = -sum(probs * log2(probs))
}
The entropy values for our four hash functions are:
## hash.function entropy
## 1 H1 4.203
## 2 H2 4.226
## 3 H3 4.254
## 4 H4 4.180
And while we’re at it, here’s the entropy of our best/worst case example that we plotted earlier.
## hash.function entropy
## 1 near.ideal 4.309
## 2 worst.possible 0.000
Why is the worst case value 0? Because if all traffic is going to one server, we wouldn’t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: \( -20\frac{1}{20}log_2{\frac{1}{20}} = 4.32 \), which we’re close to but can never exceed.
All of our hash functions appear to be working pretty well, especially for such a small sample size that I’m using for this blog post. It looks like our winner is H3.
To summarize here’s what we did to find the optimal hashing function:
There are more considerations to take, like setting an upper bound on the value of \( p_i \) to ensure that no single server ever gets so busy that it can’t handle its load.
If you want to read up more on information theory, Elements of Information Theory by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.
There is also, of course, Claude Shannon’s landmark paper from 1948 “A Mathematical Theory of Communication”“, in which he essentially defines the entire field.
I’m happy to finally give Sifter a name and a home. It’s a multi-armed bandit (MAB) API for testing web pages, optimizing which version (“arm”) of the test gets displayed. These arms could be article titles, ad positions, logos, colors, etc. Here’s the gist of how how a MAB test works:
There is an endless number of uses for something like this. Here are a few off the top of my head:
/select_arm route in the docs for more on default values)There are a lot of features that I think are pretty clever, like setting a default test result value and a TTL for the test, so that the default value gets reported when the test “expires,” and the ability to update this default value and TTL by sending a “heartbeat.” And there’s a lot on my To Do list, such as confidence intervals, multivariate testing, bucketing/aggregating test results for high-traffic websites, and more.
Right now the project is in a limited beta mode. I’m running it on a few pages that don’t get much traffic and I would love some help working out whatever bugs may come up. If you’re interested, please get in touch!
The Lakers are 6-22 when Kobe has more than 19 field goal attempts and 12-3 in the rest of the games.
(via Data supports claim that if Kobe stops ball hogging the Lakers will win more | Simply Statistics)
I was going to spend today working on a new project, but realized that the amount of work required just to get the boilerplate register/log-in/log-out code written is way too repetitive. So I stripped some code out of my multi-armed bandit API and called it bootstrap-bootstrap. It’s super basic, built with mongo/tornado/bootstrap, easy to deploy to heroku, and now up on github.
WNYC did a “How Single & How Happy Are You” survey, which I annotated for them.
(h/t DataNews for the link)
Huge turnout for the first NYC Data Science meetup! Probably about 300 people on a gross January night.
Hilary Mason is on point, as usual. Read Finding a Great Data Science Job for more excellent considerations.
This is something that I’ve been working on very heavily for the last 6 weeks or so, and I’m excited to start showing it off. It’s currently being hosted at http://banditapi.herokuapp.com.
In a normal A/B test on a website, you randomly present users with one version of a feature or another. This could be the placement of your logo, for example. You let the test run for some amount of time, then measure the relative performance of the two versions (“arms”) and go with the better one from that point on.
Multi-Armed Bandit tests are different. They are designed to learn which arm is preferred, and move towards that version of the website as quickly as possible. There are many nuances to these types of tests, and I highly recommend John Myles White’s book on the subject (there’s also a video of him talking about this subject at tumblr recently).
The API that I built provides a couple of different Multi-Armed Bandit algorithms, which are exposed as simple routes to request an arm to play, and to update the algorithm with the test results. There’s also a little admin view to see how your test is performing:

I’m at the point now where I need some help testing. This thing is not at the point where you should deploy it on a production site, but if you have something to test out that doesn’t take a lot of traffic (and you don’t mind a little latency), it should be reliable. There’s a also a little client library on github if you just want to fuss with it. Either email me at myfirstname.mylastname@gmail.com, or send me a message through tumblr to get an account.
I’m honored to be giving a keynote talk at the 2013 BigData TechCon conference in Boston (April 8-10). There are a lot of excellent speakers at the event that I’m looking forward to seeing, including Claudia Perlich (Media6Degrees), Jonathan Seidman (Cloudera, formerly Orbitz), and Oscar Boykin (twitter).
Registration is open here. Here’s a short abstract for my talk:
Social networks are able to collect large amounts of activity data from their user and customer base. As Big Data professionals, we conduct experiments on custom data sets to measure the effectiveness of our products or advertising methodologies. Since a social network is effectively useless without an active community, our companies owe it to their users to create new and better products based on this information. Learn how our data analysis and predictive analytics must take a different approach than Big Data in fields like finance, medicine, and defense.
I’m extra excited that the conference is in my home town. It’s been a while since I’ve spent a few days in Boston and I look forward to seeing all of my friends up there. I can already taste the Silhouette popcorn.
George Box “Statistics For Experimenters,” 1978.