1. ### Using entropy to route web traffic

Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.

The data looked something like this (with fake blog names, of course):

##       blog.name        H1        H2        H3        H4
## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08


One important thing to point out before we get started is that this has to be a representative sample of the request data. Despite the wild popularity of my personal blog, it doesn’t get a sliver of the traffic that Beyonce gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.

## Plot the data

The first thing I ever do is plot data to get a sense of what I’m working with and what I’m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you’re not familiar with kernel density plots, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it’s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).

By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.

## Calculate the entropy

In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we’re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1’s and 0’s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?

Huffman Coding is one encoding algorithm. There’s an example implementation here, which assigns the code 100 to “e” and 1111110010 to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you’ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.

The actual entropy formula is:

$H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i)$

Where $$H(x)$$ is the entropy, and $$p_i$$ is the probability of that symbol $$i$$ will appear. In the example I linked to above, $$p_e = 0.124$$ and $$p_q=0.0009$$, so it makes sense that e’s symbol is so much shorter. In the example, the average number of bits per symbol is $$\sum S_i p_i = 4.173 \frac{bits}{symbol}$$, where $$S_i$$ is the number of bits in the symbol. The entropy, from the above equation, is $$4.142 \frac{bits}{symbol}$$.

The example problem of web traffic distribution is a little different. We’re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as high as possible.

We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we’re trying to choose. If one of our servers is the equivalent of the letter “e”, it’s going to be totally overloaded while the “q” isn’t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.

So to calculate the entropy, we’ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server’s probability of handing the next incoming request. These are the $$p_i$$ values that we need in order to calculate the entropy. In code, it looks like this:

calc.entropy <- function(hash.value) {
h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
probs = h$counts/sum(h$counts)
print(probs)
entropy = -sum(probs * log2(probs))
}


The entropy values for our four hash functions are:

##   hash.function entropy
## 1            H1   4.203
## 2            H2   4.226
## 3            H3   4.254
## 4            H4   4.180


And while we’re at it, here’s the entropy of our best/worst case example that we plotted earlier.

##    hash.function entropy
## 1     near.ideal   4.309
## 2 worst.possible   0.000


Why is the worst case value 0? Because if all traffic is going to one server, we wouldn’t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: $$-20\frac{1}{20}log_2{\frac{1}{20}} = 4.32$$, which we’re close to but can never exceed.

All of our hash functions appear to be working pretty well, especially for such a small sample size that I’m using for this blog post. It looks like our winner is H3.

To summarize here’s what we did to find the optimal hashing function:

• Get a representative sample of your web traffic
• Run each request through the hashing function
• Take a histogram of the resulting values with N bins, where N is the number of servers you have available
• Divide the bin counts by the total number of requests in your sample to get the probability of handing a request for each server
• Calculate the entropy, $$H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i)$$, for each hash function

There are more considerations to take, like setting an upper bound on the value of $$p_i$$ to ensure that no single server ever gets so busy that it can’t handle its load.

If you want to read up more on information theory, Elements of Information Theory by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.

There is also, of course, Claude Shannon’s landmark paper from 1948 “A Mathematical Theory of Communication”“, in which he essentially defines the entire field.

2. 2013-03-01
#information theory #tumblr #engineering #data science
3. ### Sifter: an API for website testing and optimization

I’m happy to finally give Sifter a name and a home. It’s a multi-armed bandit (MAB) API for testing web pages, optimizing which version (“arm”) of the test gets displayed. These arms could be article titles, ad positions, logos, colors, etc. Here’s the gist of how how a MAB test works:

• A user visits a page that is under test.
• The page makes a request to sifter asking for an arm to display.
• Sifter makes a calculated decision about which arm to display and returns the result to the page as a value between 0 and (# arms - 1).
• The page renders the selected version (specific colors, logos, layouts, etc).
• The user interacts with the page in one way or another. Maybe they sign up, maybe they buy something, maybe they don’t.
• When the interaction is complete, send the reward earned (if any) back to Sifter. This could be something binary like click/no click, or some other value such as the amount of money spent.
• Sifter updates the bandit algorithm, influencing the arm that is selected the next time the page is rendered.

There is an endless number of uses for something like this. Here are a few off the top of my head:

### Run a news website?

• Run a few different title options for an article on your front page.
• Figure out how to get users to read more of a paginated article:
• Test the number of words shown on each page (say 500 or 1000)
• Each time the user clicks “next page”, update the default test result value to the user’s progress through the article. (See the /select_arm route in the docs for more on default values)

### Run a website with promoted content?

• Allow advertisers to choose multiple promotions to run at the same time, and optimize which one gets shown based on clicks and/or user feedback.
• Test the amount of labeling around the fact that this content is promoted vs organically created by users.

### Test the audience demographic response.

• Set up a standard A/B test.
• Display the same content to every user.
• Report back the results of the test where the “arm” represents a specific user demographic, rather than an alteration to the web page.
• Your test results will show which demographic responds best to your content.

There are a lot of features that I think are pretty clever, like setting a default test result value and a TTL for the test, so that the default value gets reported when the test “expires,” and the ability to update this default value and TTL by sending a “heartbeat.” And there’s a lot on my To Do list, such as confidence intervals, multivariate testing, bucketing/aggregating test results for high-traffic websites, and more.

Right now the project is in a limited beta mode. I’m running it on a few pages that don’t get much traffic and I would love some help working out whatever bugs may come up. If you’re interested, please get in touch!

4. 2013-02-21
#sifter #projects #optimization #operations research #multi armed bandit #ab test
5. The Lakers are 6-22 when Kobe has more than 19 field goal attempts and 12-3 in the rest of the games.
6. 2013-02-21
#nba #kobe bryant #sports #lakers #la lakers
7. I was going to spend today working on a new project, but realized that the amount of work required just to get the boilerplate register/log-in/log-out code written is way too repetitive. So I stripped some code out of my multi-armed bandit API and called it bootstrap-bootstrap. It’s super basic, built with mongo/tornado/bootstrap, easy to deploy to heroku, and now up on github.

8. 2013-02-18
#projects
9. WNYC did a “How Single & How Happy Are You” survey, which I annotated for them.

10. 2013-02-14
11. Huge turnout for the first NYC Data Science meetup! Probably about 300 people on a gross January night.

12. 2013-02-11
#meetups #data science
13. Are you going to be optimizing the solution to one problem, or solving a wide variety of problems? If it’s just one problem, is it something that you can imagine being happy about working on a year from now?

Hilary Mason is on point, as usual. Read Finding a Great Data Science Job for more excellent considerations.

14. 2013-02-04
#data science
15. ### An API for website testing/optimization

This is something that I’ve been working on very heavily for the last 6 weeks or so, and I’m excited to start showing it off. It’s currently being hosted at http://banditapi.herokuapp.com.

In a normal A/B test on a website, you randomly present users with one version of a feature or another. This could be the placement of your logo, for example. You let the test run for some amount of time, then measure the relative performance of the two versions (“arms”) and go with the better one from that point on.

Multi-Armed Bandit tests are different. They are designed to learn which arm is preferred, and move towards that version of the website as quickly as possible. There are many nuances to these types of tests, and I highly recommend John Myles White’s book on the subject (there’s also a video of him talking about this subject at tumblr recently).

The API that I built provides a couple of different Multi-Armed Bandit algorithms, which are exposed as simple routes to request an arm to play, and to update the algorithm with the test results. There’s also a little admin view to see how your test is performing:

I’m at the point now where I need some help testing. This thing is not at the point where you should deploy it on a production site, but if you have something to test out that doesn’t take a lot of traffic (and you don’t mind a little latency), it should be reliable. There’s a also a little client library on github if you just want to fuss with it. Either email me at myfirstname.mylastname@gmail.com, or send me a message through tumblr to get an account.

16. 2013-02-02
#projects
17. ### BigData TechCon 2013

I’m honored to be giving a keynote talk at the 2013 BigData TechCon conference in Boston (April 8-10). There are a lot of excellent speakers at the event that I’m looking forward to seeing, including Claudia Perlich (Media6Degrees), Jonathan Seidman (Cloudera, formerly Orbitz), and Oscar Boykin (twitter).

Registration is open hereHere’s a short abstract for my talk:

Social networks are able to collect large amounts of activity data from their user and customer base. As Big Data professionals, we conduct experiments on custom data sets to measure the effectiveness of our products or advertising methodologies. Since a social network is effectively useless without an active community, our companies owe it to their users to create new and better products based on this information. Learn how our data analysis and predictive analytics must take a different approach than Big Data in fields like finance, medicine, and defense.

I’m extra excited that the conference is in my home town. It’s been a while since I’ve spent a few days in Boston and I look forward to seeing all of my friends up there. I can already taste the Silhouette popcorn.

18. 2013-01-23
#data science #conferences #2013 #projects
19. Significance testing in general has been a greatly overworked procedure, and in many cases where significance statements have been made it would have better to provide an interval within which the value of the parameter would be expected to lie.

George Box “Statistics For Experimenters,” 1978.

20. 2013-01-22