Using entropy to route web traffic

Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.

The data looked something like this (with fake blog names, of course):

##       blog.name        H1        H2        H3        H4
## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08

One important thing to point out before we get started is that this has to be a representative sample of the request data. Despite the wild popularity of my personal blog, it doesn’t get a sliver of the traffic that Beyonce gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.

Plot the data

The first thing I ever do is plot data to get a sense of what I’m working with and what I’m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you’re not familiar with kernel density plots, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it’s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).

image

By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.

image

Calculate the entropy

In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we’re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1’s and 0’s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?

Huffman Coding is one encoding algorithm. There’s an example implementation here, which assigns the code 100 to “e” and 1111110010 to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you’ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.

The actual entropy formula is:

\[ H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \]

Where \( H(x) \) is the entropy, and \( p_i \) is the probability of that symbol \( i \) will appear. In the example I linked to above, \( p_e = 0.124 \) and \( p_q=0.0009 \), so it makes sense that e’s symbol is so much shorter. In the example, the average number of bits per symbol is \( \sum S_i p_i = 4.173 \frac{bits}{symbol} \), where \( S_i \) is the number of bits in the symbol. The entropy, from the above equation, is \( 4.142 \frac{bits}{symbol} \).

The example problem of web traffic distribution is a little different. We’re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as high as possible.

We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we’re trying to choose. If one of our servers is the equivalent of the letter “e”, it’s going to be totally overloaded while the “q” isn’t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.

So to calculate the entropy, we’ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server’s probability of handing the next incoming request. These are the \( p_i \) values that we need in order to calculate the entropy. In code, it looks like this:

calc.entropy <- function(hash.value) {
    h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
    probs = h$counts/sum(h$counts)
    print(probs)
    entropy = -sum(probs * log2(probs))
}

The entropy values for our four hash functions are:

##   hash.function entropy
## 1            H1   4.203
## 2            H2   4.226
## 3            H3   4.254
## 4            H4   4.180

And while we’re at it, here’s the entropy of our best/worst case example that we plotted earlier.

##    hash.function entropy
## 1     near.ideal   4.309
## 2 worst.possible   0.000

Why is the worst case value 0? Because if all traffic is going to one server, we wouldn’t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: \( -20\frac{1}{20}log_2{\frac{1}{20}} = 4.32 \), which we’re close to but can never exceed.

All of our hash functions appear to be working pretty well, especially for such a small sample size that I’m using for this blog post. It looks like our winner is H3.

To summarize here’s what we did to find the optimal hashing function:

  • Get a representative sample of your web traffic
  • Run each request through the hashing function
  • Take a histogram of the resulting values with N bins, where N is the number of servers you have available
  • Divide the bin counts by the total number of requests in your sample to get the probability of handing a request for each server
  • Calculate the entropy, \( H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \), for each hash function

There are more considerations to take, like setting an upper bound on the value of \( p_i \) to ensure that no single server ever gets so busy that it can’t handle its load.

If you want to read up more on information theory, Elements of Information Theory by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.

There is also, of course, Claude Shannon’s landmark paper from 1948 “A Mathematical Theory of Communication”“, in which he essentially defines the entire field.

I made this a few months ago and just came across it again. I took the activity volume on tumblr in three different time zones: GMT, EST (-5 hours) and JST (Japan, +5 hours). I assigned a different pure-tone chord to each time zone, and varied the amplitude as the activity volume changed over the course of a week.

It’s fun to listen to how much tumblr can sound like the soundtrack to a weird B-movie.

storyboard:

Fuck Yeah Fuckyeah Blogs

No one really knows why the “Fuck Yeah X” blog phenomenon became so popular — nor why it’s still going very strong in terms of raw numbers. As for ultimate beginnings, conventional wisdom points to the pop-culture longevity of “America, Fuck Yeah” from the soundtrack to Trey Parker and Matt Stone’s 2004 flick Team America: World Police, but there’s no real evidence beyond the circumstantial to support this conclusion. Only a few mainstream media outlets dared cover the trend due to the profanity in the name (may we suggest “fudge yeah” as a workaround?).

Coincidentally, the bloggers behind Fuck Yeah Menswear were yesterday (allegedly) prematurely revealed as Kevin Burrows and Lawrence Schlossman (the latter running the non-fuckyeah Tumblr How to Talk to Girls at Parties); they have a book releasing this fall. So on Tumblr, where did the fuckyeah blogs really come from, and what are people fuckyeahing about these days?

Displaying Equations on Tumblr

thatdatabaseguy:

Just installed MathJAX on my Tumblr (steps here: http://bit.ly/zQtVGs). If everything’s working correctly, this should display the sigmoid function:

\( f(z) = \frac{1}{1 + e^{-z}} \)

This is awesome! A+ find, and it’s just a single line of javascript that needs to be added to your theme: 

<script type="text/javascript"
   src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

Tips for making a technical blog on tumblr

Technical blogs have a few special formatting requirements (code samples, equations, and graphs/figures), and anyone making a technical blog probably hates using WYSIWYG/rich text editors. Here are a few tips that will hopefully be useful. If you have any questions or better suggestions, hit up the comments section.

Default Post Editor

First of all, you can change the default editor mode in your account settings (the cog image in the top right of your dashboard).

I personally hate Markdown (experiment: see how many people google HTML-to-markdown formatting help vs. markdown-to-HTML formatting help), but I can definitely see how it’s preferred to a rich text editor.

Including Code Examples

There are a couple ways of doing this: the easiest is to use the <code> and <pre> tags in HTML mode (or backtick in Markdown mode). That should produce something like this:

labels <- c( rep("a", 100*rexp(1)), rep("b", 100*rexp(1)), rep("c", 100*rexp(1)), rep("d", 100*rexp(1))) x <- data.frame(labels = factor(labels), some.value = runif(length(labels))) qplot(labels, data=x)

You can also embed a much prettier Github Gist. Just create a public Gist, copy the embed code and then put it in your tumblr post (paste it in using the HTML mode).

There are some pros and cons to using Gists: The pros are that it obviously looks much better on your blog (syntax highlighting, etc), the cons are that it won’t appear in the rich text editor or in your post preview, but it will be there once you create your post. The Gist will also show up as a grey box in the dashboard.

I personally suggest using the <pre><code> tags for very short examples and Gists for longer ones.

Images

Including images isn’t anything specific to technical blogs, but I want to point out that it’s a good idea in general to add photos using the “+ Upload photo” uploader instead of linking to remote images.  It won’t make a difference when viewing your blog, but remote images (on flickr, your personal website, etc) will show up as grey polaroid thumbnails in the Dashboard. This is because tumblr doesn’t know how big those remote images are, and if your Dashboard is full of 2MB high-resolution images, it would take forever to load.

LaTeX Equations

First of all, don’t search the “latex” tag on tumblr if you expect to find any tips about typesetting. That said, there isn’t an easy way to display equations on tumblr. Your best bet right now seems to be using a LaTeX generating API such as the Google Chart API. Inserting an image with the URL:

http://chart.apis.google.com/chart?cht=tx&chl=\frac{sin(x)}{x}

will embed the following image. If you want it to appear in the Dashboard, you’ll have to save/upload the image. I also have no idea if there is a rate limit on that API.

Hopefully these tips will help some people out. As I said above, if you have better ways of doing any of these, be sure to let me know.

Pie Charts & Word Clouds

Pie charts and word clouds have no place on technical blogs.

UPDATE

For LaTeX equations, use MathJax. Just include this line in your tumblr theme:

<script type="text/javascript"
   src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

Now you can insert LaTeX with a backslash and paren opening and closing:

\(\frac{sin(x)}{x}\)

We’re all about identities, not identification.

- David Karp (via joshuanguyen)

We (all of Tumblr NYC) had a nice chat with Jeff Jarvis in the office today. It was mostly his talk about “publicness” in the internet age. I was impressed by how strongly every person on the tumblr team stood behind the above quote, and made arguments for it in independently.

engineering:

As the new data guy at Tumblr, my first project is to take a look at algorithms we use to find and suggest blogs that a given user might be interested in.  This graph is a simple visual sample of my initial research.
Another engineer graciously volunteered to let me peek at the list of blogs he follows, from which I gathered a list of all the blogs they follow. From those two lists, I was able to create a large matrix with a row for each blog and a column for each person that he or she follows. Using a fairly simple SVD recommender, we are able to see a few distinct blog clusters (the axes here are the first three principal components).
The red dots are the blogs our guinea pig engineer follows (first degree), and the blue are the blogs his followers follow (second degree).  We performed a few spot tests to make sure that the groups made sense, and sure enough they do.  Up in the top left are some Tumblr staff blogs (including the official Staff Blog and David’s Log). The cluster on the far right, meanwhile, are a lot of “funny things I found on the internet”-style blogs. This engineer only follows one blog in the heart of that cloud, but you can see that the other followers of that blog are very cliquey (that is, they all follow each other).

My first post on the tumblr engineering blog.  This is coming along really well. I can&#8217;t wait to see it go live.

engineering:

As the new data guy at Tumblr, my first project is to take a look at algorithms we use to find and suggest blogs that a given user might be interested in.  This graph is a simple visual sample of my initial research.

Another engineer graciously volunteered to let me peek at the list of blogs he follows, from which I gathered a list of all the blogs they follow. From those two lists, I was able to create a large matrix with a row for each blog and a column for each person that he or she follows. Using a fairly simple SVD recommender, we are able to see a few distinct blog clusters (the axes here are the first three principal components).

The red dots are the blogs our guinea pig engineer follows (first degree), and the blue are the blogs his followers follow (second degree).  We performed a few spot tests to make sure that the groups made sense, and sure enough they do.  Up in the top left are some Tumblr staff blogs (including the official Staff Blog and David’s Log). The cluster on the far right, meanwhile, are a lot of “funny things I found on the internet”-style blogs. This engineer only follows one blog in the heart of that cloud, but you can see that the other followers of that blog are very cliquey (that is, they all follow each other).

My first post on the tumblr engineering blog.  This is coming along really well. I can’t wait to see it go live.