Adam Laiacano

I'm a data engineer at tumblr and this is my blog. I write mostly about personal projects, data science, R/python, and various curiosities.

  1. How to approach a problem: self-indulgent music recommendations.

    I’ve been thinking a lot about music recommendations lately, and I realized that I’m usually a little bearish about listening to recommended bands that I’ve never heard of before. Maybe it’s just because I listen to a pretty broad variety of music, but I love re-discovering a band that I know but haven’t thought of in a while. So with that, let’s build a 100% self-centered music recommender. The goal is to remind myself of some bands that I might like to play next based on what I’m listening to right now.

    Fortunately for me, I’ve used last.fm to record the last 135,000+ tracks that I’ve listened to over the course of 7 years (my “now playing” is listed at the top of this page). And even more fortunately, they let you grab your entire history via their API. I was actually able to get 127,873 of them, which is more than plenty to work with. So let’s check out which artists I’ve listened to the most and see how well it matches up with my last.fm profile:

    Artist                 Plays
    ----------------------------
    Tom Waits              3155
    Justin Townes Earle    2613
    Iron & Wine            2053
    M. Ward                2005
    Lucero                 1832
    Old 97's               1761
    The Black Keys         1755
    Beach House            1624
    Death Cab for Cutie    1592
    Ryan Adams             1527
    

    The first thing that should become clear is that I listen to a lot of Sad Bastard music. At least both Dillinger Four and Samiam are in the top 20.

    Approach

    When designing this recommender, I’m going to try to answer the following question: Given the artist I’m currently listening to, what have I generally listened to next?

    Now since this is specific to me, I’m going add a few constraints. The first of which is this: I prefer listening to full albums rather than individual tracks. I’m not going to recommend songs, I’m going to recommend artists. Because of that, the only important attributes I need for each track are the time that I listened to it and the artist.

    First attempt

    Let’s look at the song-to-song transitions. That is, given that I’m listening to a song by The Antlers, which band am I likely to listen to next? This table shows the number of times I transition from listening to an Antlers song to each artist on the list, as well as the probability of the transition.

                      artist  transitions  transition_prob
                 The Antlers          854         0.870540
                 Beach House           22         0.022426
                 Arcade Fire            5         0.005097
              Patrick Watson            4         0.004077
    The Tallest Man on Earth            3         0.003058
              Okkervil River            3         0.003058
          The Avett Brothers            3         0.003058
                 Carla Bruni            3         0.003058
                     Pinback            2         0.002039
             North Highlands            2         0.002039
    

    This simply confirms what I stated earlier: when I listen to music, I listen to full albums. 87% of the time that I listen to an Antlers song, I listen to another one of their songs next. That’s not helpful for recommendations, so I’ll add another constraint: I’m only interested in transitions where the artists are not the same. Now the above list looks like this:

                      artist  transitions  transition_prob
                 Beach House           22         0.173228
                 Arcade Fire            5         0.039370
              Patrick Watson            4         0.031496
    The Tallest Man on Earth            3         0.023622
          The Avett Brothers            3         0.023622
              Okkervil River            3         0.023622
                 Carla Bruni            3         0.023622
                     Pinback            2         0.002039
             North Highlands            2         0.002039
               Fleetwood Mac            2         0.015748
    

    The order is the same as before, but the transition probabilities are much higher. This is a reasonable list of artists to recommend to someone who listens to The Antlers. Even last.fm has Beach House and Okkervil River in the top related artists.

    Modifications

    We’re doing well so far, but let’s see if we can make it a little better with just a bit more work. Beach House is the top recommendation, but I listen to them a lot . Of all of the tracks I’ve recorded, 1.27% of them are Beach House tracks. Considering there are a total of 1,342 unique artists in my data set, that means I’m \(\frac{0.0127}{1 / 1342} = 17 \) times more likely to listen to Beach House than the “average” band.

    So let’s use this information by dividing each transition probability by the unconditional probability of listening to a given artist. The unconditional probability is simply the total plays for each artist divided by the total number of plays (1624/127873 = 0.0127 for Beach House). The equation for the ranking has now become the following, where \( Pr(artist | Antlers) \) means the probability of listening to a given artist immediately after listening to The Antlers:

    \[ \frac{Pr(artist | Antlers)}{Pr(artist)} \]

    When I divide by the unconditional probability, I will give weight to artists that I listen to less often overall, making the results a little more exciting. If I multiply by this probability, however, I’ll give extra weight to artists I listen to more often, making the results more familiar. It’s probably important to point out that this is a bit of a hack that I came up with while typing up this blog post, and shouldn’t be confused with Bayes’ Theorem even though it looks sort of related. Anyway, let’s see what these rankings look like:

                    original             less familiar               more familiar
    ------------------------------------------------------------------------------
                 Beach House               April March                 Beach House
                 Arcade Fire              Broken Bells                   Tom Waits
              Patrick Watson               Beach House                 Arcade Fire
    The Tallest Man on Earth             Army of Ponch              Patrick Watson
          The Avett Brothers                   Mineral    The Tallest Man on Earth
              Okkervil River               Carla Bruni              Okkervil River
                 Carla Bruni               Arcade Fire                    Bon Iver
                     Pinback           North Highlands          The Avett Brothers
             North Highlands    The Murder City Devils              Dillinger Four
               Fleetwood Mac                   Pinback              Band of Horses
    

    You can see that Tom Waits moved up on the chart on the right because I’ve listen to him more than anybody else. For the middle list, however, there’s lots of stuff that didn’t even make the original cut. A band like Army of Ponch might not seem like the best recommendation to someone currently listening to The Antlers, but I’ve made that transition twice and might want to again.

    While we’re at it, here’s the list of recommendations for Bruce Springsteen:

               original               less familiar          more familiar
    ----------------------------------------------------------------------
            Chuck Ragan                 Buddy Holly              Tom Waits
         Built to Spill            Buckingham Nicks            Chuck Ragan
         Camera Obscura                   Sam Cooke         Built to Spill
              Tom Waits                The Jayhawks                  Wilco
                  Wilco                 Chuck Ragan         Camera Obscura
         Okkervil River           Bridge and Tunnel         Okkervil River
             Mean Creek              Built to Spill    Death Cab for Cutie
    Death Cab for Cutie                    Mastodon               Old 97's
           Dan Auerbach              Camera Obscura             Ryan Adams
      Bridge and Tunnel    Iron & Wine and Calexico                  Spoon
    

    Just reading that list reminds me that Mean Creek has a new record out that I’m going to listen to right now.

    Performance

    So which list is best? How does it compare to standard information retrieval techniques? Well, that’s probably different for each person and the only way to find out is to test it. I could (and might eventually) put together a little app that recommends me some artists from my listening history based on what I’m listening to right now. With a simple A/B test, I could see which of the three recommendation algorithms I follow most often and stick with that one in the future. To do that, I would have to record

    • The artist that is currently playing
    • Which artists recommendations are displayed
    • Which recommended artist (if any) was played next

    The recommendation algorithm that provides the highest play / display ratio is the one I’d like to go with in the future. This seems like an obvious place to plug sifter for performing A/B and other types of testing in scenarios like this.

    Conclusion

    The point of this blog post is more about the thought process than the technical parts of the recommender. There are lots of things that I could have done “right,” like using properties of Markov Chains (which is essentially what I built) to improve the system, or account for the fact that Buddy Holly follows Bruce Springsteen in my music library, so maybe that isn’t a true transition.

    I think the main takeaway is really in the constraints that I put on the system. The idea for this one-day project followed the following course:

    1. Build a music recommendation system
    2. Build a music recommendation system that only uses my last.fm data
    3. Build a music recommendation system that only recommends music I already know
    4. Build an artist recommendation system, not songs
    5. Only recommend artists that I’ve listened to immediately after the given artist
    6. Come up with a few simple variations and test them for performance

    Identifying the problem correctly let me build something in just a few hours. Is it the best recommender system the world has ever seen? Actually, it might be, because I’ve never seen any recommender that only suggests content that you are already familiar with and that’s what I wanted. But we’d have to test it against the likes of Last.fm, Spotify, and Pandora to find out.

    As I said in the beginning, I’ve been thinking about this stuff a lot lately, but that’s not to say I’ve put this method into production anywhere. The code for all of this was done using python/pandas, and breaks pretty much every rule that I laid out in my previous blog post, so I’ll clean that up and get it posted soon.

  2. 2013-03-24
    #recommendations #data science
  3. Any data scientist worth their salary will tell you that you should start with a question, NOT the data. Unfortunately, data hackathons often lack clear problem definitions. Most companies think that if you can just get hackers, pizza, and data together in a room, magic will happen. This is the same as if Habitat for Humanity gathered its volunteers around a pile of wood and said, “Have at it!” By the end of the day you’d be left with a half of a sunroom with 14 outlets in it.

    Jake Porway on how You Can’t Just Hack Your Way to Social Change

  4. 2013-03-07
    #data science #hackathons #jake porway #datakind
  5. Using entropy to route web traffic

    Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.

    The data looked something like this (with fake blog names, of course):

    ##       blog.name        H1        H2        H3        H4
    ## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
    ## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
    ## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
    ## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
    ## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
    ## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08
    

    One important thing to point out before we get started is that this has to be a representative sample of the request data. Despite the wild popularity of my personal blog, it doesn’t get a sliver of the traffic that Beyonce gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.

    Plot the data

    The first thing I ever do is plot data to get a sense of what I’m working with and what I’m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you’re not familiar with kernel density plots, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it’s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).

    image

    By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.

    image

    Calculate the entropy

    In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we’re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1’s and 0’s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?

    Huffman Coding is one encoding algorithm. There’s an example implementation here, which assigns the code 100 to “e” and 1111110010 to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you’ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.

    The actual entropy formula is:

    \[ H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \]

    Where \( H(x) \) is the entropy, and \( p_i \) is the probability of that symbol \( i \) will appear. In the example I linked to above, \( p_e = 0.124 \) and \( p_q=0.0009 \), so it makes sense that e’s symbol is so much shorter. In the example, the average number of bits per symbol is \( \sum S_i p_i = 4.173 \frac{bits}{symbol} \), where \( S_i \) is the number of bits in the symbol. The entropy, from the above equation, is \( 4.142 \frac{bits}{symbol} \).

    The example problem of web traffic distribution is a little different. We’re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as high as possible.

    We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we’re trying to choose. If one of our servers is the equivalent of the letter “e”, it’s going to be totally overloaded while the “q” isn’t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.

    So to calculate the entropy, we’ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server’s probability of handing the next incoming request. These are the \( p_i \) values that we need in order to calculate the entropy. In code, it looks like this:

    calc.entropy <- function(hash.value) {
        h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
        probs = h$counts/sum(h$counts)
        print(probs)
        entropy = -sum(probs * log2(probs))
    }
    

    The entropy values for our four hash functions are:

    ##   hash.function entropy
    ## 1            H1   4.203
    ## 2            H2   4.226
    ## 3            H3   4.254
    ## 4            H4   4.180
    

    And while we’re at it, here’s the entropy of our best/worst case example that we plotted earlier.

    ##    hash.function entropy
    ## 1     near.ideal   4.309
    ## 2 worst.possible   0.000
    

    Why is the worst case value 0? Because if all traffic is going to one server, we wouldn’t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: \( -20\frac{1}{20}log_2{\frac{1}{20}} = 4.32 \), which we’re close to but can never exceed.

    All of our hash functions appear to be working pretty well, especially for such a small sample size that I’m using for this blog post. It looks like our winner is H3.

    To summarize here’s what we did to find the optimal hashing function:

    • Get a representative sample of your web traffic
    • Run each request through the hashing function
    • Take a histogram of the resulting values with N bins, where N is the number of servers you have available
    • Divide the bin counts by the total number of requests in your sample to get the probability of handing a request for each server
    • Calculate the entropy, \( H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \), for each hash function

    There are more considerations to take, like setting an upper bound on the value of \( p_i \) to ensure that no single server ever gets so busy that it can’t handle its load.

    If you want to read up more on information theory, Elements of Information Theory by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.

    There is also, of course, Claude Shannon’s landmark paper from 1948 “A Mathematical Theory of Communication”“, in which he essentially defines the entire field.

  6. 2013-03-01
    #information theory #tumblr #engineering #data science
  7. Huge turnout for the first NYC Data Science meetup! Probably about 300 people on a gross January night.

    Huge turnout for the first NYC Data Science meetup! Probably about 300 people on a gross January night.

  8. 2013-02-11
    #meetups #data science
  9. Are you going to be optimizing the solution to one problem, or solving a wide variety of problems? If it’s just one problem, is it something that you can imagine being happy about working on a year from now?

    Hilary Mason is on point, as usual. Read Finding a Great Data Science Job for more excellent considerations.

  10. 2013-02-04
    #data science
  11. BigData TechCon 2013

    I’m honored to be giving a keynote talk at the 2013 BigData TechCon conference in Boston (April 8-10). There are a lot of excellent speakers at the event that I’m looking forward to seeing, including Claudia Perlich (Media6Degrees), Jonathan Seidman (Cloudera, formerly Orbitz), and Oscar Boykin (twitter).

    Registration is open hereHere’s a short abstract for my talk:

    Social networks are able to collect large amounts of activity data from their user and customer base. As Big Data professionals, we conduct experiments on custom data sets to measure the effectiveness of our products or advertising methodologies. Since a social network is effectively useless without an active community, our companies owe it to their users to create new and better products based on this information. Learn how our data analysis and predictive analytics must take a different approach than Big Data in fields like finance, medicine, and defense.

    I’m extra excited that the conference is in my home town. It’s been a while since I’ve spent a few days in Boston and I look forward to seeing all of my friends up there. I can already taste the Silhouette popcorn.

  12. 2013-01-23
    #data science #conferences #2013 #projects
  13. I think what makes a good data scientist is the same thing that makes you a good [insert field here] scientist. In other words, a good data scientist is a good scientist.

    Roger Peng on What makes a good data scientist.

  14. 2013-01-04
    #data science #simply statistics #roger peng
  15. Optimal Descriptive NFL Rankings

    seanjtaylor:

    Most NFL fans, like myself, obsess over who’s going to win games or which players to start in our fantasy football leagues. One of the fundamental tools we use to look at this are rankings. Rankings are the simplest possible model that can represent a total order, which you can think of as a function that allows you to compare all possible pairs in your set.

    Describing the NFL season

    In the NFL regular season, each of the 32 teams plays 16 games. The result is 256 binary outcomes, which if we assume are the same as coin flips (independent trials of a fair coin), then the entirety of the season contains at most 256 bits of information. One way to think about this is that if you could send yourself a string of 256 1s or 0s from January back in time to Septmeber, with a simplistic coding scheme, your past self could correctly predict all 256 games.

    I began to think: we all love rankings, but how well can you describe the season with a ranking of teams? There are 32! possible rankings, so a ranking of all 32 teams contains log2(32!) bits of information, or about 118 bits. This is the smallest amount of information you could use to describe what happens in all possible pairings of teams. How accurate is it to describe 256 bits with 118 bits? How many games would you get wrong?

    Looks like Sean J. Taylor has moved his blog to tumblr! Click through and scroll to the bottom to skip the math and see the cool interactive NFL ranking graphs.

  16. 2012-11-20
    #nfl #statistics #data science
  17. The videos from PyData NYC 2012 are now online. There are about 40 videos, most of which are 30-45 minutes long with some lightening talks thrown in. They cover everything from network analysis, literate programming with iPython Notebook, R/Hadoop/Pig integration, plotting with matplotlib, and much more.

  18. 2012-11-13
    #data science #pydata #python #machine learning #programming #hadoop #big data
  19. The Bad Data Handbook is finally out! It&#8217;s the first time I&#8217;ve contributed to a publication and I&#8217;m incredibly excited about it.
The top required skills of a data scientist are generally considered to be: mathematical know-how, programming capabilities, and some sort of domain knowledge enabling them to ask (and then answer) relevant questions.
This book is about all of the other garbage that you have to put up with along the way. Each chapter is written by someone who has spent more time than they probably would have liked dealing with a specific issue, and they provide some tips and pitfalls. I haven&#8217;t read the other chapters yet, but some of the included topics are:
Test drive your data to see if it’s ready for analysis
Work spreadsheet data into a usable form
Handle encoding problems that lurk in text data
Develop a successful web-scraping effort (that&#8217;s me)
Use NLP tools to reveal the real sentiment of online reviews
Address cloud computing issues that can impact your analysis effort
Avoid policies that create data analysis roadblocks
Take a systematic approach to data quality analysis
I hope at least a few of you buy it and enjoy it.

    The Bad Data Handbook is finally out! It’s the first time I’ve contributed to a publication and I’m incredibly excited about it.

    The top required skills of a data scientist are generally considered to be: mathematical know-how, programming capabilities, and some sort of domain knowledge enabling them to ask (and then answer) relevant questions.

    This book is about all of the other garbage that you have to put up with along the way. Each chapter is written by someone who has spent more time than they probably would have liked dealing with a specific issue, and they provide some tips and pitfalls. I haven’t read the other chapters yet, but some of the included topics are:

    • Test drive your data to see if it’s ready for analysis
    • Work spreadsheet data into a usable form
    • Handle encoding problems that lurk in text data
    • Develop a successful web-scraping effort (that’s me)
    • Use NLP tools to reveal the real sentiment of online reviews
    • Address cloud computing issues that can impact your analysis effort
    • Avoid policies that create data analysis roadblocks
    • Take a systematic approach to data quality analysis

    I hope at least a few of you buy it and enjoy it.

  20. 2012-11-12
    #data science #o'reilly media