<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>I’m a data engineer at tumblr and this is my blog. I write mostly about personal projects, data science, R/python, and various curiosities.</description><title>Adam Laiacano</title><generator>Tumblr (3.0; @adamlaiacano)</generator><link>http://www.adamlaiacano.com/</link><item><title>Baseball by the (jersey) numbers</title><description>&lt;p&gt;The other day I saw &lt;a href="http://blog.mattlehrer.com/post/48637680583/cbssports-is-seventh-woods-the-next-lebron" target="_blank"&gt;this video&lt;/a&gt; of a 14 year old high school basketball sensation. He&amp;#8217;s billed as &amp;#8220;The Next LeBron&amp;#8221; who was in turn billed as &amp;#8220;The Next Michael Jordan.&amp;#8221; What caught my eye is that they all wear number 23 on their jersey (until LeBron moved to The Heat, at least).&lt;/p&gt;

&lt;p&gt;When I was a kid I played lots of sports, and the best basketball player on my team was always #23 and the best football player was always #34 (Bo Jackson, Walter Payton, etc). So that got me to wondering what the &amp;#8220;desirable&amp;#8221; numbers are in other sports. I asked about other sports on twitter and #10 is clearly the best number to wear in soccer, and maybe #9 for hockey (Richard, Hull, Howe).&lt;/p&gt;

&lt;p&gt;I wanted to figure out which jersey numbers have been the best historically, and baseball is the obvious sport to turn to for this, since data is so reliable and readily available. So I downloaded the career stats for a little over 17,000 players (which I think is every baseball player ever) and decided to see which are the best jersey numbers over all.&lt;/p&gt;

&lt;p&gt;For each player, I got their career batting average and their jersey number, which proved to be harder than I expected. For example, Johnny Damon wore #18 from 2002-2009 (during his best seasons with the Red Sox and Yankees), but has also worn 51, 8, 22, and 33. So to make it a little easy on myself, I just got the number that they wore on the most teams. Johnny Damon has had 7 numbers on 6 teams, but wore #18 on 4 teams, so I&amp;#8217;m associating him with #18.&lt;/p&gt;

&lt;p&gt;For scoring the jersey number, I&amp;#8217;m taking the mean of career batting averages for all players who wore that number, weighted by the number of plate appearances per batter. I perform this weighting so that someone like Roberto Clemente who hit .317 over 10,211 plate appearances would have more influence than someone like Buster Posey, who hit .311 with only 1,324 plate appearances. That&amp;#8217;s a bit of a mouthful so here&amp;#8217;s some math that might make more sense:&lt;/p&gt;

&lt;p&gt;\[ S_j = \sum_{i=1}^{N_j} \frac{p_{j,i} b_{j,i}}{\sum_{k=1}^{N_j}p_{j,k}} \]&lt;/p&gt;

&lt;p&gt;Where \( S_j \) is the score for jersey number \( j \), \( b_{j,i} \) and \( p_{j,i} \) are respectively the batting average and number of plate appearances for the \( i^{th} \) player who wears jersey number \( j \), and \( N_j \) is the total number of players who wore jersey number \( j \). I was also sure to limit the data set to \( p_{j,i} \geq 500 \) and \( N_j \geq 5 \). There were also a lot of older (~100 years ago) players whose number I couldn&amp;#8217;t gather, so I dropped those as well. This narrowed the data set down to about 3,500 players.&lt;/p&gt;

&lt;p&gt;The next thing to consider is that since I&amp;#8217;m using lifetime batting average and number of plate appearances to rank players, what I&amp;#8217;m really doing is ranking &lt;em&gt;hitters&lt;/em&gt;. Many pitchers in the National League end up with over 500 at bats, but their batting average is just going to hurt the overall ranking of their jersey number. The following graph makes this crystal clear. The y-axis shows the jersey number, and the x-axis is the batting average for each player.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/64671180625438f8d95bff637228ecc3/tumblr_inline_mm2fr120WA1qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;So once we remove the pitchers, here&amp;#8217;s where each number ranks in terms of weighted batting average:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/53c3e13439cb981bf5966615761e1b8a/tumblr_inline_mm2frbTwn31qz4rgp.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;There are some clear winners here. The number 51 is a bit of an outlier because there are only 10 batters with that number who meet the minumum plate appearance requirement, but the list includes Ichiro Suzuki (.321) and Bernie Williams (.297) among others with high batting averages. Here are the top players who wore #4, a more &amp;#8220;classic&amp;#8221; number:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;            name plate_appearances batting_average
  Rogers Hornsby              9480           0.358
      Lou Gehrig              9663           0.340
Riggs Stephenson              5134           0.336
  Dale Alexander              2736           0.331
     Babe Herman              6228           0.324
    Luke Appling             10254           0.310
     Hack Wilson              5556           0.307
    Paul Molitor             12167           0.306
    Smead Jolley              1815           0.305
         Mel Ott             11348           0.304
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Not a bad group to be in. There are also some clear loser numbers, but they&amp;#8217;re mostly higher numbers which I think are often worn by pitchers.&lt;/p&gt;

&lt;p&gt;Here are the the 10 best and worst lifetime batting averages:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;            name number plate_appearances batting_average
  Rogers Hornsby      4              9480           0.358          0
    Ted Williams      9              9788           0.344          0
      Bill Terry      3              7108           0.341          0
      Lou Gehrig      4              9663           0.340          0
      Tony Gwynn     19             10232           0.338          0
Riggs Stephenson      4              5134           0.336          0
      Al Simmons      7              9518           0.334          0
      Paul Waner     24             10766           0.333          0
  Dale Alexander      4              2736           0.331          0
     Stan Musial      6             12717           0.331          0
...
    Corky Miller     37               575           0.188          0
   J.R. Phillips     17               545           0.188          0
    Bill Plummer      8              1007           0.188          0
         Gus Gil     18               538           0.186          0
    Brandon Wood     32               751           0.186          0
     Drew Butera     41               531           0.183          0
      Kevin Cash     17               714           0.183          0
      Tommy Dean      3               594           0.180          0
       Ray Oyler      1              1445           0.175          0
   John Vukovich     16               607           0.161          0
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you&amp;#8217;re a baseball fan, you can see that the bottom is all populated by pitchers, because no .890 batter would ever get 950 at bats in the majors. In fact, baseball&amp;#8217;s obvious bias towards keeping good hitters and cutting poor hitters is obvious when you plot lifetime batting average against total plate appearances:&lt;/p&gt;

&lt;p&gt;So what number would I wear if I were a pro baseball player? I can&amp;#8217;t NOT choose #9, even though it&amp;#8217;s retired by the Red Sox.&lt;/p&gt;

&lt;p&gt;Also, all of the code and results for this blog post are available on my &lt;a href="https://github.com/alaiacano/baseball-jersey-numbers" target="_blank"&gt;github page&lt;/a&gt;.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/49255762570</link><guid>http://www.adamlaiacano.com/post/49255762570</guid><pubDate>Tue, 30 Apr 2013 09:03:00 -0400</pubDate><category>sports</category><category>baseball</category></item><item><title>The blogroll « Statistical Modeling, Causal Inference, and Social Science</title><description>&lt;a href="http://andrewgelman.com/2013/04/29/the-blogroll/"&gt;The blogroll « Statistical Modeling, Causal Inference, and Social Science&lt;/a&gt;: &lt;p&gt;A great list of blogs on statistics, visualization, politics and more. Curated by &lt;a href="http://andrewgelman.com/" target="_blank"&gt;Andrew Gelman&lt;/a&gt;. I’m not sure how &lt;a href="http://simplystatistics.org/" target="_blank"&gt;Simply Statistics&lt;/a&gt; didn’t make the list.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/49176512997</link><guid>http://www.adamlaiacano.com/post/49176512997</guid><pubDate>Mon, 29 Apr 2013 09:27:00 -0400</pubDate><category>links</category><category>andrew gelman</category></item><item><title>SADDLE: Scala Data Library</title><description>&lt;a href="http://saddle.github.com/"&gt;SADDLE: Scala Data Library&lt;/a&gt;: &lt;p&gt;I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I’ve heard it described as “Pandas for Scala,” which is great because nothing beats having &lt;a href="http://saddle.github.com/doc/quickstart.html#frame" target="_blank"&gt;data frames&lt;/a&gt; ported to a new language.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/46948129555</link><guid>http://www.adamlaiacano.com/post/46948129555</guid><pubDate>Tue, 02 Apr 2013 13:44:41 -0400</pubDate><category>tech</category><category>scala</category><category>pandas</category><category>programming</category></item><item><title>"All mod­els are wrong, but some are use­ful."</title><description>“All mod­els are wrong, but some are use­ful.”&lt;br/&gt;&lt;br/&gt; - &lt;em&gt;&lt;a href="http://robjhyndman.com/hyndsight/gepbox/?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed:%20RobJHyndman-ResearchTips%20(Research%20tips)" target="_blank"&gt;George E P Box (1919-2013)&lt;/a&gt;&lt;/em&gt;</description><link>http://www.adamlaiacano.com/post/46756663915</link><guid>http://www.adamlaiacano.com/post/46756663915</guid><pubDate>Sun, 31 Mar 2013 09:49:00 -0400</pubDate><category>george box</category><category>statistics</category><category>quotes</category></item><item><title>How to approach a problem: self-indulgent music recommendations.</title><description>&lt;p&gt;I’ve been thinking a lot about music recommendations lately, and I realized that I’m usually a little bearish about listening to recommended bands that I’ve never heard of before. Maybe it’s just because I listen to a pretty broad variety of music, but I love re-discovering a band that I know but haven’t thought of in a while. So with that, let’s build a 100% self-centered music recommender. The goal is to remind myself of some bands that I might like to play next based on what I’m listening to right now.&lt;/p&gt;

&lt;p&gt;Fortunately for me, I’ve &lt;a href="http://www.last.fm/user/alaiacano" target="_blank"&gt;used last.fm&lt;/a&gt; to record the last 135,000+ tracks that I’ve listened to over the course of 7 years (my “now playing” is listed at the top of this page). And even more fortunately, they let you grab your entire history via their API. I was actually able to get 127,873 of them, which is more than plenty to work with. So let’s check out which artists I’ve listened to the most and see how well it matches up with my last.fm profile:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Artist                 Plays
----------------------------
Tom Waits              3155
Justin Townes Earle    2613
Iron &amp;amp; Wine            2053
M. Ward                2005
Lucero                 1832
Old 97's               1761
The Black Keys         1755
Beach House            1624
Death Cab for Cutie    1592
Ryan Adams             1527
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first thing that should become clear is that I listen to a lot of Sad Bastard music. At least both Dillinger Four and Samiam are in the top 20.&lt;/p&gt;

&lt;h1&gt;Approach&lt;/h1&gt;

&lt;p&gt;When designing this recommender, I’m going to try to answer the following question: Given the artist I’m currently listening to, what have I generally listened to next?&lt;/p&gt;

&lt;p&gt;Now since this is specific to me, I’m going add a few constraints. The first of which is this: I prefer listening to full albums rather than individual tracks. I’m not going to recommend songs, I’m going to recommend artists. Because of that, the only important attributes I need for each track are the time that I listened to it and the artist.&lt;/p&gt;

&lt;h1&gt;First attempt&lt;/h1&gt;

&lt;p&gt;Let’s look at the song-to-song transitions. That is, given that I’m listening to a song by The Antlers, which band am I likely to listen to next? This table shows the number of times I transition from listening to an Antlers song to each artist on the list, as well as the probability of the transition.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;                  artist  transitions  transition_prob
             The Antlers          854         0.870540
             Beach House           22         0.022426
             Arcade Fire            5         0.005097
          Patrick Watson            4         0.004077
The Tallest Man on Earth            3         0.003058
          Okkervil River            3         0.003058
      The Avett Brothers            3         0.003058
             Carla Bruni            3         0.003058
                 Pinback            2         0.002039
         North Highlands            2         0.002039
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This simply confirms what I stated earlier: when I listen to music, I listen to full albums. 87% of the time that I listen to an Antlers song, I listen to another one of their songs next. That’s not helpful for recommendations, so I’ll add another constraint: I’m only interested in transitions where the artists are not the same. Now the above list looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;                  artist  transitions  transition_prob
             Beach House           22         0.173228
             Arcade Fire            5         0.039370
          Patrick Watson            4         0.031496
The Tallest Man on Earth            3         0.023622
      The Avett Brothers            3         0.023622
          Okkervil River            3         0.023622
             Carla Bruni            3         0.023622
                 Pinback            2         0.002039
         North Highlands            2         0.002039
           Fleetwood Mac            2         0.015748
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The order is the same as before, but the transition probabilities are much higher. This is a reasonable list of artists to recommend to someone who listens to The Antlers. Even last.fm has Beach House and Okkervil River in the top &lt;a href="http://www.last.fm/music/The+Antlers/+similar" target="_blank"&gt;related artists&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;Modifications&lt;/h1&gt;

&lt;p&gt;We’re doing well so far, but let’s see if we can make it a little better with just a bit more work. Beach House is the top recommendation, but I listen to them a lot . Of all of the tracks I’ve recorded, 1.27% of them are Beach House tracks. Considering there are a total of 1,342 unique artists in my data set, that means I’m \(\frac{0.0127}{1 / 1342} = 17 \) times more likely to listen to Beach House than the “average” band.&lt;/p&gt;

&lt;p&gt;So let’s use this information by dividing each transition probability by the unconditional probability of listening to a given artist. The unconditional probability is simply the total plays for each artist divided by the total number of plays (1624/127873 = 0.0127 for Beach House). The equation for the ranking has now become the following, where \( Pr(artist | Antlers) \) means the probability of listening to a given artist immediately after listening to The Antlers:&lt;/p&gt;

&lt;p&gt;\[ \frac{Pr(artist | Antlers)}{Pr(artist)} \]&lt;/p&gt;

&lt;p&gt;When I divide by the unconditional probability, I will give weight to artists that I listen to less often overall, making the results a little more exciting. If I multiply by this probability, however, I’ll give extra weight to artists I listen to more often, making the results more familiar. It’s probably important to point out that this is a bit of a hack that I came up with while typing up this blog post, and shouldn’t be confused with Bayes’ Theorem even though it looks sort of related. Anyway, let’s see what these rankings look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;                original             less familiar               more familiar
------------------------------------------------------------------------------
             Beach House               April March                 Beach House
             Arcade Fire              Broken Bells                   Tom Waits
          Patrick Watson               Beach House                 Arcade Fire
The Tallest Man on Earth             Army of Ponch              Patrick Watson
      The Avett Brothers                   Mineral    The Tallest Man on Earth
          Okkervil River               Carla Bruni              Okkervil River
             Carla Bruni               Arcade Fire                    Bon Iver
                 Pinback           North Highlands          The Avett Brothers
         North Highlands    The Murder City Devils              Dillinger Four
           Fleetwood Mac                   Pinback              Band of Horses
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can see that Tom Waits moved up on the chart on the right because I’ve listen to him more than anybody else. For the middle list, however, there’s lots of stuff that didn’t even make the original cut. A band like Army of Ponch might not seem like the best recommendation to someone currently listening to The Antlers, but I’ve made that transition twice and might want to again.&lt;/p&gt;

&lt;p&gt;While we’re at it, here’s the list of recommendations for Bruce Springsteen:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;           original               less familiar          more familiar
----------------------------------------------------------------------
        Chuck Ragan                 Buddy Holly              Tom Waits
     Built to Spill            Buckingham Nicks            Chuck Ragan
     Camera Obscura                   Sam Cooke         Built to Spill
          Tom Waits                The Jayhawks                  Wilco
              Wilco                 Chuck Ragan         Camera Obscura
     Okkervil River           Bridge and Tunnel         Okkervil River
         Mean Creek              Built to Spill    Death Cab for Cutie
Death Cab for Cutie                    Mastodon               Old 97's
       Dan Auerbach              Camera Obscura             Ryan Adams
  Bridge and Tunnel    Iron &amp;amp; Wine and Calexico                  Spoon
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Just reading that list reminds me that Mean Creek has a &lt;a href="http://open.spotify.com/album/6GOGhliFDXMN2AXbw6TNnq" target="_blank"&gt;new record out&lt;/a&gt; that I’m going to listen to right now.&lt;/p&gt;

&lt;h1&gt;Performance&lt;/h1&gt;

&lt;p&gt;So which list is best? How does it compare to standard information retrieval techniques? Well, that’s probably different for each person and the only way to find out is to test it. I could (and might eventually) put together a little app that recommends me some artists from my listening history based on what I’m listening to right now. With a simple A/B test, I could see which of the three recommendation algorithms I follow most often and stick with that one in the future. To do that, I would have to record&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;The artist that is currently playing&lt;/li&gt;
&lt;li&gt;Which artists recommendations are displayed&lt;/li&gt;
&lt;li&gt;Which recommended artist (if any) was played next&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;The recommendation algorithm that provides the highest play / display ratio is the one I’d like to go with in the future. This seems like an obvious place to plug &lt;a href="http://www.sifter.cc" target="_blank"&gt;sifter&lt;/a&gt; for performing A/B and other types of testing in scenarios like this.&lt;/p&gt;

&lt;h1&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;The point of this blog post is more about the thought process than the technical parts of the recommender. There are lots of things that I could have done “right,” like using properties of Markov Chains (which is essentially what I built) to improve the system, or account for the fact that Buddy Holly follows Bruce Springsteen in my music library, so maybe that isn’t a true transition.&lt;/p&gt;

&lt;p&gt;I think the main takeaway is really in the constraints that I put on the system. The idea for this one-day project followed the following course:&lt;/p&gt;

&lt;ol&gt;&lt;li&gt;Build a music recommendation system&lt;/li&gt;
&lt;li&gt;Build a music recommendation system that only uses my last.fm data&lt;/li&gt;
&lt;li&gt;Build a music recommendation system that only recommends music I already know&lt;/li&gt;
&lt;li&gt;Build an &lt;em&gt;artist&lt;/em&gt; recommendation system, not songs&lt;/li&gt;
&lt;li&gt;Only recommend artists that I’ve listened to immediately after the given artist&lt;/li&gt;
&lt;li&gt;Come up with a few simple variations and test them for performance&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Identifying the problem correctly let me build something in just a few hours. Is it the best recommender system the world has ever seen? Actually, it might be, because I’ve never seen any recommender that only suggests content that you are already familiar with and that’s what I wanted. But we’d have to test it against the likes of Last.fm, Spotify, and Pandora to find out.&lt;/p&gt;

&lt;p&gt;As I said in the beginning, I’ve been thinking about this stuff a lot lately, but that’s not to say I’ve put this method into production anywhere. The code for all of this was done using python/pandas, and breaks pretty much every rule that I laid out in my &lt;a href="http://www.adamlaiacano.com/post/45356689519/keeping-tabs-on-your-data-analysis-workflow" target="_blank"&gt;previous blog post&lt;/a&gt;, so I’ll clean that up and get it posted soon.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/46158753116</link><guid>http://www.adamlaiacano.com/post/46158753116</guid><pubDate>Sun, 24 Mar 2013 10:00:00 -0400</pubDate><category>recommendations</category><category>data science</category></item><item><title>"So, yes, lasso was a great idea, and I didn’t get the point. I’m still amazed in retrospect that as..."</title><description>“So, yes, lasso was a great idea, and I didn’t get the point. I’m still amazed in retrospect that as late as 2003, I was fitting uncontrolled regressions with lots of predictors and not knowing what to do. Or, I should say, as late as 2013, considering I still haven’t fully integrated these ideas into my work. I do use bayesglm() routinely, but that has very weak priors.”&lt;br/&gt;&lt;br/&gt; - &lt;em&gt;&lt;p&gt;It’s always impressive to see a leader in any field admit when they were “way way wrong” about a certain technique or idea. This post by Andrew Gelman is a great summary of what Lasso (or regularization in general) is, and how his opinion on it has changed over time.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://andrewgelman.com/2013/03/18/tibshirani-announces-new-research-result-a-significance-test-for-the-lasso/" target="_blank"&gt;Tibshirani announces new research result: A significance test for the lasso « Statistical Modeling, Causal Inference, and Social Science&lt;/a&gt;&lt;/p&gt;&lt;/em&gt;</description><link>http://www.adamlaiacano.com/post/45832891153</link><guid>http://www.adamlaiacano.com/post/45832891153</guid><pubDate>Wed, 20 Mar 2013 09:19:00 -0400</pubDate><category>statistics</category><category>andrew gelman</category><category>lasso</category></item><item><title>Keeping tabs on your data analysis workflow</title><description>&lt;p&gt;I&amp;#8217;ve been thinking a lot lately about data analysis workflows, from raw materials to a finished product. Here at Tumblr and at my previous job at Columbia Business School, I&amp;#8217;ve had to create a ton of one-off reports that would &amp;#8220;probably never be used again.&amp;#8221; The problem is that an awful lot of those “one time” requests have come back to show their head a month, or even a year later, and suddenly you have to read through old code (that was written in a hurry) in order to figure out what exactly was in &amp;#8220;counts2.csv&amp;#8221; and how it was populated.&lt;br/&gt;&lt;br/&gt; To avoid this problem, you should always make sure that you can go from raw/messy data to a finished result with one command. I generally have, at the bare minimum, a bash script to execute a series of commands to re-generate my results. The script will do any or all of the following things, in order:&lt;strong&gt;&lt;br/&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;&lt;span&gt;Delete all of the previous temporary or final result files that have anything to do with this project. You need a clean slate.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Gather the data, whether it&amp;#8217;s from an API, database, Hadoop job, or any other source, and munge it into the right format (usually .csv).&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Build any models you may need, or summary statistics, or anything else.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Generate the report with a script.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;If you send the results to anybody, put them in a .zip file with all code that you need to re-create it.&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;If you did it right, you should be able to clone your repository (you are using version control, right?) onto a new computer that&amp;#8217;s set up with the right tools and libraries, execute your build.sh script, and get your final results back with no errors along the way. Trust me, your future self will appreciate the extra effort.&lt;strong&gt;&lt;br/&gt;&lt;/strong&gt;&lt;br/&gt; Here are some tips for making this process easier:&lt;/p&gt;
&lt;h3&gt;Avoid the command line&lt;/h3&gt;
&lt;p&gt;I love ipython and the R console as much as the next nerd, but you have to be careful in there. If you have a line like&lt;/p&gt;
&lt;pre&gt;In[342]: results.to_csv('final_output.csv')&lt;/pre&gt;
&lt;p&gt;then you&amp;#8217;re going to have a hell of a time when you need to run those 342 commands again. If you&amp;#8217;re working on the command line, make sure you&amp;#8217;re copy/pasting the right code into a script while you&amp;#8217;re working. Scrolling through a terminal history to see what worked and what didn&amp;#8217;t is never fun. &lt;a href="http://www.rstudio.org" target="_blank"&gt;RStudio&lt;/a&gt; makes it really easy to do this. Just type your code in the editor and hit control+enter to execute any highlighted lines, rather than typing them into the console window.&lt;br/&gt;&lt;img alt="image" src="http://media.tumblr.com/d67903dea4745cc8eb57dd492b7ad60c/tumblr_inline_mjnykeLRLA1qz4rgp.png"/&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;

&lt;h3&gt;Use Literate Programming&lt;/h3&gt;
&lt;p&gt;ipython notebook and knitr (for R) make it easy to describe what you&amp;#8217;re doing while you&amp;#8217;re doing it. You can easily combine descriptions with code, results, and graphs. They&amp;#8217;re often used to present final results (we use knitr to generate some daily email reports), but this is as good for your own code documentation / commentary as it is for producing the report that you send off to someone else.&lt;br/&gt;&lt;br/&gt;&lt;img alt="image" src="http://media.tumblr.com/f5df437988afa24c04583e7ff6b1175e/tumblr_inline_mjnyh0Ltlz1qz4rgp.png"/&gt;&lt;/p&gt;

&lt;h3&gt;Broader workflows&lt;/h3&gt;
&lt;p&gt;The actual report generation is usually the tip of the iceberg. There&amp;#8217;s a lot more that happens before that, and if you&amp;#8217;re like me it&amp;#8217;s usually spread out among multiple scripts, unix tools, and programming languages. For example, we at Tumblr use scribe to log events, and then store the archived data in HDFS. So if I want to run an analysis that includes some of today&amp;#8217;s actions, I need to get data from two different places (the scribe server and HDFS), and probably treat it slightly differently. Here&amp;#8217;s an example of gathering data from two places, combining it together, and generating a report.&lt;/p&gt;
&lt;div class="gist"&gt;&lt;a href="https://gist.github.com/alaiacano/5163726" target="_blank"&gt;https://gist.github.com/alaiacano/5163726&lt;/a&gt;&lt;/div&gt;
&lt;p&gt;There are some more tools out there to help with this, such as &lt;a href="http://projecttemplate.net/" target="_blank"&gt;ProjectTemplate&lt;/a&gt; for R projects, which executes a series of R scripts in an organized manner to gather, munge, model, and export data. There&amp;#8217;s also age old &lt;a href="http://www.gnu.org/software/make/manual/make.html" target="_blank"&gt;GNU make&lt;/a&gt;, which Mike Bostock recently &lt;a href="http://bost.ocks.org/mike/make/" target="_blank"&gt;endorsed&lt;/a&gt; for data projects. This provides the ability to list dependencies for specific scripts and data files. In the above example, &lt;code&gt;generate_report.R&lt;/code&gt; relies on &lt;code&gt;combined_data.csv&lt;/code&gt;, which relies on both &lt;code&gt;clean_data_${TODAY_DATE}.csv&lt;/code&gt; and &lt;code&gt;archive_data.csv&lt;/code&gt;. With make, you can specify all of these dependencies and if something throws an error along the way, the whole project will hault.&lt;br/&gt;&lt;br/&gt; Have any other tips or tools for this? I&amp;#8217;d love to hear them in the comments.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/45356689519</link><guid>http://www.adamlaiacano.com/post/45356689519</guid><pubDate>Thu, 14 Mar 2013 14:43:00 -0400</pubDate></item><item><title>In celebration of what I have come to call “Hey, math!...</title><description>&lt;img src="http://25.media.tumblr.com/ee4168fc62b200851aec2e7c64654017/tumblr_mjnpb7FFMN1r0vuydo1_400.gif"/&gt;&lt;br/&gt;&lt;br/&gt;&lt;p&gt;In celebration of what I have come to call “Hey, math! Amirite? Day”, go read about how to estimate the value of  \( \pi \) using Monte Carlo simulation.&lt;/p&gt;
&lt;p&gt;(via &lt;a href="http://www.drewconway.com/zia/?p=2667" target="_blank"&gt;Happy Pi Day, Now Go Estimate It! « Zero Intelligence Agents&lt;/a&gt;)&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/45346367353</link><guid>http://www.adamlaiacano.com/post/45346367353</guid><pubDate>Thu, 14 Mar 2013 11:21:00 -0400</pubDate><category>pi day</category><category>monte carlo simulation</category></item><item><title>"Any data scientist worth their salary will tell you that you should start with a question, NOT the..."</title><description>“Any data scientist worth their salary will tell you that you should start with a question, NOT the data. Unfortunately, data hackathons often lack clear problem definitions. Most companies think that if you can just get hackers, pizza, and data together in a room, magic will happen. This is the same as if Habitat for Humanity gathered its volunteers around a pile of wood and said, “Have at it!” By the end of the day you’d be left with a half of a sunroom with 14 outlets in it.”&lt;br/&gt;&lt;br/&gt; - &lt;em&gt;Jake Porway on how &lt;a href="http://blogs.hbr.org/cs/2013/03/you_cant_just_hack_your_way_to.html" target="_blank"&gt;You Can’t Just Hack Your Way to Social Change&lt;/a&gt;&lt;/em&gt;</description><link>http://www.adamlaiacano.com/post/44802622969</link><guid>http://www.adamlaiacano.com/post/44802622969</guid><pubDate>Thu, 07 Mar 2013 15:37:10 -0500</pubDate><category>data science</category><category>hackathons</category><category>jake porway</category><category>datakind</category></item><item><title>Some more shameless self promotion</title><description>&lt;p&gt;I&amp;#8217;m going to be giving two talks in the next few weeks:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;An overview of our data stack at the &lt;a href="http://www.meetup.com/NYC-Data-Science/events/107903562/?a=ea1_grp&amp;amp;rv=ea1&amp;amp;_af_eid=107903562&amp;amp;_af=event" target="_blank"&gt;NYC Data Science Meetup&lt;/a&gt; at Foursquare HQ on March 28.&lt;/li&gt;
&lt;li&gt;A keynote talk at the &lt;a href="http://www.bigdatatechcon.com/boston2013/" target="_blank"&gt;Big Data Techchon&lt;/a&gt; in Boston April 8-10&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;I hope to see a bunch of you there. Feel free to say hello if we haven&amp;#8217;t met.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/44709803523</link><guid>http://www.adamlaiacano.com/post/44709803523</guid><pubDate>Wed, 06 Mar 2013 10:55:38 -0500</pubDate><category>talks</category></item><item><title>Using entropy to route web traffic</title><description>&lt;p&gt;Earlier this week, &lt;a href="http://tumblr.mobocracy.net" target="_blank"&gt;Blake&lt;/a&gt; asked me for some help with a problem he&amp;#8217;s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog&amp;#8217;s url, and outputs a number between 0 and 2&lt;sup&gt;32.&lt;/sup&gt; Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.&lt;/p&gt;
&lt;p&gt;The data looked something like this (with fake blog names, of course):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;##       blog.name        H1        H2        H3        H4
## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One important thing to point out before we get started is that this has to be a &lt;em&gt;representative sample&lt;/em&gt; of the request data. Despite the wild popularity of my personal blog, it doesn&amp;#8217;t get a sliver of the traffic that &lt;a href="http://beyonce.tumblr.com" target="_blank"&gt;Beyonce&lt;/a&gt; gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.&lt;/p&gt;
&lt;h2&gt;Plot the data&lt;/h2&gt;
&lt;p&gt;The first thing I ever do is plot data to get a sense of what I&amp;#8217;m working with and what I&amp;#8217;m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you&amp;#8217;re not familiar with &lt;a href="http://en.wikipedia.org/wiki/Kernel_density_estimation" target="_blank"&gt;kernel density plots&lt;/a&gt;, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it&amp;#8217;s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).&lt;/p&gt;
&lt;p&gt;&lt;img alt="image" src="http://media.tumblr.com/38b19bb8bc204a4b6dc160e90ee895c2/tumblr_inline_mizgewlbod1qz4rgp.png"/&gt;&lt;/p&gt;
&lt;p&gt;By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.&lt;/p&gt;
&lt;p&gt;&lt;img alt="image" src="http://media.tumblr.com/8102f4803a4aa667c0b9dd642e38b690/tumblr_inline_mizgfvggEd1qz4rgp.png"/&gt;&lt;/p&gt;
&lt;h2&gt;Calculate the entropy&lt;/h2&gt;
&lt;p&gt;In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we&amp;#8217;re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1&amp;#8217;s and 0&amp;#8217;s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Huffman_coding" target="_blank"&gt;Huffman Coding&lt;/a&gt; is one encoding algorithm. There&amp;#8217;s an example implementation &lt;a href="http://en.literateprograms.org/Huffman_coding_(Python" target="_blank"&gt;here&lt;/a&gt;, which assigns the code &lt;code&gt;100&lt;/code&gt; to “e” and &lt;code&gt;1111110010&lt;/code&gt; to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you&amp;#8217;ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.&lt;/p&gt;
&lt;p&gt;The actual entropy formula is:&lt;/p&gt;
&lt;p&gt;\[ H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \]&lt;/p&gt;
&lt;p&gt;Where \( H(x) \) is the entropy, and \( p_i \) is the probability of that symbol \( i \) will appear. In the example I linked to above, \( p_e = 0.124 \) and \( p_q=0.0009 \), so it makes sense that e&amp;#8217;s symbol is so much shorter. In the example, the average number of bits per symbol is \( \sum S_i p_i = 4.173 \frac{bits}{symbol} \), where \( S_i \) is the number of bits in the symbol. The entropy, from the above equation, is \( 4.142 \frac{bits}{symbol} \).&lt;/p&gt;
&lt;p&gt;The example problem of web traffic distribution is a little different. We&amp;#8217;re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as &lt;em&gt;high&lt;/em&gt; as possible.&lt;/p&gt;
&lt;p&gt;We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we&amp;#8217;re trying to choose. If one of our servers is the equivalent of the letter “e”, it&amp;#8217;s going to be totally overloaded while the “q” isn&amp;#8217;t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.&lt;/p&gt;
&lt;p&gt;So to calculate the entropy, we&amp;#8217;ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server&amp;#8217;s probability of handing the next incoming request. These are the \( p_i \) values that we need in order to calculate the entropy. In code, it looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class="r"&gt;calc.entropy &amp;lt;- function(hash.value) {
    h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
    probs = h$counts/sum(h$counts)
    print(probs)
    entropy = -sum(probs * log2(probs))
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The entropy values for our four hash functions are:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;##   hash.function entropy
## 1            H1   4.203
## 2            H2   4.226
## 3            H3   4.254
## 4            H4   4.180
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And while we&amp;#8217;re at it, here&amp;#8217;s the entropy of our best/worst case example that we plotted earlier.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;##    hash.function entropy
## 1     near.ideal   4.309
## 2 worst.possible   0.000
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Why is the worst case value 0? Because if all traffic is going to one server, we wouldn&amp;#8217;t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: \( -20\frac{1}{20}log_2{\frac{1}{20}} = 4.32 \), which we&amp;#8217;re close to but can never exceed.&lt;/p&gt;
&lt;p&gt;All of our hash functions appear to be working pretty well, especially for such a small sample size that I&amp;#8217;m using for this blog post. It looks like our winner is H3.&lt;/p&gt;
&lt;p&gt;To summarize here&amp;#8217;s what we did to find the optimal hashing function:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Get a &lt;em&gt;representative sample&lt;/em&gt; of your web traffic&lt;/li&gt;
&lt;li&gt;Run each request through the hashing function&lt;/li&gt;
&lt;li&gt;Take a histogram of the resulting values with N bins, where N is the number of servers you have available&lt;/li&gt;
&lt;li&gt;Divide the bin counts by the total number of requests in your sample to get the probability of handing a request for each server&lt;/li&gt;
&lt;li&gt;Calculate the entropy, \( H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \), for each hash function&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;There are more considerations to take, like setting an upper bound on the value of \( p_i \) to ensure that no single server ever gets so busy that it can&amp;#8217;t handle its load.&lt;/p&gt;
&lt;p&gt;If you want to read up more on information theory, &lt;a href="http://www.amazon.com/Elements-Information-Theory-Thomas-Cover/dp/0471062596/ref=sr_1_25?ie=UTF8&amp;amp;qid=1362113533&amp;amp;sr=8-25&amp;amp;keywords=information+theory" target="_blank"&gt;Elements of Information Theory&lt;/a&gt; by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.&lt;/p&gt;
&lt;p&gt;There is also, of course, Claude Shannon&amp;#8217;s landmark paper from 1948 “&lt;a href="http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf" target="_blank"&gt;A Mathematical Theory of Communication&lt;/a&gt;”“, in which he essentially defines the entire field.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/44295213078</link><guid>http://www.adamlaiacano.com/post/44295213078</guid><pubDate>Fri, 01 Mar 2013 10:02:00 -0500</pubDate><category>information theory</category><category>tumblr</category><category>engineering</category><category>data science</category></item><item><title>Sifter: an API for website testing and optimization</title><description>&lt;p&gt;I&amp;#8217;m happy to finally give &lt;a href="http://www.sifter.cc" target="_blank"&gt;Sifter&lt;/a&gt; a name and a home. It&amp;#8217;s a multi-armed bandit (MAB) API for testing web pages, optimizing which version (&amp;#8220;arm&amp;#8221;) of the test gets displayed. These arms could be article titles, ad positions, logos, colors, etc. Here&amp;#8217;s the gist of how how a MAB test works:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;A user visits a page that is under test.&lt;/li&gt;
&lt;li&gt;The page makes a request to sifter asking for an arm to display.&lt;/li&gt;
&lt;li&gt;Sifter makes a calculated decision about which arm to display and returns the result to the page as a value between 0 and (# arms - 1).&lt;/li&gt;
&lt;li&gt;The page renders the selected version (specific colors, logos, layouts, etc).&lt;/li&gt;
&lt;li&gt;The user interacts with the page in one way or another. Maybe they sign up, maybe they buy something, maybe they don&amp;#8217;t.&lt;/li&gt;
&lt;li&gt;When the interaction is complete, send the reward earned (if any) back to Sifter. This could be something binary like click/no click, or some other value such as the amount of money spent.&lt;/li&gt;
&lt;li&gt;Sifter updates the bandit algorithm, influencing the arm that is selected the next time the page is rendered.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;There is an endless number of uses for something like this. Here are a few off the top of my head:&lt;/p&gt;
&lt;h3&gt;Run a news website?&lt;/h3&gt;
&lt;ul&gt;&lt;li&gt;Run a few different title options for an article on your front page.&lt;/li&gt;
&lt;li&gt;Figure out how to get users to read more of a paginated article:&lt;/li&gt;
&lt;li&gt;Test the number of words shown on each page (say 500 or 1000)&lt;/li&gt;
&lt;li&gt;Each time the user clicks &amp;#8220;next page&amp;#8221;, update the default test result value to the user&amp;#8217;s progress through the article. (See the &lt;code&gt;/select_arm&lt;/code&gt; route in the &lt;a href="http://www.sifter.cc/docs" target="_blank"&gt;docs&lt;/a&gt; for more on default values)&lt;/li&gt;
&lt;/ul&gt;&lt;h3&gt;Run a website with promoted content?&lt;/h3&gt;
&lt;ul&gt;&lt;li&gt;Allow advertisers to choose multiple promotions to run at the same time, and optimize which one gets shown based on clicks and/or user feedback.&lt;/li&gt;
&lt;li&gt;Test the amount of labeling around the fact that this content is promoted vs organically created by users.&lt;/li&gt;
&lt;/ul&gt;&lt;h3&gt;Test the audience demographic response.&lt;/h3&gt;
&lt;ul&gt;&lt;li&gt;Set up a standard A/B test.&lt;/li&gt;
&lt;li&gt;Display the same content to every user.&lt;/li&gt;
&lt;li&gt;Report back the results of the test where the &amp;#8220;arm&amp;#8221; represents a specific user demographic, rather than an alteration to the web page.&lt;/li&gt;
&lt;li&gt;Your test results will show which demographic responds best to your content.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;There are a lot of features that I think are pretty clever, like setting a default test result value and a TTL for the test, so that the default value gets reported when the test &amp;#8220;expires,&amp;#8221; and the ability to update this default value and TTL by sending a &amp;#8220;heartbeat.&amp;#8221; And there&amp;#8217;s a lot on my To Do list, such as confidence intervals, multivariate testing, bucketing/aggregating test results for high-traffic websites, and more.&lt;/p&gt;
&lt;p&gt;Right now the project is in a limited beta mode. I&amp;#8217;m running it on a few pages that don&amp;#8217;t get much traffic and I would love some help working out whatever bugs may come up. If you&amp;#8217;re interested, please get in touch!&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/43646302605</link><guid>http://www.adamlaiacano.com/post/43646302605</guid><pubDate>Thu, 21 Feb 2013 10:08:00 -0500</pubDate><category>sifter</category><category>projects</category><category>optimization</category><category>operations research</category><category>multi armed bandit</category><category>ab test</category></item><item><title>
The Lakers are 6-22 when Kobe has more than 19 field goal...</title><description>&lt;img src="http://25.media.tumblr.com/07779a77b6c76fa590fb36e28fb37fc0/tumblr_mikorhq9Mc1r0vuydo1_400.png"/&gt;&lt;br/&gt;&lt;br/&gt;&lt;blockquote&gt;
&lt;div&gt;The Lakers are 6-22 when Kobe has more than 19 field goal attempts and 12-3 in the rest of the games.&lt;/div&gt;
&lt;/blockquote&gt;
&lt;p&gt;(via &lt;a href="http://simplystatistics.org/2013/01/28/data-supports-claim-that-if-kobe-stops-ball-hogging-the-lakers-will-win-more/" target="_blank"&gt;Data supports claim that if Kobe stops ball hogging the Lakers will win more | Simply Statistics&lt;/a&gt;)&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/43643034043</link><guid>http://www.adamlaiacano.com/post/43643034043</guid><pubDate>Thu, 21 Feb 2013 08:42:00 -0500</pubDate><category>nba</category><category>kobe bryant</category><category>sports</category><category>lakers</category><category>la lakers</category></item><item><title>I was going to spend today working on a new project, but...</title><description>&lt;img src="http://25.media.tumblr.com/7a3cd6de2fcced639d3d69a4a7a955eb/tumblr_mifjhnH3RX1r0vuydo1_500.png"/&gt;&lt;br/&gt; &lt;br/&gt;&lt;img src="http://25.media.tumblr.com/a1bd68a4e70824132a8a7bf3e89006a3/tumblr_mifjhnH3RX1r0vuydo3_500.png"/&gt;&lt;br/&gt; &lt;br/&gt;&lt;img src="http://25.media.tumblr.com/286b3e2b41b10492c35063e2e4b13ddd/tumblr_mifjhnH3RX1r0vuydo2_500.png"/&gt;&lt;br/&gt; &lt;br/&gt;&lt;p&gt;I was going to spend today working on a new project, but realized that the amount of work required just to get the boilerplate register/log-in/log-out code written is way too repetitive. So I stripped some code out of my &lt;a href="http://banditapi.herokuapp.com/" target="_blank"&gt;multi-armed bandit API&lt;/a&gt; and called it &lt;strong&gt;bootstrap-bootstrap&lt;/strong&gt;. It’s super basic, built with mongo/tornado/bootstrap, easy to deploy to heroku, and now &lt;a href="https://github.com/alaiacano/bootstrap-bootstrap" target="_blank"&gt;up on github&lt;/a&gt;.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/43416823594</link><guid>http://www.adamlaiacano.com/post/43416823594</guid><pubDate>Mon, 18 Feb 2013 14:00:00 -0500</pubDate><category>projects</category></item><item><title>radioon:

WNYC did a “How Single &amp; How Happy Are You”...</title><description>&lt;img src="http://25.media.tumblr.com/9b5d6f3a826f02bdb7cf1ab5b8649571/tumblr_mi88rriMN11qgq7mto1_500.png"/&gt;&lt;br/&gt;&lt;br/&gt;&lt;p&gt;&lt;a class="tumblr_blog" href="http://radioon.tumblr.com/post/43093663214/wnyc-did-a-how-single-how-happy-are-you" target="_blank"&gt;radioon&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;WNYC did a &lt;a href="http://www.wnyc.org/articles/wnyc-news-2/2013/feb/14/share-how-single-how-happy-are-you/" target="_blank"&gt;“How Single &amp; How Happy Are You” survey&lt;/a&gt;, which I annotated for them.&lt;/p&gt;
&lt;p&gt;(h/t &lt;a href="http://datanews.tumblr.com/post/43093145400/how-single-how-happy-are-you-wnyc-invites-you" target="_blank"&gt;DataNews&lt;/a&gt; for the link)&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://www.adamlaiacano.com/post/43095397388</link><guid>http://www.adamlaiacano.com/post/43095397388</guid><pubDate>Thu, 14 Feb 2013 15:52:37 -0500</pubDate></item><item><title>Huge turnout for the first NYC Data Science meetup! Probably...</title><description>&lt;img src="http://25.media.tumblr.com/9db1e9026f079bd842d9b713f311315a/tumblr_mi36jy2IsB1r0vuydo1_500.jpg"/&gt;&lt;br/&gt;&lt;br/&gt;&lt;p&gt;Huge turnout for the first NYC Data Science meetup! Probably about 300 people on a gross January night.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/42893723335</link><guid>http://www.adamlaiacano.com/post/42893723335</guid><pubDate>Mon, 11 Feb 2013 21:50:21 -0500</pubDate><category>meetups</category><category>data science</category></item><item><title>"Are you going to be optimizing the solution to one problem, or solving a wide variety of problems?..."</title><description>“Are you going to be optimizing the solution to one problem, or solving a wide variety of problems? If it’s just one problem, is it something that you can imagine being happy about working on a year from now?”&lt;br/&gt;&lt;br/&gt; - &lt;em&gt;Hilary Mason is on point, as usual. Read&lt;span&gt; &lt;/span&gt;&lt;a href="http://www.linkedin.com/today/post/article/20130204193931-1865628-finding-a-great-data-science-job" target="_blank"&gt;Finding a Great Data Science Job&lt;/a&gt; for more excellent considerations.&lt;/em&gt;</description><link>http://www.adamlaiacano.com/post/42292926844</link><guid>http://www.adamlaiacano.com/post/42292926844</guid><pubDate>Mon, 04 Feb 2013 15:17:00 -0500</pubDate><category>data science</category></item><item><title>An API for website testing/optimization</title><description>&lt;p&gt;This is something that I&amp;#8217;ve been working on very heavily for the last 6 weeks or so, and I&amp;#8217;m excited to start showing it off. It&amp;#8217;s currently being hosted at &lt;a href="http://banditapi.herokuapp.com" target="_blank"&gt;&lt;a href="http://banditapi.herokuapp.com" target="_blank"&gt;http://banditapi.herokuapp.com&lt;/a&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In a normal A/B test on a website, you randomly present users with one version of a feature or another. This could be the placement of your logo, for example. You let the test run for some amount of time, then measure the relative performance of the two versions (&amp;#8220;arms&amp;#8221;) and go with the better one from that point on.&lt;/p&gt;
&lt;p&gt;Multi-Armed Bandit tests are different. They are designed to &lt;em&gt;learn&lt;/em&gt; which arm is preferred, and move towards that version of the website as quickly as possible. There are many nuances to these types of tests, and I highly recommend John Myles White&amp;#8217;s &lt;a href="http://shop.oreilly.com/product/0636920027393.do" target="_blank"&gt;book on the subject&lt;/a&gt; (there&amp;#8217;s also a &lt;a href="http://adamlaiacano.tumblr.com/post/37333552557/john-myles-white-was-kind-enough-to-come-by-tumblr" target="_blank"&gt;video&lt;/a&gt; of him talking about this subject at tumblr recently).&lt;/p&gt;
&lt;p&gt;The API that I built provides a couple of different Multi-Armed Bandit algorithms, which are exposed as simple routes to request an arm to play, and to update the algorithm with the test results. There&amp;#8217;s also a little admin view to see how your test is performing:&lt;/p&gt;
&lt;p&gt;&lt;img alt="image" src="http://media.tumblr.com/2097c26004fe41c5823b7058efd3a1db/tumblr_inline_mhlydrMiCL1qz4rgp.png"/&gt;&lt;/p&gt;
&lt;p&gt;I&amp;#8217;m at the point now where I need some help testing. This thing is not at the point where you should deploy it on a production site, but if you have something to test out that doesn&amp;#8217;t take a lot of traffic (and you don&amp;#8217;t mind a little latency), it should be reliable. There&amp;#8217;s a also a little &lt;a href="https://github.com/alaiacano/banditapi-client" target="_blank"&gt;client library&lt;/a&gt; on github if you just want to fuss with it. Either email me at myfirstname.mylastname@gmail.com, or send me a message through tumblr to get an account.&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/42121103586</link><guid>http://www.adamlaiacano.com/post/42121103586</guid><pubDate>Sat, 02 Feb 2013 14:40:00 -0500</pubDate><category>projects</category></item><item><title>BigData TechCon 2013</title><description>&lt;a href="http://www.bigdatatechcon.com/boston2013/"&gt;BigData TechCon 2013&lt;/a&gt;: &lt;p&gt;I’m honored to be giving a keynote talk at the 2013 BigData TechCon conference in Boston (April 8-10). &lt;span&gt;There are a lot of excellent &lt;/span&gt;&lt;a href="http://www.bigdatatechcon.com/boston2013/speakers.aspx" target="_blank"&gt;speakers&lt;/a&gt;&lt;span&gt; at the event that I’m looking forward to seeing, including Claudia Perlich (Media6Degrees), &lt;/span&gt;&lt;a href="https://twitter.com/jseidman" target="_blank"&gt;Jonathan Seidman&lt;/a&gt;&lt;span&gt; (Cloudera, formerly Orbitz), and &lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="https://twitter.com/posco" target="_blank"&gt;Oscar Boykin&lt;/a&gt;&lt;span&gt; (twitter).&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;Registration is open &lt;/span&gt;&lt;a href="http://www.bigdatatechcon.com/boston2013/conferencepricing.aspx" target="_blank"&gt;here&lt;/a&gt;&lt;span&gt;. &lt;/span&gt;Here’s a short abstract for my talk:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Social networks are able to collect large amounts of activity data from their user and customer base. As Big Data professionals, we conduct experiments on custom data sets to measure the effectiveness of our products or advertising methodologies. Since a social network is effectively useless without an active community, our companies owe it to their users to create new and better products based on this information. Learn how our data analysis and predictive analytics must take a different approach than Big Data in fields like finance, medicine, and defense.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;span&gt;I’m extra excited that the conference is in my home town. It’s been a while since I’ve spent a few days in Boston and I look forward to seeing all of my friends up there. I can already taste the Silhouette popcorn.&lt;/span&gt;&lt;/p&gt;</description><link>http://www.adamlaiacano.com/post/41299142944</link><guid>http://www.adamlaiacano.com/post/41299142944</guid><pubDate>Wed, 23 Jan 2013 15:48:08 -0500</pubDate><category>data science</category><category>conferences</category><category>2013</category><category>projects</category></item><item><title>"Significance testing in general has been a greatly overworked procedure, and in many cases where..."</title><description>“Significance testing in general has been a greatly overworked procedure, and in many cases where significance statements have been made it would have better to provide an interval within which the value of the parameter would be expected to lie.”&lt;br/&gt;&lt;br/&gt; - &lt;em&gt;George Box “Statistics For Experimenters,” &lt;strong&gt;&lt;em&gt;1978&lt;/em&gt;&lt;/strong&gt;.&lt;/em&gt;</description><link>http://www.adamlaiacano.com/post/41244718256</link><guid>http://www.adamlaiacano.com/post/41244718256</guid><pubDate>Tue, 22 Jan 2013 21:22:33 -0500</pubDate></item></channel></rss>
