On the heels of the Airbnb blog post about how they use PMML models in their machine learning framework, I figured I’d post some code I’ve been toying with for using PMML for batch processing in Scalding.

I think this will be a huge help if you’re able to build a predictive model on a subset of your data in in something like R or python (which doesn’t have the best support for PMML yet), and then apply it to a much larger data volume in Hadoop.

I’ll have a full post up eventually, but here’s a simple example usage.

All of the work is done in a single trait called Predictable, and any class that extends it will get a predict method to apply a model.

Naive Bayes classification for large data sets

I put together a Naive Bayes classifier in scalding. It’s modeled after the scikit-learn approach, and I’ve tried to make the API look similar. The advantage of using scalding is that it is designed to run over enormous data sets in Hadoop. I’ve run some binary classification jobs on 100GB+ of data and it works quite well.

Here’s an example usage on the famous iris data set.

The three classes are the species of iris, and the four features/attributes are the length and width of the flower’s sepal and petal. The train method (line 21) returns a Pipe containing all of the information required for classification.

The scikit-learn documentation contains a good explanation of Bayes’ theorem, which boils down to the following equation:

\[ \hat{y} = \underset{y} {\mathrm{argmax}} ~P(y|x_i) = \underset{y} {\mathrm{argmax}} ~P(y) \prod_{i=1}^{n}P(x_i | y) \]

Where \( \hat{y} \) is the predicted class, \( y \) is the training class, and \( x_i \) are the features (sepalWidth, sepalLength, petalWidth, petalLength). Most Naive Bayes examples that I’ve seen are dealing with word counts, therefore \( P(x_i | y) = \frac{\text{number of times word} x_i \text{appears in class} y}{\text{total number of words in class} y} \). That’s how the MultinomialNB class works, but in the iris data set, we’re dealing with continuous, normally distributed measurements and not counts of objects in a multinomial distribution.

Therefore, we want to use GaussianNB which uses the following equation in the classification:

\[ P(x_i | y) = \frac{1}{\sqrt{2\pi {\sigma_y}^2}} \text{exp}(-\frac{(x_i-u_y)^2}{2\pi\sigma_y^2}) \]

That means that our model pipe must contain the following information in order to calculate \(P(y|x_i)\)

  • classId, \(y\) - The type of flower.
  • feature, \(i\ - The name of the feature (sepalWidth, sepalLength, petalWidth, or petalLength).
  • classPrior, \(P(y)\)- Prior probability of an iris blonging to the class.
  • mu, \(\mu_y\) - The mean of the given feature within the class.
  • sigma, \(\sigma_y\) - The standard deviation of the given feature within the class.

The model Pipe is then crossed with the test set and we calculate the likelihood that the point belongs in each class, \(P(y|x_i)\). Once we have the probability of a data point belonging to each class, we simply group by the point’s ID field and keep only the class with the maximum likelihood.

The results are shown below (plotting only two of the four features). The x's are the training points used, and o's are successfully classified points.

For now, your best bet for using this is to just copy the code off of github. If I get some time, I’d love to port this over to scalding’s typed API, combine it with some other machine learning functions (such as the K nearest neighbor classifier I wrote) to provide a nice little library of tools for scaling machine learning algorithms. If you’d like to be involved, get in touch.

I’m always amazed by the amount of work that gets done at every DataKind event. In just 14 (consecutive) hours, we were able to understand a very complex data problem, scope the project, and start laying the groundwork for a number of tools to help Amnesty International save lives.

Thanks to the folks from AI and the 30 or so people who helped out with our project.

I gave a talk about Digital Signal Processing in Hadoop at this month’s NYC Machine Learning meetup. Here’s the abstract:

In this talk I’m going to introduce the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and work through a few simple examples of low-pass filter design and application. It’s much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. I’ll show how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding) and give a demo on how to detect trendy meme-ish blogs on Tumblr.

This is a great post showing how to build and work with random variables in scala. It starts with something as simple as drawing from a uniform distribution and moves to distribution transforms, adding random variables, conditional probability, and more. The code examples are great if you’re a scala programmer who wants to learn more about probability, or if you already know the probability and want to learn some scala by example.

Here’s a quick taste of how to build a generic Distribution trait and create uniform and Bernoulli distributions out of it.

trait Distribution[A] {
  def get: A

  def sample(n: Int): List[A] = {

  def map[B](f: A => B): Distribution[B] = new Distribution[B] {
    override def get = f(self.get)

val uniform = new Distribution[Double] {
  private val rand = new java.util.Random()
  override def get = rand.nextDouble()

def bernoulli(p: Double): Distribution[Boolean] = {
  uniform.map(_ < p)

Data generating products.

People put a lot of effort into predicting the sentiment around a certain article, tweet, photo, or any other piece of information on the internet. There’s huge value in knowing who is consuming your content, how they feel about it, and what kinds of things they feel similar about. For example, if I regularly share my love for Pepsi products and disdain for Coca-Cola products, then Pepsi can consider me a loyal customer and target their advertising dollars elsewhere.

Measuring sentiment is not always easy. The cononical approach is to build large list of “positive” and “negative” words (either manually or via labeled training data), and then count how many of each group appear in the text that you want to classify. You can add some weights and filters to the words, but it really all comes down to the same sort of thing. This works OK in some contexts, but will never be exact and gets much more difficult if you try to figure out the sentiment of specific users and not an aggregate “mood.”

A few months ago I spoke at a conference and focused my talk on building “data generating products.” By that, I mean building features that enhance a user’s experience, and simultaneously let you collect useful data that you would otherwise have to predict in order to build new products on top of the new information. The example that I used was tumblr’s typed post system.

With the seven post types, we can give users tools that make it easier to share specific types of media. Sharing a song? Search for it on Spotify or SoundCloud. Sharing a photo? Drag-and-drop the images, or take one with your webcam. It’s easy for us to determine the type of media that you share or consume. If your blog is focused on sharing songs that you enjoy, we can recommend it to people who are using tumblr to discover new music.

Today bitly released a bitly for feelings bookmarklet (above). It lets users bookmark and/or share articles and websites that they come across, and uses a cute short-url domains like oppos.es or wtfthis.me, based on how you feel about the article. I’m not sure what their long-term plans are for this product (it’s still in beta), but I’d love to be able to log into my bitly account and see all of the funny links that I’ve saved (lolthis.me), or all of the products I’d like to buy someday (iwantth.is).

It lets me, the user, organize content and makes me more likely to use bitly as a bookmarking service. It also tells them explicitly how I feel about a photo, article, or product and gives them an idea about why I am bookmarking and/or sharing it. There’s no need to scrape my twitter account and analyze the words I use to describe the link that I’m sharing.

There are other products that have been well designed to collect rich user data, such as LinkedIn Endorsements or when Facebook switched from having text lists of bands/movies/books that I like to subscribing to individual ‘like’ pages for each item. These are fine ideas that create clear signals, but I’m no more likely to use LinkedIn because I can verify that my friend knows how to program in Ruby. Bitly added a simple interface to enhance their simple service, and are getting a wealth of valuable data out of it.

Now that they’re collecting this great information, I can’t wait to see what they build with it. Hopefully they’ll work it into their real time search engine.