## Naive Bayes classification for large data sets

I put together a Naive Bayes classifier in scalding. It’s modeled after the scikit-learn approach, and I’ve tried to make the API look similar. The advantage of using scalding is that it is designed to run over enormous data sets in Hadoop. I’ve run some binary classification jobs on 100GB+ of data and it works quite well.

Here’s an example usage on the famous iris data set.

The three classes are the species of iris, and the four features/attributes are the length and width of the flower’s sepal and petal. The `train`

method (line 21) returns a Pipe containing all of the information required for classification.

The scikit-learn documentation contains a good explanation of Bayes’ theorem, which boils down to the following equation:

\[ \hat{y} = \underset{y} {\mathrm{argmax}} ~P(y|x_i) = \underset{y} {\mathrm{argmax}} ~P(y) \prod_{i=1}^{n}P(x_i | y) \]

Where \( \hat{y} \) is the predicted class, \( y \) is the training class, and \( x_i \) are the features (sepalWidth, sepalLength, petalWidth, petalLength). Most Naive Bayes examples that I’ve seen are dealing with word counts, therefore \( P(x_i | y) = \frac{\text{number of times word} x_i \text{appears in class} y}{\text{total number of words in class} y} \). That’s how the `MultinomialNB`

class works, but in the iris data set, we’re dealing with continuous, normally distributed measurements and not counts of objects in a multinomial distribution.

Therefore, we want to use `GaussianNB`

which uses the following equation in the classification:

\[ P(x_i | y) = \frac{1}{\sqrt{2\pi {\sigma_y}^2}} \text{exp}(-\frac{(x_i-u_y)^2}{2\pi\sigma_y^2}) \]

That means that our model pipe must contain the following information in order to calculate \(P(y|x_i)\)

`classId`

, \(y\) - The type of flower.`feature`

, \(i\ - The name of the feature (sepalWidth, sepalLength, petalWidth, or petalLength).`classPrior`

, \(P(y)\)- Prior probability of an iris blonging to the class.`mu`

, \(\mu_y\) - The mean of the given feature within the class.`sigma`

, \(\sigma_y\) - The standard deviation of the given feature within the class.

The `model`

Pipe is then crossed with the test set and we calculate the likelihood that the point belongs in each class, \(P(y|x_i)\). Once we have the probability of a data point belonging to each class, we simply group by the point’s ID field and keep only the class with the maximum likelihood.

The results are shown below (plotting only two of the four features). The `x`

's are the training points used, and `o`

's are successfully classified points.

For now, your best bet for using this is to just copy the code off of github. If I get some time, I’d love to port this over to scalding’s typed API, combine it with some other machine learning functions (such as the K nearest neighbor classifier I wrote) to provide a nice little library of tools for scaling machine learning algorithms. If you’d like to be involved, get in touch.