Apache Mahout in DataStax Enterprise – Building a Classification System
Apache Mahout in DataStax Enterprise - Building a Classification System
DataStax Enterprise includes a ready-to-use distribution of Apache Mahout. In this post, we'll take a look at an aspect of Mahout called classification, and we'll use DataStax Enterprise's included version of Mahout to create a functioning classifier. A classifier, in simplified terms, is a system that can auto classify input data to one of several predefined categories. Given some information I and a set of categories C, the classification system's output will be exactly one element of C, based on its examination of I. We'll see a concrete example of this soon, but first let's take a look at how to run Mahout in DataStax Enterprise.
Running Mahout in DataStax Enterprise
Included in the Mahout distribution is a handy utility program aptly named "mahout". That program is available in DataStax Enterprise as "$DSE_HOME/bin/dse mahout". Let's run it now and have a look at its output:
$ bin/dse mahout --help Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory baumwelch: : Baum-Welch algorithm for unsupervised HMM training canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text ... runlogistic: : Run a logistic regression model against CSV data ... trainlogistic: : Train a logistic regression using stochastic gradient descent ...
As you can see from the output, there are quite a few mahout commands available. We don't need to pay attention to most of these now, but there are a couple that are of particular interest to us, given our objective of building a classification system. These are trainlogistic and runlogistic.
The trainlogistic command is used to produce a model that can be used to automatically classify data sets of a specific format. trainlogistic takes as input a CSV file containing training data, and uses it to produce the target model. We'll have a look at the format of the training data in just a little bit, but first, here's how we use the command to produce the model:
$ bin/dse mahout trainlogistic \ --input animals.csv \ --output ./animal-model \ --target good_pet \ --categories 2 \ --predictors easy_to_train \ lives_in_water \ lives_on_land \ submissive \ aggressive \ obedient \ bites \ poisonous \ low_cost_food \ requires_cage \ --types numeric \ --features 20 \ --passes 100 \ --rate 50 \
"input" is the path to the training data.
"output" is the path to the file that the model will be written to.
"target" is the name of the field in the training data that is used to explicitly state the correct classification of input records
"categories" is the number of unique possible values that "target" can be assigned.
"predictors" is a list of field names in the training data that are to be considered key indicators in determining the value of "target"
"types" specifies the data types for the items in the "predictors" list.
Now let's take a look at the first few lines of the input training data set.
$ head animals.csv "easy_to_train","lives_in_water","lives_on_land","submissive","aggressive","obedient","bites","poisonous","requires_cage","low_cost_food","good_pet" 1,0,1,1,0,1,0,0,0,1,1 1,0,1,0,0,1,0,0,0,1,1 0,0,1,1,0,0,0,0,0,1,1 1,1,0,0,0,0,1,1,1,1,0 1,1,0,1,0,1,0,0,1,1,1 0,1,0,0,1,0,1,1,1,0,0 1,0,1,1,0,1,0,1,1,1,0 1,1,0,0,1,0,1,0,1,0,0 1,1,0,0,1,0,1,1,1,0,0
The first line, as you would expect from a CSV file, lists the field names for the rows of the data. I've chosen a very simple binary data format here for the field values. Each field value can be either a 1 or a 0. A value of 1 means "yes" and a value of 0 means "no". As you might have guessed from the field names the rows of this data set represent animals. More specifically, the field values of each row describe certain characteristics of an animal.
The Target Field Value
The last field in this data set "good_pet" is used to hold the value/answer to the main classification question we want our model to answer. That is, given the characteristics of this animal, would it make a good pet? To keep things simple, only two possible values are allowed, just like for all the other fields. But this is specific to our use case. It's perfectly acceptable and sometimes reasonable to have several possible values for the target question. We could have chosen a set of values such as ["yes", "no", "unlikely", "likely", "you're out of your mind!"], but for this use case to keep things simple it's either "yes" or "no".
good_pet is the field name we passed into trainlogistic via its "--target" option. trainlogistic uses the value of this field to make sense of the other field values, and ultimately to learn what attribute values (when considered together) result in a 1 or a 0 for good_pet.
Now, let's take a look at the first record in our training data set:
easy_to_train = "yes"
lives_in_water = "no"
lives_on_land = "yes"
submissive = "yes"
aggressive = "no"
obedient = "yes"
bites = "no"
poisonous = "no"
requires_cage = "no"
low_cost_food = "yes"
good_pet = "yes"
submissive, obedient, not aggressive, doesn't bite, is not poisonous... sounds like a pretty good pet to me!
Notice that some attributes are more important than others. Bites, poisonous, and obedient are more important than lives_on_land or requires_cage, for example. During the training process, one of the primary goals is to get the model to weight the field values correctly, so that the attributes you consider to be important are the same attributes it considers to be important.
When you run trainlogistic, you tell it what fields it should consider to be key indicators of the value of the "target" field. These field names are passed into trainlogistic via its "--predictors" option. The goal is to be able to use the predictor fields to predict the correct value of the "target" field, in future previously unseen data sets.
Now that we are more familiar with how trainlogistic works, let's take a look at the output of the trainlogistic command we ran earlier:
good_pet ~ 3.756*Intercept Term + 3.756*aggressive + 17.012*bites... Intercept Term 3.75570 aggressive 3.75570 bites 17.01198 easy_to_train -9.49807 lives_in_water -22.85121 lives_on_land -21.37863 low_cost_food 8.48469 obedient 8.48469 poisonous 8.48469 requires_cage 8.48469 submissive 0.78405
As you can see, it seems to have done a fairly decent job of determining from the input data what fields are more important than others. for example, bites has been weighted pretty heavily vs lives_in_water and lives_on_land, which are really not very important at all. Poisonous, however, seems to have a relatively low weight for the potential hardship of living with a poisonous animal, and this brings up a good point about the need for using the right training data as well as the need for sufficient training data. It's very likely that there simply wasn't enough data present to indicate just how important the poisonous attribute value really is!
The process of preparing a training data set is not straight forward and we won't get into it here, but just know that garbage in really does equal garbage out. If you get the training data wrong, then it's very unlikely that the rest of the classification system will work.
One way we could have placed more importance on the poisonous attribute would be to include some more rows in our data set where all the field values indicated that the animal would make a good pet, except for the fact that it was poisonous. We'd indiscriminately set good_pet to "no" on every row where poisonous was "yes" no matter what the other field values indicated. The training algorithm used by trainlogistic would have had a better chance of figuring out that poisonous is in deed a very important factor in determining whether an animal makes a good pet or not.
So now that we have generated a model, let's use runlogistic to test our system to see how well it does in classifying things.
$ dse mahout runlogistic \ --input animals.csv \ --model ./animal-model \ --auc \ --confusion AUC = 0.95 confusion: [[48.0, 11.0], [8.0, 53.0]] entropy: [[-1.0, NaN], [-7.1, -0.2]]
Here we see that runlogistic used the same input data as trainlogistic. This is important because it uses that training data together with the model (animal-model) that was output from trainlogistic to test the accuracy of the system in classifying data correctly. AUC stands for Area Under the Curve and can range from 0 to 1. A value of 0 means it wasn't able to correctly classify any of the input records, and a value of 1 means that it was able to correctly classify all the input records. Obviously, we want to strive for a value closer to 1. Here 0.95 is not bad at all! It means that our model is working well.
So now we have a model that we can use to answer the question of whether or not a given animal will make a good pet or not, given a set of predefined attributes that describe the animal. What we can do now is use this model on new data that has not been seen before and it will, with some degree of accuracy, classify the data for us automatically.
In this post we took a little dive into the world of machine learning and Mahout with DataStax Enterprise. We developed a very simple classifier that is capable of telling us whether or not a particular animal would make a good pet or not, based solely on the attributes of the animal. The data set we used to train the classifier was small and simple, consisting of only a couple hundred rows of data. In real world applications you'd use training data sets much larger. Mahout in DataStax Enterprise really starts to shine when extremely large data sets are required. This is when other systems begin to show their limitations and DataStax Enterprise with its integrated Hadoop and Mahout really excels!
DataStax has many ways for you to advance in your career and knowledge.
You can take free classes, get certified, or read one of our many white papers.
register for classes
DBA's Guide to NoSQL