DataStax Enterprise 3.1 Documentation

Using Mahout

This documentation corresponds to an earlier product version. Make sure this document corresponds to your version.

Latest DSE documentation | Earlier DSE documentation

DataStax 2.1 and later support Apache Mahout, a Hadoop component that offers machine learning libraries. Mahout facilitates building intelligent applications that learn from data and user input. Machine learning use cases are many and some, such as the capability of web sites to recommend products to visitors based on previous visits, are notorious.

Currently, Mahout jobs that use Lucene features are not supported.

Running the Mahout demo

The DataStax Enterprise installation includes a Mahout demo. The demo determines with some percentage of certainty which entries in the input data remained statistically in control and which have not. The input data is time series historical data. Using the Mahout algorithms, the demo classifies the data into categories based on whether it exhibited relatively stable behavior over a period of time. The demo produces a file of classified results.

To run the Mahout demo

  1. After installing DataStax Enterprise, start an analytics node.

  2. Go to the demos directory in one of these locations:

    • Tarball install: cd <install_location>/demos/mahout
    • Packaged install: cd /usr/share/dse-demos/mahout
  3. Run the script in the demo directory. For example, on Linux:

    ./run_mahout_example.sh
    

    If you are running OpsCenter, view the Hadoop job progress:


    ../../_images/mahout_jt.png

When the demo completes, a message appears on the standard output about the location of the output file. For example:

The output is in /tmp/clusteranalyze.txt

Using Mahout commands in DataStax Enterprise

You can run Mahout commands on the dse command line. For example, on Mac OSX to get a list of which commands are available:

cd ~/dse-3.0
bin/dse mahout

The list of commands appears.

Mahout command line help

You use one of these commands as the first argument plus the help option.

cd ~/dse-3.0
bin/dse mahout arff.vector --help

The output is help on the arff.vector command.

Add Mahout classes to the class path, execute Hadoop command

You can use Hadoop commands to work with Mahout. Using this syntax first adds Mahout classes to the class path, and then executes the Hadoop command.

dse mahout hadoop <hadoop command> <options>

For example, a Mahout file as input to this command, converts the file to text, so you can read it:

cd ~/dse-3.0
bin/dse mahout hadoop fs -text <mahout file> | more

The Apache web site offers an in-depth tutorial.