Suppose you have a MySQL data source, npa_nxx_demo, and you want to migrate its table and data to a Cassandra column family. You execute the dse command in the bin directory of the DataStax Enterprise installation, using cassandra options. For example, if you already ran the Sqoop demo, delete the npa_nxx.java file in $DSE_HOME/bin, then run this command:
sudo ./dse sqoop import --connect jdbc:mysql://127.0.0.1/npa_nxx_demo \ --username root \ --table npa_nxx \ --cassandra-keyspace newKS \ --cassandra-column-family npa_nxx_cf \ --cassandra-row-key npa_nxx_key \ --cassandra-thrift-host 127.0.0.1 \ --cassandra-create-schema
First, the dse command passes the external data source parameters:
Next, the dse command passes the Cassandra parameters:
DSE interprets these command options as follows:
--cassandra-row-key <key>
The values of the key column of the source table become the row keys for each corresponding row in Cassandra.
Note
No detection of duplicates is performed, so using a unique key in the source table is strongly recommended to prevent overwriting and losing data.
--cassandra-thrift-host <cassandra-host(s)>
A comma-separated list of IP addresses identifies Cassandra nodes.
--cassandra-create-schema
An option that creates a Cassandra keyspace and column family during the import operation.
Note
The column family is created with no column metadata and all data is imported as strings.
Usage Notes
Using this option when the keyspace already exists is valid.
Using this option when the column family already exists causes an error that aborts the import operation.
Using this option when the keyspace or column family, or both, do not exist is required; otherwise, the import operation is aborted.
Using this option creates a column family with no column metadata and all data is imported as strings. Do not use this option if you want column metadata.
Note
To generate column metadata for imported data, create a column family, setting column values and names as string types, and then import data.
After running the dse sqoop import command, a Java class appears in the DSE installation bin directory. This file, by default named npa_nxx.java after the demo table, can encapsulate one row of the imported data. You can specify the name of this JAR file, the output directory, and the class package using Sqoop command line options. For more information, see Sqoop documentation.
Snippets of the output and end result of the import operation look like this:
INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
INFO tool.CodeGenTool: Beginning code generation
INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `npa_nxx` AS t LIMIT 1
INFO orm.CompilationManager: HADOOP_HOME is /Users/robin/dev/dse-2.0/resources/hadoop/bin/..
INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-robin/compile/2e2b8b85fba83ccf1f52a8ee77c3b12f/npa_nxx.jar
. . .
INFO mapreduce.ImportJobBase: Beginning import of npa_nxx
. . .
INFO config.DseConfig: Load of settings is done.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`npa_nxx_key`), MAX(`npa_nxx_key`) FROM `npa_nxx`
WARN db.TextSplitter: Generating splits for a textual index column.
WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
INFO mapred.JobClient: Running job: job_201203051624_0002
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: map 25% reduce 0%
INFO mapred.JobClient: map 50% reduce 0%
INFO mapred.JobClient: map 75% reduce 0%
INFO mapred.JobClient: map 100% reduce 0%
INFO mapred.JobClient: Job complete: job_201203051624_0002
. . .
INFO mapred.JobClient: FILE_BYTES_WRITTEN=88472
INFO mapred.JobClient: CFS_BYTES_READ=587
INFO mapred.JobClient: File Input Format Counters
INFO mapred.JobClient: Bytes Read=0
INFO mapred.JobClient: Map-Reduce Framework
INFO mapred.JobClient: Map input records=105291
INFO mapred.JobClient: Spilled Records=0
INFO mapred.JobClient: Total committed heap usage (bytes)=340000768
INFO mapred.JobClient: Map output records=105291
INFO mapred.JobClient: SPLIT_RAW_BYTES=587
INFO mapreduce.ImportJobBase: Transferred 0 bytes in 10.724 seconds (0 bytes/sec)
INFO mapreduce.ImportJobBase: Retrieved 105291 records
In this version of Sqoop, there is a Map, but no Reduce, phase.