<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="bbPress/1.0.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>DataStax Support Forums &#187; Topic: Problem JOINing data from Cassandra and Cfs</title>
		<link>http://www.datastax.com/support-forums/topic/problem-joining-data-from-cassandra-and-cfs</link>
		<description>Software, Support, and Training for Apache Cassandra</description>
		<language>en-US</language>
		<pubDate>Sun, 26 May 2013 03:30:33 +0000</pubDate>
		<generator>http://bbpress.org/?v=1.0.3</generator>
		<textInput>
			<title><![CDATA[Search]]></title>
			<description><![CDATA[Search all topics from these forums.]]></description>
			<name>q</name>
			<link>http://www.datastax.com/support-forums/search.php</link>
		</textInput>
		<atom:link href="http://www.datastax.com/support-forums/rss/topic/problem-joining-data-from-cassandra-and-cfs" rel="self" type="application/rss+xml" />

		<item>
			<title>Anonymous on "Problem JOINing data from Cassandra and Cfs"</title>
			<link>http://www.datastax.com/support-forums/topic/problem-joining-data-from-cassandra-and-cfs#post-1719</link>
			<pubDate>Thu, 19 Apr 2012 20:41:48 +0000</pubDate>
			<dc:creator>Anonymous</dc:creator>
			<guid isPermaLink="false">1719@http://www.datastax.com/support-forums/</guid>
			<description>&#60;p&#62;I'm having a similar problem, and it seems that the exception only happens when performing a group or join using the field that is derived from the CF row key. &#60;/p&#62;
&#60;p&#62;here's the simplest failing example:&#60;br /&#62;
if I load a CF and flatten to get multiple tuples for each row&#60;/p&#62;
&#60;p&#62;&#60;code&#62;&#60;br /&#62;
p1 = LOAD 'cassandra://keyspace/CF' USING CassandraStorage() AS (key:chararray, columns: bag{T: tuple(property:chararray, value:chararray)});&#60;br /&#62;
p2 = group p1 by columns::property;&#60;br /&#62;
dump p2;&#60;br /&#62;
&#60;/code&#62;&#60;/p&#62;
&#60;p&#62;Works just fine, but if I change the group by&#60;br /&#62;
&#60;code&#62;&#60;br /&#62;
p2 = group p1 by key;&#60;br /&#62;
dump p2;&#60;br /&#62;
&#60;/code&#62;&#60;br /&#62;
will fail with&#60;/p&#62;
&#60;blockquote&#62;&#60;p&#62;
 Type mismatch in key from map: expected org.apache.pig.impl.io.NullableText, recieved org.apache.pig.impl.io.NullableBytesWritable
&#60;/p&#62;&#60;/blockquote&#62;
&#60;p&#62;However, if I do a load, using the same schema, from a text file with lines of the form&#60;/p&#62;
&#60;blockquote&#62;&#60;p&#62;
key      {(c1,v1),(c2,v2),(c3,v3)}
&#60;/p&#62;&#60;/blockquote&#62;
&#60;p&#62;I can group (or join) by the key successfully.&#60;/p&#62;
&#60;p&#62;Any suggestions on how to proceed?&#60;/p&#62;
&#60;p&#62;Thanks,&#60;br /&#62;
John
&#60;/p&#62;</description>
		</item>
		<item>
			<title>larsen on "Problem JOINing data from Cassandra and Cfs"</title>
			<link>http://www.datastax.com/support-forums/topic/problem-joining-data-from-cassandra-and-cfs#post-545</link>
			<pubDate>Fri, 07 Oct 2011 14:00:06 +0000</pubDate>
			<dc:creator>larsen</dc:creator>
			<guid isPermaLink="false">545@http://www.datastax.com/support-forums/</guid>
			<description>&#60;p&#62;I am having troubles using Pig + Cassandra (I deployed a staging node using DataStax Auto-Clustering&#60;br /&#62;
AMI 2.0, ami-edc30384 on an EC2 m1.large instance)&#60;/p&#62;
&#60;p&#62;Here a description of what I'm doing, with snippets from the pig script I'm using:&#60;/p&#62;
&#60;p&#62;&#60;code&#62;&#60;br /&#62;
    register '/home/ubuntu/src/pig/contrib/piggybank/java/piggybank.jar';&#60;br /&#62;
    register '/home/ubuntu/src/pygmalion/udf/target/pygmalion-1.1.0-SNAPSHOT.jar';&#60;br /&#62;
    define FromCassandraBag org.pygmalion.udf.FromCassandraBag();&#60;br /&#62;
    define ToCassandraBag   org.pygmalion.udf.ToCassandraBag();&#60;/p&#62;
&#60;p&#62;    data = LOAD 'cassandra://RawLogTest/RawLog2' using CassandraStorage()&#60;br /&#62;
        AS (key, columns: {T: tuple( name, value ) });&#60;/p&#62;
&#60;p&#62;    rows = FOREACH data GENERATE key, FLATTEN(FromCassandraBag('..., some_id, ...', columns)) AS (&#60;br /&#62;
        ...&#60;br /&#62;
        some_id:int,&#60;br /&#62;
        ...&#60;br /&#62;
    );&#60;/p&#62;
&#60;p&#62;    rows_stripped = FOREACH rows {&#60;br /&#62;
            t = REGEX_EXTRACT( key, '([0-9]+)\\.([0-9]+)\\.([0-9]+)', 1 );&#60;br /&#62;
            GENERATE (long)t AS timestamp, ..., some_id, ...;&#60;br /&#62;
    }&#60;br /&#62;
&#60;/code&#62;&#60;/p&#62;
&#60;p&#62;Now I select a slice of rows_stripped, using parameters coming from the user.&#60;br /&#62;
Then I GROUP and generate a report from the raw data.&#60;/p&#62;
&#60;p&#62;&#60;code&#62;&#60;br /&#62;
    interval = FILTER rows_stripped BY (timestamp &#38;gt;= $FROM and timestamp &#38;lt;= $TO);&#60;br /&#62;
    grouped = GROUP rows_stripped BY (...);&#60;br /&#62;
    report = FOREACH grouped GENERATE FLATTEN( group ), COUNT( rows_stripped );&#60;br /&#62;
&#60;/code&#62;&#60;/p&#62;
&#60;p&#62;So far so good. I can DUMP all the relations defined, getting back the&#60;br /&#62;
data I expect.&#60;/p&#62;
&#60;p&#62;Now I gather some other data from CSV files I previously put on Cfs (they actually&#60;br /&#62;
come from a db table, which I imported via Sqoop). My purpose is to join&#60;br /&#62;
the two relations to augment the report.&#60;/p&#62;
&#60;p&#62;&#60;code&#62;&#60;br /&#62;
    other_info = LOAD 'lat.csv' using PigStorage(',') as (... , some_id:int, ...);&#60;br /&#62;
&#60;/code&#62;&#60;/p&#62;
&#60;p&#62;This relation is DUMPable as well without problems.&#60;/p&#62;
&#60;p&#62;And now, I join the data with this.augmented_report:&#60;/p&#62;
&#60;p&#62;&#60;code&#62;&#60;br /&#62;
    augmented_report = JOIN other_info BY some_id, report BY some_id;&#60;br /&#62;
&#60;/code&#62;&#60;/p&#62;
&#60;p&#62;When I try to STORE or DUMP augmented_report, the mapreduce fails with the following&#60;br /&#62;
error: &#60;/p&#62;
&#60;p&#62;&#60;code&#62;&#60;br /&#62;
java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableIntWritable, recieved org.apache.pig.impl.io.NullableBytesWritable&#60;br /&#62;
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1013)&#60;br /&#62;
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:690)&#60;br /&#62;
	at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)&#60;br /&#62;
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)&#60;br /&#62;
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:239)&#60;br /&#62;
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:232)&#60;br /&#62;
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)&#60;br /&#62;
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)&#60;br /&#62;
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)&#60;br /&#62;
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)&#60;br /&#62;
	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)&#60;br /&#62;
	at java.security.AccessController.doPrivileged(Native Method)&#60;br /&#62;
	at javax.security.auth.Subject.doAs(Subject.java:396)&#60;br /&#62;
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)&#60;br /&#62;
	at org.apache.hadoop.mapred.Child.main(Child.java:253)&#60;br /&#62;
&#60;/code&#62;&#60;/p&#62;
&#60;p&#62;I can solve this problem storing report in Cfs, then JOINing other_info against&#60;br /&#62;
a new relation, but I'd like to be able to do my augmented_report using data&#60;br /&#62;
from the original source.&#60;/p&#62;
&#60;p&#62;What strategy do you recommend to debug and solve the problem ?&#60;/p&#62;
&#60;p&#62;thank you,&#60;br /&#62;
s.
&#60;/p&#62;</description>
		</item>

	</channel>
</rss>
