Hadoop avro correct jar files issue - hadoop

I'm writing my first Avro job that is meant to take an avro file and output text. I tried to reverse engineer it from this example:
https://gist.github.com/chriswhite199/6755242
I am getting the error below though.
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
I looked around and found it was likely an issue with what jar files are being used. I'm running CDH4 with MR1 and am using the jar files below:
avro-tools-1.7.5.jar
hadoop-core-2.0.0-mr1-cdh4.4.0.jar
hadoop-mapreduce-client-core-2.0.2-alpha.jar
I can't post code for security reasons but it shouldn't need anything not used in the example code. I don't have maven set up yet either so I can't follow those routes. Is there something else I can try to get around these issues?

Try using avro 1.7.3
AVRO-1170 bug

Related

elephant bird does not exist error while loading json data in pig 0.16

Can anyone help me figure out why i am getting error while using REGISTER to register the jar file 'elephant bird' to load json data:
I work in the local mode of the pig 0.16 and get the error:
/home/shanky/Downloads/elephant-bird-hadoop-compat-4.1.jar' does not exist.
/home/shanky/Downloads/elephant-bird-pig-4.1.jar' does not exist.
Code to load json data:
REGISTER '/home/shanky/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/home/shanky/Downloads/elephant-bird-pig-4.1.jar';
REGISTER '/home/shanky/Downloads/json-simple-1.1.1.jar';
load_tweets = LOAD '/home/shanky/Downloads/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
dump load_tweets;
I tried replacing REGISTER statement by removing quotes and putting hdfs:// but nothing work for me.
The quotes shouldn't be included per the pig documentation (https://pig.apache.org/docs/r0.16.0/basic.html#register-jar), but your syntax did work for me (I'm using 0.12.0-cdh5.12.0 though).
Since you said you tried it without the quotes, some thoughts:
*You mention trying adding hdfs://, are these dependencies on hdfs by any chance? It doesn't seem like it since they have Downloads in the path, but if they are, you won't be able to locate them running pig in local mode. If they are on your local filesystem, you should be able to access them with the path as you have it whether you run it locally or not.
*Are the files actually there? Are the permissions right? Etc.
*Assuming you just want to get around the issue for now, have you tried any of the other methods of registering a jar, such as -Dpig.additional.jars.uris=/home/shanky/elephant-bird-hadoop-compat-4.1.jar,/home/shanky/Downloads/elephant-bird-pig-4.1.jar

HDInsight Hive not finding SerDe jar in ADD JAR statement

I've uploaded json-serde-1.1.9.2.jar to the blob store with path "/lib/" and added
ADD JAR /lib/json-serde-1.1.9.2.jar
But am getting
/lib/json-serde-1.1.9.2.jar does not exist
I've tried it without the path and also provided the full url to the ADD JAR statement with the same result.
Would really appreciate some help on this, thanks!
If you don't include the scheme, then Hive is going to look on the local filesystem (you can see the code around line 768 of the source)
when you included the URI, make sure you use the full form:
ADD JAR wasb:///lib/json-serde-1.1.9.2.jar
If that still doesn't work, provide your updated command as well as some details about how you are launching the code. Are you RDP'd in to the cluster running via the Hive shell, or running remote via PowerShell or some other API?

Using DSE Hadoop/Mahout, NoClassDef of org.w3c.dom.Document

Trying to run a simple hadoop job, but hadoop is throwing a NoClassDef on "org/w3c/dom/Document"
I'm trying to run the basic examples from the "Mahout In Action" book (https://github.com/tdunning/MiA).
I do this using nearly the same maven setup but tooled for cassandra use rather than a file data model.
But, when I try to run the *-job.jar, it spits a NoClassDef from the datastax/hadoop end.
I'm using 1.0.5-dse of the driver as that's the only one that supports the current DSE version of Cassandra(1.2.1) if that helps at all though the issue seems to be deeper.
Attached is a gist with more info included.
There is the maven file, this brief overview, and the console output.
https://gist.github.com/zmarcantel/8d56ae4378247bc39be4
Thanks
try dropping the jar file for class of org.w3c.dom.Document to $DSE/resource/hadoop/lib/ folder as a work around.

Skipping bad input files in hadoop

I'm using Amazon Elastic MapReduce to process some log files uploaded to S3.
The log files are uploaded daily from servers using S3, but it seems that a some get corrupted during the transfer. This results in a java.io.IOException: IO error in map input file exception.
Is there any way to have hadoop skip over the bad files?
There's a who bunch of record skipping configuration properties you can use to do this - see the mapred.skip. prefixed properties on http://hadoop.apache.org/docs/r1.2.1/mapred-default.html
There's also a nice blog post entry about this subject and these config properties:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
That said, if you file is completely corrupt (i.e. broken before the first record), you might still have issues even with these properties.
Chris White's comment suggesting writing your own RecordReader and InputFormat is exactly right. I recently faced this issue and was able to solve it by catching the file exceptions in those classes, logging them, and then moving on to the next file.
I've written up some details (including full Java source code) here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/

Something like .hiverc for Hue

I would like to be able to configure hue/hive to have a few custom jar files added and a few UDFs created so that the user does not have to do this every time.
Ideally, I am hopeful that there might be a feature similar to the Hive CLI's ".hiverc" file where I can simply put a few HQL statements to do all of this. Does anyone know if Hue has this feature? It looks like it is not using the file $HIVE_HOME/conf/.hiverc.
Alternatively, if I could handle both the custom jar file and the UDFs separately, that would be fine too. For example, I'm thinking I could put the jar in $HADOOP_HOME/lib on all of the tasktrackers, and maybe also on Hue's classpath somehow. Not sure, but I don't think this would be too difficult...
But that still leaves the UDFs. It seems like I might be able to modify the Hive source (org.apache.hadoop.hive.ql.exec.FunctionRegistry probably) and compile a custom version of Hive, but I'd really rather not go down that rabbit hole if at all possible.
It looks like this jira: https://issues.cloudera.org/browse/HUE-1066 [beeswax] Preload jars into the environment.

Resources