Error: Failed to create Data Storage while running embedded pig in java - hadoop

I wrote a simple program to test the embedded pig in java to run in mapreduce mode.
The hadoop version in the server I am running is 0.20.2-cdh3u4a, and pig version is 0.10.0-cdh3u4a.
When I try to run in local mode, it runs successfully. But when I try to run in mapreduce mode, it gives me the error.
I run my program using the following commands as shown in http://pig.apache.org/docs/r0.9.1/cont.html#embed-java
javac -cp pig.jar EmbedPigTest.java
javac -cp pig.jar:.:/etc/hadoop/conf EmbedPigTest.java input.txt
My program gives error as:
Exception in thread "main" java.lang.RuntimeException: Failed to create DataStorage
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
at org.apache.pig.PigServer.<init>(PigServer.java:226)
at org.apache.pig.PigServer.<init>(PigServer.java:215)
at org.apache.pig.PigServer.<init>(PigServer.java:211)
at org.apache.pig.PigServer.<init>(PigServer.java:207)
at WordCount.main(EmbedPigTest.java:9)
In some online resources they say that this problem occurs due to different hadoop version. But, I didn't understand what I should do. Suggestions please !!

This is happening because you are linking to the wrong jar, Please see the link below it describes this issue very well.
http://localsteve.wordpress.com/2012/09/30/embedding-pig-for-cdh4-java-apps-fer-realz/

I was faced same kind of issue when I tried to use pig in map reduce mode without starting the services.
Please check all services using jps before using pig in map reduce mode.

Related

Spark 2.0.1 not finding file passed in through archives flag

I was running Spark job which make use of other files that is passed in through --archives flag of spark
spark-submit .... --archives hdfs:///user/{USER}/{some_folder}.zip .... {file_to_run}.py
Spark is currently running on YARN and when I tried it with spark version 1.5.1 it was fine.
However, when I ran the same commands with spark 2.0.1, I got
ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "/home/{USER}/{some_folder}/.....": error=2, No such file or directory
Since the resource is managed by YARN, it is challenging to manually check if the file gets successfully decompressed and exist when the job runs.
I wonder if anyone has experienced similar issue.

Running MapReduce code that uses zooKeeper

I want to ask about how to execute a MapReduce java code that uses zooKeeper.
My first code is just to create a variable (znode) and to modify it by each mapper.
So I modified the wordCount code just to test zookeeper for the first time.
When I run it using the eclipse console, everything goes well, so I can see the changes on the value of the znode, etc.
However, I was trying to execute it using linux command line:
**bin/hadoop jar ./myjar.jar algo.WordCount /input.txt /out
I got the following error
**Error: java.lang.ClassNotFoundException: org.apache.zookeeper.Watcher
Although that I added the path of the jar file using conf.set("mapred.jar","...."); in the mapreduce code but I don't know why it did not recognize the classes of zookeeper.
Any idea?

UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation

I'd like to know your input as to why this error is happening. On production environment onshore, we're using CDH4. On our local testing environment, we're just using Apache Hadoop v2.2.0. When I run the same jar compiled on CDH4, the MR jobs are executed fine. But when I run the jar on Hadoop v2.2.0 (YARN enabled), I get this error:
INFO mapreduce.Job: Task Id : attempt_1391062333435_0001_m_000000_0, Status : FAILED
Error: java.lang.UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation
The log showed Map jobs ran successfully, but the Reduce jobs - all of them failed - with the above error. There's not too many hits on Google regarding this error so I'm kind of nowhere to run but here.
Any thoughts guys? Thanks.
Sorry for the lateness of this reply.
This problem was solved when we synched our environment with the one onshore. That is, instead of using plain Apache Hadoop, we used the Cloudera distribution.

What is 'NoClassDefFoundError: clojure.core.protocols$seq_reduce' while running Storm topology?

When I try to run the word counting topology as storm jar wordcount.jar words.txt, I get the below error:
My topology is as below:
Why I might be getting this error? I'm using storm-0.8.2 an external jar and have Storm 0.8.2 installed.
Read the stacktrace again please:
Exception in thread "main" java.lang.NoClassDefFoundError: streaming.words.txt
Does this ring a bell? Check your setup again.
Holy smokes, gcj? Clojure requires an actual java runtime environment, 1.5+. I have never found gcj to be good for much of anything, but especially I'm certain it can't run Clojure.

run pig 0.7.0 error : ERROR 2998: Unhandled internal error

I have to connect pig to a hadoop which changed a little from Hadoop 0.20.0. I choose pig 0.7.0, and setting PIG_CLASSPATH by
export PIG_CLASSPATH=$HADOOP_HOME/conf
when I run pig, an error is reported like this:
ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage
So, I copy hadoop-core.jar in $HADOOP_HOME to overwrite hadoop20.jar in $PIG_HOME/lib, then "ant". Now, I can run pig, but when I use dump or store, another error:
Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(Lorg/apache/hadoop/mapreduce/Job;Lorg/apache/ hadoop/fs/Path;)V
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(Lorg/apache/hadoop/mapreduce/Job;Lorg/apache/hadoop/fs/ Path;)V
at org.apache.pig.builtin.BinStorage.setStoreLocation(BinStorage.java:369)
...
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
================================================================================
Does anyone have encountered this error, or is my compile way not right?
Thanks.
There is a section about this issue in the Pig FAQ which should give you a good idea what's wrong. Here is the outline taken from this page:
This usually happens when you are connecting hadoop cluster other than standard Apache hadoop 20.2 release. Pig bundles standard hadoop 20.2 jars in release. If you want to connect to other version of hadoop cluster, you need to replace bundled hadoop 20.2 jars with compatible jars. You can try:
do "ant"
copy hadoop jars from your hadoop installation to overwrite ivy/lib/Pig/hadoop-core-0.20.2.jar and ivy/lib/Pig/hadoop-test-0.20.2.jar
do "ant" again
cp pig.jar to overwrite pig-*-core.jar
Some other tricks is also possible. You can use "bin/pig -secretDebugCmd" to inspect the command line of Pig. Make sure you are using the right version of hadoop.
As pointed in this FAQ section, if nothing works I would advise just upgrading to a recent version of Pig after 0.9.1, Pig 0.7 is a bit old.
The Pig (core) jar has a bundled Hadoop dependency, which may differ from the version you want to use. If you have an old Pig version (< 0.9) the you have the option, to build a jar without Hadoop:
cd $PIG_HOME
ant jar-withouthadoop
cp $PIG_HOME/build/pig-x.x.x-dev-withouthadoop.jar $PIG_HOME
Then start Pig:
cd $PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/hadoop-core-x.x.x.jar:$HADOOP_HOME/lib/*:$HADOOP_HOME/conf:$PIG_HOME/pig-x.x.x-dev-withouthadoop.jar; ./pig
Newer Pig versions contain the prebuilt withouthadoop version (see this ticket) so you can skip the building process. Furthermore when you run pig it will pick up the withouthadoop jar from PIG_HOME rather than the bundled version, so you don't need to add withouthadoop.jar
to the PIG_CLASSPATH either (provided, that you run Pig from $PIG_HOME/bin)
..Back to your question:
Hadoop 0.20 and its modified variant (0.20-append?) can work even with the latest Pig distribution (0.11.1) :
You just need to do the followings:
unpack Pig 0.11.1
cd $PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/hadoop-core-x.x.jar:$HADOOP_HOME/lib/*:$HADOOP_HOME/conf; ./pig
If you still get "Failed to create DataStorage" it's worth to start Pig with -secretDebugCmd as Charles Menguy suggested, so that you
can see whether Pig gets the right Hadoop version..etc.
Did you remember to run start-all.sh from /usr/local/bin? I ran into the same problem and I basically retraced my steps in configuring Hadoop itself. I am able to use Pig now.

Resources