Streaming job fails in HDP 2.0 - hadoop

Iam trying to run streaming job as below in HDP-2.0.6 cluster.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar -input /USER/city.txt -output /USER/streamout -mapper /bin/cat -reducer /bin/cat -numReduceTasks 2
(Referred:
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
On executing the command iam getting an error as below.
Streaming Command Failed!
I couldn't find any other details from the console which would help me to track the issue.
But the same query is running in the local installation of hadoop-2.2.0.2.0.6.0-76 without any errors(same version of hadoop as in cluster, except that : in cluster- it is HDP installed using Ambari and in local-individual component hadoop installed manually from tarball).
Has any one came across such an issue ???
Or has any idea about the root cause?
Any suggestions would really be appreciated.
Thanks in Advance..

Related

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:
Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: com.example.SimpleApp
Here's the command:
$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
I'm using hadoop version 2.6.0, spark version 1.2.1
The only way it worked for me, when I was using
--master yarn-cluster
To make HDFS library accessible to spark-job , you have to run job in cluster mode.
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar
Also, There is Spark JIRA raised for client mode which is not supported yet.
SPARK-10643 :Support HDFS application download in client mode spark submit
There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.
I did the same (with azure blob storage, but it should be similar for HDFS)
example command for azure wasb
sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777
Now, in your spark submit command, you provide the path from the command above
$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py
For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.
Yes, it has to be a local file. I think that's simply the answer.

Running spark-submit with --master yarn-cluster: issue with spark-assembly

I am running Spark 1.1.0, HDP 2.1, on a kerberized cluster. I can successfully run spark-submit using --master yarn-client and the results are properly written to HDFS, however, the job doesn't show up on the Hadoop All Applications page. I want to run spark-submit using --master yarn-cluster but I continue to get this error:
appDiagnostics: Application application_1417686359838_0012 failed 2 times due to AM Container
for appattempt_1417686359838_0012_000002 exited with exitCode: -1000 due to: File does not
exist: hdfs://<HOST>/user/<username>/.sparkStaging/application_<numbers>_<more numbers>/spark-assembly-1.1.0-hadoop2.4.0.jar
.Failing this attempt.. Failing the application.
I've provisioned my account with access to the cluster. I've configured yarn-site.xml. I've cleared .sparkStaging. I've tried including --jars [path to my spark assembly in spark/lib]. I've found this question that is very similar, yet unanswered. I can't tell if this is a 2.1 issue, spark 1.1.0, kerberized cluster, configurations, or what. Any help would be much appreciated.
This is probably because you left sparkConf.setMaster("local[n]") in the code.

Executing Mahout against Hadoop cluster

I have a jar file which contains the mahout jars as well as other code I wrote.
It works fine in my local machine.
I would like to run it in a cluster that has Hadoop already installed.
When I do
$HADOOP_HOME/bin/hadoop jar myjar.jar args
I get the error
Exception in thread "main" java.io.IOException: Mkdirs failed to create /some/hdfs/path (exists=false, cwd=file:local/folder/where/myjar/is)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java 440)
...
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I checked that I can access and create the dir in the hdfs system.
I have also ran hadoop code (no mahout) without a problem.
I am running this in a linux machine.
Check for the mahout user and hadoop user being same. and also check for mahout and hadoop version compatibility.
Regards
Jyoti ranjan panda

How to submit a Hadoop streaming job and check execution history with Hadoop 2.x

I am newbie to Hadoop. In Hadoop 1.X, I can submit a hadoop streaming job from master node and check the result and execution time from the namenode web.
The following is the sample code for hadoop streaming in Hadoop 1.X:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
However, in Hadoop 2.x, the job tracker is removed. How can I get the same feature in Hadoop 2.X?
In Hadoop 2.0, you can view the jobs in multiple ways
1) View the jobs from ResourceManager UI ResourceMnagerhostname:8088/cluster
2) View the jobs from HUE - HUEServerHostname.com:8888/jobbrowser/
3) From command line (once the job is completed)
usage: yarn logs -applicationId [OPTIONS]
general options are:
-appOwner AppOwner (assumed to be current user if
not specified)
-containerId ContainerId (must be specified if node
address is specified)
-nodeAddress NodeAddress in the format nodename:port
(must be specified if container id is
specified)
Example: yarn logs -applicationId application_1414530900704_0005

How to configure Pivotal Hadoop

We are working on a Greenplum with HAWQ installed. I would like to run a hadoop-streaming job. However, it seems that hadoop is not configured or started. How can i start mapred to make sure that i can use hadoop-streaming?
Try the below command to get word count:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input <inputDir> \
-output <outputDir> \
-mapper /bin/cat \
-reducer /bin/wc
If that gives you correct word count then its working else check the error that's spit out by running this command
First, make sure that cluster is started and is working. To make it go to the Pivotal Command Center (usually the link is like this: https://<admin_node>:5443/ ) and see the cluster status or ask your administrator to do so.
Next, make sure that you have the PHD client libraries installed on the machine you are trying to start your job. Run "rpm -qa | grep phd"
Next, if the cluster is running and libraries are installed, you can run the streaming job like this:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-streaming.jar -mapper /bin/cat -reducer /bin/wc -input /example.txt -output /testout
/example.txt file should exist on HDFS
I do it long back, Greenplum/Pivotal Hadoop
--1. For Instatllation
icm_client deploy
ex. - icm_client deploy HIVE
--2. For status
HDFS
Service hadoop-namenode status
Service hadoop-datanode status
Service hadoop-secondarynamenode status
MapRed
Service hadoop-jobtracker status
Service hadoop-tasktracker status
Hive
service hive-server status
service hive-metastore status
--3. For start/stop/restart
service hive-server start
service hive-server stop
service hive-server restart
Note: You will find all this command and details in installation guide, may be available online somewhere hadoop installation guide
Thanks,

Resources