How to submit a Hadoop streaming job and check execution history with Hadoop 2.x

How to submit a Hadoop streaming job and check execution history with Hadoop 2.x - hadoop

I am newbie to Hadoop. In Hadoop 1.X, I can submit a hadoop streaming job from master node and check the result and execution time from the namenode web.
The following is the sample code for hadoop streaming in Hadoop 1.X:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
However, in Hadoop 2.x, the job tracker is removed. How can I get the same feature in Hadoop 2.X?

In Hadoop 2.0, you can view the jobs in multiple ways
1) View the jobs from ResourceManager UI ResourceMnagerhostname:8088/cluster
2) View the jobs from HUE - HUEServerHostname.com:8888/jobbrowser/
3) From command line (once the job is completed)
usage: yarn logs -applicationId [OPTIONS]
general options are:
-appOwner AppOwner (assumed to be current user if
not specified)
-containerId ContainerId (must be specified if node
address is specified)
-nodeAddress NodeAddress in the format nodename:port
(must be specified if container id is
specified)
Example: yarn logs -applicationId application_1414530900704_0005

Related

Set YARN application name for Hadoop Distcp job

NOTE: I don't want to specify a YARN-queue name as in Hadoop: specify yarn queue for distcp
I frequently use hadoop distcp for moving data around HDFS and would like to have a descriptive application name for these jobs.
Presently all copying jobs just appear with the name "distcp" on Resource Manager UI and there's no way to distinguish between different jobs.
Is there a way to improve it?

Like many other MR tools, hadoop distcp also allows you to pass mapred properties using
-Dmapred.property.name=property-value
so when I use
hadoop distcp \
-Dmapred.job.name=billing_db.replicate \
-m 10 \
/user/hive/warehouse/billing_db.db/ \
s3a://my-s3-bucket/billing_db.db/
it appears nicely on Resource Manager UI
References
Hadoop: specify yarn queue for distcp
Sqoop User Guide: Using Generic and Specific Arguments

Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.
I tried to run this command
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat
But I got this error output :
File:
/home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py
does not exist, or is not readable.
Try -help for more information Streaming Command Failed!
Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?
Thank You

The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.
Your invocation would become:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
-mapper mapper_prod_cat.py \
-reducer reducer_prod_cat.py \
-input gs://bucket-name/intro_to_mapreduce/purchases.txt \
-output gs://bucket-name/intro_to_mapreduce/output_prod_cat

Setting S3 output file grantees for spark output files

I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files (rdd.saveAsTextFile('<file_dir_name>')). In hive, I would add a line in the beginning with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the correct permissions. For Spark, I tried running:
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.driver.extraJavaOptions -Dfs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
But the permissions do not get set properly on the output files. What is the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or any of the S3 canned permissions to the spark job?
Thanks in advance

I found the solution. In the job, you have to access the JavaSparkContext and from there get the Hadoop configuration and set the parameter there. For example:
sc._jsc.hadoopConfiguration().set('fs.s3.canned.acl','BucketOwnerFullControl')

The proper way to pass hadoop config keys in spark is to use --conf with keys prefixed with spark.hadoop.. Your command would look like
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.hadoop.fs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
Unfortunately I cannot find any reference in official documentation of spark.

How to configure Pivotal Hadoop

We are working on a Greenplum with HAWQ installed. I would like to run a hadoop-streaming job. However, it seems that hadoop is not configured or started. How can i start mapred to make sure that i can use hadoop-streaming?

Try the below command to get word count:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input <inputDir> \
-output <outputDir> \
-mapper /bin/cat \
-reducer /bin/wc
If that gives you correct word count then its working else check the error that's spit out by running this command

First, make sure that cluster is started and is working. To make it go to the Pivotal Command Center (usually the link is like this: https://<admin_node>:5443/ ) and see the cluster status or ask your administrator to do so.
Next, make sure that you have the PHD client libraries installed on the machine you are trying to start your job. Run "rpm -qa | grep phd"
Next, if the cluster is running and libraries are installed, you can run the streaming job like this:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-streaming.jar -mapper /bin/cat -reducer /bin/wc -input /example.txt -output /testout
/example.txt file should exist on HDFS

I do it long back, Greenplum/Pivotal Hadoop
--1. For Instatllation
icm_client deploy
ex. - icm_client deploy HIVE
--2. For status
HDFS
Service hadoop-namenode status
Service hadoop-datanode status
Service hadoop-secondarynamenode status
MapRed
Service hadoop-jobtracker status
Service hadoop-tasktracker status
Hive
service hive-server status
service hive-metastore status
--3. For start/stop/restart
service hive-server start
service hive-server stop
service hive-server restart
Note: You will find all this command and details in installation guide, may be available online somewhere hadoop installation guide
Thanks,

Streaming job fails in HDP 2.0

Iam trying to run streaming job as below in HDP-2.0.6 cluster.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar -input /USER/city.txt -output /USER/streamout -mapper /bin/cat -reducer /bin/cat -numReduceTasks 2
(Referred:
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
On executing the command iam getting an error as below.
Streaming Command Failed!
I couldn't find any other details from the console which would help me to track the issue.
But the same query is running in the local installation of hadoop-2.2.0.2.0.6.0-76 without any errors(same version of hadoop as in cluster, except that : in cluster- it is HDP installed using Ambari and in local-individual component hadoop installed manually from tarball).
Has any one came across such an issue ???
Or has any idea about the root cause?
Any suggestions would really be appreciated.
Thanks in Advance..

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to submit a Hadoop streaming job and check execution history with Hadoop 2.x - hadoop

Related

Set YARN application name for Hadoop Distcp job

Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

Setting S3 output file grantees for spark output files

How to configure Pivotal Hadoop

Streaming job fails in HDP 2.0

Categories

Resources