Pass queue name in mapreduce job via command line - hadoop

How can we pass the queue name in mapreduce job while running via command line. I have tried passing it as :
set -e; export HADOOP_USER_CLASSPATH_FIRST='true';export HADOOP_OPTS='-Djava.io.tmpdir=/tmp'; export HADOOP_CLASSPATH='/path/to/jars-1.0.jar';sudo -E -u myUser hadoop jar /path/to/jar com.pacakage.ClassName -D mapred.job.queue.name=prod_queue --input {inputPath} --output {outputPath}
Also tried to set the mapred.job.queue.name as :
set -e; export HADOOP_USER_CLASSPATH_FIRST='true';export HADOOP_OPTS='-Djava.io.tmpdir=/tmp'; export HADOOP_CLASSPATH='/path/to/jars-1.0.jar';set mapred.job.queue.name=prod_queue;sudo -E -u myUser hadoop jar /path/to/jar com.pacakage.ClassName --input {inputPath} --output {outputPath}
None of the above command is working and the error I am getting is :
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxxxx to YARN : Application application_xxxxx submitted by user myUser to unknown queue: default

After Hadoop 2.4.1 property name is mapreduce.job.queuename.
If it still does not work via command line you can try to set property directly in job:
job.getConfiguration().set("mapreduce.job.queuename", queue);

Related

How can I get job configuration in command line?

I get get running apps with this yarn application -appStates RUNNING then I get one applicationID from list.
then I can get status of app with this: yarn application -status
I want to get job configuration information on command line. it is possible?
That's not "Job Configuration". It is whole cluster config.
You can use cURL to parse it
$ curl -s http://localhost:8088/conf | grep defaultFS
<property><name>fs.defaultFS</name><value>file:///</value><final>false</final><source>core-default.xml</source></property>
<property><name>mapreduce.job.hdfs-servers</name><value>${fs.defaultFS}</value><final>false</final><source>mapred-default.xml</source></property>

im getting error code 127 while creating jenkins pipeline here is the script

pg_dump -h 10.12.0.4 -U pet--rsmb--prod-l1--usr -w -c -f 2022-08-10t1228z-data.sql
/var/lib/jenkins/workspace/BACKUP-RSMB--POSTGRESQL#tmp/durable-510acc0f/script.sh: 1: /var/lib/jenkins/workspace/BACKUP-RSMB--POSTGRESQL#tmp/durable-510acc0f/script.sh: pg_dump: not found
Your error clearly indicates that the Shell executor cannot find pg_dump command. Either you have not set pg_dump properly in the Jenkins server or you have not added pg_dump to the executable $PATH.

Cannot start cluster from namenode (master): different $HADOOP_HOME on datanode (slave) and namenode (master)

I am using Hadoop 1.2.1 on master and slave but I have them installed on different directories. So when I invoke bin/start-dfs.sh on master, I get the following error.
partho#partho-Satellite-L650: starting datanode, logging to /home/partho/hadoop/apache/hadoop-1.2.1/libexec/../logs/hadoop-partho-datanode-partho-Satellite-L650.out
hduser#node2-VirtualBox: bash: line 0: **cd: /home/partho/hadoop/apache/hadoop-1.2.1/libexec/..: No such file or directory**
hduser#node2-VirtualBox: bash: **/home/partho/hadoop/apache/hadoop-1.2.1/bin/hadoop-daemon.sh: No such file or directory**
partho#partho-Satellite-L650: starting secondarynamenode, logging to /home/partho/hadoop/apache/hadoop-1.2.1/libexec/../logs/hadoop-partho-secondarynamenode-partho-Satellite-L650.out
The daemons are getting created fine on the Master as you can see below
partho#partho-Satellite-L650:~/hadoop/apache/hadoop-1.2.1$ jps
4850 Jps
4596 DataNode
4441 NameNode
4764 SecondaryNameNode
It is obvious that Hadoop is trying to find the hadoop-daemon.sh and libexec on the slave using the $HADOOP_HOME on the master.
How can I configure individual datanodes/slaves so that when I start a cluster from master, the Hadoop home directory for the respective slaves are checked for hadoop-daemon.sh?
Hadoop usually sets the HADOOP_HOME environment variable on each node in a file named hadoop-env.sh.
You can update hadoop-env.sh on each node with the path for the respective node. It should maybe be in /home/partho/hadoop/apache/hadoop-1.2.1/. Probably want to stop the cluster first so it will pickup the changes.
If you have locate installed run
locate hadoop-env.sh
or find / -name "hadoop-env.sh"
For Best solution for this you should keep hadoop directory in your any directory but it should be same for both like Example:
on master path:
/opt/hadoop
on slave path
/opt/hadoop
it doesn't matter which version you are using but directory name should be same
Once you set up the cluster, to start all daemons from master
bin/hadoop namenode -format(if required)
bin/stop-dfs.sh
bin/start-dfs.sh
bin/start-mapred.sh
In order to start all nodes from master,
- you need to install ssh on each node
- once you install ssh and generate ssh key in each server, try connecting each nodes from master
- make sure slaves file in master node has all Ips of all nodes
So commands would be
- install ssh(in each node) : apt-get install openssh-server
- once ssh is installed,generate key : ssh-keygen -t rsa -P ""
- Create password less login from namenode to each node:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub user#datanodeIP
user - hadoop user on each machine`enter code here`
- put all nodes ip in slaves(in conf dir) file in namenode
Short Answer
On Master-Side
hadoop-daemons.sh
In $HADOOP_HOME/sbin/hadoop-daemons.sh (not $HADOOP_HOME/sbin/hadoop-daemon.sh, there is an s in the filename), there is a line calling $HADOOP_HOME/sbin/slaves.sh. In my version (Hadoop v2.7.7), it reads:
exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_PREFIX" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$#"
Change the line it to the following line to make it respect slave-side environment variables:
exec "$bin/slaves.sh" "source" ".bash_aliases" \; "hadoop-daemon.sh" "$#"
yarn-daemons.sh
Similarly, in $HADOOP_HOME/sbin/yarn-daemons.sh, change the line:
exec "$bin/slaves.sh" --config $YARN_CONF_DIR cd "$HADOOP_YARN_HOME" \; "$bin/yarn-daemon.sh" --config $YARN_CONF_DIR "$#"
to
exec "$bin/slaves.sh" "source" ".bash_aliases" \; "yarn-daemon.sh" "$#"
On Slave-Side
Put all Hadoop-related environment variables into $HOME/.bash_aliases.
Start / Stop
To start HDFS, just run start-dfs.sh on master-side. The slave-side data node will be started as if hadoop-daemon.sh start datanode is executed from an interactive shell on slave-side.
To stop HDFS, just run stop-dfs.sh.
Note
The above changes already are already completed. But for perfectionists, you may also want to fix the callers of sbin/hadoop-daemons.sh so that the commands are correct when you dump them. In this case, find all occurrences of hadoop-daemons.sh in the Hadoop scripts and replace --script "$bin"/hdfs to --script hdfs (and all --script "$bin"/something to just --script something). In my case, all the occurrences are hdfs, and since the slave side will rewrite the command path when it is hdfs related, the command works just fine with or without this fix.
Here is an example fix in sbin/start-secure-dns.sh.
Change:
"$HADOOP_PREFIX"/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script "$bin"/hdfs start datanode $dataStartOpt
to
"$HADOOP_PREFIX"/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode $dataStartOpt
In my version (Hadoop v2.7.7), the following files need to be fixed:
sbin/start-secure-dns.sh (1 occurrence)
sbin/stop-secure-dns.sh (1 occurrence)
sbin/start-dfs.sh (5 occurrences)
sbin/stop-dfs.sh (5 occurrences)
Explanation
In sbin/slaves.sh, the line which connects the master to the slaves via ssh reads:
ssh $HADOOP_SSH_OPTS $slave $"${#// /\\ }" \
2>&1 | sed "s/^/$slave: /" &
I added 3 lines before it to dump the variables:
printf 'XXX HADOOP_SSH_OPTS: %s\n' "$HADOOP_SSH_OPTS"
printf 'XXX slave: %s\n' "$slave"
printf 'XXX command: %s\n' $"${#// /\\ }"
In sbin/hadoop-daemons.sh, the line calling sbin/slaves.sh reads (I split it into 2 lines to prevent scrolling):
exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_PREFIX" \; \
"$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$#"
The sbin/start-dfs.sh script calls sbin/hadoop-daemons.sh. Here is the result when sbin/start-dfs.sh is executed:
Starting namenodes on [master]
XXX HADOOP_SSH_OPTS:
XXX slave: master
XXX command: cd
XXX command: /home/hduser/hadoop-2.7.7
XXX command: ;
XXX command: /home/hduser/hadoop-2.7.7/sbin/hadoop-daemon.sh
XXX command: --config
XXX command: /home/hduser/hadoop-2.7.7/etc/hadoop
XXX command: --script
XXX command: /home/hduser/hadoop-2.7.7/sbin/hdfs
XXX command: start
XXX command: namenode
master: starting namenode, logging to /home/hduser/hadoop-2.7.7/logs/hadoop-hduser-namenode-akmacbook.out
XXX HADOOP_SSH_OPTS:
XXX slave: slave1
XXX command: cd
XXX command: /home/hduser/hadoop-2.7.7
XXX command: ;
XXX command: /home/hduser/hadoop-2.7.7/sbin/hadoop-daemon.sh
XXX command: --config
XXX command: /home/hduser/hadoop-2.7.7/etc/hadoop
XXX command: --script
XXX command: /home/hduser/hadoop-2.7.7/sbin/hdfs
XXX command: start
XXX command: datanode
slave1: bash: line 0: cd: /home/hduser/hadoop-2.7.7: Permission denied
slave1: bash: /home/hduser/hadoop-2.7.7/sbin/hadoop-daemon.sh: Permission denied
Starting secondary namenodes [master]
XXX HADOOP_SSH_OPTS:
XXX slave: master
XXX command: cd
XXX command: /home/hduser/hadoop-2.7.7
XXX command: ;
XXX command: /home/hduser/hadoop-2.7.7/sbin/hadoop-daemon.sh
XXX command: --config
XXX command: /home/hduser/hadoop-2.7.7/etc/hadoop
XXX command: --script
XXX command: /home/hduser/hadoop-2.7.7/sbin/hdfs
XXX command: start
XXX command: secondarynamenode
master: starting secondarynamenode, logging to /home/hduser/hadoop-2.7.7/logs/hadoop-hduser-secondarynamenode-akmacbook.out
As you can see from the above result, the script does not respect the slave-side .bashrc and etc/hadoop/hadoop-env.sh.
Solution
From the result above, we know that the variable $HADOOP_CONF_DIR is resolved at master-side. The problem will be solved if it is resolved at slave-side. However, since the shell created by ssh (with a command attached) is a non-interactive shell, the .bashrc script is not loaded on the slave-side. Therefore, the following command prints nothing:
ssh slave1 'echo $HADOOP_HOME'
We can force it to load .bashrc:
ssh slave1 'source .bashrc; echo $HADOOP_HOME'
However, the following block in .bashrc (default in Ubuntu 18.04) guards non-interactive shells:
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
At this point, you may remove the above block from .bashrc to try to achieve the goal, but I don't think it's a good idea. I did not try it, but I think that the guard is there for a reason.
On my platform (Ubuntu 18.04), when I login interactively (via console or ssh), .profile loads .bashrc, and .bashrc loads .bash_aliases. Therefore, I have a habit of keeping all .profile, .bashrc, .bash_logout unchanged, and put any customizations into .bash_aliases.
If on your platform .bash_aliases does not load, append the following code to .bashrc:
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
Back to the problem. We could therefore load .bash_aliases instead of .bashrc. So, the following code does the job, and the $HADOOP_HOME from the slave-side is printed:
ssh slave1 'source .bash_aliases; echo $HADOOP_HOME'
By applying this technique to the sbin/hadoop-daemons.sh script, the result is the Short Answer mentioned above.

How to access file in s3n when running a customized jar in Amazon Elastic MapReduce

I am running the below step in my cluster in EMR:
./elastic-mapreduce -j CLUSTERID -jar s3n://mybucket/somejar
--main-class SomeClass
--arg -conf --arg 's3n://mybucket/configuration.xml'
The SomeClass is Hadoop job and implements Runnable interface. It reads configuration.xml for parameters, but in the above command the SomeClass can not access "s3n://mybucket/configuration.xml" (no error reported). I tried "s3://mybucket/configuration.xml" and it does not work either. I am sure the file existed, since I can see it with "hadoop fs -ls s3n://mybucket/configuration.xml". Any suggestion for the problem?
Thanks,
Here are the options to try
Use s3 instead of s3n.
Check the access permission for s3 bucket.
You can specify the log location and can check the log after job
fails.You can create job like below
elastic-mapreduce --create --name "j_flow_name" --log-uri "s3://your_s3_bucket"
It gives you the more debug information.
3.
./elastic-mapreduce -j JobFlowId -jar s3://your_bucket --arg "s3://your_conf_file_bucket_name" --arg "second parameter"
For more detailed information EMR CLI

EC2 Job Flow Failure

I have a jar file MapReduce that I'd like to run on s3. It takes two args, an input dir and an output file.
So I tried the following command using the elastic-mapreduce ruby cmd line tool:
elastic-mapreduce -j j-JOBFLOW --jar s3n://this.bucket.com/jars/this.jar --arg s3n://this.bucket.com/data/ --arg s3n://this.bucket.com/output/this.csv
This failed with error
Exception in thread "main" java.lang.ClassNotFoundException: s3n://this/bucket/com/data/
So I tried it using --input and --output after the respective args. That failed too with an error on the --input class not being found (seems like it couldn't decipher --input and not that it couldn't decipher the argument after input)
This seems like such a basic thing, yet I'm having trouble getting it to work. Any help is much appreciated. Thanks.
Try:
elastic-mapreduce --create --jar s3n://this.bucket.com/jars/this.jar --args "s3n://this.bucket.com/data/,s3n://this.bucket.com/output/this.csv"
Double check your jar,input data are there:
s3cmd ls s3://this.bucket.com/data/

Resources