How to set queue name for Pig on Tez? - hadoop

How do I set a queue name from a command line when running Pig on TEZ?
I would like to run a Pig script from the command line such as:
pig -useHCatalog -p INPUT=input_dir \
-p 'OUT_FILE=out_file \
-p UDF_PATH=udf.py \
-f ./script_name.pig \
-Dmapred.job.queue.name=my_queue_name \
-x tez;
I tried the following settings:
-tez.job.queue.name=my_queue_name
-q mapreduce.job.queuename=my_queue_name
-Dmapred.job.queue.name=my_queue_name
-q my_queue_name
However, my job is not running in the queue I specified.
Thank you!

The property is tez.queue.name.
<property>
<name>tez.queue.name</name>
<value>myqueue</value>
</property>
So try
-tez.queue.name=my_queue_name

In my version of pig (0.16.0.2.5.3.0-37) I could not set this parameter in the command line.
Instead adding
SET tez.queue.name 'my_queue';
to the beginning of the PIG script did work.

Related

snowsql not found from cron tab

I am trying to execute snowsql from an shell script which i have scheduled with cron job. But i am getting error like snowsql: command not found.
I went through many links where they are asking us to give full path of the snowflake. i tried with that also but no luck.
https://support.snowflake.net/s/question/0D50Z00007ZBOZnSAP/snowsql-through-shell-script. Below is my code snippet abc.sh:
#!/bin/bash
set -x
snowsql --config /home/basant.jain/snowsql_config.conf \
-D cust_name=mean \
-D feed_nm=lbl \
-o exit_on_error=true \
-o timing=false \
-o friendly=false \
-o output_format=csv \
-o header=false \
-o variable_substitution=True \
-q 'select count(*) from table_name'
and my crontab looks like below:
*/1 * * * * /home/basant.jain/abc.sh
Cron doesn't set PATH like your login shell does.
As you already wrote in your question you could specify the full path of snowsql, e.g.
#!/bin/bash
/path/to/snowsql --config /home/basant.jain/snowsql_config.conf \
...
Note: /path/to/snowsql is only an example. Of course you should find out the real path of snowsql, e.g. using type snowsql.
Or you can try to source /etc/profile. Maybe this will set up PATH for calling snowsql.
#!/bin/bash
. /etc/profile
snowsql --config /home/basant.jain/snowsql_config.conf \
...
see How to get CRON to call in the correct PATHs

What is the complete list of streaming command line options possible for Hadoop YARN version?

I was browsing through the Hadoop website and found the following link for hadoop streaming.
https://hadoop.apache.org/docs/current1/streaming.html
But, I am more interested in Hadoop YARN (MRv2) - Streaming command line options.
If someone has the exhaustive list, can you please post it here?
If it is not found, can somebody please tell me if any of the command line options in the following command are illegal.
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name="Streaming wordCount Rating" \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D map.output.key.field.separator=\t \
-D mapreduce.partition.keycomparator.options=-k2,2nr \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-files mapper2.py,reducer2.py \
-mapper "python mapper2.py" \
-reducer "python reducer2.py" \
-input ${OUT_DIR} \
-output ${OUT_DIR_2} > /dev/null
If you want to see all the Hadoop streaming command line options refer to StreamJob.java - setupOptions():
allOptions = new Options().
addOption(input).
addOption(output).
addOption(mapper).
addOption(combiner).
addOption(reducer).
addOption(file).
addOption(dfs).
addOption(additionalconfspec).
addOption(inputformat).
addOption(outputformat).
addOption(partitioner).
addOption(numReduceTasks).
addOption(inputreader).
addOption(mapDebug).
addOption(reduceDebug).
addOption(jobconf).
addOption(cmdenv).
addOption(cacheFile).
addOption(cacheArchive).
addOption(io).
addOption(background).
addOption(verbose).
addOption(info).
addOption(debug).
addOption(help).
addOption(lazyOutput);
The options related to MapReduce are general options for all MapReduce applications and to see if they are valid look at the mapred-default.xml configuration variables. FYI: this refers to Hadoop 2.8.0 so you might need to find the appropriate XML for your version of Hadoop.

spark-submit: command not found

A very simple question:
I try to use a bash script to submit spark jobs. But somehow it keeps complaining that it cannot find spark-submit command.
But when I just copy out the command and run directly in my terminal, it runs fine.
My shell is fish shell, here's what I have in my fish shell config: ~/.config/fish/config.fish:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
Here's my bash script:
#!/usr/bin/env bash
SUBMIT_COMMAND="HADOOP_USER_NAME=hdfs spark-submit \
--master $MASTER \
--deploy-mode client \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
--num-executors $NUM_EXECUTORS \
--executor-cores $EXECUTOR_CORES \
--conf spark.shuffle.compress=true \
--conf spark.network.timeout=2000s \
$DEBUG_PARAM \
--class com.fisher.coder.OfflineIndexer \
--verbose \
$JAR_PATH \
--local $LOCAL \
$SOLR_HOME \
--solrconfig 'resource:solrhome/' \
$ZK_QUORUM_PARAM \
--source $SOURCE \
--limit $LIMIT \
--sample $SAMPLE \
--dest $DEST \
--copysolrconfig \
--shards $SHARDS \
$S3_ZK_ZNODE_PARENT \
$S3_HBASE_ROOTDIR \
"
eval "$SUBMIT_COMMAND"
What I've tried:
I could run this command perfectly fine on my Mac OS X fish shell when I copy this command literally out and directly run.
However, what I wanted to achieve is to be able to run ./submit.sh -local which executes the above shell.
Any clues please?
You seem to be confused about what a fish alias is. When you run this:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
You are actually doing this:
function spark-submit
/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit $argv
end
That is, you are defining a fish function. Your bash script has no knowledge of that function. You need to either put that path in your $PATH variable or put a similar alias command in your bash script.
Make sure this command is added to path:
export PATH=$PATH:/Users/{your_own_path_where_spark_installed}/bin
For mac, open either one of these files ~/.bash, ~/.zprofile, ~/.zshrc and add the command below in the file.

Changing script from PBS to SLURM

I have just switched from PBS to SLURM. Originally my script read as:
Trying to change my script from PBS to SLURM. Before looked something like:
qsub -N $JK -e $LOGDIR/JK_MASTER.error -o $LOGDIR/JK_MASTER.log -v
Z="$ZBIN",NBINS="$nbins",MIN="$Theta_min" submit_MASTER_analysis.sh
Now need something like:
sbatch --job-name=$JK -e $LOGDIR/JK_MASTER.error -o $LOGDIR/JK_MASTER.log --export=Z="$ZBIN",NBINS="$nbins",MIN="$Theta_min"
submit_MASTER_analysis.sh
But for some reason this is not quite executing the job, think its a problem with the variables.
I have found out how to do this now so thought I better just update the post for anyone else interested.
In my launch script I now have
`sbatch --job-name=REALIZ_${R}_zbin${Z} \
--output=$RAND_DIR/RANDOM_MASTER_${R}_zbin${Z}.log \
--error=$RAND_DIR/RANDOM_MASTER_${R}_zbin${Z}.error \
--ntasks=1 \
--cpus-per-task=1 \
--ntasks-per-core=1 \
--threads-per-core=1 \
submit_RANDOMS_analysis.sh $JK $ZBIN $nbins $R $Theta_min 'LOW'`
where $JK $ZBIN $nbins $R $Theta_min 'LOW' are the arguments I pas through to the script I am submitting to the queue submit_RANDOMS_analysis.sh. This is then called in the submitted script by for instance the first argument JK=$1.

How to set PIG_HEAPSIZE value in a shell script triggering a pig job

Also what can be the maximum value set for this. Please let me know any preconditions that I need to consider while setting this flag.
Thanks!
I configured the the value "PIG_HEAPSIZE=6144" just before the pig script as in the below:
PIG_HEAPSIZE=6144 pig \
-logfile ${pig_log_file} \
-p ENV=${etl_env} \
-p OUTPUT_PATH=${pad_output_dir} \
${pig_script} >> $log_file 2>&1;
And it worked!

Resources