How to set PIG_HEAPSIZE value in a shell script triggering a pig job - shell

Also what can be the maximum value set for this. Please let me know any preconditions that I need to consider while setting this flag.
Thanks!

I configured the the value "PIG_HEAPSIZE=6144" just before the pig script as in the below:
PIG_HEAPSIZE=6144 pig \
-logfile ${pig_log_file} \
-p ENV=${etl_env} \
-p OUTPUT_PATH=${pad_output_dir} \
${pig_script} >> $log_file 2>&1;
And it worked!

Related

psql return value / error killing the shell script that called it?

I'm running several psql commands inside a bash shell script. One of the commands imports a csv file to a table. The problem is, the CSV file is occasionally corrupt, it has invalid characters at the end and the import fails. When that happens, and I have the ON_ERROR_STOP=on flag set, my entire shell script stops at that point as well.
Here's the relevant bits of my bash script:
$(psql \
-X \
$POSTGRES_CONNECTION_STRING \
-w \
-b \
-L ./output.txt
-A \
-q \
--set ON_ERROR_STOP=on \
-t \
-c "\copy mytable(...) from '$input_file' csv HEADER"\
)
echo "import is done"
The above works fine as long as the csv file isn't corrupt. If it is however, psql spits out a message to the console that begins ERROR: invalid byte sequence for encoding "UTF8": 0xb1 and my bash script apparently stops cold at that point-- my echo statement above doesn't execute, and neither do any other subsequent commands.
Per the psql documentation, a hard stop in psql should return an error code of 3:
psql returns 0 to the shell if it finished normally, 1 if a fatal error of its own occurs (e.g. out of >memory, file not found), 2 if the connection to the server went bad and the session was not >interactive, and 3 if an error occurred in a script and the variable ON_ERROR_STOP was set
That's fine and good, but is there a reason returning a value of 3 should terminate my calling bash script? And can I prevent that? I'd like to keep ON_ERROR_STOP set to on because I actually have other commands I'd like to run in that psql statement if the intial import succeeds, but not if it doesn't.
ON_ERROR_STOP will not work with the -c option.
Also, the $(...) surronding the psql look wrong — do you want to execute the output as a command?
Finally, you forgot a backslash after the -L option
Try using a “here document”:
psql \
-X \
$POSTGRES_CONNECTION_STRING \
-w \
-b \
-L ./output.txt \
-A \
-q \
--set ON_ERROR_STOP=on \
-t <<EOF
\copy mytable(...) from '$input_file' csv HEADER
EOF
echo "import is done"

snowsql not found from cron tab

I am trying to execute snowsql from an shell script which i have scheduled with cron job. But i am getting error like snowsql: command not found.
I went through many links where they are asking us to give full path of the snowflake. i tried with that also but no luck.
https://support.snowflake.net/s/question/0D50Z00007ZBOZnSAP/snowsql-through-shell-script. Below is my code snippet abc.sh:
#!/bin/bash
set -x
snowsql --config /home/basant.jain/snowsql_config.conf \
-D cust_name=mean \
-D feed_nm=lbl \
-o exit_on_error=true \
-o timing=false \
-o friendly=false \
-o output_format=csv \
-o header=false \
-o variable_substitution=True \
-q 'select count(*) from table_name'
and my crontab looks like below:
*/1 * * * * /home/basant.jain/abc.sh
Cron doesn't set PATH like your login shell does.
As you already wrote in your question you could specify the full path of snowsql, e.g.
#!/bin/bash
/path/to/snowsql --config /home/basant.jain/snowsql_config.conf \
...
Note: /path/to/snowsql is only an example. Of course you should find out the real path of snowsql, e.g. using type snowsql.
Or you can try to source /etc/profile. Maybe this will set up PATH for calling snowsql.
#!/bin/bash
. /etc/profile
snowsql --config /home/basant.jain/snowsql_config.conf \
...
see How to get CRON to call in the correct PATHs

spark-submit: command not found

A very simple question:
I try to use a bash script to submit spark jobs. But somehow it keeps complaining that it cannot find spark-submit command.
But when I just copy out the command and run directly in my terminal, it runs fine.
My shell is fish shell, here's what I have in my fish shell config: ~/.config/fish/config.fish:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
Here's my bash script:
#!/usr/bin/env bash
SUBMIT_COMMAND="HADOOP_USER_NAME=hdfs spark-submit \
--master $MASTER \
--deploy-mode client \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
--num-executors $NUM_EXECUTORS \
--executor-cores $EXECUTOR_CORES \
--conf spark.shuffle.compress=true \
--conf spark.network.timeout=2000s \
$DEBUG_PARAM \
--class com.fisher.coder.OfflineIndexer \
--verbose \
$JAR_PATH \
--local $LOCAL \
$SOLR_HOME \
--solrconfig 'resource:solrhome/' \
$ZK_QUORUM_PARAM \
--source $SOURCE \
--limit $LIMIT \
--sample $SAMPLE \
--dest $DEST \
--copysolrconfig \
--shards $SHARDS \
$S3_ZK_ZNODE_PARENT \
$S3_HBASE_ROOTDIR \
"
eval "$SUBMIT_COMMAND"
What I've tried:
I could run this command perfectly fine on my Mac OS X fish shell when I copy this command literally out and directly run.
However, what I wanted to achieve is to be able to run ./submit.sh -local which executes the above shell.
Any clues please?
You seem to be confused about what a fish alias is. When you run this:
alias spark-submit='/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit'
You are actually doing this:
function spark-submit
/Users/MY_NAME/Downloads/spark-2.0.2-bin-hadoop2.7/bin/spark-submit $argv
end
That is, you are defining a fish function. Your bash script has no knowledge of that function. You need to either put that path in your $PATH variable or put a similar alias command in your bash script.
Make sure this command is added to path:
export PATH=$PATH:/Users/{your_own_path_where_spark_installed}/bin
For mac, open either one of these files ~/.bash, ~/.zprofile, ~/.zshrc and add the command below in the file.

Changing script from PBS to SLURM

I have just switched from PBS to SLURM. Originally my script read as:
Trying to change my script from PBS to SLURM. Before looked something like:
qsub -N $JK -e $LOGDIR/JK_MASTER.error -o $LOGDIR/JK_MASTER.log -v
Z="$ZBIN",NBINS="$nbins",MIN="$Theta_min" submit_MASTER_analysis.sh
Now need something like:
sbatch --job-name=$JK -e $LOGDIR/JK_MASTER.error -o $LOGDIR/JK_MASTER.log --export=Z="$ZBIN",NBINS="$nbins",MIN="$Theta_min"
submit_MASTER_analysis.sh
But for some reason this is not quite executing the job, think its a problem with the variables.
I have found out how to do this now so thought I better just update the post for anyone else interested.
In my launch script I now have
`sbatch --job-name=REALIZ_${R}_zbin${Z} \
--output=$RAND_DIR/RANDOM_MASTER_${R}_zbin${Z}.log \
--error=$RAND_DIR/RANDOM_MASTER_${R}_zbin${Z}.error \
--ntasks=1 \
--cpus-per-task=1 \
--ntasks-per-core=1 \
--threads-per-core=1 \
submit_RANDOMS_analysis.sh $JK $ZBIN $nbins $R $Theta_min 'LOW'`
where $JK $ZBIN $nbins $R $Theta_min 'LOW' are the arguments I pas through to the script I am submitting to the queue submit_RANDOMS_analysis.sh. This is then called in the submitted script by for instance the first argument JK=$1.

How to set queue name for Pig on Tez?

How do I set a queue name from a command line when running Pig on TEZ?
I would like to run a Pig script from the command line such as:
pig -useHCatalog -p INPUT=input_dir \
-p 'OUT_FILE=out_file \
-p UDF_PATH=udf.py \
-f ./script_name.pig \
-Dmapred.job.queue.name=my_queue_name \
-x tez;
I tried the following settings:
-tez.job.queue.name=my_queue_name
-q mapreduce.job.queuename=my_queue_name
-Dmapred.job.queue.name=my_queue_name
-q my_queue_name
However, my job is not running in the queue I specified.
Thank you!
The property is tez.queue.name.
<property>
<name>tez.queue.name</name>
<value>myqueue</value>
</property>
So try
-tez.queue.name=my_queue_name
In my version of pig (0.16.0.2.5.3.0-37) I could not set this parameter in the command line.
Instead adding
SET tez.queue.name 'my_queue';
to the beginning of the PIG script did work.

Resources