Grep qstat output and copy files once done - bash

I am using PBS job scheduler on my cluster. In bash,I would like to monitor the job status and once the job is done I would like to copy the results to a
certain location(/data/myfolder/)
My qstat output looks like this:
JobID Username Queue Jobname SessID NDS TSK Memory Time Status
----------------------------------------------------------------
717.XXXXXX user XXXX SS 2323283 1 24 122gb -- E
Thanks in advance

There is a script here that does this (for SGE). I started to excerpt just the relevant parts for you, but it will probably be easier for you to start with the full script and just insert the qsub commands inside the submit_job function, and then put the code you want for copying the results after the wait_job_finish command in the script. You can remove the log printing at the end if you want.
#!/bin/bash
# this script will submit a qsub job and check on host information for the cluster
# node which it ends up running on
# ~~~~~ CUSTOM FUNCTIONS ~~~~~ #
submit_job () {
local job_name="$1"
qsub -j y -N "$job_name" -o :${PWD}/ -e :${PWD}/ <<E0F
set -x
hostname
cat /etc/hosts
python -c "import socket; print socket.gethostbyname(socket.gethostname())"
# sleep 5000
E0F
}
wait_job_start () {
local job_id="$1"
printf "waiting for job to start"
while ! qstat | grep "$job_id" | grep -Eq '[[:space:]]r[[:space:]]'
do
printf "."
sleep 1
done
printf "\n\n"
local node_name="$(get_node_name "$job_id")"
printf "Job is running on node $node_name \n\n"
}
wait_job_finish () {
local job_id="$1"
printf "waiting for job to finish"
while qstat | grep -q "$job_id"
do
printf "."
sleep 1
done
printf "\n\n"
}
check_for_job_submission () {
local job_id="$1"
if ! qstat | grep -q "$job_id" ; then
echo "its there"
else
echo "not there"
fi
}
get_node_name () {
local job_id="$1"
qstat | grep "$job_id" | sed -e 's|^.*[[:space:]]\([a-zA-Z0-9.]*#[^ ]*\).*$|\1|g'
}
# ~~~~~ RUN ~~~~~ #
printf "Submitting cluster job to get node hostname and IP\n\n"
job_name="get_node_hostnames"
job_id="$(submit_job "$job_name")" # Your job 832606 ("get_node_hostnames") has been submitted
job_id="$(echo "$job_id" | sed -e 's|.*[[:space:]]\([[:digit:]]*\)[[:space:]].*|\1|g' )"
job_stdout_log="${job_name}.o${job_id}"
printf "Job ID:\t%s\nJob Name:\t%s\n\n" "$job_id" "$job_name"
wait_job_start "$job_id"
wait_job_finish "$job_id"
printf "\n\nReading log file ${job_stdout_log}\n\n"
[ -f "$job_stdout_log" ] && cat "$job_stdout_log"
printf "\n\nRemoving log file ${job_stdout_log}\n\n"
[ -f "$job_stdout_log" ] && rm -f "$job_stdout_log"
Sidenote: If you like Python, there is a slightly more robust equivalent here
You'll probably have to do some little tweaks to both to adjust it for your PBS system, since this was written for SGE.

You can just look for " C " with grep, but you could also just use -o [hostname:]path to stream to the final destination, as long as you have your ssh keys set up from the node for your POSIX account.
If you end up doing grep, you should be a good citizen and limit your check frequency to once or twice a minute, so as not to contribute to server spam, which can impact performance.

Related

Slurm wait option: show time waiting

When you apply the --wait flag in a slurm script, is it possible to display how long it has been waiting in real time?
When sbatch is used with the --wait option, the command does not exit until the submitted job terminates.
There is no additional option available to show the pending time.
However, you can open another session and execute the following command to display the pending time (in seconds) if the job is still in pending state:
squeue --Format=PendingTime -j <jobid> --noheader
One time display
If you simply wish to know the elapsed time before the job was scheduled, you can add the following line in your batch script:
echo "waited: $(squeue --Format=PendingTime -j $SLURM_JOB_ID --noheader | tr -d ' ')s"
Note: the tr command is used here to delete the trailing spaces added by squeue
Real time counter
If you would like to display the elapsed time in real time you can remove the --wait option and use a sbatch-wrapper such as:
#!/bin/sh
# Time before issuing another squeue command
# XXX: Ensure this is large enough to avoid flooding the Slurm controller
WAIT=20
# Convert seconds to days:hours:minutes:seconds format
seconds_to_days()
{
printf '%dd:%dh:%dm:%ds\n' $(($1/86400)) $(($1%86400/3600)) $(($1%3600/60)) $(($1%60))
}
# Convert days-hours:minutes:seconds time format to seconds
squeue_time_to_seconds()
{
local time=$(echo $1 | tr -d ' ') # Removing spaces
# Print input and return if the time format is not recongized
echo $time | grep -q ':' ||
{
printf "$time"
return
}
# Check if time contains hours, otherwise add 0 hour
[ $(echo $time | awk -F: '{print NF-1}') -eq 2 ] || time="0:$time"
# Check if time contains days, otherwise add 0 day
echo $time | grep -q '-' || time="0-$time"
# Parse and convert to seconds
echo $time | tr '-' ':' |
awk -F: '{ print ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4 }'
}
# Poll job counter with squeue
squeue_polling()
{
local counter=$1
local counter_description=$2
local jobid=$3
local prev_time="-${WAIT}"
while true; do
elapsed_time=$(squeue --Format=$counter -j $jobid --noheader || exit $?)
elapsed_time=$(squeue_time_to_seconds "$elapsed_time")
# Return in case no counter is found
if [ -z "$elapsed_time" ]; then
echo; return
fi
# Update one more time the counter if it is not progressing anymore
if [ "$elapsed_time" -lt "$((prev_time + WAIT ))" ]; then
printf "\33[2K\r$counter_description: $(seconds_to_days $prev_time)\n"
return
fi
# Update the counter without calling squeue to release the pressure on
# the Slurm controller
for i in $(seq 1 $WAIT); do
printf "\33[2K\r$counter_description: $(seconds_to_days $(($elapsed_time + i)))"
sleep 1
done
prev_time=$elapsed_time
done
}
# Execute sbatch and display the output
OUTPUT=$(sbatch $#)
echo $OUTPUT
# Exit on error
if [ $? -ne 0 ]; then
exit $?
fi
# Parse the job ID
JOBID=$(echo $OUTPUT | sed -rn 's/Submitted batch job ([0-9]+)/\1/p')
# Display pending time until the job is scheduled
squeue_polling 'PendingTime' 'Pending time' $JOBID
# Display the time used by the allocation until the job is over
squeue_polling 'TimeUsed' 'Allocation time' $JOBID
It will act as if you submitted the job with the --wait flag (i.e. will return when the job completed). The pending time will be updated in real time
./sbatch-wait <options> <batch script>
Submitted batch job 42
Pending time: 0d:0h:1m:0s
Allocation time: 0d:0h:1m:23s
An easy way is to (ab)use the pv command like this:
sbatch --wait ... | pv -t
It will look like this:
$ sbatch --wait --wrap "sleep 30" | pv -t
Submitted batch job 123456
0:00:42
and the stopwatch will stop when the job is completed

Looping over IP addresses from a file using bash array

I have a file in which I have given all the IP addresses. The file looks like following:
[asad.javed#tarts16 ~]#cat file.txt
10.171.0.201
10.171.0.202
10.171.0.203
10.171.0.204
10.171.0.205
10.171.0.206
10.171.0.207
10.171.0.208
I have been trying to loop over the IP addresses by doing the following:
launch_sipp () {
readarray -t sipps < file.txt
for i in "${!sipps[#]}";do
ip1=(${sipps[i]})
echo $ip1
sip=(${i[#]})
echo $sip
done
But when I try to access the array I get only the last IP address which is 10.171.0.208. This is how I am trying to access in the same function launch_sipp():
local sipp=$1
echo $sipp
Ip=(${ip1[*]})
echo $Ip
Currently I have IP addresses in the same script and I have other functions that are using those IPs:
launch_tarts () {
local tart=$1
local ip=${ip[tart]}
echo " ---- Launching Tart $1 ---- "
sshpass -p "tart123" ssh -Y -X -L 5900:$ip:5901 tarts#$ip <<EOF1
export DISPLAY=:1
gnome-terminal -e "bash -c \"pwd; cd /home/tarts; pwd; ./launch_tarts.sh exec bash\""
exit
EOF1
}
kill_tarts () {
local tart=$1
local ip=${ip[tart]}
echo " ---- Killing Tart $1 ---- "
sshpass -p "tart123" ssh -tt -o StrictHostKeyChecking=no tarts#$ip <<EOF1
. ./tartsenvironfile.8.1.1.0
nohup yes | kill_tarts mcgdrv &
nohup yes | kill_tarts server &
pkill -f traf
pkill -f terminal-server
exit
EOF1
}
ip[1]=10.171.0.10
ip[2]=10.171.0.11
ip[3]=10.171.0.12
ip[4]=10.171.0.13
ip[5]=10.171.0.14
case $1 in
kill) function=kill_tarts;;
launch) function=launch_tarts;;
*) exit 1;;
esac
shift
for ((tart=1; tart<=$1; tart++)); do
($function $tart) &
ips=(${ip[tart]})
tarts+=(${tart[#]})
done
wait
How can I use different list of IPs for a function created for different purpose from a file?
How about using GNU parallel? It's an incredibly powerful wonderful-to-know very popular free linux tool, easy to install.
Firstly, here's a basic parallel tool usage ex.:
$ parallel echo {} :::: list_of_ips.txt
# The four colons function as file input syntax.†
10.171.0.202
10.171.0.201
10.171.0.203
10.171.0.204
10.171.0.205
10.171.0.206
10.171.0.207
10.171.0.208
†(Specific to parallel; see parallel usage cheatsheet here]).
But you can replace echo with just about any as complex series of commands as you can imagine / calls to other scripts. parallel loops through the input it receives and performs (in parallel) the same operation on each input.
More specific to your question, you could replace echo simply with a command call to your script
Now you would no longer need to handle any looping through ip's itself, and instead be written designed for just a single IP input. parallel will handle running the program in parallel (you can custom set the number of concurrent jobs with option -j n for any int 'n')* .
*By default parallel sets the number of jobs to the number of vCPUs it automatically determines your machine has available.
$ parallel process_ip.sh :::: list_of_ips.txt
In pure Bash:
#!/bin/bash
while read ip; do
echo "$ip"
# ...
done < file.txt
Or in parallel:
#!/bin/bash
while read ip; do
(
sleep "0.$RANDOM" # random execution time
echo "$ip"
# ...
) &
done < file.txt
wait

trap ERR doesn't work with pipes

I am try to make a system backup script with trap "" ERR. I realized the trap doesn't get called when commands are part of pipes |.
Heres are some parts of my code that don't work with trap "" ERR ...
OpenFiles=$(lsof "$Source" | wc -l)
PackagesList=$(dpkg --get-selections | awk '!/deinstall|purge|hold/ {print $1}' | tee "$FilePackagesList")
How can I get this to work without using if [ "$?" -eq 0 ]; then, or similar coding ? Because this is the reason I declared a trap this way.
Here is the script ...
root#Lian-Li:~# cat /usr/local/bin/create_incremental_backup_of_system.sh
#!/bin/bash
# Create an incremental GNU-standard backup of important system-files.
# This script works with Debian Jessie and newer systems.
# Created for my lian-li NAS 2016-11-27.
MailTo="admin#example.com" # Mail Address of an admin
Source="boot etc root usr/local usr/lib/cgi-bin var/www"
BackupDirectory=/media/hdd1/backups/lian-li
SubDir="system.d"
FileTimeStamp=$(date "+%Y%m%d%H%M%S")
FileName=$(uname -n)
File="${BackupDirectory}/${SubDir}/${FileName}-${FileTimeStamp}.tgz"
FileIncremental="${BackupDirectory}/${SubDir}/${FileName}.gtar"
FilePackagesList="${BackupDirectory}/${SubDir}/installed_packages_on_${FileName}.txt"
# have2do ...
# Backup rotate
MailContent="None"
TimeStamp=$(date "+%F %T") # This format "2011-12-31 23:59:59" is needed to read the journal
exec 1> >(logger -i -s -t "$0" -p 3) 2>&1 # all error messages are redirected to syslog journal and after that to stdout
trap "BriefExit" ERR # Provide information for an admin (via sendmail) when an error occurred and exit the script
function BriefExit(){
rm -f "$File"
if [ "$MailContent" = "None" ]
then
case "$LANG" in
de_DE.UTF-8)
echo "Beende Skript, aufgrund vorherige Fehler." 1>&2
;;
*)
echo "Stopping script because of previous error(s)." 1>&2
;;
esac
MailContent=$(journalctl -p 3 -o "short" --since="$TimeStamp" --no-pager)
ScriptName="${0##*/}"
SystemName=$(uname -n)
MailSubject="${SystemName}: ${ScriptName}"
echo -e "Subject: ${MailSubject}\n\n${MailContent}\n" | sendmail "$MailTo"
fi
exit 1
}
if [ ! -d "${BackupDirectory}/${SubDir}" ]
then
mkdir -p "${BackupDirectory}/${SubDir}"
fi
LoopCount=0
OpenFiles=1
cd /
while [ "$OpenFiles" -ne 0 ]
do
if [ "$LoopCount" -le 180 ]
then
sleep 1
OpenFiles=$(lsof $Source | wc -l)
LoopCount=$(($LoopCount + 1))
else
echo "Closing Script. Reason: Can't create incremental backup, because some files are open." 1>&2
BriefExit
fi
done
tar -cpzf "$File" -g "$FileIncremental" $Source
chmod 0700 "$File"
PackagesList=$(dpkg --get-selections | awk '!/deinstall|purge|hold/ {print $1}' | tee "$FilePackagesList")
while read -r PackageName
do
case "$PackageName" in
minidlna)
# Code ...
;;
slapd)
# Code ...
;;
esac
done <<< "$PackagesList"
exit 0
This isn't a problem with ERR traps at all, or with command substitutions, but with pipelines.
false | true
returns true, unless the pipefail option is set.
Thus in OpenFiles=$(lsof "$Source" | wc -l), only a failure in wc will cause the pipeline to be considered a failure, or in PackagesList=$(dpkg --get-selections | awk '!/deinstall|purge|hold/ {print $1}' | tee "$FilePackagesList"), only a failure in tee will cause the command as a whole to be considered failed.
Put the command set -o pipefail at the top of your script if you want a failure from any pipeline component (as opposed to the last component alone) to cause the command as a whole to be considered failed -- and note the other caveats for ERR traps given in BashFAQ #105.
Another alternative is to look at the status for each stage in the pipeline:
# cat test_bash_return.bash
true | true | false | true
echo "${PIPESTATUS[#]}"
# ./test_bash_return.bash
0 0 1 0

Conditional variables in bash script?

I'm not used to writing code in bash but I'm self teaching myself. I'm trying to create a script that will query info from the process list. I've done that but I want to take it further and make it so:
The script runs with one set of commands if A OS is present.
The script runs with a different set of commands if B OS is present.
Here's what I have so far. It works on my Centos distro but won't work on my Ubuntu. Any help is greatly appreciated.
#!/bin/bash
pid=$(ps -eo pmem,pid | sort -nr -k 1 | cut -d " " -f 2 | head -1)
howmany=$(lsof -l -n -p $pid | wc -l)
nameofprocess=$(ps -eo pmem,fname | sort -nr -k 1 | cut -d " " -f 2 | head -1)
percent=$(ps -eo pmem,pid,fname | sort -k 1 -nr | head -1 | cut -d " " -f 1)
lsof -l -n -p $pid > ~/`date "+%Y-%m-%d-%H%M"`.process.log 2>&1
echo " "
echo "$nameofprocess has $howmany files open, and is using $percent"%" of memory."
echo "-----------------------------------"
echo "A log has been created in your home directory"
echo "-----------------------------------"
echo " "
echo ""$USER", do you want to terminate? (y/n)"
read yn
case $yn in
[yY] | [yY][Ee][Ss] )
kill -15 $pid
;;
[nN] | [n|N][O|o] )
echo "Not killing. Powering down."
echo "......."
sleep 2
;;
*) echo "Does not compute"
;;
esac
Here's my version of your script. It works with Ubuntu and Debian. It's probably safer than yours in some regards (I clearly had a bug in yours when a process takes more than 10% of memory, due to your awkward cut). Moreover, your ps are not "atomic", so things can change between different calls of ps.
#!/bin/bash
read percent pid nameofprocess < <(ps -eo pmem,pid,fname --sort=-pmem h)
mapfile -t openfiles < <(lsof -l -n -p $pid)
howmany=${#openfiles[#]}
printf '%s\n' "${openfiles[#]}" > ~/$(date "+%Y-%m-%d-%H%M.process.log")
cat <<EOF
$nameofprocess has $howmany files open, and is using $percent% of memory.
-----------------------------------
A log has been created in your home directory
-----------------------------------
EOF
read -p "$USER, do you want to terminate? (y/n) "
case $REPLY in
[yY] | [yY][Ee][Ss] )
kill -15 $pid
;;
[nN] | [n|N][O|o] )
echo "Not killing. Powering down."
echo "......."
sleep 2
;;
*) echo "Does not compute"
;;
esac
First, check that your version of ps has the --sort flag and the h option:
--sort=-pmem tells ps to sort wrt decreasing pmem
h tells ps to not show any header
All this is given to the read bash builtin, which reads space-separated fields, here the fields pmem, pid, fname and puts these values in the corresponding variables percent, pid and nameofprocess.
The mapfile command reads standard input (here the output of the lsof command) and puts each line in an array field. The size of this array is computed by the line howmany=${#openfiles[#]}. The output of lsof, as stored in the array openfiles is output to the corresponing file.
Then, instead of the many echos, we use a cat <<EOF, and then the read is use with the -p (prompt) option.
I don't know if this really answers your question, but at least, you have a well-written bash script, with less multiple useless command calls (until your case statement, you called 16 processes, I only called 4). Moreover, after the first ps call, things can change in your script (even though it's very unlikely to happen), not in mine.
You might also like the following which doesn't put the output of lsof in an array, but uses an extra wc command:
#!/bin/bash
read percent pid nameofprocess < <(ps -eo pmem,pid,fname --sort=-pmem h)
logfilename="~/$(date "+%Y-%m-%d-%H%M.process.log")
lsof -l -n -p $pid > "$logfilename"
howmany=$(wc -l < "$logfilename")
cat <<EOF
$nameofprocess has $howmany files open, and is using $percent% of memory.
-----------------------------------
A log has been created in your home directory ($logfilename)
-----------------------------------
EOF
read -p "$USER, do you want to terminate? (y/n) "
case $REPLY in
[yY] | [yY][Ee][Ss] )
kill -15 $pid
;;
[nN] | [n|N][O|o] )
echo "Not killing. Powering down."
echo "......."
sleep 2
;;
*) echo "Does not compute"
;;
esac
You could achieve this for example by (update)
#!/bin/bash
# place distribution independent code here
# dist=$(lsb_release -is)
if [[ -f /etc/redheat-release ]];
then # this is a RedHead based distribution like centos, fedora, ...
dist="redhead"
elif [[ -f /etc/issue.net ]];
then
# dist=$(cat /etc/issue.net | cut -d' ' -f1) # debian, ubuntu, ...
dist="ubuntu"
else
dist="unknown"
fi
if [[ $dist == "ubuntu" ]];
then
# use your ubuntu command set
elif [[ $dist == "redhead" ]];
then
# use your centos command set
else
# do some magic here
fi
# place distribution independent code here

shell scripting help cron job not executing

#!/bin/bash
#!/bin/sh
# Need help
__help() { echo "$0 [ stop|start ]" 1>&2; exit 1; }
# Not enough args to run properly
[ $# -ne 1 ] && __help
# See what we're called with
case "$1" in
start) # Start sniffer as root, under a different argv[0] and make it drop rights
s=$(/usr/local/sbin/tcpdump -n -nn -f -q -i lo | awk 'END {print NR}')
echo "$s" > eppps_$(/bin/date +'%Y%m%d%H%M')
;;
stop) # End run, first "friendly", then strict:
/usr/bin/pkill -15 -f /usr/local/sbin/tcpdump >/dev/null 2>&1|| { sleep 3s; /usr/bin/pkill -9 -f /usr/local/sbin/tc$
;;
*) # Superfluous but show we only accept these args
__help
;;
esac
exit 0
This code runs perfectly on manual testing. But when i couple it with cron it just doesn't do anything. No output file is created.
My cron entries for the script looks like
http://postimage.org/image/1pztgd6xw/
It looks like you are not setting the working directory, so you may need to give an absolute path for the output file

Resources