Multithreading semaphore for bash script (sub-processes) - bash

Is there any way / binary for a semaphore-like structure? Eg. For running a fixed amount of (background) sub-process as we loop through a directory of files (using word "sub-process" here and not "thread", since using an appended & in my bash commands to do the "multithreading" (but would be open to any more convenient suggestions)).
My actual use case is trying to use a binary called bcp on CentOS 7 to write a (variable sized) set of TSV files to a remote MSSQL Server DB and have observed that there seems to be a problem with the program when running too many threads. Eg. something like
for filename in $DATAFILES/$TARGET_GLOB; do
if [ ! -f $filename ]; then
echo -e "\nFile $filename not found!\nExiting..."
exit 255
else
echo -e "\nImporting $filename data to $DB/$TABLE"
fi
echo -e "\nStarting BCP export threads for $filename"
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE \
-t "\t" \
-e ${filename}.bcperror.log &
done
# collect all subprocesses at the end
wait
that starts a new sub-process for every file all at once in an unrestricted way, appears to crash each sub-process. Would like to see if adding a semaphore-like structure into the loop to lock the number of sub-process that will be spun up would help. Eg. something like (using some non-bash-like pseudo-code here)
sem = Semaphore(locks=5)
for filename in $DATAFILES/$TARGET_GLOB; do
if [ ! -f $filename ]; then
echo -e "\nFile $filename not found!\nExiting..."
exit 255
else
echo -e "\nImporting $filename data to $DB/$TABLE"
fi
sem.lock()
<same code from original loop>
sem.unlock()
done
# collect all subprocesses at the end
wait
If anything like this is possible or if this is a common problem with an existing best practice solution (I'm pretty new to bash programming), advice would be appreciated.

This isn't strictly equivalent, but you can use xargs to start up to a given number of processes at once:
-P max-procs, --max-procs=max-procs
Run up to max-procs processes at a time; the default is 1. If
max-procs is 0, xargs will run as many processes as possible at
a time. Use the -n option or the -L option with -P; otherwise
chances are that only one exec will be done. While xargs is
running, you can send its process a SIGUSR1 signal to increase
the number of commands to run simultaneously, or a SIGUSR2 to
decrease the number. You cannot decrease it below 1. xargs
never terminates its commands; when asked to decrease, it merely
waits for more than one existing command to terminate before
starting another.
Something like:
printf "%s\n" $DATAFILES/$TARGET_GLOB |
xargs -d '\n' -I {} --max-procs=5 bash -c '
filename=$1
if [ ! -f $filename ]; then
echo -e "\nFile $filename not found!\nExiting..."
exit 255
else
echo -e "\nImporting $filename data to $DB/$TABLE"
fi
echo -e "\nStarting BCP export threads for $filename"
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE \
-t "\t" \
-e ${filename}.bcperror.log
' _ {}
You'll need to export the TABLE, TO_SERVER_ODBCDSN, USER, PASSWORD, DB and RECOMMEDED_IMPORT_MODE variables beforehand, so that they're available in the processes started by xargs. Or you can put commands run using bash -c here in a separate script, and put the variables in that script.

Following recommendation by #Mark Setchell, using GNU Parallel to replace the loop (in a simulated cron environment (see https://stackoverflow.com/a/2546509/8236733)) with
bcpexport() {
filename=$1
TO_SERVER_ODBCDSN=$2
DB=$3
TABLE=$4
USER=$5
PASSWORD=$6
RECOMMEDED_IMPORT_MODE=$7
DELIMITER=$8 # DO NOT use format like "'\t'", nested quotes seem to cause hard-to-catch error
<same code from original loop>
}
export -f bcpexport
parallel -j 10 bcpexport \
::: $DATAFILES/$TARGET_GLOB \
::: "$TO_SERVER_ODBCDSN" \
::: $DB \
::: $TABLE \
::: $USER \
::: $PASSWORD \
::: $RECOMMEDED_IMPORT_MODE \
::: $DELIMITER
to run at most 10 threads at a time, where $DATAFILES/$TARGET_GLOB is a glob string to return all of the files in the desired dir. (eg. "$storagedir/tsv/*.tsv") that we want to go through (and adding the remaining fixed args with each of the elements returned by that glob as the remaining parallel inputs shown) (The $TO_SERVER_ODBCDSN variable is actually "-D -S <some ODBC DSN>", so needed to add quotes to pass as single arg). So if the $DATAFILES/$TARGET_GLOB glob returns files A, B, C, ..., we end up running the commands
bcpexport A "$TO_SERVER_ODBCDSN" $DB ...
bcpexport B "$TO_SERVER_ODBCDSN" $DB ...
bcpexport C "$TO_SERVER_ODBCDSN" $DB ...
...
in parallel. An additionally nice thing about using parallel is
GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially.

Using &
Example code
#!/bin/bash
xmms2 play &
sleep 5
xmms2 next &
sleep 1
xmms2 stop

Related

Looping over IP addresses from a file using bash array

I have a file in which I have given all the IP addresses. The file looks like following:
[asad.javed#tarts16 ~]#cat file.txt
10.171.0.201
10.171.0.202
10.171.0.203
10.171.0.204
10.171.0.205
10.171.0.206
10.171.0.207
10.171.0.208
I have been trying to loop over the IP addresses by doing the following:
launch_sipp () {
readarray -t sipps < file.txt
for i in "${!sipps[#]}";do
ip1=(${sipps[i]})
echo $ip1
sip=(${i[#]})
echo $sip
done
But when I try to access the array I get only the last IP address which is 10.171.0.208. This is how I am trying to access in the same function launch_sipp():
local sipp=$1
echo $sipp
Ip=(${ip1[*]})
echo $Ip
Currently I have IP addresses in the same script and I have other functions that are using those IPs:
launch_tarts () {
local tart=$1
local ip=${ip[tart]}
echo " ---- Launching Tart $1 ---- "
sshpass -p "tart123" ssh -Y -X -L 5900:$ip:5901 tarts#$ip <<EOF1
export DISPLAY=:1
gnome-terminal -e "bash -c \"pwd; cd /home/tarts; pwd; ./launch_tarts.sh exec bash\""
exit
EOF1
}
kill_tarts () {
local tart=$1
local ip=${ip[tart]}
echo " ---- Killing Tart $1 ---- "
sshpass -p "tart123" ssh -tt -o StrictHostKeyChecking=no tarts#$ip <<EOF1
. ./tartsenvironfile.8.1.1.0
nohup yes | kill_tarts mcgdrv &
nohup yes | kill_tarts server &
pkill -f traf
pkill -f terminal-server
exit
EOF1
}
ip[1]=10.171.0.10
ip[2]=10.171.0.11
ip[3]=10.171.0.12
ip[4]=10.171.0.13
ip[5]=10.171.0.14
case $1 in
kill) function=kill_tarts;;
launch) function=launch_tarts;;
*) exit 1;;
esac
shift
for ((tart=1; tart<=$1; tart++)); do
($function $tart) &
ips=(${ip[tart]})
tarts+=(${tart[#]})
done
wait
How can I use different list of IPs for a function created for different purpose from a file?
How about using GNU parallel? It's an incredibly powerful wonderful-to-know very popular free linux tool, easy to install.
Firstly, here's a basic parallel tool usage ex.:
$ parallel echo {} :::: list_of_ips.txt
# The four colons function as file input syntax.†
10.171.0.202
10.171.0.201
10.171.0.203
10.171.0.204
10.171.0.205
10.171.0.206
10.171.0.207
10.171.0.208
†(Specific to parallel; see parallel usage cheatsheet here]).
But you can replace echo with just about any as complex series of commands as you can imagine / calls to other scripts. parallel loops through the input it receives and performs (in parallel) the same operation on each input.
More specific to your question, you could replace echo simply with a command call to your script
Now you would no longer need to handle any looping through ip's itself, and instead be written designed for just a single IP input. parallel will handle running the program in parallel (you can custom set the number of concurrent jobs with option -j n for any int 'n')* .
*By default parallel sets the number of jobs to the number of vCPUs it automatically determines your machine has available.
$ parallel process_ip.sh :::: list_of_ips.txt
In pure Bash:
#!/bin/bash
while read ip; do
echo "$ip"
# ...
done < file.txt
Or in parallel:
#!/bin/bash
while read ip; do
(
sleep "0.$RANDOM" # random execution time
echo "$ip"
# ...
) &
done < file.txt
wait

How to convert for loop to multiple job submission?

I submit a job to cluster using qsub SubmitJob.sh. It works well but takes a long time to finish. Inside of SubmitJob.sh there is for loop which runs sequentially. I would like to convert my for loop for parallel job submission, such that each of them submits a single job (SubmitJob.sh).
#!/bin/bash
#$ -S /bin/bash
#$ -V -cwd
#$ -e ./error.$JOB_NAME.$JOB_ID
#$ -o ./outpt.$JOB_NAME.$JOB_ID
#$ -l h_vmem=256g
##$ -q long
##$ -pe smp 4
#$ -l h_rt=24:00:00
cd /mydirectroy/
for ID in $(cat FilID.txt) ; do
Do_Somthing -n $ID -o /OutputDirectory/$ID
done
I had to do something like this once or twice. The generic idea is that you supply parts of a array as reference to a function and execute it as child processes. I choose to use the square root as divider, because the work load will grow linear to the amount of items to process.
#! /bin/bash
FILE="FilID.txt"
DATA=($(cat ${FILE}))
AMOUNT=${#DATA[#]}
RANGE=$(echo "sqrt(${AMOUNT})" | bc)
echo ${amount}
echo $range
function _child {
local -n numbers=$1
echo "From ${numbers[0]} to ${numbers[-1]}"
for n in ${numbers[#]}; do echo -n "$n, "; done
echo
}
for ((i=0; i<AMOUNT; i+=RANGE)) {
part=(${DATA[#]:$i:$RANGE})
_child part &
# wait
}
wait
exit 0
You can test the script by populating FilID.txt as follows. Uncomment the wait in the for loop for readable output.
$ seq 0 98 > FilID.txt
You might want to wait until every N child processes are finished before you start the next batch. Back when I executed the script, the load became too high and Linux choose to kill our virtual development environment :p
P.S. if FilID.txt contain spaces with filenames you have to set IFS=$'\n' or something.

Unix shell scripting, need to adjust my script for performance?

I have a script below that does a few things...
#!/bin/bash
# Script to sync dr-xxxx
# 1. Check for locks and die if exists
# 2. CPIO directories found in cpio.cfg
# 3. RSYNC to remote server
# 5. TRAP and remove lock so we can run again
if ! mkdir /tmp/drsync.lock; then
printf "Failed to aquire lock.\n" >&2
exit 1
fi
trap 'rm -rf /tmp/drsync.lock' EXIT # remove the lockdir on exit
# Config specific to CPIO
BASE=/home/mirxx
DUMP_DIR=/usrx/drsync
CPIO_CFG="$BASE/cpio.cfg"
while LINE=: read -r f1 f2
do
echo "Working with $f1"
cd $f1
find . -print | cpio -o | gzip > $DUMP_DIR/$f2.cpio.gz
echo "Done for $f1"
done <"$CPIO_CFG"
RSYNC=/usr/bin/rsync # use latest version
RSYNC_BW="4500" # 4.5MB/sec
DR_PATH=/usrx/drsync
DR_USER=root
DR_HOST=dr-xxxx
I=0
MAX_RESTARTS=5 # max rsync retries before quitting
LAST_EXIT_CODE=1
while [ $I -le $MAX_RESTARTS ]
do
I=$(( $I + 1 ))
echo $I. start of rsync
$RSYNC \
--partial \
--progress \
--bwlimit=$RSYNC_BW \
-avh $DUMP_DIR/*gz \
$DR_USER#$DR_HOST:$DR_PATH
LAST_EXIT_CODE=$?
if [ $LAST_EXIT_CODE -eq 0 ]; then
break
fi
done
# check if successful
if [ $LAST_EXIT_CODE -ne 0 ]; then
echo rsync failed for $I times. giving up.
else
echo rsync successful after $I times.
fi
What I would like to change above is, for this line..
find . -print | cpio -o | gzip > $DUMP_DIR/$f2.cpio.gz
I am looking to change the above line so that it starts a parallel process for every entry in CPIO_CFG which gets feed in. I believe i have to use & at the end? Should I implement any safety precautions?
Is it also possible to modify the above command to also include an exclude list that I can pass in via $f3 in the cpio.cfg file.
For the below code..
while [ $I -le $MAX_RESTARTS ]
do
I=$(( $I + 1 ))
echo $I. start of rsync
$RSYNC --partial --progress --bwlimit=$RSYNC_BW -avh $DUMP_DIR/*gz $DR_USER#$DR_HOST:$DR_PATH
LAST_EXIT_CODE=$?
if [ $LAST_EXIT_CODE -eq 0 ]; then
break
fi
done
The same thing here, is it possible to run multiple RSYNC threads one for .gz file found in $DUMP_DIR/*.gz
I think the above would greatly increase the speed of my script, the box is fairly beefy (AIX 7.1, 48 cores and 192GB RAM).
Thank you for your help.
The original code is a traditional batch queue. Let's add a bit of lean thinking...
The actual workflow is the transformation and transfer of a set of directories in compressed cpio format. Assuming that there is no dependency between the directories/archives, we should be able to create a single action for creating the archive and the transfer.
It helps if we break up the script into functions, which should make our intentions more visible.
First, create a function transfer_archive() with archive_name and an optional number_of_attempts as arguments. This contains your second while loop, but replaces $DUMP_DIR/*gz with $archive_name. Details will be left as an exercise.
function transfer_archive {
typeset archive_name=${1:?"pathname to archive expected"}
typeset number_of_attempts=${2:-1}
(
n=0
while
((n++))
((n<=number_of_attempts))
do
${RSYNC:?}
--partial \
--progress \
--bwlimit=${RSYNC_BW:?} \
-avh ${archive_name:?} ${DR_USER:?}#${DR_HOST:?}:${DR_PATH:?} && exit 0
done
exit 1
)
}
Inside the function we use a subshell, ( ... ) with two exit statements.
The function will return the exit value of the subshell, either true (rsync succeeded), or false (too many attempts).
We then combine that with archive creation:
function create_and_transfer_archive {
(
# only cd in a subshell - no confusion upstairs
cd ${DUMP_DIR:?Missing global setting} || exit
dir=${1:?directory}
archive=${2:?archive}
# cd, find and cpio must be in the same subshell together
(cd ${dir:?} && find . -print | cpio -o ) |
gzip > ${archive:?}.cpio.gz || return # bail out
transfer_archive ${archive:?}.cpio.gz
)
}
Finally, your main loop will process all directories in parallel:
while LINE=: read -r dir archive_base
do
(
create_and_transfer_archive $dir ${archive_base:?} &&
echo $dir Done || echo $dir failed
) &
done <"$CPIO_CFG" | cat
Instead of the pipe with cat, you could just add wait at the end of the script, but
it has the nice effect of capturing all output from the background processes.
Now, I've glossed over one important aspect, and that is the number of jobs you can run in
parallel. This will scale reasonably well, but it would be better to actually maintain a
job queue. Above a certain number, adding more jobs will start to slow things down, and
at that point you will have to add a job counter and a job limit. Once the job limit is
reached, stop starting more create_and_transfer_archive jobs, until processes have completed.
How to keep track of those jobs is a separate question.

Bash script to run query on 28 cores

I am trying to have an outfile query run a single process per value in an array to speed up the process of exporting data from mysql, id like to run the script on multiple cores. My bash script is:
dbquery=$(mysql -u user -p -e "SELECT distinct(ticker) FROM db.table")
array=( $( for i in $dbquery ; do echo $i ; done ) )
csv ()
{
dbquery=$(mysql -u user --password=password -e "SELECT * FROM db2.table2 WHERE symbol = '$i' INTO OUTFILE '/tmp/$i.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
}
set -m
for i in 'seq 28'; do #trying to run on 28 cores
for j in ${array[#]}; do
csv $j &
done
sleep 5 &
done
while [ 1 ];
do
fg 2> /dev/null; [ $? == 1 ] && break;
done
Now I ran this and it is not exporting files as i wished it too and i cannot figure out how to kill the processes. Could you help me understand how to fix this so that it will run the outfile query per ticker? Also how do I kill the current script that is running without killing other scripts and programs that are running?
You can use xargs to automatically handle job scheduling:
dbquery=$(mysql -u user -p -e "SELECT distinct(ticker) FROM db.table")
array=( $( for i in $dbquery ; do echo $i ; done ) )
csv ()
{
dbquery=$(mysql -u user --password=password -e "SELECT * FROM db2.table2 WHERE symbol = '$i' INTO OUTFILE '/tmp/$i.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
}
export -f csv
echo "${array[#]}" | xargs -P 28 -n 1 bash -c 'csv "$1"' --
The problem with your approach is that because the loops are nested, you start all processes 28 times each, rather than running them once and 28 at a time.
wait will wait until all the child processes are done.
for i in 'seq 28'; do #trying to run on 28 cores
for j in ${array[#]}; do
csv $j &
done
done
wait

SSH commands via bash script

I've been trying several fails to perform the following:
Basically, what I need is to execute several sequenced commands on a remote unix shell, such as setting environment variables with variables that I have on the script, move to a particular directory and run a script there and so on.
I've tried using a printf with the portion of the script and then piped the ssh command, but it didn't work quite well, also, I've read about the "ssh ... >> END" marker, which is great but since I'm using functions, it doesn't work well.
Do you have any thoughts?
Here's an excerpt of the code:
deployApp() {
inputLine=$1;
APP_SPECIFIC_DEPLOY_SCRIPT="$(echo $inputLine | cut -d ' ' -s -f1)";
BRANCH="$(echo $inputLine | cut -d ' ' -s -f2)";
JBOSS_HOME="$(echo $inputLine | cut -d ' ' -s -f3)";
BASE_PORT="$(echo $inputLine | cut -d ' ' -s -f4)";
JAVA_HOME_FOR_JBOSS="$(echo $inputLine | cut -d ' ' -s -f5)";
JAVA_HEAP="$(echo $inputLine | cut -d ' ' -s -f6)";
echo "DEPLOYING $APP_SPECIFIC_DEPLOY_SCRIPT"
echo "FROM BRANCH $BRANCH"
echo "IN JBOSS $JBOSS_HOME"
echo "WITH BASE PORT $BASE_PORT"
echo "USING $JAVA_HOME_FOR_JBOSS"
if [[ -n "$JAVA_HEAP" ]]; then
echo "WITH $JAVA_HEAP"
fi
echo
echo "Exporting jboss to $JBOSS_HOME"
ssh me#$SERVER <<END
cleanup() {
rm -f $JBOSS_SERVER/log/*.log
rm -Rf $JBOSS_SERVER/deploy/
rm -Rf $JBOSS_SERVER/tmp/
mkdir $JBOSS_SERVER/deploy
}
startJboss() {
cd $JBOSS_SERVER/bin
./jbossctl.sh start
return 0;
}
export JBOSS_HOME
export JBOSS_SERVER=$JBOSS_HOME/server/default
END
return 0;
}
With that "HERE" approach, I'm getting this error: "syntax error: unexpected end of file"
Thanks a lot in advance!
Just put the functions in your here document, too:
var="Hello World"
ssh user#host <<END
x() {
print "x function with args=$*"
}
x "$var"
END
[EDIT] Some comments:
You say "export JBOSS_HOME" but you never define a value for the variable in the here document. You should use export JBOSS_HOME="$JBOSS_HOME". BASH will take all text between the two END, replace all variables, and send the result to SSH for processing.
That also means the other side will see rm -f /path/to/jboss/server/*.log; the assignment to JBOSS_SERVER in the last line of the here document has no effect (at least not to the code in cleanup()).
If you want to pass $ unmodified to the remote server, you have to escape it with \: rm -f \$JBOSS_SERVER/log/*.log
You never call cleanup()
There is a } missing after return 0 to finish the definition of deployapp()
There may be other problems as well. Run the script with bash -x to see what the shell actually executes. You can also add echo commands in the here document to see what the values of the variables are or you can add set -x before cleanup() to get the same output as with bash -x but from the remote side.
I don't understand why you're using cut to split the arguments to your function. Just do
APP_SPECIFIC_DEPLOY_SCRIPT=$1
BRANCH=$2
JBOSS_HOME=$3
# etc.
If you don't quote your here document delimiter, the contents are expanded before they're sent to the server. That may be what you want. If you don't and you want all expansion to be done on the server side, then quote it like this:
ssh me#$SERVER <<'END'
# etc.
END
If you wan't a mixture, don't quote the delimiter, but do escape those things that you want delayed expansion for:
ssh me#$SERVER <<END
echo $EXPAND_ME_NOW \$EXPAND_ME_LATER
END
What are the export statements supposed to do? I can't see that they would have any effect at all.

Resources