I am experienced in Bash and I have set of variables stored in a array which I want to pass in a shell script that I want to run simultaneously
Right now I have something like this dummy code working
array = (1 2 3 4)
for i in array
do
if [condition] then;
call script1
else
call script2
fi
done
But what I want is instead of going through the elements of the array one by one, I want to run concurrently everything in the loop for each of them . How would I do that. I know how to call scripts concurrently using & but I am not sure how to handle the if conditions.
Have you tried using GNU Parallel? I think it's exactly what you're looking for. I'm no expert with parallel, but I know you can easily pipe a list of newline separated commands into it:
# An array of commands to run in parallel
array=(command1 command2 command3 command4)
# Pipe the array of commands to parallel as a list
(IFS=$'\n'; echo "${array[*]}") | parallel -j4
The -j flag lets you selects how many parallel threads to run. Parallel will go through the list, executing each line in parallel with bash -c until all lines have been executed.
I often use a for loop to build a list of commands and pipe the output directly into parallel. For example:
## Multicore Parallel FizzBuzz Engine
for((i=1;i<=100;++i));do
echo 'i='$i';p="";((i%3==0))&&p=Fizz;((i%5==0))&&p+=Buzz;[[ ! $p ]]&&p=$i;echo $p;';
done | parallel -kj4
Related
I was writing a question, but finally came up with a solution. As it might be useful for others (my future self, at least), here it is.
Context
To run a single command in parallel in several detached screens that automatically close themselves, this works nicely:
timeslots='00_XX 01_XX 02_XX 03_XX 04_XX 05_XX 06_XX'
for timeslot in $timeslots;
do
screen -dmS $timeslot bash -c "echo '$timeslot' >> DUMP";
done
But what if, for each timeslot, we want to execute in screen not one but several (RAM-heavy) commands, one after the other?
We can write a function (in which everything is run sequentially), with an argument in our bash script.
test_function () {
# Commands to be executed sequentially, one at a time:
echo $1 >> DUMP; # technically we'd put heavy things that shouldn't be executed in parallel
echo $1 $1 >> DUMP; # these are just dummy MWE commands
# ETC
}
But, how to create detached screens that run this function with the $timelot argument?
There are lots of discussions on stackoverflow about running a distinct executable script file or on using stuff, but that's not what I want to do. Here the idea is to avoid unnecessary files, keep it all in the same small bash script, simple and clean.
Function definition (in script.sh)
test_function () {
# Commands to be executed sequentially, one at a time:
echo $1 >> DUMP; # technically we'd put heavy things that shouldn't be executed in parallel
echo $1 $1 >> DUMP; # these are just dummy MWE commands
# ETC
}
export -f test_function # < absolutely crucial bit to enable using this with screen
Usage (further down in script.sh)
Now we can do
timeslots='00_XX 01_XX 02_XX 03_XX 04_XX 05_XX 06_XX'
for timeslot in $timeslots;
do
screen -dmS $timeslot bash -c "test_function $timeslot";
done
And it works.
Somehow I don't find a sufficient answer to my problem, only parts of hackarounds.
I'm calling a single "chained" shell command (from a Node app), that starts a long-running update process, which it's stdout/-err should be handed over, as arguments, to the second part of the shell command (another Node app that logs into a DB).
I'd like to do something like this:
updateCommand 2>$err 1>$out ; logDBCommand --log arg err out
Can't use > as it is only for files or file descriptors.
Also if I use shell variables (like error=$( { updateCommand | sed 's/Output/tmp/'; } 2>&1 ); logDBCommand --log arg \"${error}.\"), I can only have stdout or both mixed into one argument.
And I don't want to pipe, as the second command (logCommand) should run whether the first one succeeded or failed execution.
And I don't want to cache to file, cause honestly that's missing the point and introduce another asynchronous error vector
List item
After a little chat in #!/bin/bash someone suggested to just make use of tpmsf (file system held in RAM), which is the 2nd most elegant (but only possible) way to do this. So I can make use of the > operator and have stdout and stderr in separate variables in memory.
command1 >/dev/shm/c1stdout 2>/dev/shm/c1stderr
A=$(cat /dev/shm/c1sdtout)
B=$(cat /dev/shm/c1stderr)
command2 $A $B
(or shorter):
A=$(command1 2>/dev/shm/c1stderr )
B=$(cat /dev/shm/c1stderr)
command2 $A $B
I have a shell script which usually runs nearly 10 mins for a single run,but i need to know if another request for running the script comes while a instance of the script is running already, whether new request need to wait for existing instance to compplete or a new instance will be started.
I need a new instance must be started whenever a request is available for the same script.
How to do it...
The shell script is a polling script which looks for a file in a directory and execute the file.The execution of the file takes nearly 10 min or more.But during execution if a new file arrives, it also has to be executed simultaneously.
the shell script is below, and how to modify it to execute multiple requests..
#!/bin/bash
while [ 1 ]; do
newfiles=`find /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/ -newer /afs/rch/usr$
touch /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/.my_marker
if [ -n "$newfiles" ]; then
echo "found files $newfiles"
name2=`ls /afs/rch/usr8/fsptools/WWW/cgi-bin/upload/ -Art |tail -n 2 |head $
echo " $name2 "
mkdir -p -m 0755 /afs/rch/usr8/fsptools/WWW/dumpspace/$name2
name1="/afs/rch/usr8/fsptools/WWW/dumpspace/fipsdumputils/fipsdumputil -e -$
$name1
touch /afs/rch/usr8/fsptools/WWW/dumpspace/tempfiles/$name2
fi
sleep 5
done
When writing scripts like the one you describe, I take one of two approaches.
First, you can use a pid file to indicate that a second copy should not run. For example:
#!/bin/sh
pidfile=/var/run/$(0##*/).pid
# remove pid if we exit normally or are terminated
trap "rm -f $pidfile" 0 1 3 15
# Write the pid as a symlink
if ! ln -s "pid=$$" "$pidfile"; then
echo "Already running. Exiting." >&2
exit 0
fi
# Do your stuff
I like using symlinks to store pid because writing a symlink is an atomic operation; two processes can't conflict with each other. You don't even need to check for the existence of the pid symlink, because a failure of ln clearly indicates that a pid cannot be set. That's either a permission or path problem, or it's due to the symlink already being there.
Second option is to make it possible .. nay, preferable .. not to block additional instances, and instead configure whatever it is that this script does to permit multiple servers to run at the same time on different queue entries. "Single-queue-single-server" is never as good as "single-queue-multi-server". Since you haven't included code in your question, I have no way to know whether this approach would be useful for you, but here's some explanatory meta bash:
#!/usr/bin/env bash
workdir=/var/tmp # Set a better $workdir than this.
a=( $(get_list_of_queue_ids) ) # A command? A function? Up to you.
for qid in "${a[#]}"; do
# Set a "lock" for this item .. or don't, and move on.
if ! ln -s "pid=$$" $workdir/$qid.working; then
continue
fi
# Do your stuff with just this $qid.
...
# And finally, clean up after ourselves
remove_qid_from_queue $qid
rm $workdir/$qid.working
done
The effect of this is to transfer the idea of "one at a time" from the handler to the data. If you have a multi-CPU system, you probably have enough capacity to handle multiple queue entries at the same time.
ghoti's answer shows some helpful techniques, if modifying the script is an option.
Generally speaking, for an existing script:
Unless you know with certainty that:
the script has no side effects other than to output to the terminal or to write to files with shell-instance specific names (such as incorporating $$, the current shell's PID, into filenames) or some other instance-specific location,
OR that the script was explicitly designed for parallel execution,
I would assume that you cannot safely run multiple copies of the script simultaneously.
It is not reasonable to expect the average shell script to be designed for concurrent use.
From the viewpoint of the operating system, several processes may of course execute the same program in parallel. No need to worry about this.
However, it is conceivable, that a (careless) programmer wrote the program in such a way that it produces incorrect results, when two copies are executed in parallel.
I have written a script to count how many records have been inserted into 3 individual HBase tables every 2 hours. I'm aware it's shoddy but it works well and i retrieve the desired results.... however I am having to invoke the HBase shell every time it works through the loop.
Is there a way to improve my code so that I don't have to do this to speed things up?
#!/bin/bash
declare -a hbaseTables=("table1" "table2" "table3");
for i in "${hbaseTables[#]}"
do
echo $i >> results.txt
time=1431925200000
for ((x=0; x<2; x=x+1))
do
hbase shell <<EOF | tail -2 | grep -oE "^[0-9]+" >> results.txt
scan '$i', {TIMERANGE => [$time,$time+7199999]}
EOF
time=$time+7200000
done
echo ----- >> results.txt
done
HBase shell is written in Ruby so you have full access to any Ruby commands.
So for example if I wanted to drop all the tables in a cluster that do not start with the string dev01 I could do this:
$ echo 'a=list; a.delete_if{ |t| t=~/dev01.*/}; \
a.each{ |t| disable t; drop t}; quit;' | hbase shell
The above makes a copy of the list array into a. It then deletes in the copied array, a, all the elements that start with dev01, and then it loops through the remaining elements in a and runs the HBase shell command disable X followed by drop X.
While working in telecom I was often in need to interact with various CLI tools where they had no any API. For almost all cases like that expect was a perfect tool. So it can work in style 'expect prompt' then 'write command' then 'collect output'. For advanced scripting there is approach to combine it with TCL language.
For me it once allowed to control distributed setup with several routers available only by SSH. So it is definitely approach you can use, the question is if it is not too powerful.
Other alternative is to just prepare script for HBase shell into external file and then execute it with output processing. Probebly it is the best combination of efforts and result.
Let's say I have a bash function
Yadda() {
# time-consuming processes that must take place sequentially
# the result will be appended >> $OUTFILE
# $OUTFILE is set by the main body of the script
# No manipulation of variables in the main body
# Only local-ly defined variables are manipulated
}
Am I allowed to invoke the function as a background job in a subshell? E.g.:
OUTFILE=~/result
for PARM in $PARAMLIST; do
( Yadda $PARM ) &
done
wait
cat $OUTFILE
What do you think?
You can invoke the function as a background job in a subshell. It will work just like you typed in your example.
I see one problem in the way you demonstrated it in your example. If some of the processes finish simultaneously, they will try to write to the OUTFILE at the same time and the output might get mixed up.
I suggest to let each process write to it's own file then collect the files after all processes are done.