How to process files concurrently with bash? - bash

Suppose I have 10K filesa and a bash script which processes a single file. Now I would like to process all these files concurrently with only K script running in parallel. I do not want (obviously) to process any file more than once.
How would you suggest implement it in bash ?

One way of executing a limited number of parallel jobs is with GNU parallel. For example, with this command:
find . -type f -print0 | parallel -0 -P 3 ./myscript {1}
You will pass all files in the current directory (and its subdirectories) as parameters to myscript, one at a time. The -0 option sets the delimiter to be the null character, and the -P option sets the number of jobs that are executed in parallel. The default number of parallel processes is equal to the number of cores in the system. There are other options for parallel processing in clusters etc, which are documented here.

I bash you can easily run part of the script in a different process just by using '(' and ')'. If you add &, then the parent process will not wait for the child. So you in fact use ( command1; command2; command3; ... ) &:
while ... do
(
your script goes here, executed in a separate process
) &
CHILD_PID = $!
done
And also the $! gives you the PID of the child process. What else you need to know? When you reach the k processes launched, you need to wait for the others. This is done using wait <PID>:
wait $CHILD_PID
If you want to wait for all of them, just use wait.
This should be sufficient for you to implement the system.

for f1 in *;do
(( cnt = cnt +1 ))
if [ cnt -le $k ];then
nohup ./script1 $f1 &
continue
fi
wait
cnt=0
done
please test it . dont' have time to

Related

Bash wait command ignoring specified process IDs

DIRECTORIES=( group1 group2 group3 group4 group5 )
PIDS=()
function GetFileSpace() {
shopt -s nullglob
TARGETS=(/home/${1}/data/*)
for ITEM in "${TARGETS[#]}"
do
# Here we launch du on a user in the background
# And then add their process id to PIDS
du -hs $ITEM >> ./${1}_filespace.txt &
PIDS+=($!)
done
}
# Here I launch function GetFileSpace for each group.
for GROUP in "${DIRECTORIES[#]}"
do
echo $GROUP
# Store standard error to collect files with bad permissions
GetFileSpace $GROUP 2>> ./${GROUP}_permission_denied.txt &
done
for PID in "${PIDS[#]}"
do
wait $PID
done
echo "Formatting Results..."
# The script will after this, but it isn't relevant.
I am trying to write a script that monitors storage volume and file permissions of individual users across 5 groups.
|_home # For additional reference to understand my code,
|_group1 # directories are laid out like this
| |_data
| |_user1
| |_user2
| |_user3
|
|_group2
|_data
|_user4
|_user5
First, I use a loop to iteratively launch a function, GetFileSpace, for each group in DIRECTORIES. This function then runs du -sh for each user found within a group.
To speed up this whole process, I launch each instance of GetFileSpace and the subsequent du -sh sub processes in the background with &. This makes it so everything can run pretty much simultaneously, which takes much less time.
My issue is that after I launch these processes I want my script to wait for every background instance of du -sh to finish before moving on to the next step.
To do this, I have tried to collect process IDs after each task is launched within the array PIDS. Then I try to loop through the array and wait for each PID until all sub-processes finish. Unfortunately this doesn't seem to work. The script correctly launches du -sh for each user, but then immediately tries to move on to the next step, breaking.
My question then, is why does my script not wait for my background tasks to finish and how can I implement this behavior?
As a final note, I have tried several other methods to accomplish this from this SO post, but haven't been able to get them working either.
GetFileSpace ... &
You are running the whole function as a subproces. So it immediately tries to move on to the next step and PID is unset, cause it beeing set in subprocess.
Do not run it in the background.
GetFileSpace ... # no & on the end.
Notes: Consider using xargs or GNU parallel. Prefer lower case for script local variables. Quote variable expansions. Use shellcheck to check for such errors.
work() {
tmp=$(du -hs "$2")
echo "$tmp" >> "./${1}_filespace.txt"
}
export -f work
for i in "${directories[#]}"; do
printf "$i %s\n" /home/${1}/data/*
done | xargs -n2 -P$(nproc) bash -c 'work "$#"' _
Note that when job is I/O bound, running multiple processes (escpecially without no upper bound) doesn't really help much, if it's on one disc.

Is there a way to stop scripts that are running simultaneously if one of them send an echo?

I need to find if a value (actually it's more complex than that) is in one of 20 servers I have. And I need to do it as fast as possible. Right now I am sending the scripts simultaneously to all the servers. My main script is something like this (but with all the servers):
#!/bin/sh
#mainScript.sh
value=$1
c1=`cat serverList | sed -n '1p'`
c2=`cat serverList | sed -n '2p'`
sh find.sh $value $c1 & sh find.sh $value $c2
#!/bin/sh
#find.sh
#some code here .....
if [ $? -eq 0 ]; then
rm $tempfile
else
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
echo "$myValue"
fi
So the script only returns a response if it finds the value in the server. I would like to know if there is a way to stop executing the other scripts if one of them already return a value.
I tried adding an "exit" on the find.sh script but it won't stop all the scripts. Can somebody please tell me if what I want to do is possible?
I would suggest that you use something that can handle this for you: GNU Parallel. From the linked tutorial:
If you are looking for success instead of failures, you can use success. This will finish as soon as the first job succeeds:
parallel -j2 --halt now,success=1 echo {}\; exit {} ::: 1 2 3 0 4 5 6
Output:
1
2
3
0
parallel: This job succeeded:
echo 0; exit 0
I suggest you start by modifying your find.sh so that its return code depends on its success, that will let us identify a successful call more easily; for instance:
myValue=`sed -n '/VALUE/p' $tempfile | awk 'BEGIN{FS="="} {print substr($2, 8, length($2)-2)}'`
success=$?
echo "$myValue"
exit $success
To terminate all the find.sh processes spawned by your script you can use pkill with a Parent Process ID criteria and a command name criteria :
pkill -P $$ find.sh # $$ refers to the current process' PID
Note that this requires that you start the find.sh script directly rather than passing it as a parameter to sh. Normally that shouldn't be a problem, but if you have a good reason to call sh rather than your script, you can replace find.sh in the pkill command by sh (assuming you're not spawning other scripts you wouldn't want to kill).
Now that find.sh exits with success only when it finds the expected string, you can plug the two actions with && and run the whole thing in background :
{ find.sh $value $c1 && pkill -P $$ find.sh; } &
The first occurrence of find.sh that terminates with success will invoke the pkill command that will terminate all others (those killed processes will have non-zero exit codes and therefore won't run their associated pkill).

bash: limiting subshells in a for loop with file list

I've been trying to get a for loop to run a bunch of commands sort of simultaneously and was attempting to do it via subshells. Ive managed to cobble together the script below to test and it seems to work ok.
#!/bin/bash
for i in {1..255}; do
(
#commands
)&
done
wait
The only problem is that my actual loop is going to be for i in files* and then it just crashes, i assume because its started too many subshells to handle. So i added
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $i % 10 == 0 )); then wait; fi
done
wait
which now fails. Does anyone know a way around this? Either using a different command to limit the number of subshells or provide a number for $i?
Cheers
xargs/parallel
Another solution would be to use tools designed for concurrency:
printf '%s\0' files* | xargs -0 -P6 -n1 yourScript
The -P6 is the maximum number of concurrent processes that xargs will launch. Make it 10 if you like.
I suggest xargs because it is likely already on your system. If you want a really robust solution, look at GNU Parallel.
Filenames in array
For another answer explicit to your question: Get the counter as the array index?
files=( files* )
for i in "${!files[#]}"; do
commands "${files[i]}" &
(( i % 10 )) || wait
done
(The parentheses around the compound command aren't important because backgrounding the job will have the same effects as using a subshell anyway.)
Function
Just different semantics:
simultaneous() {
while [[ $1 ]]; do
for i in {1..11}; do
[[ ${#:i:1} ]] || break
commands "${#:i:1}" &
done
shift 10 || shift "$#"
wait
done
}
simultaneous files*
You can find useful to count the number of jobs with jobs. e.g.:
wc -w <<<$(jobs -p)
So, your code would look like this:
#!/bin/bash
for i in files*; do
(
#commands
)&
if (( $(wc -w <<<$(jobs -p)) % 10 == 0 )); then wait; fi
done
wait
As #chepner suggested:
In bash 4.3, you can use wait -n to proceed as soon as any job completes, rather than waiting for all of them
Define the counter explicitly
#!/bin/bash
for f in files*; do
(
#commands
)&
(( i++ % 10 == 0 )) && wait
done
wait
There's no need to initialize i, as it will default to 0 the first time you use it. There's also no need to reset the value, as i %10 will be 0 for i=10, 20, 30, etc.
If you have Bash≥4.3, you can use wait -n:
#!/bin/bash
max_nb_jobs=10
for i in file*; do
# Wait until there are less than max_nb_jobs jobs running
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=max_nb_jobs)); do
wait -n
done
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
If you don't have wait -n available, you can use something like this:
#!/bin/bash
set -m
max_nb_jobs=10
sleep_jobs() {
# This function sleeps until there are less than $1 jobs running
local n=$1
while mapfile -t < <(jobs -pr) && ((${#MAPFILE[#]}>=n)); do
coproc read
trap "echo >&${COPROC[1]}; trap '' SIGCHLD" SIGCHLD
[[ $COPROC_PID ]] && wait $COPROC_PID
done
}
for i in files*; do
# Wait until there are less than 10 jobs running
sleep_jobs "$max_nb_jobs"
{
# Your commands here: no useless subshells! use grouping instead
} &
done
wait
The advantage of proceeding like this, is that we make no assumptions on the time taken to finish the jobs. A new job is launched as soon as there's room for it. Moreover, it's all pure Bash, so doesn't rely on external tools and (maybe more importantly), you may use your Bash environment (variables, functions, etc.) without exporting them (arrays can't be easily exported so that can be a huge pro).

Running bash script in parallel

I have a very simple command that I would like to execute in parallel rather than sequential.
>for i in ../data/*; do ./run.sh $i done
run.sh processes the input files from the ../data directory and I would like to perform this process all at the same time using a shell script rather than a Python program or something like that. Is there a way to do this using GNU Parallel?
You can try this:
shopt -s nullglob
FILES=(../data/*)
[[ ${#FILES[#]} -gt 0 ]] && printf '%s\0' "${FILES[#]}" | parallel -0 --jobs 2 ./run.sh
I have not used GNU Parallel but you can use & to run your script in the background. Add a wait (optional) later if you want to wait for all the scripts to finish.
for i in ../data/*; do ./run.sh $i & done
# Below wait command is optional
wait
echo "All scripts executed"
You can try this:
find ../data -maxdepth 1 -name '[^.]*' -print0 | parallel -0 --jobs 2 ./run.sh
The name argument of the find command is needed because you used shell globbing ../data/* in your example and so we need to ignore files starting with a dot.

How do I terminate all the subshell processes?

I have a bash script to test how a server performs under load.
num=1
if [ $# -gt 0 ]; then
num=$1
fi
for i in {1 .. $num}; do
(while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done) &
done
wait
When I hit Ctrl-C, the main process exits, but the background loops keep running. How do I make them all exit? Or is there a better way of spawning a configurable number of logic loops executing in parallel?
Here's a simpler solution -- just add the following line at the top of your script:
trap "kill 0" SIGINT
Killing 0 sends the signal to all processes in the current process group.
One way to kill subshells, but not self:
kill $(jobs -p)
Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes).
If you just want to make sure one specific child-process (and its own children) are tidied up then a better solution is to kill by process group (PGID) using the sub-process' PID, like so:
set -m
./some_child_script.sh &
some_pid=$!
kill -- -${some_pid}
Firstly, the set -m command will enable job management (if it isn't already), this is important, as otherwise all commands, sub-shells etc. will be assigned to the same process group as your parent script (unlike when you run the commands manually in a terminal), and kill will just give a "no such process" error. This needs to be called before you run the background command you wish to manage as a group (or just call it at script start if you have several).
Secondly, note that the argument to kill is negative, this indicates that you want to kill an entire process group. By default the process group ID is the same as the first command in the group, so we can get it by simply adding a minus sign in front of the PID we fetched with $!. If you need to get the process group ID in a more complex case, you will need to use ps -o pgid= ${some_pid}, then add the minus sign to that.
Lastly, note the use of the explicit end of options --, this is important, as otherwise the process group argument will be treated as an option (signal number), and kill will complain it doesn't have enough arguments. You only need this if the process group argument is the first one you wish to terminate.
Here is a simplified example of a background timeout process, and how to cleanup as much as possible:
#!/bin/bash
# Use the overkill method in case we're terminated ourselves
trap 'kill $(jobs -p | xargs)' SIGINT SIGHUP SIGTERM EXIT
# Setup a simple timeout command (an echo)
set -m
{ sleep 3600; echo "Operation took longer than an hour"; } &
timeout_pid=$!
# Run our actual operation here
do_something
# Cancel our timeout
kill -- -${timeout_pid} >/dev/null 2>&1
wait -- -${timeout_pid} >/dev/null 2>&1
printf '' 2>&1
This should cleanly handle cancelling this simplistic timeout in all reasonable cases; the only case that can't be handled is the script being terminated immediately (kill -9), as it won't get a chance to cleanup.
I've also added a wait, followed by a no-op (printf ''), this is to suppress "terminated" messages that can be caused by the kill command, it's a bit of a hack, but is reliable enough in my experience.
You need to use job control, which, unfortunately, is a bit complicated. If these are the only background jobs that you expect will be running, you can run a command like this one:
jobs \
| perl -ne 'print "$1\n" if m/^\[(\d+)\][+-]? +Running/;' \
| while read -r ; do kill %"$REPLY" ; done
jobs prints a list of all active jobs (running jobs, plus recently finished or terminated jobs), in a format like this:
[1] Running sleep 10 &
[2] Running sleep 10 &
[3] Running sleep 10 &
[4] Running sleep 10 &
[5] Running sleep 10 &
[6] Running sleep 10 &
[7] Running sleep 10 &
[8] Running sleep 10 &
[9]- Running sleep 10 &
[10]+ Running sleep 10 &
(Those are jobs that I launched by running for i in {1..10} ; do sleep 10 & done.)
perl -ne ... is me using Perl to extract the job numbers of the running jobs; you can obviously use a different tool if you prefer. You may need to modify this script if your jobs has a different output format; but the above output is also on Cygwin, so it's very likely identical to yours.
read -r reads a "raw" line from standard input, and saves it into the variable $REPLY. kill %"$REPLY" will be something like kill %1, which "kills" (sends an interrupt signal to) job number 1. (Not to be confused with kill 1, which would kill process number 1.) Together, while read -r ; do kill %"$REPLY" ; done goes through each job number printed by the Perl script, and kills it.
By the way, your for i in {1 .. $num} won't do what you expect, since brace expansion is handled before parameter expansion, so what you have is equivalent to for i in "{1" .. "$num}". (And you can't have white-space inside the brace expansion, anyway.) Unfortunately, I don't know of a clean alternative; I think you have to do something like for i in $(bash -c "{1..$num}"), or else switch to an arithmetic for-loop or whatnot.
Also by the way, you don't need to wrap your while-loop in parentheses; & already causes the job to be run in a subshell.
Here's my eventual solution. I'm keeping track of the subshell process IDs using an array variable, and trapping the Ctrl-C signal to kill them.
declare -a subs #array of subshell pids
function kill_subs() {
for pid in ${subs[#]}; do
kill $pid
done
exit 0
}
num=1 if [ $# -gt 0 ]; then
num=$1 fi
for ((i=0;i < $num; i++)); do
while true; do
{ time curl --silent 'http://localhost'; } 2>&1 | grep real
done &
subs[$i]=$! #grab the pid of the subshell
done
trap kill_subs 1 2 15
wait
While these is not an answer, I just would like to point out something which invalidates the selected one; using jobs or kill 0 might have unexpected results; in my case it killed unintended processes which in my case is not an option.
It has been highlighted somehow in some of the answers but I am afraid not with enough stress or it has been not considered:
"Bit of a late answer, but for me solutions like kill 0 or kill $(jobs -p) go too far (kill all child processes)."
"If these are the only background jobs that you expect will be running, you can run a command like this one:"

Resources