GNU parallel processing - shell

I have the following script that I want to run using GNU parallel, it is a for loop that needs to be run n times. How can I do this using GNU parallel?
SHARK=tshark
# Create file list
FILELIST=`ls $1`
TEMPDIR=/tmp/foobar
mkdir $TEMPDIR
i=1
for I in $FILELIST; do
echo "$i $I $2"
$SHARK -r $I -w $TEMPDIR/~$I-$i -R "$2" &>/dev/null
i=`echo $i+1|bc`
done

There are a number of ways of doing this, either with sub-shells and sub-processes, see e.g.
Running shell script in parallel
or by installing neat utilities designed to do this, e.g:
|P|P|S|S| - (Distributed) Parallel Processing Shell Script
GNU Parallel
I would try to get it done first with sub-shells, and then try the others if you still need better power.

Related

BASH - transfer large files and process after transfer limiting the number of processes

I have several large files that I need to transfer to a local machine and process. The transfer takes about as long as the processing of the file, and I would like to start processing it immediately after it transfers. But the processing could take longer than the transfer, and I don't want the processes to keep building up, but I would like to limit it to some number, say 4.
Consider the following:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
scp user#host:$FILE ./
myCommand $FILE &
done
This will transfer each file and start processing it after the transfer while allowing the next file to start transferring. However, if myCommand $FILE takes much longer than the time to transfer one file, these could keep piling up and bogging down the local machine. So I would like to limit myCommand to maybe 2-4 parallel instances. Subsequent attempts to invoke myCommand should buffer it until a "slot" is open. Is there a good way to do this in BASH (using xargs or other utilities is acceptable).
UPDATE:
Thanks for the help in getting this far. Now I'm trying to implement the following logic:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
echo "Starting on $FILE" # should go to terminal output
scp user#host:$FILE ./
echo "Processing $FILE" # should go to terminal output
echo $FILE # should go through pipe to parallel
done | parallel myCommand
You can use GNU Parallel for that. Just echo the commands you want run into parallel and it will run one job per CPU core your machine has.
for f in ... ; do
scp ...
echo ./process "$f"
done | parallel
If you specifically want 4 processes at a time, use parallel -j 4.
If you want a progress bar, use parallel --bar.
Alternatively, echo just the filename with null-termination, and add the processing command into the invocation of parallel:
for f in ... ; do
scp ...
printf "%s\0" "$f"
done | parallel -0 -j4 ./process

How to create "workers" using bash?

I'd like to run some different scripts simultanously using bash.
All of them say something to an screen-session.
What we have:
worker=1
while [[ ! -f "worker$worker.sh" ]]; do
if [[ ! -f "worker$worker.sh" ]]; then
cat >worker$worker.sh <<EOL
#some code with variables which change and say something to an screen session#
EOL
chmod a+x worker$worker.sh
./worker$worker.sh
break
else
(( worker ++ ))
continue
fi
done
The current code does not work :/ What's wrong?
tmux is an alternative to screen.
GNU Parallel has an interface to tmux, so try this:
parallel --fg --delay 0.1 --tmuxpane ::: worker*.sh
parallel --fg --delay 0.1 --tmux ::: worker*.sh
If you do not need the tmux interface:
parallel ::: worker*.sh
Start by watching the intro videos for a quick introduction:
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Then look at the EXAMPLEs after the list of OPTIONS (Use LESS=+/EXAMPLE: man parallel). That will give you an idea of what GNU parallel is capable of.
Then spend an hour walking through the tutorial (man parallel_tutorial). Your command line will love you for it.

shell script, for loop, does loop wait for execution of the command to iterate

I have a shell script with a for loop. Does loop wait for execution of the command in its body before iterating?
Thanks in Advance
Here is my code. Will the commands execute sequentially or parallel?
for m in "${mode[#]}"
do
cmd="exec $perlExecutablePath $perlScriptFilePath --owner $j -rel $i -m $m"
$cmd
eval "$cmd"
done
Assuming that you haven't background-ed the command, then yes.
For example:
for i in {1..10}; do cmd; done
waits for cmd to complete before continuing the loop, whereas:
for i in {1..10}; do cmd &; done
doesn't.
If you want to run your commands in parallel, I would suggest changing your loop to something like this:
for m in "${mode[#]}"
do
"$perlExecutablePath" "$perlScriptFilePath" --owner "$j" -rel "$i" -m "$m" &
done
This runs each command in the background, so it doesn't wait for one command to finish before the next one starts.
An alternative would be to look at GNU Parallel, which is designed for this purpose.
Using GNU Parallel it looks like this:
parallel $perlExecutablePath $perlScriptFilePath --owner $j -rel $i -m {} ::: "${mode[#]}"
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Running bash script in parallel

I have a very simple command that I would like to execute in parallel rather than sequential.
>for i in ../data/*; do ./run.sh $i done
run.sh processes the input files from the ../data directory and I would like to perform this process all at the same time using a shell script rather than a Python program or something like that. Is there a way to do this using GNU Parallel?
You can try this:
shopt -s nullglob
FILES=(../data/*)
[[ ${#FILES[#]} -gt 0 ]] && printf '%s\0' "${FILES[#]}" | parallel -0 --jobs 2 ./run.sh
I have not used GNU Parallel but you can use & to run your script in the background. Add a wait (optional) later if you want to wait for all the scripts to finish.
for i in ../data/*; do ./run.sh $i & done
# Below wait command is optional
wait
echo "All scripts executed"
You can try this:
find ../data -maxdepth 1 -name '[^.]*' -print0 | parallel -0 --jobs 2 ./run.sh
The name argument of the find command is needed because you used shell globbing ../data/* in your example and so we need to ignore files starting with a dot.

Process Scheduling

Let's say, I have 10 scripts that I want to run regularly as cron jobs. However, I don't want all of them to run at the same time. I want only 2 of them running simultaneously.
One solution that I'm thinking of is create two script, put 5 statements on each of them, and them as separate entries in the crontab. However the solution seem very adhoc.
Is there existing unix tool to perform the task I mentioned above?
The jobs builtin can tell you how many child processes are running. Some simple shell scripting can accomplish this task:
MAX_JOBS=2
launch_when_not_busy()
{
while [ $(jobs | wc -l) -ge $MAX_JOBS ]
do
# at least $MAX_JOBS are still running.
sleep 1
done
"$#" &
}
launch_when_not_busy bash job1.sh --args
launch_when_not_busy bash jobTwo.sh
launch_when_not_busy bash job_three.sh
...
wait
NOTE: As pointed out by mobrule, my original answer will not work because the wait builtin with no arguments waits for ALL children to finish. Hence the following 'parallelexec' script, which avoids polling at the cost of more child processes:
#!/bin/bash
N="$1"
I=0
{
if [[ "$#" -le 1 ]]; then
cat
else
while [[ "$#" -gt 1 ]]; do
echo "$2"
set -- "$1" "${#:3}"
done
fi
} | {
d=$(mktemp -d /tmp/fifo.XXXXXXXX)
mkfifo "$d"/fifo
exec 3<>"$d"/fifo
rm -rf "$d"
while [[ "$I" -lt "$N" ]] && read C; do
($C; echo >&3) &
let I++
done
while read C; do
read -u 3
($C; echo >&3) &
done
}
The first argument is the number of parallel jobs. If there are more, each one is run as a job, otherwise all commands to run are read from stdin line by line.
I use a named pipe (which is sent to oblivion as soon as the shell opens it) as a synchronization method. Since only single bytes are written there are no race condition issues that could complicate things.
GNU Parallel is designed for this kind of tasks:
sem -j2 do_stuff
sem -j2 do_other_stuff
sem -j2 do_third_stuff
do_third_stuff will only be run when either do_stuff or do_other_stuff has finished.
Watch the intro videos to learn more:
http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Resources