I have a build script, which works very slowly, especially on Solaris. I want to improve its performance by running it in multiple jobs. How can I do that?
Try GNU Parallel, it is quite easy to use:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
If you use xargs and tee today you will find GNU parallel very easy to use as GNU parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel.
GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU parallel as input for other programs.
For each line of input GNU parallel will execute command with the line as arguments. If no command is given, the line of input is executed. Several lines will be run in parallel. GNU parallel can often be used as a substitute for xargs or cat | bash.
You mentioned that it is a build script. If you are using command line utility make you can parallelize builds using make's -j<N> option:
GNU make knows how to execute several recipes at once. Normally, make will execute only one recipe at a time, waiting for it to finish before executing the next. However, the ‘-j’ or ‘--jobs’ option tells make to execute many recipes simultaneously.
Also, there is distcc which can be used with make to distribute compiling to multiple hosts:
export DISTCC_POTENTIAL_HOSTS='localhost red green blue'
cd ~/work/myproject;
make -j8 CC=distcc
GNU parallel is quite good. #Maxim - good suggestion +1.
For a one off, if you cannot install new software, try this for a slow command that has to run multiple times, run slowcommand 17 times. Change things to fit your needs:
#!/bin/bash
cnt=0
while [ $cnt -le 17 ] # loop 17 times
do
slow_command &
cnt=$(( $cnt + 1 ))
[ $(( $cnt % 5 )) -eq 0 ] && wait # 5 jobs at a time in parallel
done
wait # you will have 2 jobs you di not wait for in the loop 17 % 5 == 2
Related
I have a requirement where I need to trigger 10 shell scripts at a time. I may have 200+ shell scripts to be executed.
e.g. if I trigger 10 jobs and two jobs completed, I need to trigger another 2 jobs which will make number of jobs currently executing to 10.
I need your help and suggestion to cater this requirement.
Yes with GNU Parallel like this:
parallel -j 10 < ListOfJobs.txt
Or, if your jobs are called job_1.sh to job_200.sh:
parallel -j 10 job_{}.sh ::: {1..200}
Or. if your jobs are named with discontiguous, random names but are all shell scripts named with .sh suffix in one directory:
parallel -j 10 ::: *.sh
There is a very good overview here. There are lots of questions and answers on Stack Overflow here.
Simply run them as background jobs:
for i in {1..10}; { ./script.sh & }
Adding more jobs if less than 10 are running:
while true; do
pids=($(jobs -pr))
((${#pids[#]}<10)) && ./script.sh &
done &> /dev/null
There are different ways to handle this:
Launch them together as background tasks (1)
Launch them in parallel (1)
Use the crontab (2)
Use at (3)
Explanations:
(1) You can launch the processes exactly when you like (by launching a command, click a button or whatever event you choose)
(2) The processes will be launched at the same time, every (working) day, periodically.
(3) You choose a time when the processes will be launched together once.
I have used below to trigger 10 jobs a time.
max_jobs_trigger=10
while mapfile -t -n ${max_jobs_trigger} ary && ((${#ary[#]})); do
jobs_to_trigger=`printf '%s\n' "${ary[#]}"`
#Trigger script in background
done
I have a Bash script that I submit to a cluster that calls a pipeline of Python scripts which are built to be multithreaded for parallel processing. I need to call this pipeline on all files in a directory, which I can accomplish with a for-loop. However, I am worried that this will run the operations (i.e. the pipeline) on just a single-thread rather than the full range that was intended.
The batch file for submission looks like this:
#!/bin/bash
##SBATCH <parameters>
for filename in /path/to/*.txt; do
PythonScript1.py "$filename"
PythonScript2.py "$filename"
done
Will this work as intended, or will the for loop hamper the efficiency/parallel processing of the Python scripts?
If you are running on a single server:
parallel ::: PythonScript1.py PythonScript2.py ::: /path/to/*.txt
This will generate all combinations of {PythonScript1.py,PythonScript2.py} and *.txt. These combinations will be run in parallel but GNU parallel will only run as many at a time as there are CPU threads in the server.
If you are running on multiple servers in a cluster, it really depends on what system is used for controlling the cluster. On some system you ask for a list of server and then you can ssh to those:
get list of servers > serverlist
parallel --slf serverlist ::: PythonScript1.py PythonScript2.py ::: /path/to/*.txt
On others you have to give each of the commands you want to run to the queing system:
parallel queue_this ::: PythonScript1.py PythonScript2.py ::: /path/to/*.txt
Without knowing more about which cluster control system is used, it is hard to help you more.
As originally written, PythonScript2.py won't run until PythonScript1.py returns, and the for loop won't iterate until PythonScript2.py returns.
Note that I said "returns", not "finishes"; if PythonScript1.py and/or PythonScript2.py forks or otherwise goes into the background on its own, then it will return before it is finished, and will continue processing while the calling bash script continues on to its next step.
You could have the calling script put them into the background with PythonScript1.py & and PythonScript2.py &, but this might or might not be what you want, since PythonScript1.py and PythonScript2.py will thus (likely) be running at the same time.
If you want multiple files processed at the same time, but want PythonScript1.py and PythonScript2.py to run in strict order, follow the comment from William Pursell:
for filename in /path/to/*.txt; do
{ PythonScript1.py "$filename"; PythonScript2.py "$filename"; } &
done
So I have a situation where I'm running numerous commands with parallel and piping the output to another script that consumes the output. The problem I'm having is that my script that does the processing of output needs to know when a particular command has finished executing.
I'm using the --tag option so that I know what command has generated output but currently I have to wait until parallel is done running all commands before I can know that I'm not going to get anymore output from a particular command. From my understanding of parallel I see the following possible solutions but none really suit me.
I could group the output lines with the --line-buffer option so it
looks like that were ran sequentially. Then whenever I see output
from the next command I know the previous has finished, however
doing it that way slows me up as one command may take 30 seconds to
complete while after it there may 20 other commands that only took
one second and I wish to process them in as close to real-time as
possible.
I could wrap my command in a tiny bash script that outputs 'Process
with some ID DONE' to get the notification the command completed. I
don't really like this because I'm running several hundred commands
at a time and don't really want to add all those extra bash
processes.
I am really hoping that I'm just missing something in the docs and there is a flag in there to do what I'm looking for.
My understanding is that parallel is implemented in perl, which I'm comfortable with, but would rather not have to add the functionality myself unless its completely necessary.
Any help or suggestions are greatly appreciated.
The default behaviour with --tag should work perfectly. It will not output anything until the job is done. And then your postprocessor can simply grab the argument from the start of the line.
Example:
parallel -j3 --tag 'echo Job {} start; sleep {}; echo Job {} ended' ::: 7 1 3 5 2 4 6
If you want to keep the order:
parallel -j3 --keep-order --tag 'echo Job {} start; sleep {}; echo Job {} ended' ::: 7 1 3 5 2 4 6
Notice how the jobs would mix if the output was done immediately. Compare with --ungroup (which you do not want):
parallel -j3 --ungroup 'echo Job {} start; sleep {}; echo Job {} ended' ::: 7 1 3 5 2 4 6
I would like to run commands in parallel. So that if one fails, the whole job exists as failure. How can I do that? More specifically, I would like to run aws sync commands in parallel. I have 5 aws sync commands that run sequentially. I would like them to run in parallel so that if one fails, the whole job fails. How can I do that?
GNU Parallel is a really handy and powerful tool that works with anything bash
http://www.gnu.org/software/parallel/
https://www.youtube.com/watch?v=OpaiGYxkSuQ
# run lines from a file, 8 at a time
cat commands.txt | parallel --eta -j 8 "{}"
I need a shell script that will create a loop to start parallel tasks read in from a file...
Something in the lines of..
#!/bin/bash
mylist=/home/mylist.txt
for i in ('ls $mylist')
do
do something like cp -rp $i /destination &
end
wait
So what I am trying to do is send a bunch of tasks in the background with the "&" for each line in $mylist and wait for them to finish before existing.
However, there may be a lot of lines in there so I want to control how many parallel background processes get started; want to be able to max it at say.. 5? 10?
Any ideas?
Thank you
Your task manager will make it seem like you can run many parallel jobs. How many you can actually run to obtain maximum efficiency depends on your processor. Overall you don't have to worry about starting too many processes because your system will do that for you. If you want to limit them anyway because the number could get absurdly high you could use something like this (provided you execute a cp command every time):
...
while ...; do
jobs=$(pgrep 'cp' | wc -l)
[[ $jobs -gt 50 ]] && (sleep 100 ; continue)
...
done
The number of running cp commands will be stored in the jobs variable and before starting a new iteration it will check if there are too many already. Note that we jump to a new iteration so you'd have to keep track of how many commands you already executed. Alternatively you could use wait.
Edit:
On a side note, you can assign a specific CPU core to a process using taskset, it may come in handy when you have fewer more complex commands.
You are probably looking for something like this using GNU Parallel:
parallel -j10 cp -rp {} /destination :::: /home/mylist.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel