shell script to loop and start processes in parallel? - bash

I need a shell script that will create a loop to start parallel tasks read in from a file...
Something in the lines of..
#!/bin/bash
mylist=/home/mylist.txt
for i in ('ls $mylist')
do
do something like cp -rp $i /destination &
end
wait
So what I am trying to do is send a bunch of tasks in the background with the "&" for each line in $mylist and wait for them to finish before existing.
However, there may be a lot of lines in there so I want to control how many parallel background processes get started; want to be able to max it at say.. 5? 10?
Any ideas?
Thank you

Your task manager will make it seem like you can run many parallel jobs. How many you can actually run to obtain maximum efficiency depends on your processor. Overall you don't have to worry about starting too many processes because your system will do that for you. If you want to limit them anyway because the number could get absurdly high you could use something like this (provided you execute a cp command every time):
...
while ...; do
jobs=$(pgrep 'cp' | wc -l)
[[ $jobs -gt 50 ]] && (sleep 100 ; continue)
...
done
The number of running cp commands will be stored in the jobs variable and before starting a new iteration it will check if there are too many already. Note that we jump to a new iteration so you'd have to keep track of how many commands you already executed. Alternatively you could use wait.
Edit:
On a side note, you can assign a specific CPU core to a process using taskset, it may come in handy when you have fewer more complex commands.

You are probably looking for something like this using GNU Parallel:
parallel -j10 cp -rp {} /destination :::: /home/mylist.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

is there a way to trigger 10 scripts at any given time in Linux shell scripting?

I have a requirement where I need to trigger 10 shell scripts at a time. I may have 200+ shell scripts to be executed.
e.g. if I trigger 10 jobs and two jobs completed, I need to trigger another 2 jobs which will make number of jobs currently executing to 10.
I need your help and suggestion to cater this requirement.
Yes with GNU Parallel like this:
parallel -j 10 < ListOfJobs.txt
Or, if your jobs are called job_1.sh to job_200.sh:
parallel -j 10 job_{}.sh ::: {1..200}
Or. if your jobs are named with discontiguous, random names but are all shell scripts named with .sh suffix in one directory:
parallel -j 10 ::: *.sh
There is a very good overview here. There are lots of questions and answers on Stack Overflow here.
Simply run them as background jobs:
for i in {1..10}; { ./script.sh & }
Adding more jobs if less than 10 are running:
while true; do
pids=($(jobs -pr))
((${#pids[#]}<10)) && ./script.sh &
done &> /dev/null
There are different ways to handle this:
Launch them together as background tasks (1)
Launch them in parallel (1)
Use the crontab (2)
Use at (3)
Explanations:
(1) You can launch the processes exactly when you like (by launching a command, click a button or whatever event you choose)
(2) The processes will be launched at the same time, every (working) day, periodically.
(3) You choose a time when the processes will be launched together once.
I have used below to trigger 10 jobs a time.
max_jobs_trigger=10
while mapfile -t -n ${max_jobs_trigger} ary && ((${#ary[#]})); do
jobs_to_trigger=`printf '%s\n' "${ary[#]}"`
#Trigger script in background
done

Julia Command Line Running Processes in Parallel

I have a Julia script that converts csvs to a binary format. Trust me it's great. I also have many (seemingly innumerable) csvs that I want to process. It's a shared network and so I can only process five files at a clip without savagely burdening the CPU and making my coworkers irate and potentially unstable. Accordingly, I want to run the script in groups of five, wait for them to finish, and then run the next batch as background processes until it's Miller time all using Julia's wonderful run() function ala:
julia csvparse3.jl /home/file1.csv > /dev/null 2>&1 &
I'm fairly certain that I could sidestep all of this by using addprocs() and pmap() if I made my parsing script into a Julia module/function. However, the reason I'm asking this is because I don't know what I would then do if my original script was written in Fortran or even worse Python? Is there a way for me to achieve my aforementioned goals for an arbitrary number of external programs, ascertain when the processes are finished, and start anew in the context of a simple loop? Many thanks.
With GNU Parallel you can run:
parallel -j5 julia csvparse3.jl ::: /home/*.csv > /dev/null 2>&1
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Queue using several processes to launch bash jobs

I need to run many (hundreds) commands in shell, but I only want to have a maximum of 4 processes running (from the queue) at once. Each process will last several hours.
When a process finishes I want the next command to be "popped" from the queue and executed.
I also want to be able to add more process after the beginning, and it will be great if I could remove some jobs from the queue, or at least empty the queue.
I have seen solutions using makefile, but this only work if I have all my list of commands before the beginning. Also tried using mkfifo sjobq, and others, but I never could reach my needs...
Does anyone have code to solve this problem?
Edit: In response to Mark Setchell
The solution with tail -f and parallel is almost perfect, but when I do it, it always keep not launching the last 4 commands until I add more, and so on, I don't know why, and it is quite troublesome...
As for Redis, good solution also, but it takes more time to master all of it.
Thanks !
Use GNU Parallel to make a job queue like this:
# Clear out file containing job queue
> jobqueue
# Start GNU Parallel processing jobs from queue
# -k means "keep" output in order
# -j 4 means run 4 jobs at a time
tail -f jobqueue | parallel -k -j 4
# From another terminal, submit 40 jobs to the queue
for i in {1..40}; do echo "sleep 5;date +'%H:%M:%S Job $i'"; done >> jobqueue
Another option is to use REDIS - see my answer here Run several jobs parallelly and Efficiently

How to run shell script in few jobs

I have a build script, which works very slowly, especially on Solaris. I want to improve its performance by running it in multiple jobs. How can I do that?
Try GNU Parallel, it is quite easy to use:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
If you use xargs and tee today you will find GNU parallel very easy to use as GNU parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel.
GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU parallel as input for other programs.
For each line of input GNU parallel will execute command with the line as arguments. If no command is given, the line of input is executed. Several lines will be run in parallel. GNU parallel can often be used as a substitute for xargs or cat | bash.
You mentioned that it is a build script. If you are using command line utility make you can parallelize builds using make's -j<N> option:
GNU make knows how to execute several recipes at once. Normally, make will execute only one recipe at a time, waiting for it to finish before executing the next. However, the ‘-j’ or ‘--jobs’ option tells make to execute many recipes simultaneously.
Also, there is distcc which can be used with make to distribute compiling to multiple hosts:
export DISTCC_POTENTIAL_HOSTS='localhost red green blue'
cd ~/work/myproject;
make -j8 CC=distcc
GNU parallel is quite good. #Maxim - good suggestion +1.
For a one off, if you cannot install new software, try this for a slow command that has to run multiple times, run slowcommand 17 times. Change things to fit your needs:
#!/bin/bash
cnt=0
while [ $cnt -le 17 ] # loop 17 times
do
slow_command &
cnt=$(( $cnt + 1 ))
[ $(( $cnt % 5 )) -eq 0 ] && wait # 5 jobs at a time in parallel
done
wait # you will have 2 jobs you di not wait for in the loop 17 % 5 == 2

Small Scale load levelling

I have a series of jobs which need to be done; no dependencies between jobs. I'm looking for a tool which will help me distribute these jobs to machines. The only restriction is that each machine should run one job at a time only. I'm trying to maximize throughput, because the jobs are not very balanced. My current hacked together shell scripts are less than efficient as I pre-build the per-machine job-queue, and can't move jobs from the queue of a heavily loaded machine to one which is waiting, having already finished everything.
Previous suggestions have included SLURM which seems like overkill, and even more overkill LoadLeveller.
GNU Parallel looks like almost exactly what I want, but the remote machines don't speak SSH; there's a custom job launcher used (which has no queueing capabilities). What I'd like is Gnu Parallel where the machine can just be substituted into a shell script on the fly just before the job is dispatched.
So, in summary:
List of Jobs + List of Machines which can accept: Maximize throughput. As close to shell as possible is preferred.
Worst case scenario something can be hacked together with bash's lockfile, but I feel as if a better solution must exist somewhere.
Assuming your jobs are in a text file jobs.tab looking like
/path/to/job1
/path/to/job2
...
Create dispatcher.sh as something like
mkfifo /tmp/jobs.fifo
while true; do
read JOB
if test -z "$JOB"; then
break
fi
echo -n "Dispatching job $JOB .."
echo $JOB >> /tmp/jobs.fifo
echo ".. taken!"
done
rm /tmp/jobs.fifo
and run one instance of
dispatcher.sh < jobs.tab
Now create launcher.sh as
while true; do
read JOB < /tmp/jobs.fifo
if test -z "$JOB"; then
break
fi
#launch job $JOB on machine $0 from your custom launcher
done
and run one instance of launcher.sh per target machine (giving the machine as first and only argument)
GNU Parallel supports your own ssh command. So this should work:
function my_submit { echo On host $1 run command $3; }
export -f my_submit
parallel -j1 -S "my_submit server1,my_submit server2" my_command ::: arg1 arg2

Resources