Run script in multiple machines in parallel - bash

I am interested to know the best way to start a script in the background in multiple machines as fast as possible. Currently, I'm doing this
Run for each IP address
ssh user#ip -t "perl ~/setup.pl >& ~/log &" &
But this takes time as it individually tries to SSH into each one by one to start the setup.pl in the background in that machine. This takes time as I've got a large number of machines to start this script on.
I tried using GNU parallel, but couldn't get it to work properly:
seq COUNT | parallel -j 1 -u -S ip1,ip2,... perl ~/setup.pl >& ~/log
But it doesn't seem to work, I see the script started by GNU parallel in the target machine, but it's stagnant. I don't see anything in the log.
What am I doing wrong in using the GNU parallel?

GNU Parallel assumes per default that it does not matter which machine it runs a job on - which is normally true for computations. In your case it matters greatly: You want one job on each of the machine. Also GNU Parallel will give a number as argument to setup.pl, and you clearly do not want that.
Luckily GNU Parallel does support what you want using --nonall:
http://www.gnu.org/software/parallel/man.html#example__running_the_same_command_on_remote_computers
I encourage you to read and understand the rest of the examples, too.

I recommend that you use pdsh
It allows you to run the same command on multiple machines
Usage:
pdsh -w machine1,machine2,...,machineN <command>
It might not be included in your distribution of linux so get it through yum or apt

Try to wrap ssh user#ip -t "perl ~/setup.pl >& ~/log &" & in the shell script, and run for each ip address ./mysctipt.sh &

Related

Parallel execution of a command on few boxes over ssh (using bash)

What would be the cleanest way to execute same command remotely on several boxes with joint output to console?
For instance, I would like to tail logs from several boxes all together in my console as one output.
Definitely GNU parallel is a nice tool for parallelizing things in the shell. It also has reasonable remote execution capabilities.
It can be as easy as
parallel -S $SERVER1 -S $SERVER2 echo ::: running on more hosts

Running bash scripts in parallel

I would like to run commands in parallel. So that if one fails, the whole job exists as failure. How can I do that? More specifically, I would like to run aws sync commands in parallel. I have 5 aws sync commands that run sequentially. I would like them to run in parallel so that if one fails, the whole job fails. How can I do that?
GNU Parallel is a really handy and powerful tool that works with anything bash
http://www.gnu.org/software/parallel/
https://www.youtube.com/watch?v=OpaiGYxkSuQ
# run lines from a file, 8 at a time
cat commands.txt | parallel --eta -j 8 "{}"

Perform a command on cluster computers

I'd like to perform some bash command on set of computers of yarn cluster. For example, print last line of each logs:
tail -n 1 `ls /data/pagerank/Logs/*`
There are too many computers in cluster subset to manually enter to each computer and perform the command. Is there a way to automate the procedure?
I think you could use the Parallel SSH tool. You can find more at
https://code.google.com/p/parallel-ssh/
A basic tutorial how to use it can be found at
http://noone.org/blog/English/Computer/Debian/CoolTools/SSH/parallel-ssh.html

shell script to loop and start processes in parallel?

I need a shell script that will create a loop to start parallel tasks read in from a file...
Something in the lines of..
#!/bin/bash
mylist=/home/mylist.txt
for i in ('ls $mylist')
do
do something like cp -rp $i /destination &
end
wait
So what I am trying to do is send a bunch of tasks in the background with the "&" for each line in $mylist and wait for them to finish before existing.
However, there may be a lot of lines in there so I want to control how many parallel background processes get started; want to be able to max it at say.. 5? 10?
Any ideas?
Thank you
Your task manager will make it seem like you can run many parallel jobs. How many you can actually run to obtain maximum efficiency depends on your processor. Overall you don't have to worry about starting too many processes because your system will do that for you. If you want to limit them anyway because the number could get absurdly high you could use something like this (provided you execute a cp command every time):
...
while ...; do
jobs=$(pgrep 'cp' | wc -l)
[[ $jobs -gt 50 ]] && (sleep 100 ; continue)
...
done
The number of running cp commands will be stored in the jobs variable and before starting a new iteration it will check if there are too many already. Note that we jump to a new iteration so you'd have to keep track of how many commands you already executed. Alternatively you could use wait.
Edit:
On a side note, you can assign a specific CPU core to a process using taskset, it may come in handy when you have fewer more complex commands.
You are probably looking for something like this using GNU Parallel:
parallel -j10 cp -rp {} /destination :::: /home/mylist.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

GNU parallel: different commands to different computers?

Have searched on SO and GNU parallel tutorial and gone through examples here, but still don't quite see what I need solved. Any tips appreciated on how I could accomplish the following:
I need to invoke the same script on several remote servers with a different argument passed to each one (argument is a string), then wait until all those jobs are done... Then, run that same script some more times on those same remote servers, but this time try to keep the remote servers as busy as possible (ie when they finish their job, send them another job). Ideally the strings could be read in from a file on the "master" machine that is sending the jobs to the remote servers.
To diagram this, I'm trying to run *my_script* like this:
server A: myscript fee
server B: myscript fi
When both jobs are done I then want to do something like:
server A: myscript fo
server B: myscript fum
... and supposing A finished its work before server B, immediately sending it the next job like :
server A: myscript englishmun
... etc
Again, hugely appreciate any ideas people might have about whether this is easy/hard with GNU parallel (or if something else like pdsh, cluster ssh, might be better suited).
TIA!
It seems we can split the problem up in two parts: An initialization part that needs to be run on all server and a job processing part that does not care which server it is run on.
The last part is GNU Parallel's specialty:
cat argfile | parallel -S serverA,serverB myscript
The first part is a bit more tricky: You want the first k arguments to go onto to k servers.
head -n 2 argfile | parallel -j1 -S serverA,serverB myscript
The problem is here that if there are loads of servers, then serverA may finish before you get to the last server. It is much easier to run the same job on all servers:
head -n 1 argfile | parallel --onall -S serverA,serverB myscript

Resources