How can I implement MapReduce using shell commands? - parallel-processing

How do you execute a Unix shell command (e.g awk one liner) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)?
Update: I've just found http://blog.last.fm/2009/04/06/mapreduce-bash-script
It seems to do exactly what I need.

If all you're trying to do is fire off a bunch of remote commands, you could just use perl. You can "open" a ssh command and pipe the results back to perl. (You of course need to set up keys to allow password-less access)
open (REMOTE, "ssh user#hostB \"myScript\"|");
while (<REMOTE>)
{
print $_;
}
You'd want to craft a loop with your machine names, and fire off one for each. After that just do non-blocking reads on the filehandles to pull back the data as it becomes available.

parallel can be installed on your central node and can be used to run a command across multiple machines.
In the example below, multiple ssh connections are used to run commands on the remote hosts. (-j is the number of jobs to run at the same time on the central node). The result can then be piped to commands to perform the "reduce" stage. (sort then uniq in this example).
parallel -j 50 ssh {} "ls" ::: host1 host2 hostn | sort | uniq -c
This example assumes "keyless ssh login" has been set up between the central node and all machines in the cluster.
It can be tricky to escape characters correctly when running more complex commands than "ls" remotely, you have to escape the escape character sometimes. You mention bashreduce, it may simplify this.

Related

Parallel execution of a command on few boxes over ssh (using bash)

What would be the cleanest way to execute same command remotely on several boxes with joint output to console?
For instance, I would like to tail logs from several boxes all together in my console as one output.
Definitely GNU parallel is a nice tool for parallelizing things in the shell. It also has reasonable remote execution capabilities.
It can be as easy as
parallel -S $SERVER1 -S $SERVER2 echo ::: running on more hosts

Perform a command on cluster computers

I'd like to perform some bash command on set of computers of yarn cluster. For example, print last line of each logs:
tail -n 1 `ls /data/pagerank/Logs/*`
There are too many computers in cluster subset to manually enter to each computer and perform the command. Is there a way to automate the procedure?
I think you could use the Parallel SSH tool. You can find more at
https://code.google.com/p/parallel-ssh/
A basic tutorial how to use it can be found at
http://noone.org/blog/English/Computer/Debian/CoolTools/SSH/parallel-ssh.html

GNU parallel: different commands to different computers?

Have searched on SO and GNU parallel tutorial and gone through examples here, but still don't quite see what I need solved. Any tips appreciated on how I could accomplish the following:
I need to invoke the same script on several remote servers with a different argument passed to each one (argument is a string), then wait until all those jobs are done... Then, run that same script some more times on those same remote servers, but this time try to keep the remote servers as busy as possible (ie when they finish their job, send them another job). Ideally the strings could be read in from a file on the "master" machine that is sending the jobs to the remote servers.
To diagram this, I'm trying to run *my_script* like this:
server A: myscript fee
server B: myscript fi
When both jobs are done I then want to do something like:
server A: myscript fo
server B: myscript fum
... and supposing A finished its work before server B, immediately sending it the next job like :
server A: myscript englishmun
... etc
Again, hugely appreciate any ideas people might have about whether this is easy/hard with GNU parallel (or if something else like pdsh, cluster ssh, might be better suited).
TIA!
It seems we can split the problem up in two parts: An initialization part that needs to be run on all server and a job processing part that does not care which server it is run on.
The last part is GNU Parallel's specialty:
cat argfile | parallel -S serverA,serverB myscript
The first part is a bit more tricky: You want the first k arguments to go onto to k servers.
head -n 2 argfile | parallel -j1 -S serverA,serverB myscript
The problem is here that if there are loads of servers, then serverA may finish before you get to the last server. It is much easier to run the same job on all servers:
head -n 1 argfile | parallel --onall -S serverA,serverB myscript

How can i run two commands exactly at same time on two different unix servers?

My requirement is that i have to reboot two servers at same time (exactly same timestamp) . So my plan is to create two shell script that will ssh to the server and trigger the reboot. My doubt is
How can i run same shell script on two server at same time. (same timestamp)
Even if i run Script1 &; Script2. This will not ensure that reboot will be issued at same time, minor time difference will be there.
If you are doing it remotely, you could use a terminal emulator with broadcast input, so that what you type is sent to all sessions of the open terminal. On Linux tmux is one such emulator.
The other easiest way is write a shell script which waits for the same timestamp on both machines and then both reboot.
First, make sure both machine's time are aligned (use the best implementation of http://en.wikipedia.org/wiki/Network_Time_Protocol and your system's related utilities).
Then,
If you need this just one time: on each servers do a
echo /path/to/your/script | at ....
(.... being when you want it. See man at).
If you need to do it several times: use crontab instead of at
(see man cron and man crontab)

Run script in multiple machines in parallel

I am interested to know the best way to start a script in the background in multiple machines as fast as possible. Currently, I'm doing this
Run for each IP address
ssh user#ip -t "perl ~/setup.pl >& ~/log &" &
But this takes time as it individually tries to SSH into each one by one to start the setup.pl in the background in that machine. This takes time as I've got a large number of machines to start this script on.
I tried using GNU parallel, but couldn't get it to work properly:
seq COUNT | parallel -j 1 -u -S ip1,ip2,... perl ~/setup.pl >& ~/log
But it doesn't seem to work, I see the script started by GNU parallel in the target machine, but it's stagnant. I don't see anything in the log.
What am I doing wrong in using the GNU parallel?
GNU Parallel assumes per default that it does not matter which machine it runs a job on - which is normally true for computations. In your case it matters greatly: You want one job on each of the machine. Also GNU Parallel will give a number as argument to setup.pl, and you clearly do not want that.
Luckily GNU Parallel does support what you want using --nonall:
http://www.gnu.org/software/parallel/man.html#example__running_the_same_command_on_remote_computers
I encourage you to read and understand the rest of the examples, too.
I recommend that you use pdsh
It allows you to run the same command on multiple machines
Usage:
pdsh -w machine1,machine2,...,machineN <command>
It might not be included in your distribution of linux so get it through yum or apt
Try to wrap ssh user#ip -t "perl ~/setup.pl >& ~/log &" & in the shell script, and run for each ip address ./mysctipt.sh &

Resources