Starting a Linux PSOCK cluster from a Windows machine hangs R - windows

I'm trying to setup a cluster on a Linux box using the parallel package. A wart is that the machine I'm using as the master is running Windows as opposed to CentOS.
After some hacking around with puttygen and plink (putty's version of ssh) I got a command string that manages to execute Rscript on (a) slave, without needing a password:
plink -i d:/hong/documents/gpadmin.ppk -l gpadmin 192.168.224.128 Rscript
where gpadmin.ppk is a private key file generated using puttygen, and copied to the slave.
I translated this into a makeCluster call, as follows:
cl <- makeCluster("192.168.224.128",
user="gpadmin",
rshcmd="plink -i d:/hong/documents/gpadmin.ppk",
master="192.168.224.1",
rscript="Rscript")
but when I try to run this, R (on Windows) hangs. Well, it doesn't hang as in crashing, but it doesn't do anything until I press Escape.
However, I can laboriously get the cluster running by adding manual=TRUE to the end of the call:
cl <- makeCluster("192.168.224.128",
user="gpadmin",
rshcmd="plink -i d:/hong/documents/gpadmin.ppk",
master="192.168.224.1",
rscript="Rscript",
manual=TRUE)
I then log into the slave using the above plink command, and, at the resulting bash prompt, running the string that R displayed. This suggests that the string is fine, but makeCluster is getting confused trying to run it by itself.
Can anyone help diagnose what's going on, and how to fix it? I'd rather not have to start the cluster by manually logging into 16+ nodes every time.
I'm running R 3.0.2 on Windows 7 on the master, and R 3.0.0 on CentOS on the slave.

Your method of creating the cluster seems correct. Using your instructions, I was able to start a PSOCK cluster on a Linux machine from a Windows machine.
My first thought was that it was a quoting problem, but that doesn't seem to be the case since the Rscript command worked for you in manual mode. My second thought was that your environment is not correctly initialized when running non-interactively. For instance, you'd have a problem if Rscript was only in your PATH when running interactively, but again, that doesn't seem to be the case, since you were able to execute Rscript via plink. Have you checked if you have anything in ~/.Rprofile that only works interactively? You might want to temporarily remove any ~/.Rprofile on the Linux machine to see if that helps.
You should use outfile="" in case the worker issues any error or warning messages. You should run "ps" on the Linux machine while makeCluster is hanging to see if the worker has exited or is hanging. If it is running, then that suggests a networking problem that only happens when running non-interactively, strange as that seems.
Some additional comments:
Use Rterm.exe on the master so you see any worker output when using outfile="".
I recommend using "Pageant" so that you don't need to use an unencrypted private key. That's safer and avoids the need for the plink "-i" option.
It's a good idea to use the same version of R on the master and workers.
If you're desperate, you could write a wrapper script for Rscript on the Linux machine that executes Rscript via strace. That would tell you what system calls were executed when the worker either exited or hung.

Related

How to start WSL cron jobs on boot?

I have some scripts that must be run under WSL, and must be run all the time. Currently, my Windows 11 randomly decides when it thinks it's convenient to reboot and install some updates. Is there a way to start WSL cron jobs automatically with Windows?
If this sounds like an XY Problem, I'd be more than happy to elaborate further.
I set up the cron job via crontab -e (and also sudo). I was expecting it to behave as it would on a regular Linux distro, but it doesn't do anything until I "sudo service cron start" and have at least one of the WSL windows open.

Bash script to send commands to remote ssh session

Is it possible to write a bash script that opens a remote node (i.e. through ssh and/or slurm) and starts an interactive session there after running some commands? I'm trying to automate the process of starting a jupyter session on a remote computing cluster, which currently looks like this:
ssh into a login node of the remote cluster, using a specific port
use slurm to request an interactive session on one of the compute nodes, including x11 forwarding through that port
change directory to the working directory
activate conda environment for my project
open jupyter from the command line, specifying the port I used previously
It's a lengthy process, and if I get something wrong at any step I usually have to go back and start from the beginning because the port I'm using is still tied up. So I'm looking for a way I can run a single script (possibly with arguments) from my local machine that jumps through all the hoops to get me a working jupyter session with a link I can paste to my browser.
Like #Diego Torres Milano said, you would need to write a script locally that could do the interactive part, then invoke that via a remote script.
But since your process is interactive, this gets tricky. Luckily, linux has a tool which can easily be installed via a package manager called expect which has the ability to write logic to execute multi-step interactive scripts.
So you would write an expect script which would "expect" certain prompts, then it can read those prompts and use conditional logic respond to those prompts appropriately.
Once you have this written and it works locally, it's just a matter of executing it via ssh from a remote server as:
ssh user#12.34.56.78 /path/to/script.ex

Running python script in parallel with Ansible

I am managing 6 machines, or more, at AWS with Ansible. Those machines must run a python script that runs forever (the script has a while True).
I call the python script via command: python3 script.py
But just 5 machines run the script, the others doesn't. I can't figure out what I am doing wrong.
(Before the script call everything works fine for all machines like echo, ping, etc)
I already found the awnser.
The fork in ansible restrict to 5 machines as default. You must add an fork to the configuration file with a greater number, but your machine with Ansible must have power to manage that.
I'll let the question because to me was pretty hard to find the awnser.

Cannot successfully disconnect from remote machine using 'nohup' or 'screen'

I am trying to do some work on a remote machine and disconnect without terminating the work. I have tried both nohup and screen, unfortunately it is not working out. After I type exit to logout my work also terminates immediately.
I am trying to run 108 simulations on a remote machine. For that purpose I have written a script named batch.sh which runs one simulation after the other until all 108 are done. The program that actually runs a simulation launches 5 programs in 5 different terminals (using xterm -e). I run batch.sh using:
nohup bash batch.sh &
As long as I am connected everything works just fine. If I disconnect and then reconnect to check whether everything is working as it should...no joy :(
Are there any caveats I am overlooking? Possibly because my program launches other programs in external terminals?
UPDATE
If I use the suggestions of adding -oForwardX11=no to ssh and unset DISPLAY before launching my script I get these errors:
nohup: ignoring input and appending output to nohup.out
In nohup.out I have these messages:
xterm Xt error: Can't open display:
xterm: DISPLAY is not set
Apparently your script/program is trying to launch xterm on its own. These days many systems enable X11 forwarding for their SSH client by default - as a result the DISPLAY variable is set in your shell session but becomes invalid once you disconnect. Therefore, as long as you are connected to the remote system, the xterm processes can access the X server on your local machine through the SSH connection, but die once that connection is severed.
I have occasionally encountered the same issue with Java programs that use e.g. the Java AWT subsystem to generate image files, even when there is no actual graphical window. You should first see if your program will somehow adapt if there is no X server available. One option is to disable X11 forwarding with the -oForwardX11=no option to ssh:
$ ssh -oForwardX11=no user#server.host.name
You could also try unsetting the DISPLAY environment variable before starting your script and see what happens.
However, if your program is launching xterm windows indiscriminately then you'd have to make it e.g. use an output file on the server instead - by modifying it, if necessary. As an added advantage, you would get rid off the network load and timing overhead involved with forwarded X connections.
If you cannot change the way your program works and you do not actually care about the output in those xterm windows, then you could try launching a virtual framebuffer X server on the remote system and have your script use that for xterm.

Running remotely Linux script from Windows and get execution result code

I have the current scenario to deal with:
I have to schedule the backup of my company's Linux-based server (under Suse Linux) with ARCServe R15 (installed on Windows 2003R2SP2).
I know I have the ability in my backup software (ARCServe) to add pre/post execution scripts to my backup-jobs.
If failure of the script, ARCServe would be specified NOT to run the backup-job, and if success, specified to be run. I have no problem with this.
The problem is, I want to make a windows script (to be launched by ARCServe) for executing a Linux script on the cluster:
- If this Linux-script fails, I want my windows-script to fail, so my backup job in ARCServe wouldn't run
- If the Linux-script success, I want my windows-script to end normally with error code 0, so my ARCServe job would run normally.
I've tried creating this batch file (let's call it HPC.bat):
echo ON
start /wait "C:\Program Files\PUTTY\plink.exe" -v -l root -i "C:\IST\admin\scripts\HPC\pri.ppk" [cluster_name] /appli/admin/backup_admin
exit %errorlevel%
If I manually launch this .bat by double-clicking on it, or launching it in a command prompt under Windows, it executes normally and then ends.
If I make it being launched by ARCServe, the script seems never to end.
My job stays in "waiting" status, it seems the execution code of the linux script isn't returned to my batch file, and this one doesn't close.
In my mind, what's happening is plink just opens the connection to the Linux, send the sript execution signal, and then close the connection, so the execution code can't be returned to the batch. Am I right ?
Is what I want to do possible or am I trying something impossible to do ?
So, do I have to proceed differently ?
Do I have to use PUTTY or CygWin instead of plink ?
Please, it's giving me headaches ...
If you install Cygwin, you could do it exactly like you can do it on Linux to Linux, i.e. remotely run a command with ssh someuser#remoteserver.com somecommand
This command will return with the same return code on the calling client, as the command exited with on the remote end. If you use SSH shared keys for authentication instead of passwords, it can also be scripted without user interaction.

Resources