Setting up a remote environment that runs on a SLURM compute node - cluster-computing

I'm trying to set up a remote environment (PyCharm preferred, VSCode also okay) so I can run and debug all my experiments on a compute node in a SLURM-managed cluster. The issue is that I have to run srun from a login node (that I SSH into) in order to reserve resources and connect. Most times I do this I also end up on a different machine. What I'd like is to be able to connect once via ssh and srun, and then tunnel everything to this new hardware limited job context.
Some things I've tried:
Run an interactive bash session using srun, then attaching a job to that jobid using sattach. I think this will always just wait until the bash session has ended before being able to run the command supplied with sattach.
sshing into a node already running my job. (Due to what I think is pam_slurm_adopt, I can't ssh into any node, but I can ssh into one where my job is already running. This works, but gives me access to all GPUs and all hardware on the machine and causes chaos once someone else joins the same node reserving only one or two GPUs.
Everything in this thread https://github.com/microsoft/vscode-remote-release/issues/1722
An idea I had was to srun tmux instead of bash and then forward ports and attach to that tmux session somehow on connection via SSH to the login node, but I'm not entirely sure how that would work.

Related

Run batch file on windows ec2 instance then shutdown instance?

I have a .bat file on a windows ec2 instance I would like to run every day.
Is there any way to schedule the instance to run this file every day and then shut down the ec2 instance without manually going to the ec2 management console and launching the instance?
There are two requirements here:
Start the instance each day at a particular time (This is an assumption I made based on your desire to shutdown the instance each day, so something needs to turn it on)
Run the script and then shutdown
Option 1: Start & Stop
Amazon CloudWatch Events can perform a task on a given schedule, such as once-per-day. While it has many in-built capabilities, it cannot natively start an instance. Therefore, configure it to trigger an AWS Lambda function. The Lambda function can start the instance with a single API call.
When the instance starts up, use the normal Windows OS capabilities to run your desired program, eg: Automatically run program on Windows Server startup
When the program has finished running, it should issue a command to the Windows OS to shutdown Windows. The benefit of doing it this way (instead of trying to schedule a shutdown) is that the program will run to completion before any shutdown is activated. Just be sure to configure the EC2 instance to Stop on Shutdown (which is the default behaviour).
Option 2: Launch & Terminate
Instead of starting and stopping an instance, you could instead launch a new instance using an Amazon CloudWatch Events schedule.
Pass the desired PowerShell script to run in the instance's User Data. This script can install and run software.
When the script has finished, it should call the Windows OS command to shutdown Windows. However, this time configure Terminate on Shutdown so that the instance is terminated (deleted). This is fine because the above schedule will launch a new instance next time.
The benefit of this method is that the software configuration, and what should be run each time, can be fully configured via the User Data script, rather than having to start the instance, login, change the scripts, then shutdown. There is no need to keep an instance around just to be Stopped for most of the day.
Option 3: Rethink your plan and Go Serverless!
Instead of using an Amazon EC2 instance to run a script, investigate the ability to run an AWS Lambda function instead. The Lambda function might be able to do all the processing you desire, without having to launch/start/stop/terminate instances. It is also cheaper!
Some limitations might preclude this option (eg maximum 5 minutes run-time, limit of 500MB disk space) but it should be the first option you explore rather than starting/stopping an Amazon EC2 instance.

spark submit on edge node

I am submitting my spark-submit command command through my Edge Node. For this I am using client mode, Now I am accessing my edge node(which is on the same network as my cluster) through my laptop. I know that the driver program runs on my Edge Node, what I want to know is that why does my spark-job automatically suspends when I close my ssh session with the Edge Node? Does opening Edge Node putty connection through VPN/Wireless internet has any effect on the spark job vs using the Ethernet cable from within the network? At present the spark submit job is very slow even though the cluster is really powerful!Please help!
Thanks!
You are submitting the job with --master yarn but possibly you are not specifying --deploy-mode cluster, so the driver application (your Java code) is running locally on this edge node machine. When choosing --deploy-mode cluster the driver will run on your cluster and will overall be more robust.
The spark job dies when you close the ssh connection because you're killing the driver when doing it, it is running on your terminal session. To avoid this you must send the command as a background job using & at the end of your spark-submit. For example:
spark-submit --master yarn --class foo bar zaz &
This will send the driver into the background and the stdout will be sent to your tty, polluting your session but not killing the process when you close the ssh connection.
If you however don´t want it to be so polluted you can send the stdout to /dev/null by doing this:
spark-submit --master yarn --class foo bar zaz &>/dev/null &
However you won´t know why things failed. You can redirect the stdout to a file too instead of /dev/null.
Finally, once this is clear enough I strongly recommend not deploying like this your spark jobs, since the driver process in the edge node failing for any funky reason will kill the job running in the cluster. It also has a strange behavior since the job dying in the cluster (Some runtime problem) will not stop nor kill your driver in the edge node, which leads to a lot of wasted memory in that machine if you don´t take care of manually kill all those old driver processes in that machine.
All this is avoided by using the flag --deploy-mode cluster in your spark submit.

Terminate google cloud compute engine instance with shell/bash script

I am using google cloud compute engine for some computational intense tasks (32 parallel processes). My tasks sometimes finished in mid night, and I am wondering is there a way to stop the instance once all my processes stop? I prefer to make a shell script to monitor all my processes and stop the instance when everything is finished.
halt or shutdown or poweroff does not works for me, as my command only submit jobs. The command finished immediately while all processes (computing tasks) kept running on the backend. If I put halt or shutdown at the end of my command line, the instance simply shut down as I entered the command
Take a look at How to automatically exit/stop the running instance.
To summarize, you can simply run halt or shutdown -h now. Once the operating system halts the instance will terminate and you will no longer be charged.
Alternatively if you've started the instance with the appropriate permissions/scope you could issue the gcloud compute instances stop command:
https://cloud.google.com/sdk/gcloud/reference/compute/instances/stop
I typed:
gcloud compute instances stop [virtual-machine-instance]
Therafter, you specify the zone [zone] to confirm terminating instance

How to execute a shell script on all nodes of an EMR cluster?

Is there a proper way to execute a shell script on every node in a running EMR hadoop cluster?
Everything I look for brings up bootstrap actions, but that only applies to when the cluster is starting, not for a running cluster.
My application is using python, so my current guess is to use boto to list the IPs of each node in the cluster, then loop through each node and execute the shell script via ssh.
Is there a better way?
If your cluster is already started, you should use steps.
The steps are executed after the cluster is started, so technically it appears to be what you are looking for.
Be careful, steps are executed only on the master node, you should connect to the rest of your nodes in some way for modifyng them.
Steps are scripts as well, but they run only on machines in the
Master-Instance group of the cluster. This mechanism allows
applications like Zookeeper to configure the master instances and
allows applications like Hbase and Apache Drill to configure
themselves.
Reference
See this also.

Running multiple mesos slaves locally

I'm trying to run a test cluster locally following this guide https://mesosphere.com/2014/07/07/installing-mesos-on-your-mac-with-homebrew/
Currently, I'm able to have a master running at localhost:5050 and a slave running at the default port 5051 (with slave id say S0). However, when I tried to start another slave at a different port, it re-registered itself as S0 and the master console only showed 1 activated slave. Does anybody know how would I start another slave S1? Thanks!
Did you specify a another work_dir?
E.g.
sudo /usr/local/sbin/mesos-slave --master=localhost:5050 --port=5052 -- work_dir=/tmp/mesos2
To explain a bit why this is needed/ where the error you saw came from.
Mesos supports so called slave recovery for helping with upgrades and error recovery.
Therefore when starting a slave, it will check its work_dir for checkpoint and try to recover that state (i.e. reconnect to still running executors).
In your case as both slaves wanted to start from the same working directory, the second one tried to recover the checkpoint of the still running first slave...
P.S. I should probably replace all the above occurences of slave with worker (https://issues.apache.org/jira/browse/MESOS-1478), but I hope this is easier to read.

Resources