In vscode, how to debug on the computing node of a SGE cluster? - cluster-computing

I am trying to do some debug with VSCode on the computing node of a SGE cluster via qrsh commands. However, every time I entered the debug mode, it always remained at the login node instead of the computing node.
Here's my dilemma:
First, log in login-node via remote-ssh in vscode
Then apply a compute-node with gpu via qrsh command
Each time the hostname of the compute-node could change.
If I hit the debug button trying to do some debugging on the python programmes, it would go back to the login-node.
I tried to google a solution, some people mentioned that can try with proxy jump in ~/.ssh/config. But it seems not realistic since the applied compute-node's hostname is always different with the previous one.
At the moment, I have to either print everything out or use pdb.set_trace(). Both of them are not convenient for me because some python programmes are very huge. Using the IDE debugging function is really helpful and efficient for understanding other people's coding.
Is there any solution to fix it?
I tried to google some solution, but most of them are for slurm cluster, rather than SGE cluster.

You can use wildcards in ~/.ssh/config host definitions.
For example, if the compute nodes in your HPC are named compute-1, compute-2, compute-3, etc. then you can set up a single definition for all these hosts.
Host compute*
You don't give an example of how ProxyJump might be used here so I can't test this, but hopefully this gets you one step closer to a solution.

Related

how to fix hyper v issue

While running vagrant up --provision, getting following error:
One of the issues that I have occasionally struggled with in Hyper-V is that of time synchronization. Right now, for example, the time on my Windows 10 desktop that I am using to write this article is 9:37 p.m. However, the Hyper-V host that I am watching on another monitor is displaying a time of 9:33 p.m.
On the surface, a few minutes of clock skew might seem like a non-issue. However, there are several reasons why the differing clocks are problematic.
One such reason is that the Kerberos protocol, which is heavily used by the Windows operating system, is time-sensitive. If two systems' clocks are more than a few minutes out of sync with one another, it can cause Kerberos to stop working.
Another reason why clock skew is such a big deal is because multi-tier applications often span multiple Hyper-V hosts. A database server might, for instance, reside on one host, while a Web front end exists on another host, possibly even within a different host cluster.
when looking at start_vm.ps1, it seems to use command Start-VM.
and since this seems to be Windows, the Event Viewer might tell why it crashes.
I'd have a look at event channel Hyper-V-Config or Hyper-V-Worker.
command Test-VHD might also be worth a try.
I am not getting event log and do not understand it.

CloudianOS Aml v1.2.7 command line for reboot using quick cache argument

I am new to CloudianOS Aml (Have only used the Glass '18 distribution afore) and it seems to be much different than its other distributions.
Here's a line from the changelog on August 25th
My application requires a system restart (after the user agrees on a prompt), however, when executing a general reboot (what power reboot does) some of my processes are started all over (without their cache and data, which prevents the app from knowing where it ended and where to proceed).
I did some research and it seems like application cache and data (didn't really understand when does it apply for data) is saved in "quick cache storage" (which isn't cleaned after the reboot).
I tried all these:
I also tried executing those with QuickAccess. None worked.
Any help would be highly appreciated!
Quick storage is defined by two ways: just an uppercase "Q" in ccd and and index.storage.quick when using os.exec(params) and etc
It is tricky to figure but I think the correct command is power restart -Q. Use a singular dash to define a param and then the name of the param "Q".

Considerations when porting a MS VC++ program (single machine) to a rocks cluster

I am trying to port a MS VC++ program to run on a rocks cluster! I am not very good with linux but I am eager to learn and I imagine porting it wouldn't be an impossible task for me. However, I do not understand how to take advantage of the cluster nodes. because it seems that the code execute only runs on the front end server (obviously).
I have read a little about MPI and its seems like I should use MPI to comminicate between nodes. The program is currently written such that I have a main thread that synchronizes all worker threads. The main thread also recieves commands to manipulate the simulation or query its state. If the simulation is properly setup, communication between executing threads can be significantly minimized. What I don't understand is how do I start the process on the compute nodes and how do I handle failures in nodes? And maybe there should be other things I should also consider when porting my program to run in a cluster?
The first step is porting the threaded MS VC++ program to run on a single Linux machine.
Once you have gotten past that point, then modify your program to use MPI in addition to threads (or instead of threads). You can do this on a single computer as well.
To run the program on multiple nodes on your cluster, you will need to submit the program to whatever scheduling system you cluster uses. The command for this is dependent on the scheduling software used for your Rocks cluster. Ask your administrator. It may look something like mpirun -np 32 yourprogram.
Handling failures is the nodes is a broad question. Your first pass should probably just report the failure, then fail the program. If the program doesn't take to long to compute on the cluster, then restarting the program, adjusting for the failed node, may be good enough. Beyond that, your application can write to disk intermediate information needed to resume where it left off. This is called checkpointing your application. Thus, when a node fails, the job fails, but restarting the job doesn't start from the beginning. Much more advanced would be trying to actually detect node failures and reschedule the work unit that was on the failed node. This assumes that the work unit doesn't have non-idempotent side effects. This sort of thing gets really complicated. Checkpointing is likely good enough.

MPI application on HPC doesn't communicate between the nodes

I'm trying to do some parallel programming on a HPC running on Windows server 2008. I'm trying to use MPI and particulary MPI.NET an implentation of the MPI protocol in C#. Right now I'm following the tutorial to understand the library.
I'm trying to run the program pingpong.exe given with the SDK. It works fine on the HPC if the processes are on the same node (so it's the same utility as a threading system) As soon as it is on more than one node, it doesn't work.
What am I missing ?
Thanks.
Ok I've been able to understand why, it wasn't working : the firewall was blocking the message between the nodes. source It's possible to change the firewall so it doesn't block anything inside the cluster.
Hope it helps the next person.
Thanks Mark and IronMan.

Jenkins/Hudson - Run script on all slaves

I have a requirement to run a script on all available slave machines. Primarily this is so they get relevant windows hotfixes and new 3rd party tools before building.
The script I have can be run multiple times without undesirable side effects & is quite light weight, so I'm happy for this to be brute force if necessary.
Can anybody give suggestions as to how to ensure that a slave is 'up-to-date' before it works on a job?
I'm happy with solutions that are driven by a job on the master, or ones which can inject the task (automatically) before normal slave job processing.
My shop does this as part of the slave launch process. We have the slaves configured to launch via execution of a command on the master; this command runs a shell script that rsync's the latest tool files to the slave and then launches the slave process. When there is a tool update, all we need to do is to restart the slaves or the master.
However - we use Linux whereas it looks like you are on Windows, so I'm not sure what the equivalent solution would be for you.
To your title: either use Parameter Plugin or use matrix configuration and list your nodes in it.
To your question about ensuring a slave is reliable, we mark it with a 'testbox' label and try out a variety of jobs on it. You could also have a job that is deployed to all of them and have the job take the machine offline it fails, I imagine.
Using Windows for slaves is very obnoxious for us too :(

Resources