slurm - action_unknown in pam_slurm_adopt - cluster-computing

What does "source job" refer to in the description of action_unknown?
action_unknown
The action to perform when the user has multiple jobs on the node
and the RPC does not locate the **source job**. If the RPC mechanism works
properly in your environment, this option will likely be relevant only
when connecting from a login node. Configurable values are:
newest (default)
Pick the newest job on the node. The "newest" job is chosen based
on the mtime of the job's step_extern cgroup; asking Slurm would
require an RPC to the controller. Thus, the memory cgroup must be in
use so that the code can check mtimes of cgroup directories. The user
can ssh in but may be adopted into a job that exits earlier than the
job they intended to check on. The ssh connection will at least be
subject to appropriate limits and the user can be informed of better
ways to accomplish their objectives if this becomes a problem.
allow
Let the connection through without adoption.
deny
Deny the connection.
https://slurm.schedmd.com/pam_slurm_adopt.html

slurm_pam_adopt will try to capture an incoming SSH session into the cgroup corresponding to the job currently running on the host. This option is meant to decide what to do when there are several jobs running for the user that initiates the ssh command.
The 'source job' is the jobid of the process that initiates the ssh call. Typically, if you use an interactive ssh session from the frontend, there is not 'source job', but if the ssh command is run from within a submission script, then the 'source job' is the one corresponding to that submission script.

Related

Is it possible to execute post-script after slurm job execution?

Is it possible to tell slurm that it must execute specific, for example post-script.py, script after the submitted task has been completed?
Not submit new task, just run it on login-node
Something like...
#SBATCH --at-end-run="bash post-script.sh"
Or is it only option to check if task has been completed every N-minutes?
The short answer is that there is no such option in Slurm.
If post-script.sh can run on a compute node, the best option would be
if it is short: to add it at the end of the job submission script
if it is long; to submit it in its own job and use --dependency options to make start at the end of the first job.
If you have root privileges, you can use strigger to run post-script.sh after the job has completer. That would run on the slurmctld server.
If the post-script.sh must run on the login node, for external network access for instance, then the options first mentioned would work if you are able/allowed to SSH from a compute node to a login node. This is sometimes prevented/forbidden, but if not, then you can run ssh login.node bash post-script.sh at the end of the submission script or in a job of itself.
If that is not a possibility, then "busy polling" is indeed needed. You can do it in a Bash loop making sure not to put too large a burden on the Slurm server (every 5 minutes is OK, every 5 seconds is useless and harmful to the system).
You can also use a dedicated workflow management tool such as Maestro that will allow you to define a job and a dependent task to run on the login node.
See some general information about workflows on HPC systems here.

deny parallel ssh connection to server for specific host / IP

I have a bot machine (controlled via mobile device) which
connects to the Server and fetch information from it by method os
"ssh, shell script, os commands,sql queries etc" than it feed that
information over the internet (private)
I want to disallow this multiple connection to the server via the
bot machine ONLY.. there are other machine which connects to the server which must not be affected
Suppose
Client A from his mobile acess bot machine (via webpage) than the bot
machine connect to server (1st session) now if the process of this
connection is 5 minute during this period the bot machine will be
creating, quering, deleting, appending, updating etc
in the very mean time of that 5 minute duration (suppose 2min after
the 1st session started) Client B from his mobile access bot machine
(via webpage) than the bot machine connect to server (2nd session) now
it will conflict with the 1st session and create Havoc...
Limitation
Now first of all i do not want to editing any setting on the SERVER
ANY WHAT SO EVER
I do not want to edit the webpage/mobile etc
I already know abt the lock file method of parallel shell script and
it is implemented at script level but what abt the OS commands and
stuff like that which are not in bash script
My Thougth
What i thougt was whenever we create a connection with server it
create a process named what ever (SSH) which is viewable in ps -fu
OSUSER so by applying a unique id/tag/name to our connection we can
identify if one session is active or not. This will be check as soon
as the bot connects to the server. But i do not know how to do
that.... Please also suggest any more information over it.
Also is there is way to identify if the existing process is hanged or
the time of the process started or elapsed?
Maybe try using limits.conf to enforce a hard limit of 1 login for the user/group.
You might need a periodic cron job to check for and remove any stale logins.
Locks/mutexes are hard to get right and add complexity. Limits.conf is a standard feature of most unix/linux systems and should be more reliable, emphasis on should...
A similar question was raised here:
https://unix.stackexchange.com/questions/127077/number-of-ssh-connections-on-a-single-linux-machine
Details here:
http://linux.die.net/man/5/limits.conf
I assume you have a single login for the ssh account and that this runs a script on login
Add something like this to the script at login
#!/bin/bash
LOCK_FILE="/tmp/sshlock"
trap "rm $LOCK_FILE; exit" SIGHUP SIGINT SIGTERM
if [ $(( (`date +%s` - `stat -L --format %Y $LOCK_FILE`) < (30*60) )) ]; then
exit 0
fi
touch $LOCK_FILE
When the processes that the ssh login calls end, delete the $LOCK_FILE
The trap statement is an important part of this way of locking, please do use it
The "30*60" is a 30 minute timeout, thanks to the answer on this question How can I tell if a file is older than 30 minutes from /bin/sh?

Rsync stop when failover

I have two cpanel servers(A->B) with failover configured in dnsmadeeasy. I have right now setup a rsync to sync the /home/account folder every 4 hours from A->B.
So when A fails, B takes over with a backlog of 4 hours of data in server A.
My problem is when A comes back to normal from a failure, the rsync in B overwrites the data from A since the rsync is A->B.
I like to know what is the best method to prevent the rsync from running after the first failover so that I can manually handle the rsync. I am thinking of a shell script which will try to access a text file in server A, which if results in failure will stop the cron from running.
Is this a good way to handle this, or is there a easier way?
Well, I have done something similar on a group of servers I have at the office. An overview of what I have found to work well is simply to run a cron script that keeps the status of each of the other servers in a temporary status file and the status is updated with calls to ping.
Specifically, the routine works by maintaining a list of hosts to be included in the check. Each host (except for the name matching the machine running the cron job) has a status file maintained in the /tmp directory called hoststatus.$HOSTNAME. Each status file contains either up or down. (if the status file does not exists, it is created during the check process and assumed up). The status files themselves provides a local means of checking the status of each remote host for any script before running it.
The cron job that checks the status, reads the status file for each remote host and provides the status to a case statement. For the case where status is up a call is made to the remote host with ping -c1 hostname. If the ping succeeds, then the script exits (remote host is up). If the ping fails, then the script waits 20 seconds (to insure the remote isn't rebooting, etc.. and checks again. If the second call succeeds, the status remains up and the script exits. If the second call to ping fails, the wait for 20 seconds repeats and retests. If the third test fails, then the status file is written down and the remote host is considered down.
Continuing in the case statement, if the initial status was down, a simple check is made with ping. If it succeeds, status is changed to up, if it fails, it remains down.
A log file is also kept that reflects each change of status to provide a running history of server availability.
Something similar would work for you case. If server A goes down, sever B could write a simple log in a similar fashion something like rsynchold.hostA that is checked before rsync is run between either A->B or B->A. This would allow you manual intervention with the first rsync after a failure -- at which time you could reset the rsynchold.hostA file.
This isn't elegant, but it has proven fairly foolproof over the past several years.

Open a JDBC connection in a specific AS400 subsystem

I have a web service that calls some stored procedure on a AS400 via JTOpen.
What I would like to do is that the connections used to call the stored procedures was opened in a specific subsystem with a specific user, instead of qusrwrk/quser as now (default).
I think I can be able to clone the qusrwrk subsystem to make it start with a specific user, but what I cannot figure out is the mechanism to open the connection in the specific subsystem.
I guess there should be a property at connection level to say subsystem=MySubsystem.
But unfortunatly I haven't found that property.
Any hint would be appreciated.
Flavio
Let the system take care of the subsystem the job database server job is started in.
You should just focus on the application (which is what IBM i excels in).
If need be, you can tweak subsystem parameters for QUSRWRK to improve performance by allocating memory, etc.
The system uses a pool of prestarted jobs as described in the FAQ: When I do WRKACTJOB, why is the host server job running under QUSER instead of the profile specified on the AS400 object?
To improve performance, the host server jobs are prestarted jobs running under QUSER. When the Toolbox connects to a host server job in order to perform an API call, run a command, etc, a request is sent from the Toolbox to an available prestarted job. This request includes the user profile specified on the AS400 object that represents the connection. The host server job receives the request and swaps to the specified user profile before it runs the request. The host server itself originally runs under the QUSER profile, so output from the WRKACTJOB command will show the job as being owned by QUSER. However, the job is in fact running under the profile specified on the request. To determine what profile is being used for any given host server job, you can do one of three things:
1. Display the job log for that job and find the message indicating which user profile is used as a result of the swap.
2. Work with the job and display job status attributes to view the current user profile.
3. Use Navigator for i to view all of the server jobs, which will list the current user of each job. You can also use Navigator for i to look at the server jobs being used by a particular user.

multiple commands in the same session with ssh

I am currently using the Trilead ssh2 library and when i try to execute multiple commands (using execCommand) in the same session, I get "a remote execution has already started" error.
Just wanted clarify, the session is limited to one system command; does that mean I can only send one command through execCommand() per connection? Is there any other alternative besides injecting multiple commands with semicolons?

Resources