Slurmd remains inactive/failed on start - cluster-computing

I currently have a cluster of 10 worker nodes managed by Slurm with 1 master node. I have previously successfully set up the cluster, after some teething problems, but managed to get it working. I put all my scripts and instructions on my GitHub repo (https://brettchapman.github.io/Nimbus_Cluster/). I recently needed to start over again to increase hard drive space, and now can't seem to install and configure it correctly no matter what I've tried.
Slurmctld and slurmdbd install and are configured correctly (both active and running with the systemctl status command), however slurmd remains in a failed/inactive state.
The following is my slurm.conf file:
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=node-0
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=600
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
JobCompHost=localhost
JobCompLoc=slurm_acct_db
JobCompPass=password
#JobCompPort=
JobCompType=jobcomp/mysql
JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=node-[1-10] NodeAddr=node-[1-10] CPUs=16 RealMemory=64323 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=node-[1-10] Default=YES MaxTime=INFINITE State=UP
And the following is my slurmdbd.conf file:
AuthType=auth/munge
AuthInfo=/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=info
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=password
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm
Running pdsh -a sudo systemctl status slurmd on my compute nodes gives me the following error:
pdsh#node-0: node-5: ssh exited with exit code 3
node-6: ● slurmd.service - Slurm node daemon
node-6: Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
node-6: Active: inactive (dead) since Tue 2020-08-11 03:52:58 UTC; 2min 45s ago
node-6: Docs: man:slurmd(8)
node-6: Process: 9068 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
node-6: Main PID: 8983
node-6:
node-6: Aug 11 03:34:09 node-6 systemd[1]: Starting Slurm node daemon...
node-6: Aug 11 03:34:09 node-6 systemd[1]: slurmd.service: Supervising process 8983 which is not our child. We'll most likely not notice when it exits.
node-6: Aug 11 03:34:09 node-6 systemd[1]: Started Slurm node daemon.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Succeeded.
pdsh#node-0: node-6: ssh exited with exit code 3
I did not previously receive this type of error before when I had my cluster up and running, so I'm unsure of what I did or didn't do between now and last time I had it running. My guess is it's something to do with file/folder permissions, as I have found that can be quite critical when setting up. I may have missed documenting something I did previously. This is my second attempt at setting up a slurm managed cluster.
My entire workflow and scripts can be followed from my GitHub repo. If you need any other error outputs, please ask.
Thank you for any help you can provide.
Brett
EDIT:
Looking in one of node-1 and running sudo slurmd -Dvvv I get this:
slurmd: debug: Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug: Resource spec: No specialized cores configured by default on this node
slurmd: debug: Resource spec: Reserved system memory limit not configured for this node
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug: task/cgroup: now constraining jobs allocated cores
slurmd: debug: task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug: task/cgroup: now constraining jobs allocated memory
slurmd: debug: task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug: task/cgroup: now constraining jobs allocated devices
slurmd: debug: task/cgroup: loaded
slurmd: debug: Munge authentication plugin loaded
slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug: /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug: job_container none plugin loaded
slurmd: debug: switch NONE plugin loaded
slurmd: error: Error binding slurm stream socket: Address already in use
slurmd: error: Unable to bind listen port (*:6818): Address already in use
Logging into a differnt node, node-10, I get this:
slurmd: debug: Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug: Resource spec: No specialized cores configured by default on this node
slurmd: debug: Resource spec: Reserved system memory limit not configured for this node
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug: task/cgroup: now constraining jobs allocated cores
slurmd: debug: task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug: task/cgroup: now constraining jobs allocated memory
slurmd: debug: task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug: task/cgroup: now constraining jobs allocated devices
slurmd: debug: task/cgroup: loaded
slurmd: debug: Munge authentication plugin loaded
slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug: /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug: job_container none plugin loaded
slurmd: debug: switch NONE plugin loaded
slurmd: slurmd started on Tue, 11 Aug 2020 06:56:10 +0000
slurmd: CPUs=16 Boards=1 Sockets=1 Cores=8 Threads=2 Memory=64323 TmpDisk=297553 Uptime=756 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug: AcctGatherEnergy NONE plugin loaded
slurmd: debug: AcctGatherProfile NONE plugin loaded
slurmd: debug: AcctGatherInterconnect NONE plugin loaded
slurmd: debug: AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES.
Another node, node-5, I get this, the same as node-1:
slurmd: debug: Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug: Resource spec: No specialized cores configured by default on this node
slurmd: debug: Resource spec: Reserved system memory limit not configured for this node
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug: task/cgroup: now constraining jobs allocated cores
slurmd: debug: task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug: task/cgroup: now constraining jobs allocated memory
slurmd: debug: task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug: task/cgroup: now constraining jobs allocated devices
slurmd: debug: task/cgroup: loaded
slurmd: debug: Munge authentication plugin loaded
slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug: /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug: job_container none plugin loaded
slurmd: debug: switch NONE plugin loaded
slurmd: error: Error binding slurm stream socket: Address already in use
slurmd: error: Unable to bind listen port (*:6818): Address already in use
node-10 was down previously, and I struggled to bring it back up, so the error may be unrelated to the overall problem.
Edit2: After killing stuck slurmd processes across all nodes, slurmd still fails on start:
slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Tue 2020-08-11 07:10:42 UTC; 3min 58s ago
Docs: man:slurmd(8)
Aug 11 07:09:11 node-1 systemd[1]: Starting Slurm node daemon...
Aug 11 07:09:11 node-1 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Aug 11 07:10:42 node-1 systemd[1]: slurmd.service: start operation timed out. Terminating.
Aug 11 07:10:42 node-1 systemd[1]: slurmd.service: Failed with result 'timeout'.
Aug 11 07:10:42 node-1 systemd[1]: Failed to start Slurm node daemon.
The sudo slurmd -Dvvv output on node1:
slurmd: debug: Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug: CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug: Resource spec: No specialized cores configured by default on this node
slurmd: debug: Resource spec: Reserved system memory limit not configured for this node
slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug: task/cgroup: now constraining jobs allocated cores
slurmd: debug: task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug: task/cgroup: now constraining jobs allocated memory
slurmd: debug: task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug: task/cgroup: now constraining jobs allocated devices
slurmd: debug: task/cgroup: loaded
slurmd: debug: Munge authentication plugin loaded
slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug: /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug: job_container none plugin loaded
slurmd: debug: switch NONE plugin loaded
slurmd: slurmd started on Tue, 11 Aug 2020 07:14:08 +0000
slurmd: CPUs=16 Boards=1 Sockets=1 Cores=8 Threads=2 Memory=64323 TmpDisk=297553 Uptime=15897 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug: AcctGatherEnergy NONE plugin loaded
slurmd: debug: AcctGatherProfile NONE plugin loaded
slurmd: debug: AcctGatherInterconnect NONE plugin loaded
slurmd: debug: AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES.
Edit3: I get these debug messages from the slurmd.log file, which seem to indicate the PID can't be retrieved and certain files/folders are not accessible:
[2020-08-11T07:38:27.973] slurmd version 19.05.5 started
[2020-08-11T07:38:27.973] debug: Job accounting gather NOT_INVOKED plugin loaded
[2020-08-11T07:38:27.973] debug: job_container none plugin loaded
[2020-08-11T07:38:27.973] debug: switch NONE plugin loaded
[2020-08-11T07:38:27.973] slurmd started on Tue, 11 Aug 2020 07:38:27 +0000
[2020-08-11T07:38:27.973] CPUs=16 Boards=1 Sockets=1 Cores=8 Threads=2 Memory=64323 TmpDisk=297553 Uptime=17357 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2020-08-11T07:38:27.973] debug: AcctGatherEnergy NONE plugin loaded
[2020-08-11T07:38:27.973] debug: AcctGatherProfile NONE plugin loaded
[2020-08-11T07:38:27.974] debug: AcctGatherInterconnect NONE plugin loaded
[2020-08-11T07:38:27.974] debug: AcctGatherFilesystem NONE plugin loaded
[2020-08-11T07:38:27.974] debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
[2020-08-11T07:38:27.975] debug: _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2020-08-11T07:38:33.496] got shutdown request
[2020-08-11T07:38:33.496] all threads complete
[2020-08-11T07:38:33.496] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2020-08-11T07:38:33.496] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2020-08-11T07:38:33.496] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2020-08-11T07:38:33.496] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2020-08-11T07:38:33.497] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2020-08-11T07:38:33.497] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2020-08-11T07:38:33.497] Consumable Resources (CR) Node Selection plugin shutting down ...
[2020-08-11T07:38:33.497] Munge credential signature plugin unloaded
[2020-08-11T07:38:33.497] Slurmd shutdown completing
Edit4: slurmd is active, but only after running sudo service slurmd restart. Running stop and then start does not activate slurmd.
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2020-08-11 08:17:46 UTC; 1min 37s ago
Docs: man:slurmd(8)
Process: 28281 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 28474
Tasks: 0
Memory: 1.1M
CGroup: /system.slice/slurmd.service
Aug 11 08:17:46 node-1 systemd[1]: Starting Slurm node daemon...
Aug 11 08:17:46 node-1 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Aug 11 08:17:46 node-1 systemd[1]: Started Slurm node daemon.
Aug 11 08:18:41 node-1 systemd[1]: slurmd.service: Supervising process 28474 which is not our child. We'll most likely not notice when it exits.
Edit5: another issue which may be related, is that sacct can only be run with sudo, and it complains about permissions to the log file. I tried changing the permissions to /var/log but it caused problems since its a system folder:
ubuntu#node-0:/data/pangenome_cactus$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
/var/log/slurm_jobacct.log: Permission denied
ubuntu#node-0:/data/pangenome_cactus$ sudo sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2 cactus_pa+ debug (null) 0 FAILED 127:0
3 cactus_pa+ debug (null) 0 RUNNING 0:0
3.0 singulari+ (null) 0 RUNNING 0:0

I am going to add some of my opinions to this question since I think my solution works. "Slurmd" does not seem to care if the PidFile path exist or not. However, it will return error-code when running as daemon if it fails to write into the given path. Linux service catches the error-code and thinks daemon failed to start but in reality "slurmd" already started. This is why you get "Address already in use" error when trying to start it again. So, the solution to this is to ensure the PidFile path does exist when the machine boots up.
#solution #1
Don't create files under /var/run. Use some other directory that is not for "root". If you want to use /var/run, go to solution #2.
#solution #2
/var/run is a temporary directory created in the memory. It will not persist between reboot. Another problem is "/var/run" is for "root" user not "slurm". This is why "slurmd" has no right to write into it. So I would recommend create /var/run/slurm and put everything in there.
To solve this problem, we can reference "Munge". If you do "ls -l /var/run/" you will notice "/var/run/munge" has user "munge" and group "munge". Also, munge is able to create "/var/run/munge" directory upon start.
To create the directory under "/var/run" upon start, simply create a file under /usr/lib/tmpfiles.d/slurm.conf (again this is how munge did it. You can reference /usr/lib/tmpfiles.d/munge.conf).
d /var/run/slurm 0755 slurm slurm -
d /var/log/slurm 0755 slurm slurm -
d /var/spool/slurm 0755 slurm slurm -
Then, make sure your slurm.conf, slurmd.service, slurmctld.service has PidFile point to the same location as above.
This is it. It should do the trick. I also run into another weird problem where the service will fail upon start and I had to add "Restart=on-failure" and "RestartSec=5" to my services so that it will eventually succeed (about 10~20s). This is not neat but does work.

The slurmd daemon says got shutdown request, so it was terminated by systemd probably because of Can't open PID file /run/slurmd.pid (yet?) after start. systemd is configured to consider that slurmd starts successfully if the PID file /run/slurmd.pid exists. But the Slurm configuration states SlurmdPidFile=/var/run/slurmd.pid. Try changing it to SlurmdPidFile=/run/slurmd.pid.

Related

Ansible Reload Cisco Devices Hangs

I have a task that reloads a number of Cisco Routers and Cisco Firewalls. For some reason the playbook always hangs at this task after the playbook have already reloaded the first device. The reload command is actually sent to the second device and I can see the device restart but the task will eventually fail.
Output:
Task [Reload Host Device]
fatal: [firewall2]: FAILED! => {"changed: false, "msg": "command timeout triggered, timeout value is 30 secs. \nsee the timeout setting options in the Network Debug and Troubleshooting Guide."} ...ignoring
Task:
- name: Reload Host Device
cli_command:
command: "reload"
prompt:
- "Proceed with reload?"
answer:
- "y"
ignore_errors: yes
According the given description, task and error message
command timeout triggered ... see the timeout setting options in the Network Debug and Troubleshooting Guide.
the devices which Ansible are connects to are going to restart faster than Ansible and the module cli_command can maintain his own connection. Meaning, close and disconnect the session.
According the Command Reference and Cisco Router: Auto restart in x seconds you may try with
- name: Schedule Host Device Reload in 1 min
cli_command:
command: "reload in 1"
prompt:
- "Proceed with reload?"
answer:
- "y"
register: result
- name: Show result
debug:
var: result
Further Reading
An other approach which was used in the past for Linux systems
How to automate system reboots using the Ansible?

How to save ansible debug logs to a single file

Goal to achieve - check the status of "filebeat" & "Telegraf" service from ansible on 20 production servers. In case any service is stopped on any servers, I could get alert.
---
- hosts: ALL
tasks:
- name: checking service status
command: systemctl status "{{ item }}"
with_items:
- filebeat
- telegraf
register: result
ignore_errors: yes
- debug:
var: result
Got Below output -
ok: [10.5.10.10] => {
"result.results[0].stdout": "* filebeat.service - Filebeat sends log files to Logstash or directly to Elasticsearch.\n Loaded: loaded (/usr/lib/systemd/system/filebeat.service; disabled; vendor preset: disabled)\n Active: active (running) since Tue 2019-08-06 11:07:34 IST; 3 weeks 6 days ago\n Docs: https://www.elastic.co/products/beats/filebeat\n Main PID: 102961 (filebeat)\n CGroup: /system.slice/filebeat.service\n `-102961 /usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs /var/log/filebeat\n\nWarning: Journal has been rotated since unit was started. Log output is incomplete or unavailable."
}
ok: [10.5.10.11] => {
"result.results[0].stdout": "* filebeat.service - Filebeat sends log files to Logstash or directly to Elasticsearch.\n Loaded: loaded (/usr/lib/systemd/system/filebeat.service; disabled; vendor preset: disabled)\n Active: inactive (dead)\n Docs: https://www.elastic.co/products/beats/filebeat"
}
How could I store these outputs in a file on my ansible server. So that I could apply alert in case any service is not running on any server.
Why don't you parse the output and see a red sign when one of them is not active?
Answering your question:
What I would do is saving the output to a file and copying that file back to your ansible server then appending all the results.

Ansible: Start service in next host after service finished starting on previous host

I have three hosts in which I want to start a service in a rolling fashion. The host 2 needs to wait for the service to finish starting on host 1, and host 3 needs to wait for service on the host 2.
Host 1 has finished starting the service when a line with an instruction like:
Starting listening for CQL clients
is written to a file.
How can I instruct Ansible (service module preferably) only to start the following host, when the service on the previous host writes the line to that file?
you'll probably need to break your playbook down a bit, for example
Your restart.yml contents:
- service:
name: foobar
state: restarted
- wait_for:
search_regex: "Starting listening for CQL clients"
path: /tmp/foo
and then your main.yml contents:
- include_tasks: restart.yml
with_items:
- host1
- host2
- host3
https://docs.ansible.com/ansible/latest/modules/wait_for_module.html
It seems it's not possible to serialize at a task level. So I had to build another playbook specific to start the service and used serial: 1 in the playbook yaml.
My files now look like this:
roles/start-cluster/tasks/main.yaml
- name: Start Cassandra
become: yes
service:
name: cassandra
state: started
- name: Wait for node to join cluster
wait_for:
search_regex: "Starting listening for CQL clients"
path: /var/log/cassandra/system.log
start-cluster.yaml
- hosts: all
serial: 1
gather_facts: False
roles:
- start-cluster

How to instruct Ansible for remote node command to find a text resource file on remote node

I am trying to upgrade some Jboss servers for my application running on remote nodes using Ansible. Through Ansible I can invoke a jboss server start script which has to upgrade and start my server on remote node.
Problem is that the script internally takes a configuration property file as an argument which resides on the remote server (there are many such servers and every server has different configuration property file which resides within the remote node server so I cannot keep these files locally on ansible controller machine) on which actually upgrade is running. However Ansible expects that the resource file should be available on ansible controller (locally) and fails to do the upgrade.
Is there any way I can instruct Ansible to find the particular resource or file directly on the remote node rather then finding it locally and then copying every resource on remote node for execution?
Ansible Playbook file contents
---
- name: Upgrade Server
hosts: remote_host
connection: ssh
vars:
server_version: 188
server_name: UpgradeTest
tasks:
- name: Start server
shell: "{{ jboss_home }}/bin/startJBossServer.sh {{ server_name }} >/dev/null 2>&1 &"
- name: Wait for port {{ server_http_port }} to come up
wait_for: host="localhost" port="{{ server_http_port }}" delay=15 timeout=300 state=started
- name: Test server is up and running
action: uri url="http://localhost:{{ server_http_port }}/{{ server_name }}" return_content=yes timeout=90
register: webpage
until: webpage.status == 200
retries: 25
delay: 5
The file startJBossServer.sh contains the following command:
nohup "${JBOSS_HOME}/bin/standalone.sh" -Djboss.server.base.dir=${JBOSS_HOME}/${i_server_name} -b=0.0.0.0 -c=#fm.config.xml# -P=${start_server_properties_file} </dev/null > "/dev/null" 2>&1 &
If you can see we need ${start_server_properties_file} in -P argument which actually is available on remote node server, however Ansible expects the same resource to be available on local machine and hence fails to run the command.

Initialize Couchbase with puppet

I set up a vagrant box that runs my Couchbase DB. When creating the box I want to initialize my Couchbase with puppet. When I run the following command (which inits the Couchbase cluster) it works.
vagrant#precise64:~$ /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost --cluster-init-username=Administrator --cluster-init-password=foobar --cluster-init-ramsize=256 -u Administrator -p foobar -d
INFO: running command: cluster-init
INFO: servers {'add': {}, 'failover': {}, 'remove': {}}
METHOD: POST
PARAMS: {'username': 'Administrator', 'password': 'foobar', 'port': 'SAME', 'initStatus': 'done'}
ENCODED_PARAMS: username=Administrator&password=foobar&port=SAME&initStatus=done
REST CMD: POST /settings/web
response.status: 200
METHOD: POST
PARAMS: {'memoryQuota': '256'}
ENCODED_PARAMS: memoryQuota=256
REST CMD: POST /pools/default
response.status: 200
SUCCESS: init localhost
$ vagrant#precise64:~$ echo $?
0
However when I run the same command via puppet, puppet complaints about a non zero return value.
vagrant#precise64:~$ puppet apply --debug -e 'exec { "couchbase-init-cluster": command => "/opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost --cluster-init-username=administrator --cluster-init-password=foobar --cluster-init-ramsize=256 -u administrator -p foobar"}'
warning: Could not retrieve fact fqdn
debug: Creating default schedules
debug: Failed to load library 'selinux' for feature 'selinux'
debug: Failed to load library 'shadow' for feature 'libshadow'
debug: Failed to load library 'ldap' for feature 'ldap'
debug: /File[/home/vagrant/.puppet/var/state/state.yaml]: Autorequiring File[/home/vagrant/.puppet/var/state]
debug: /File[/home/vagrant/.puppet/var/log]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/var/state/last_run_report.yaml]: Autorequiring File[/home/vagrant/.puppet/var/state]
debug: /File[/home/vagrant/.puppet/var/state/graphs]: Autorequiring File[/home/vagrant/.puppet/var/state]
debug: /File[/home/vagrant/.puppet/var/run]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/ssl/private]: Autorequiring File[/home/vagrant/.puppet/ssl]
debug: /File[/home/vagrant/.puppet/ssl]: Autorequiring File[/home/vagrant/.puppet]
debug: /File[/home/vagrant/.puppet/var/facts]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/var/clientbucket]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/ssl/certificate_requests]: Autorequiring File[/home/vagrant/.puppet/ssl]
debug: /File[/home/vagrant/.puppet/var/state/last_run_summary.yaml]: Autorequiring File[/home/vagrant/.puppet/var/state]
debug: /File[/home/vagrant/.puppet/var/state]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/var/client_data]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/ssl/public_keys]: Autorequiring File[/home/vagrant/.puppet/ssl]
debug: /File[/home/vagrant/.puppet/var/lib]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/ssl/certs]: Autorequiring File[/home/vagrant/.puppet/ssl]
debug: /File[/home/vagrant/.puppet/var]: Autorequiring File[/home/vagrant/.puppet]
debug: /File[/home/vagrant/.puppet/var/client_yaml]: Autorequiring File[/home/vagrant/.puppet/var]
debug: /File[/home/vagrant/.puppet/ssl/private_keys]: Autorequiring File[/home/vagrant/.puppet/ssl]
debug: Finishing transaction 70097870601760
debug: Loaded state in 0.00 seconds
debug: Loaded state in 0.00 seconds
info: Applying configuration version '1387188181'
debug: /Schedule[daily]: Skipping device resources because running on a host
debug: /Schedule[monthly]: Skipping device resources because running on a host
debug: /Schedule[hourly]: Skipping device resources because running on a host
debug: Exec[couchbase-init-cluster](provider=posix): Executing '/opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost --cluster-init-username=administrator --cluster-init-password=foobar --cluster-init-ramsize=256 -u administrator -p foobar'
debug: Executing '/opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost --cluster-init-username=administrator --cluster-init-password=foobar --cluster-init-ramsize=256 -u administrator -p foobar'
err: /Stage[main]//Exec[couchbase-init-cluster]/returns: change from notrun to 0 failed: /opt/couchbase/bin/couchbase-cli cluster-init --cluster=localhost --cluster-init-username=administrator --cluster-init-password=foobar --cluster-init-ramsize=256 -u administrator -p foobar returned 2 instead of one of [0] at line 1
debug: /Schedule[never]: Skipping device resources because running on a host
debug: /Schedule[weekly]: Skipping device resources because running on a host
debug: /Schedule[puppet]: Skipping device resources because running on a host
debug: Finishing transaction 70097871491620
debug: Storing state
debug: Stored state in 0.01 seconds
notice: Finished catalog run in 0.63 seconds
debug: Finishing transaction 70097871014480
debug: Received report to process from precise64
debug: Processing report from precise64 with processor Puppet::Reports::Store
Please any ideas how I can run that command with puppet.
I believe puppet apply -e takes a puppet expression, not an arbitrary shell expression. You probably want something like:
puppet apply -e 'exec { "couchbase-init": \
command => "/opt/couchbase/bin/couchbase-cli cluster-init <rest of options>"'
I invite you to look at this blog post:
http://blog.couchbase.com/couchbase-cluster-minutes-vagrant-and-puppet

Resources