Autostart `slurmd` service on computes after reboot - systemd

I am calling scontrol reboot <nodename> to reboot compute nodes in my SLURM cluster.
The reboot usually times out (seen from SLURM) and the node is set to state "DOWN".
(RESUME_TIMEOUT is set to 300).
This presumably happens because the slurmd service does not autostart itself after boot.
By default, the service is "disabled":
[root#c1 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Activating it using systemctl enable slurmd does not last after the next reboot, the service is again "disabled" then.
I assume this is because the change does not happen in the image which is used for booting.
How can I enable the slurmd service on the computes so that it starts on boot and scontrol reboot works?

This is probably not the recommended way, but I setup a mini cluster at work and the way I fixed it was with a cronjob:
#reboot /usr/bin/scontrol update nodename=[put hostname here] state=resume

I got a reply from Antanas Budriūnas via the OpenHPC mailing list which solved the issue.
(execute on master node)
# chroot /<path>/<to>/<cnode>/<image>
# systemctl enable slurmd
# exit

Related

What is the proper way to shut down ICp?

I have an ICp installation on some bare metal to educate myself with. So I don't need to keep it running all the time. What is the proper way to shut it down while I am not using it? I have two physical nodes; master and worker. Currently I just ssh into each and issue a sudo shutdown now command.
When I bring the cluster back on line later, the I can't get to the admin UI. It responds with a 502 bad gateway error. When I load https://master:9443 I get the Welcome to Liberty page (indicating that at least the web server is running).
If you stop docker containers or the docker runtime, then the kubelet will attempt to restart them.
If you want to shutdown the system, you must stop the kubelet on each node. On Ubuntu, you would use systemctl:
sudo systemctl stop kubelet
sudo systemctl stop docker
Confirm that all processes are shutdown:
top
And that all related network ports are no longer in use:
netstat -antp
(Note that netstat's "-p" option requires root privileges to inspect the pid holding onto the port).
To restart the cluster, start docker and then the kubelet. Again for Ubuntu:
sudo start docker
sudo start kubelet
And of course you can follow the logs for the kubelet:
sudo journalctl -e -u kubelet
Stop Docker to shut it down, I hope this helped.
systemctl stop docker

ntpd service in a docker container is dead, cannot restart

I'm trying to mount a local hadoop cluster using docker and ambari, the problem im having is that ambari install check shows NTP is not running, and it is needed to know if the services installed with ambari are working. I checked ntpd in the containers and tried to launch them but it failed
[root#97ea7075ca78 ~]# service ntpd start
Starting ntpd: [ OK ]
[root#97ea7075ca78 ~]# service ntpd status
ntpd dead but pid file exists
Is there a way to start ntp daemon in those containers?
In docker you don't use the service command as there is no init system. Just run the ntpd command and it should work
ntpd by default goes to background. If that was not the case you would need to use ntpd &

zookeeper automatic start operation fails when firewall is stopped at rc.local script

iam using hadoop apache 2.7.1 on centos 7
and my cluster is ha cluster and iam using zookeeper quorum for automatic failover
but i want to automate zookeeper start process and ofcourse in the shell script we have to stop firewall first in order to let other quorum elements able to contact current zookeeper element
iam writing the following script in /etc/rc.d/rc.local
hostname jn1
systemctl stop firewalld
ZOOKEEPER='/usr/local/zookeeper-3.4.9/'
source /etc/rc.d/init.d/functions
source $ZOOKEEPER/bin/zkEnv.sh
daemon --user root $ZOOKEEPER/bin/zkServer.sh start
but iam facing the problem that when iam issuing the command
systemctl stop firewalld
in rc.local
and issuing zkServer status after host boots iam getting the error
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper-3.4.9/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
but if i execute the same commands with out a script i mean after my host boots as normal process
systemctl status firewalld
zkServer start
there is no problem and zkstatus shows its mode
i have noticed the difference in zookeeper.out log between executing rc.local script and normal commands after the host boots
and the difference is reading server environments in normal commands execute
what could be the effect of stopping firewall at rc.local script to server environment and how to handle it
?
i have abig headache about stopping and restarting firewall scenarios
and i discovered that stopping firewall at rc.local is a fake stopping
so because idon't want fire wall to work at all i ended up with the following solution
systemctl disable firewalld
https://www.rootusers.com/how-to-disable-the-firewall-in-centos-7-linux/
so firewall is not going to work again at any boot

How do I restart hadoop services on dataproc cluster

I may be searching with the wrong terms, but google is not telling me how to do this. The question is how can I restart hadoop services on Dataproc after changing some configuration files (yarn properties, etc)?
Services have to be restarted on a specific order throughout the cluster. There must be scripts or tools out there, hopefully in the Dataproc installation, that I can invoke to restart the cluster.
Configuring properties is a common and well supported use case.
You can do this via cluster properties, no daemon restart required. Example:
dataproc clusters create my-cluster --properties yarn:yarn.resourcemanager.client.thread-count=100
If you're doing something more advanced, like updating service log levels, then you can use systemctl to restart services.
First ssh to a cluster node and type systemctl to see the list of available services. For example to restart HDFS NameNode type sudo systemctl restart hadoop-hdfs-namenode.service
If this is part of initialization action then sudo is not needed.
On master nodes:
sudo systemctl restart hadoop-yarn-resourcemanager.service
sudo systemctl restart hadoop-hdfs-namenode.service
on worker nodes:
sudo systemctl restart hadoop-yarn-nodemanager.service
sudo systemctl restart hadoop-hdfs-datanode.service
After that, you can use systemctl status <name> to check the service status, also check logs in /var/log/hadoop.

Why does clock offset error in the host keeps occurring again and again : cloudera

I have stopped the ntpd and restarted it again. Have done a ntpdate pool.ntp.org. the error went once and the hosts were healthy but after sometime again got a clock offset error.
Also I observed that after doing a ntpdate the web interface of cloudera stopped working. It says potential mismatch configuration fix and restart hue.
I have the cloudera quick start vm with centos setup on VMware.
Check if /etc/ntp.conf file is the same across all nodes/masters
restart ntp
add deamon with chkconfig and set it to on
You can fix it by restarting the NTP service which syncronizes the time with a central source.
You can do this by logging in as root from the commandline and running service ntpd restart.
After about a minute the error in CM shoud go away.
Host Terminal
sudo su
service ntpd restart
Clock offset Error occur on Cloudera Manager if host\node's NTP service could not located or did not respond to a request for the clock offset.
Solution:
1)Identify NTP Server IP or Get details of NTP Server IP for your hadoop Cluster
2)On your Hadoop Cluster Nodes Edit-> /etc/ntp.conf
3)Add entries in ntp.conf
server [NTP Server IP]
server xxx.xx.xx.x
4)Restart Services.Execute
Service ntpd restart
5) Restart Cluster From Cloudera Manager
Note: If Problem Still Persist .Reboot you Hadoop Nodes & Check Process.
Check $ cat /etc/ntp.conf make sure configuration file is same as others (nodes)
$ systemctl restart ntpd
$ ntpdc -np
$ ntpdate -u 0.centos.pool.ntp.org
$ hwclock --systohc
$ systemctl restart cloudera-scm-agent
After that wait a few seconds to let it auto configure.

Resources