How to force move cluster resource when maintenance mode on - cluster-computing

I have pacemaker cluster that is in maintence mode because some maintenance is needed to be done manually on some of the resources. One of the resource is a virtual IP that is currently running on node1. How can I force to move this resource over to another node even though maintenance mode is on.
Tried
pcs resource move virtualip node2
pcs resource ban virtualip node1
I can always manually remove the virtual IP from node1 and add it to node2 but then when I enable the cluster again it first brings the virtualIP back to node1 and then again back again to node2. So technically after a few seconds I end up with the desired situation but I want to avoid this brief outage. I thought if I moved fhe resource through cluster somehow it will understand that the IP should be on node2 and wouldnt want to try to "fix" it.

Related

pacemaker pending tasks list

Does anyone know how to get list of pending tasks in linux HA cluster with pacemaker/openais?
Suppose I have a 2-node cluster with both nodes in online state. I add nodeB into standby state and then stop the pacemaker/openais services on it.
Then, I accidentally executed using crm: "crm node online nodeB".
As soon as pacemaker/openais service is started on nodeB, the node instead of remaining in standby state, changes its state to online.
I want to know if we can view such pending actions/tasks and is there a way to undo/remove them?

Do EC2 system reboot events cause EMR to wait for the node or to replace it?

I have a long running EMR cluster. I received EC2 event notifications of upcoming system reboots. The help document advises that even rebooting these manually will not reschedule this, though stopping and starting the instances might.
The EMR cluster claims if a core node goes unresponsive it will provision a new one. I suspect this provisioning takes longer than a reboot, so what I cannot find in the documentation is whether the EC2 event is known to EMR and the cluster will wait for it's missing core nodes (or task nodes) to reboot and rejoin, or whether EMR will respond as though these instances disappeared un-expectantly, and thus will start provisioning new replacements even as the nodes come back and rejoin the cluster.
Does anyone know which it will be?
It turns out the AWS service person operating the HW replacement and rebooting the instance was tasked with doing the correct adjustments in EMR for changing the instance. They started by adding a node, then draining the old node of tasks. Then they rebooted the node and it was reattached to EMR. Then they drained the added node and shut that down.
I'm not sure this is happening every time there's a reboot event though. It seems like the script of service steps is modified for different types of case.

Changing FQDN of nodes in hadoop cluster

I would like to change the DNS of the nodes added to my hadoop cluster.
FOr example, FQDN of a node in my cluster is hadoop1.dev.com and I would like to change it to hadoop1.abc.xyz
COuld someone suggest me the process to change it without effecting my cluster data.
Update your /etc/hosts file like below, then restart network service to take effect
x.x.x.x hadoop1.dev.com hadoop1.abc.xyz

Dynamic IPs for Hadoop cluster

I need to setup a multi-node Hadoop cluster. So far, I have done installations using static IP addresses for each of the cluster nodes. However, in my latest cluster, I need to work with DHCP assigned nodes. So I am wondering, how should I get the cluster working and survive restarts etc.
Is it mandatory to have static IP address for the cluster nodes or can we get it working with dynamic IPs as well?
Any expert guidance please.
For standalone and pseudo-distributed modes, you can get going on dynamic IP address for it runs on a single node.
For fully distributed mode, the nodes are identified with the file masters and slaves located in 'HADOOP_HOME/conf'. These names are hosts which have been described in '/etc/hosts'. So, when IP of any node changes, Hadoop cannot identify the machines or nodes or hosts (even if identified, these nodes have no Hadoop configured). Thus, you cannot achieve the fully distributed Hadoop setup.
Get your DHCP configured on a router if you can. Else install DHCP on all of the nodes. And get going!!

Understand Spark: Cluster Manager, Master and Driver nodes

Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?
1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.
If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage,
you can launch multiple Masters in your cluster connected to the same
ZooKeeper instance. One will be elected “leader” and the others will
remain in standby mode. If the current leader dies, another Master
will be elected, recover the old Master’s state, and then resume
scheduling. The entire recovery process (from the time the first
leader goes down) should take between 1 and 2 minutes. Note that this
delay only affects scheduling new applications – applications that
were already running during Master failover are unaffected. check here
for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high
availability, but if you want to be able to restart the Master if
it goes down, FILESYSTEM mode can take care of it. When applications
and Workers register, they have enough state written to the provided
directory so that they can be recovered upon a restart of the Master
process. check here for conf and more details
The Cluster Manager is a long-running service, on which node it is running?
A cluster manager is just a manager of resources, i.e. CPUs and RAM, that SchedulerBackends use to launch tasks.
A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks.
You can start a standalone master server by executing:
./sbin/start-master.sh
Can be started anywhere.
To run an application on the Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and the Driver nodes will be the same machine?
I presume that there should be a rule somewhere stating that these two nodes should be different?
In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master.
Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application.
Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex.
So Spark Master is per cluster and Driver JVM is per application.
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly?
i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them.
And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM
Similarly to the previous question: In case where the Master node fails,
what will happen exactly and who is responsible of recovering from the failure?
failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:
1.application won't be able to finish in elegant way.
2.if Spark Master is down Worker will try to reregisterWithMaster. If this fails multiple times workers will simply give up.
reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters.
It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters,
will may arise race condition.Error detailed in SPARK-4592:
At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure.
Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.

Resources