How to recover deleted Master Node in dataproc cluster? - hadoop

My one of the Master Nodes got deleted accidentally in my dataproc cluster. Is there any way to recover that Master node or Can I spin up a new Master Node and add it to my cluster? The reason of deletion is still unknown.
Any help is really appreciated.

After knowing that I didn't have many options, I tried the below steps and it worked.
Determine the current active NameNode(hdfs haadmin -getServiceState nn0/nn1)
Create an AMI of the current active NameNode
Launch a new instance from that AMI having exact same name as of deleted master-node.(This is crucial as all hdfs properties inside hdfs-site.xml are configured using this hostname only. So make sure every detail of this instance is exact same as the lost one.)
Our AMI contains every required configuration and services, So as the new instance starts, dataproc will automatically identify the node and add it to the cluster.

If it has been deleted I don't think it can be restored to whatever state you had before deletion. However you can prevent from future accidental deletion by making sure it doesn't get scheduled deleted.

Related

Do EC2 system reboot events cause EMR to wait for the node or to replace it?

I have a long running EMR cluster. I received EC2 event notifications of upcoming system reboots. The help document advises that even rebooting these manually will not reschedule this, though stopping and starting the instances might.
The EMR cluster claims if a core node goes unresponsive it will provision a new one. I suspect this provisioning takes longer than a reboot, so what I cannot find in the documentation is whether the EC2 event is known to EMR and the cluster will wait for it's missing core nodes (or task nodes) to reboot and rejoin, or whether EMR will respond as though these instances disappeared un-expectantly, and thus will start provisioning new replacements even as the nodes come back and rejoin the cluster.
Does anyone know which it will be?
It turns out the AWS service person operating the HW replacement and rebooting the instance was tasked with doing the correct adjustments in EMR for changing the instance. They started by adding a node, then draining the old node of tasks. Then they rebooted the node and it was reattached to EMR. Then they drained the added node and shut that down.
I'm not sure this is happening every time there's a reboot event though. It seems like the script of service steps is modified for different types of case.

Closing down amazon EMR temporarily

I wanted to know if I can temporarily close down my EMR ec2 instance to avoid extra charges. I waned to know if I can get a snapshot of my cluster and closing the ec2 instances temporarily.
You cannot currently terminate/stop your master instance without losing everything on your cluster, including in HDFS, but one thing you might be able to do is shrink your core/task node instance groups when you don't need them. You must still keep at least one core instance (or more if you have a lot of data in HDFS that you want to keep), but you can resize your task instance groups down to zero if your cluster is not in use.
On the other hand, unless you really have anything on your cluster that you need to keep, you might just want to terminate your cluster when you no longer need it, then clone it to a new cluster when you require it again.
For instance, if you only ever store your output data in S3 or another service external to your EMR cluster, there is no reason you need to keep your EMR cluster running while idle and no reason to need to "snapshot" it because your cluster is essentially stateless.
If you do have any data/configuration stored on your cluster that you don't want to lose, you might want to consider moving it off the cluster so that you can shut down your cluster when not in use. (Of course, how you would do this depends upon what exactly we're talking about.)

Creating rethinkdb cluster

I'm writing an automation script that supposed to create 4 instances in AWS and deploy rethinkdb cluster on them without any human interaction. According to the documentation I need to either use --join parameter on command line or put join statements in configuration file. However, what I don't understand is if I need to specify join only once in order to create the cluster or every time I restart any of the cluster nodes?
My current understanding is that I only need to issue it once, the cluster configuration is somehow stored in metadata and next time I can just start rethinkdb without --join parameter and it will reconnect to the rest of the cluster on its own. But when would I need the join option in the configuration file then?
If this is true then do I need to start rethinkdb with --join option in my script then shut it down and then start again without --join? Is this the right way to do it or there are better alternatives?
You're right that on subsequent restarting, you don't need to specify --join from command line, it will discover the cluster and attempt to re-connect. Part of cluster state is store in system table server_config.
Even if you wiped out the data directory on a this node, it may still be able to form cluster because other nodes may have information about that node, and will attempt to connect to it. But if no other node store information about this particular server, or when this particular node is restarted and have a new IP address for some reason, and its data directory is wiped as well, this time, the cluster doesn't know about it(with new IP address).
So, I'll always specifiing --join. It doesn't hurt. And it helps in worst case to make the new node be able to join cluster again.

How to delete datanode from hadoop clusters without losing data

I want to delete datanode from my hadoop cluster, but don't want to lose my data. Is there any technique so that data which are there on the node which I am going to delete may get replicated to the reaming datanodes?
What is the replication factor of your hadoop cluster?
If it is default which is generally 3, you can delete the datanode directly since the data automatically gets replicated. this process is generally controlled by name node.
If you changed the replication factor of the cluster to 1, then if you delete the node, the data in it will be lost. You cannot replicate it further.
Check all the current data nodes are healthy, for these you can go to the Hadoop master admin console under the Data nodes tab, the address is normally something link http://server-hadoop-master:50070
Add the server you want to delete to the files /opt/hadoop/etc/hadoop/dfs.exclude using the full domain name in the Hadoop master and all the current datanodes (your config directory installation can be different, please double check this)
Refresh the cluster nodes configuration running the command hdfs dfsadmin -refreshNodes from the Hadoop name node master
Check the Hadoop master admin home page to check the state of the server to remove at the "Decommissioning" section, this may take from couple of minutes to several hours and even days depending of the volume of data you have.
Once the server is shown as decommissioned complete, you may delete the server.
NOTE: if you have other services like Yarn running on the same server, the process is relative similar but with the file /opt/hadoop/etc/hadoop/yarn.exclude and then running yarn rmadmin -refreshNodes from the Yarn master node

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.

Resources