I can find info regarding the process of decommissioning of a node in hadoop cluster but not what happens to the data in the decommissioned node.
Only info i found is the answer given by Tariq at Does decomissioning a node remove data from that node?
My doubt is regarding the recommission of an old node.
If data is lost, it is same as commissioning a new node. Problem solved.
But, if the data is not lost, when we recommission it, the blocks present in this node become redundant since these blocks were already copied during decommission process. This leads to inconsistency in replicas of the blocks.
How does the hadoop framework takes care of this situation?
Related
I am reading the basics about Yarn and Hadoop FileSystem. I was told by some blogs online that Yarn is just resource management system and HDFS is about storage. But I encountered the following lines in the book Hadoop Definitive Guide:
In this line, I can infer that there should be some connection between the location of Datanodes and Node Manager Node. Maybe they can be in the same place. That contradicts the knowledge I got from the blog.
Can anyone helps to explain this?
I googled a lot by"connection between Datanode and Node Manager" and I can not find direct answer to that.
Yarn is the OS, the compute power.
HDFS is the Disk.
If beneficial to move the compute to a node where the data is located. A node will often have a node manager that manages the compute(yarn) and a data node(HDFS). So both a container, and files for a yarn/hadoop job, can be colocated on 1 node/server. It's also the case you can just have a node manager on a node that isn't a data node. And you could have a data node, that wasn't a nodemanager. The two are independent, but frequently it makes sense to collocate them, to take advantage of data locality. After-all who wants a OS without a disk? (Their is actually a use case for this but lets not get into "compute nodes")
I'm replacing multiple machines in my Hadoop CDH 5.7 cluster.
I started by adding a few new machines and decommission same amount of existing datanodes.
I noticed that blocks are marked as under-replicated when decommissioning a node.
Does it mean I'm at risk when decommissioning multiple nodes?
Can I decommission all nodes in parallel?
Is there a better way of replacing all machines?
Thanks!
Its obvious that when a node is down(or removed) the data is under-replicated.
When you add a new node and rebalance this will automatically be fixed.
What's actually happening?
Lets say the replication factor on your cluster is 3. When a node is decommissioned, all the data stored on it is gone and the replication factor of that data is now 2 (and hence under replicated). Now when you add a new node and re-balance the missing copy is made again hence restoring the replication to the default.
Am I at risk?
Not if you are doing it one by one.
That is replace a node and re-balance cluster. Repeat. (I think this is the only way! )
If you just remove multiple nodes there is good chance of losing data as you may lose all replications of some data(which resided on those nodes).
Don't decommission multiple nodes at once!
I have 10 data noes and 2 name nodes Hadoop cluster with replicates configured 3, I was wondering if one of data nodes goes down, will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Add, what if the down data node come back after a while, can hadoop recognize the data on that node? Thanks!
will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Yes, Hadoop will recognize it and make copies of that data on some other nodes. When Namenode stop receiving heart beats from the data nodes, it assumes that data node is lost. To keep the replication of the all the data to defined replication factor, it will make the copies on other data nodes.
Add, what if the down data node come back after a while, can hadoop recognize the data on that node?
Yes, when a data node comes back with all its data, Name node will remove/delete the extra copies of data. In the next heart beat to the data node, Name node will send the instruction to remove the extra data and free up the space on disk.
Snippet from Apache HDFS documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
I'd like to use a hadoop/hdfs-based system, but I'm a bit concerned as I think I will want to have all data for one user on the same physical machine. Is there a way of accomplishing this in the hadoop-based universe?
During hdfs data write process, the datablock is written first in to node from which the client is accessing the cluster if the node is a datanode.
In order to solve your problem. The edge nodes will also be datanodes. Edge nodes are from where the user starts interacting to the cluster.
But using datanodes as edgenodes has some disadvantages. One of them include Data distribution. The data distribution will not be even and if the node fails, cluster re-balancing will be very costly.
Assume the data is not present in its node and present in some other machine,
How will the task tracker know which node contains data?
Does it talk to that data node directly? Or it will contact its own data node and it will take that responsibilty to copy that data?
How will the task tracker know which node contains data?
The TaskTracker does not know it. The JobTracker contacts the Namenode, gets the locations of the data, and tries its best to allocate data from one node to a TaskTracker on the same node (or as close as possible).
Does it talk to that data node directly? Or it will contact its own data node and it will take that responsibilty to copy that data?
It talks to the Datanode directly.