Hadoop how to delete data on disk after reformatted namenode - hadoop

I want to know that Hadoop how to process disk data after reformatted namenode.
The namenode stores metasore of clusters , In HDFS data from disk data mapping.
when reformatting namenode , hadoop how to execute deletion of disk data?
I will appreciate it!

If you reformat namenode, all the data in namenode is deleted and your namenode becomes fresh. It will be given a new cluster id. All the other nodes are now having old cluster id. So its always a bad idea to reformat namenode when cluster is active .
If you do this accidentally you need to restore namenode metadata from secondary namenode.
If u want to delete data in your data nodes you can simply execute hadoop hdfs -rm -R /

Related

How to take the backup of datanode in the hadoop cluster

Where as i find many solutions for taking the back up of metadata in name node and would like to know how to take the back up of datanode? leaving replication factor aside but want to know the detail process to take the back up of data node in the production level for 20 node cluster.
distcp command in hadoop, can copy data from source clusster to target
for example :
hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase
this command copy hbase folder from cdh57-namenode to CDH59-nameservice
more information can obtain from this link
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_admin_distcp_data_cluster_migrate.html

Format HDFS when adding a node

I have a hdfs setup running in psuedo cluster mode. If I add an additional data node to I need to run the hdfs format on the new node before or after adding it as an additional slave.
refer to this link, there is no need to format any node. just do the instructions of the mentioned link.
There is no command to format a Data Node. After adding a Data Node, you should just start that Data Node.
You can only format a Name Node.
No need to format Namenode After adding new datanode. You just need to refresh the nodes by
hadoop dfsadmin -refreshNodes

Fi-Ware Cosmos: Name node is in safe mode

I am trying to delete a folder in my Cosmos account,
but I get the SafeModeException:
# hadoop fs -rmr /home/<user>/input
rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException:
Cannot delete /user/<user>/input. Name node is in safe mode
During start up Namenode loads the filesystem state from fsimage and edits log file. It then waits for datanodes to report their blocks so that it does not prematurely start replicating the blocks though enough replicas already exist in the cluster. During this time Namenode stays in safemode. A Safemode for Namenode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to filesystem or blocks. And it takes some time to do above operations and after that it comes out of same node.
If still that doesn't happen or you want the namenode to leave safe mode then give
hadoop dfsadmin -safemode leave

Where does Name node store fsImage and edit Log?

I am a java programmer, learning Hadoop.
I read that the Name node in HDFS stores its information into two files namely fsImage & editLog. In case of start up it reads this data from the disk & performs checkpoint operation.
But at many places I also read that Name Node stores the data in RAM & that is why apache recommends a machine with high RAM for Name Node server.
Please enlighten me on this.
What data does it store in RAM & where does it store fsImage and edit Log ?
Sorry if I asked anything obvious.
Let me first answer
What data does it store in RAM & where does it store fsImage and edit Log ?
In RAM -- file to block and block to data node mapping.
In persistent storage (includes both edit log and fsimage) -- file related metadata (permissions, name and so on)
Regarding the storage location of the fsimage and editlog #mashuai's answer is spot on.
For a more detailed discussion you can read up on this
When namenode starts, it loads fsimage from persistent storage(disk) it's location specified by the property dfs.name.dir (hadoop-1.x) or dfs.namenode.name.dir (hadoop-2.x) in hdfs-site.xml. Fsimage is loaded into main memory. Also as you asked during namenode starting it performs check point operation. Namenode keeps the Fsimage in RAM inorder to serve requests fast.
Apart from initial checkpoint, subsequent checkpoints can be controlled by tuning the following parameters in hdfs-site.xml.
dfs.namenode.checkpoint.period # in second 3600 Secs by default
dfs.namenode.checkpoint.txns # No of namenode transactions
It store fsimage and editlog in dfs.name.dir , it's in hdfs-site.xml. When you start the cluster, NameNode load fsimage and editlog to the memory.
When Name Node starts, it goes in safe mode. It loads FSImage from persistent storage and replay edit logs to create updated view of HDFS storage(FILE TO BLOCK MAPPING). Then it writes this updated FSImage to to persistent storage. Now Name node waits for block reports from data nodes. From block reports it creates BLOCK TO DATA NODE MAPPING. When name node received certain threshold of block reports, it goes out of safe mode and Name Node can start serving client requests. Whenever any change in meta data done by client, NameNode(NN) first write thing change in edit log segment with increasing transaction ID to persistent storage (Hard Disk). Then it updates FSImage present in its RAM.
Fsimage and editlog are stored in dfs.name.dir , it's in hdfs-site.xml.
During the start of cluster, NameNode load fsimage and editlog to the memory(RAM).

name node Vs secondary name node

Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram.
Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down.
But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes.
What is the major difference between name node and secondary name node that makes hadoop unavailable.
Thanks in advance.
The namenode stores the HDFS filesystem information in a file named fsimage. Updates to the file system (add/remove blocks) are not updating the fsimage file, but instead are logged into a file, so the I/O is fast append only streaming as opposed to random file writes. When restaring, the namenode reads the fsimage and then applies all the changes from the log file to bring the filesystem state up to date in memory. This process takes time.
The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time.
Unfortunatley the secondarynamenode service is not a standby secondary namenode, despite its name. Specifically, it does not offer HA for the namenode. This is well illustrated here.
See Understanding NameNode Startup Operations in HDFS.
Note that more recent distributions (current Hadoop 2.6) introduces namenode High Availability using NFS (shared storage) and/or namenode High Availability using Quorum Journal Manager.
Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature.
Secondary Namenode is optional now & Standby Namenode has been to used for failover process.
Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .
HDFS High availability is possible with two options : NFS and Quorum Journal Manager but Quorum Journal Manager is preferred option.
Have a look at Apache documentation
From Slide 8 from : http://www.slideshare.net/cloudera/hdfs-futures-world2012-widescreen
When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is reads these edits from the JNs and apply to its own name space.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
Have a look at about fail over process in related SE question :
How does Hadoop Namenode failover process works?
Regarding your queries on CAP theory for Hadoop:
It can be strong consistent
HDFS is almost highly Available unless you met with some bad luck
( If all three replicas of a block are down, you won't get data)
Supports data Partition
Name Node is a primary node in which all the metadata into is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder . when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.
Even in HDFS High Availability, where there are two NameNodes instead of one NameNode and one SecondaryNameNode, there is not availability in the strict CAP sense. It only applies to the NameNode component, and even there if a network partition separates the client from both of the NameNodes then the cluster is effectively unavailable.
If I explain it in simple way, suppose Name Node as a men(working/live) and secondary Name Node as a ATM machine(storage/data storage)
So all the functions carried out by NN or men only but if it goes down/fails then SNN will be useless it doesn’t work but later it can be used to recover your data or logs
When NameNode starts, it loads FSImage and replay Edit Logs to create latest updated namespace. This process may take long time if size of Edit Log file is big and hence increase startup time.
The job of Secondary Name Node is to periodically check edit log and replay to create updated FSImage and store in persistent storage. When Name Node starts it doesn't need to replay edit log to create updated FSImage, it uses FSImage created by secondary name node.
The namenode is a master node that contains metadata in terms of fsimage and also contains the edit log. The edit log contains recently added/removed block information in the namespace of the namenode. The fsimage file contains metadata of the entire hadoop system in a permanent storage. Every time we need to make changes permanently in fsimage, we need to restart namenode so that edit log information can be written at namenode, but it takes a lot of time to do that.
A secondary namenode is used to bring fsimage up to date. The secondary name node will access the edit log and make changes in fsimage permanently so that next time namenode can start up faster.
Basically the secondary namenode is a helper for namenode and performs housekeeping functionality for the namenode.

Resources