Does data remain in HDFS when Hadoop cluster is down? - hadoop

I am new to Qubole and wanted to know if data remains in HDFS after Hadoop cluster is down?
Any help is appreciated.
Thank you.

No data on HDFS is gone. We don't backup/restore HDFS. The model of computation on EC2/S3 is that the long-lived data always lives on S3 and HDFS is used only for intermediate and control data. We also use HDFS (and local disk), sometimes, as a cache.

That depends on what is down in the cluster. There are daemons in Hadoop, Namenode, data node, Resource manager, AppMaster and etc.
So if Namenode is down (Master node), then the data remains as is in the cluster, BUT you will not be able to access it at all. Because, Name node holds the meta data of the data nodes.
If a Data node is down on a cluster (slave node), then you will not be able to access the data from this node, but by default data will be stored in 3 locations in the cluster for fault tolerance. So you can still access the data from other two nodes.

Related

Hadoop - data balanced automatically on copying to HDFS?

If I copy a set of files to HDFS in a Hadoop 7 node cluster, would HDFS take care of automatically balancing out the data across the 7 nodes, is there any way I can tell HDFS to constrain/force data to a particular node in the cluster?
NameNode is 'the' master who decides about where to put data blocks on different nodes in the cluster. In theory, you should not alter this behavior as it is not recommended. If you copy files to hadoop cluster, NameNode will automatically take care of distributing them almost equally on all the DataNodes.
If you want to force change this behaviour (not recommended), these posts could be useful:
How to put files to specific node?
How to explicilty define datanodes to store a particular given file in HDFS?

Does hbase really scales linearly?

I started to learn hbase and I don't understand how it scales linearly.
The problem is that before you install hbase you have to have an hdfs cluster. The HDFS cluster have a master node which can be only one in the whole cluster, so it is a bottleneck. Ofcourse we can run 1 more master node (it is possible to run only 1 more master node) but it will be in the standby state.
As I understand hbase uses the HDFS cluster to store data. So, for me it is logically that it have no sense to run more than one Hmaster because all requests will go to the hdfs active master which performance can suffer if we have too much requests.
Also I don't understand properly do we need to install hbase on the same nodes with hdfs or separately. What are the benefits if we run hbase separately from HDFS.
As for me it is logically to install hbase cluster on the same nodes with hdfs as in the following example:
HDFS active master - HMaster
HDFS standby master - HMaster backup
HDFS Data node - HRegion server
for me it is the most logically structure because if we separate hdfs master from hmaster then probability to loose hbase cluster will be two times bigger.
I will be very happy if someone can share information about all these stuff. Because I really don't understand how hbase can linearly scales and how it works with hdfs.
First if you want you can install HBase over any supported file system. It is not mandatory to use it over Hdfs but using it with Hdfs give advantage to it like
Fault taulrence , Data replication, checksums etc.
That's why it is recommended to use HBase over hdfs
Moreover although there is a bottleneck of namenode in hdfs but it does not effect HBase efficiency because it is not that every operation internal working is dependent on namenode of hdfs for instance Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly while Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Which means that reading and writing of data is independent of creating and deleting of table.
You can refer https://www.mapr.com/blog/in-depth-look-hbase-architecture for more details about hdfs.
Also see this webinar on HBase by lars george. https://m.youtube.com/watch?v=_HLoH_PgrLk
Hope this will clear your doubts.

what Hadoop will do after one of datanodes down

I have 10 data noes and 2 name nodes Hadoop cluster with replicates configured 3, I was wondering if one of data nodes goes down, will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Add, what if the down data node come back after a while, can hadoop recognize the data on that node? Thanks!
will hadoop try to generate the lost replicates on the other alive nodes? or just do nothing(since still have 2 replicas left).
Yes, Hadoop will recognize it and make copies of that data on some other nodes. When Namenode stop receiving heart beats from the data nodes, it assumes that data node is lost. To keep the replication of the all the data to defined replication factor, it will make the copies on other data nodes.
Add, what if the down data node come back after a while, can hadoop recognize the data on that node?
Yes, when a data node comes back with all its data, Name node will remove/delete the extra copies of data. In the next heart beat to the data node, Name node will send the instruction to remove the extra data and free up the space on disk.
Snippet from Apache HDFS documentation:
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

How to shard on user id with hdfs?

I'd like to use a hadoop/hdfs-based system, but I'm a bit concerned as I think I will want to have all data for one user on the same physical machine. Is there a way of accomplishing this in the hadoop-based universe?
During hdfs data write process, the datablock is written first in to node from which the client is accessing the cluster if the node is a datanode.
In order to solve your problem. The edge nodes will also be datanodes. Edge nodes are from where the user starts interacting to the cluster.
But using datanodes as edgenodes has some disadvantages. One of them include Data distribution. The data distribution will not be even and if the node fails, cluster re-balancing will be very costly.

Difference between Rack Awareness and Name node

I was going through Hadoop, I have doubt whether there is difference between Rack wareness and Name Node. Will Rack wareness and name node will remain on same box
As Aviral rightly said, the question has been quite vague. But just quoting for your understanding,
Namenode : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
You can read in detail about this concept here.
Rack Awareness : In simple words rack awareness is the strategy namenode employs to choose the nearest datanode based on rack information. You can read details here
Further more, I would like to suggest this blog
Image credits Brad Hedlund
From Apache HDFS Users Guide
HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.
From RackAwareness tutorial:
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Let's see how Hadoop writes are implemented.
If the writer is on a datanode, the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack.
The 3rd replica is placed on a datanode which is on a different node of the rack as the second replica.
Due to replication of data blocks on three different nodes across two different RACs, Hadoop read operations provides high availability of data blocks.
At least one replica is stored on different RAC. If one RAC is not accessible, still Hadoop can fetch data block from other RAC.

Resources