Difference between Rack Awareness and Name node - hadoop

I was going through Hadoop, I have doubt whether there is difference between Rack wareness and Name Node. Will Rack wareness and name node will remain on same box

As Aviral rightly said, the question has been quite vague. But just quoting for your understanding,
Namenode : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
You can read in detail about this concept here.
Rack Awareness : In simple words rack awareness is the strategy namenode employs to choose the nearest datanode based on rack information. You can read details here
Further more, I would like to suggest this blog
Image credits Brad Hedlund

From Apache HDFS Users Guide
HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.
From RackAwareness tutorial:
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Let's see how Hadoop writes are implemented.
If the writer is on a datanode, the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack.
The 3rd replica is placed on a datanode which is on a different node of the rack as the second replica.
Due to replication of data blocks on three different nodes across two different RACs, Hadoop read operations provides high availability of data blocks.
At least one replica is stored on different RAC. If one RAC is not accessible, still Hadoop can fetch data block from other RAC.

Related

Can a hadoop client leverage the benefit of rack awareness?

I have 10 ingestion machines which use akka stream for data ingestion.
I have a Hadoop cluster of 50 nodes, and run pipelines using Spark Streaming. Hadoop cluster uses the data generated by 10 machines for producing reports.
Can I leverage rack awareness from those 10 machines without adding them as part of a Hadoop cluster?
When I say rack awareness, I mean if those machines are in the same rack as Hadoop data nodes, so using rack awareness, I would want each ingestion machine to upload data to it's nearest datanode instead of random manner, so that I would have less network traffic.
Please let me know if that's possible.
If I understood your setup correctly, this should happen automagically. According to HDFS Architecture:
For the common case, when the replication factor is three, HDFS’s
placement policy is to put one replica on the local machine if the
writer is on a datanode, otherwise on a random datanode in the same
rack as that of the writer, another replica on a node in a different
(remote) rack, and the last on a different node in the same remote
rack.
(highlighted is whats relevant to your case if your ingest nodes are not cluster datanodes.)

Need of maintaining replication factor on datanodes

Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.

Data replication in hadoop cluster

I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.

What are the possible reasons behind the imbalance of files stored on HDFS?

Sometimes, the data blocks are stored in imbalanced way across the data node. Based on HDFS block placement policy, the first replica is favored to be stored on the writer node (i.e. the client node), then the second replica is stored on a remote rack and the third one is stored on a local rack. What are the use cases that make the data blocks unbalanced across the data nodes under this placement policy? one possible reason in mind that if the writer nodes are few, then one replica of the data blocks will be stored on these nodes. Are there any other reasons ?
Here are some potential reasons for data skew:
If some of the DataNodes are unavailable for some time (not accepting requests/writes), the cluster can end up unbalanced.
TaskTrackers are not collocated with DataNodes evenly across cluster nodes. If we write data through MapReduce in this situation, the cluster can be unbalanced because the nodes hosting both a TaskTracker and a DataNode would be preferred.
Same as above, but with the RegionServers of HBase.
Large deletion of data can result in an unbalanced cluster depending on the location of the deleted blocks.
Adding new DataNodes will not automatically rebalance existing blocks across the cluster.
The "hdfs balancer" command allows admins to rebalance the cluster. Also, https://issues.apache.org/jira/browse/HDFS-1804 added a new block storage policy that takes into account free space left on the volume.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

Resources