Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.
Related
I just started learning Hadoop, and I am little confused regarding how the data is stored in a distributed manner. I have an MPI background. With MPI, we typically have a master processor that sends out data to various other processors. This is done explicitly by the programmer.
With Hadoop, you have a Hadoop Distributed File System (HDFS). So when you put some file from your local server into HDFS, does HDFS automatically store this file in a distributed manner without anything needed to be done by the programmer? The name, HDFS, seems to imply this, but I just wanted to verify.
Yes, it does.
The file is uploaded, the NameNode coordinates the replication based on the replication factor (usually 3) to the DataNodes where it is stored.
In addition, the NameNode has a job that looks for under-replicated files or blocks and will duplicate them to maintain the replication factor. See HDFS Architecture - Data Replication for more information.
I have read multiple articles about how HBase gain data locality i.e link
or HBase the Definitive guide book.
I have understood that when re-writing HFile, Hadoop would write the blocks on the same machine which is actually the same Region Server that made compaction and created bigger file on Hadoop. everything is well understood yet.
Questions:
Assuming a Region server has a region file (HFile) which is splitted on Hadoop to multiple block i.e A,B,C. Does that means all block (A,B,C) would be written to the same region server?
What would happen if HFile after compaction has 10 blocks (huge file), but region server doesn't have storage for all of them? does it means we loose data locality, since those blocks would be written on other machine?
Thanks for the help.
HBase uses HDFS API to write data to the distributed file sytem (HDFS). I know this will increase your doubt on the data locality.
When a client writes data to HDFS using the hdfs API, it ensures that a copy of the data is written to the local datatnode (if applicable) and then go for replication.
Now I will answer your questions,
Yes. HFile(blocks) written by a specific RegionServer(RS) resides in the local datanode until it is moved for load balancing or recovery by the HMaster(will be back on major compaction). So the blocks A,B,C would be there in the same region server.
Yes. This may happen. But we can control the same by configuring region start and end key for each regions for HBase tables at creation time, which allows the data to be equally distributed in the cluster.
Hope this helps.
Recently I read the book 《Hadoop:The definitive Guide》, I met one paragraph which I cannot understand,
"One important aspect of this design is that the client contacts datanodes directly to
retrieve data and is guided by the namenode to the best datanode for each block. This
design allows HDFS to scale to a large number of concurrent clients because the data
traffic is spread across all the datanodes in the cluster."
I cannot understand why the author says "This design allows HDFS to scale to a large number of concurrent clients", although he explained the reason that "because the data
traffic is spread across all the datanodes in the cluster " but I cannot understand his words, can someone explain it to me in a easier way?
There is a single namenode which knows on which datanodes the blocks for a given file are located.
Imagine your HDFS has a file of size 1024 MB, split into 8 blocks of size 128 MB. Let's imagine those blocks are conveniently distributed on 8 different nodes.
When your client needs to download the file it will ask the namenode for it. If the namenode wanted to return the file itself, it would have to download it from all the datanodes first, and then return them one by one to the client. This is grossly inefficient because the namenode would have to serve all clients on its own, and waste memory/cpu to store and serve intermediate data. If 2 clients requested the same file at the same time, the namenode would have to serve 16 blocks in total to the clients.
What would be efficient, though, is if your client directly downloaded 1 block from each machine. That way if you had 2 clients requesting the same file, each datanode would only have to serve 2 blocks simultaneoulsy.
When you use a HDFS client, for example the FS shell which comes with your hadoop installation and is callable via hdfs dfs -<cmd> or hadoop fs -<cmd>, the client asks the namenode for the file, on port 9000 by default. The namenode returns the URIs of the different blocks on the different datanodes. The client can download the blocks from separate datanodes, usually on port 50010, the data transfer port.
If you use the FS shell to download a file, and monitor the network on your client machine, you will see that it downloads blocks directly from different datanodes. Here is an example downloading a 4-block file and monitoring the network with the tcptrack tool.
This relieves the namenode, spreading out the workload, and also enables the client to make simultaneous downloads of blocks from multiple datanodes. You can see that in the screenshot, where 2 connections are active and 1 connection is closing from 3 different IP addresses (datanodes).
When all downloads are finished to client concatenates all blocks to obtain the full file.
I'm planning a data processing pipeline. My scenario is this:
A user uploads data to a server
This data should be distributed to one (and only one) node in my cluster. There is no distributed computing, just picking a node which has currently the least to do
The data processing pipeline gets data from some kind of distributed job engine. Though here is (finally) my question: many job engines rely on HDFS to work on the data. But since this data is processed on one node only, I'd rather like to avoid to distribute it. But my understanding is that HDFS keeps the data redundant - though I could not find any info if this means whether all data on HDFS is available on all nodes, or the data is mostly on the node where it is processed (locality).
It would be a concern to me due to IO reasons for my usage scenario if data on HDFS would completely redundant.
You can go with Hadoop (Map Reduce + HDFS) to solve your problem.
You can tell HDFS to store specific number of copies as you want. See below dfs.replication property. Set this value to 1 if you want only one copy.
conf/hdfs-site.xml - On master and all slave machines
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Not necessary that HDFS copy data on each and every node. More info.
Hadoop is work on principle that 'Move code to Data'. Since moving code (mostly in MB's) demands very less network bandwidth than moving data in GB's or TB's, you no need to worry about data locality or network bandwidth. Hadoop take cares of it.
I was going through Hadoop, I have doubt whether there is difference between Rack wareness and Name Node. Will Rack wareness and name node will remain on same box
As Aviral rightly said, the question has been quite vague. But just quoting for your understanding,
Namenode : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
You can read in detail about this concept here.
Rack Awareness : In simple words rack awareness is the strategy namenode employs to choose the nearest datanode based on rack information. You can read details here
Further more, I would like to suggest this blog
Image credits Brad Hedlund
From Apache HDFS Users Guide
HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.
From RackAwareness tutorial:
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Let's see how Hadoop writes are implemented.
If the writer is on a datanode, the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack.
The 3rd replica is placed on a datanode which is on a different node of the rack as the second replica.
Due to replication of data blocks on three different nodes across two different RACs, Hadoop read operations provides high availability of data blocks.
At least one replica is stored on different RAC. If one RAC is not accessible, still Hadoop can fetch data block from other RAC.