hadoop datanodes use too much bandwidth after adding new nodes - hadoop

the problem is that: I have 3 datanodes when I created the cluster, and a few days ago I added another two datanodes.
After I did this, I ran the balancer, and the balancer finished quickly, and said the cluster was balanced.
But I found that once I put data(about 30MB) into the cluster, the datanodes used a lot of bandwidth (about 400Mbps) to send and receive data between the old datanodes and the new ones.
Could someone tell me what's the possible reason ?
Maybe I described the problem not very clear, I'll show you two pics (from zabbix), hadoop-02 is one of the "old datanode", and hadoop-07 is one of the "new datanode".

If you mean network traffic. Hdfs uses write pipeline. Assume the replication factor is 3, the data flow is
client --> Datanode_1 --> Datanode_2 --> Datanode_3
If the data size is 30mb, the overall traffic is 90mb plus a little overhead (for connection creation, packet headers, data checksums in packets)
If you mean traffic rate. I believe currently Hdfs doesn't have bandwidth throttling between client <--> DN, and DN <--> DN. It will use as much as bandwidth as it can get.
If you noticed more data flows between the old datanodes and the new ones. It might happens when some blocks are under-replicated before. After you add new nodes, NameNode periodically schedule replication task from old DNs to the other DNs(not necessarily the new ones).

Hold on!! You are saying that the bandwidth is over-utilized during the data transfer OR the DNs were not balanced after putting the data because balancer is used to balance the amount of data present on nodes in the cluster.

Related

Is it good to have hadoop Namenode and datanode in two different networks?

We are installing HA enabled 10 node Hadoop cluster by using Cloudera distribution.
Is it good to have Namenode and datanode on two different subnet which is secured through the hardware firewall ?
As long as network requests work in both directions from the active namenode (assuming you setup HA) and every datanode, then should work fine, although the extra network hop would add some latency
In case of big data networks, large number of node to node interactions shall get generated from a single client interaction for getting the expected operation done or result (like clients reading more than a single block of data). Such big data networks shall face performance impact due to additional hop count that can increase latency between the client, name node & job tracker and data node & task tracker when the data traverses between through rack switches.
Hadoop basically provides distributed processing of large data sets across clusters of computers which directly implies that networking plays a key role in deployment architecture and also directly associated with its performance and scalability. HDFS and MapReduce have high east-west traffic pattern.
In HDFS, if rack awareness configuration is enabled for HA, the replication is a continuous activity which happens across network based on replication factor. The shuffle phase involving the transfer of data from mapper to reducer in Hadoop is one of the most network bandwidth consuming activity as all the involved servers shall transfer data to every other simultaneously and this directly underlines the network topology.
Also, RPC mechanism are used by platform services like HDFS, HBase, Hive when a client requests for the remote service to execute a function. Every RPC would require the response sent back to client as soon as possible and if there is a delay for the response to reach the client, then the execution of the command can take longer time.
For optimum performance of hadoop, the network must have high bandwidth, low latency and reliable node connectivity across different nodes which boils down to having reduced hops as far as possible as one of the criteria.
In a typical network deployment, firewalls can impact cluster performance if placed between cluster nodes as they have to inspect the packets in network. Hence, it is better to avoid firewall between nodes in cluster.

Replication Path Cassandra NoSQL

In data replication, is it correct to claim that the time for replication is the write time on the source server plus the delay between the nodes plus the write time on the target server?
Basically. Thats included in the coordinator read latency if you want to look at that. At least up to the consistency level requested number of replicas.

Cannot understand the reason of why HDFS can scale to a large number of concurrent clients

Recently I read the book 《Hadoop:The definitive Guide》, I met one paragraph which I cannot understand,
"One important aspect of this design is that the client contacts datanodes directly to
retrieve data and is guided by the namenode to the best datanode for each block. This
design allows HDFS to scale to a large number of concurrent clients because the data
traffic is spread across all the datanodes in the cluster."
I cannot understand why the author says "This design allows HDFS to scale to a large number of concurrent clients", although he explained the reason that "because the data
traffic is spread across all the datanodes in the cluster " but I cannot understand his words, can someone explain it to me in a easier way?
There is a single namenode which knows on which datanodes the blocks for a given file are located.
Imagine your HDFS has a file of size 1024 MB, split into 8 blocks of size 128 MB. Let's imagine those blocks are conveniently distributed on 8 different nodes.
When your client needs to download the file it will ask the namenode for it. If the namenode wanted to return the file itself, it would have to download it from all the datanodes first, and then return them one by one to the client. This is grossly inefficient because the namenode would have to serve all clients on its own, and waste memory/cpu to store and serve intermediate data. If 2 clients requested the same file at the same time, the namenode would have to serve 16 blocks in total to the clients.
What would be efficient, though, is if your client directly downloaded 1 block from each machine. That way if you had 2 clients requesting the same file, each datanode would only have to serve 2 blocks simultaneoulsy.
When you use a HDFS client, for example the FS shell which comes with your hadoop installation and is callable via hdfs dfs -<cmd> or hadoop fs -<cmd>, the client asks the namenode for the file, on port 9000 by default. The namenode returns the URIs of the different blocks on the different datanodes. The client can download the blocks from separate datanodes, usually on port 50010, the data transfer port.
If you use the FS shell to download a file, and monitor the network on your client machine, you will see that it downloads blocks directly from different datanodes. Here is an example downloading a 4-block file and monitoring the network with the tcptrack tool.
This relieves the namenode, spreading out the workload, and also enables the client to make simultaneous downloads of blocks from multiple datanodes. You can see that in the screenshot, where 2 connections are active and 1 connection is closing from 3 different IP addresses (datanodes).
When all downloads are finished to client concatenates all blocks to obtain the full file.

Need of maintaining replication factor on datanodes

Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

Resources