How to explicilty define datanodes to store a particular given file in HDFS? - hadoop

I want to write a script or something like .xml file which explicitly defines the datanodes in Hadoop cluster to store a particular file blocks.
for example:
Suppose there are 4 slave nodes and 1 Master node (total 5 nodes in hadoop cluster ).
there are two files file01(size=120 MB) and file02(size=160 MB).Default block size =64MB
Now I want to store one of two blocks of file01 at slave node1 and other one at slave node2.
Similarly one of three blocks of file02 at slave node1, second one at slave node3 and third one at slave node4.
So,my question is how can I do this ?
actually there is one method :Make changes in conf/slaves file every time to store a file.
but I don't want to do this
So, there is another solution to do this ??
I hope I made my point clear.
Waiting for your kind response..!!!

There is no method to achieve what you are asking here - the Name Node will replicate blocks to data nodes based upon rack configuration, replication factor and node availability, so even if you do managed to get a block on two particular data nodes, if one of those nodes goes down, the name node will replicate the block to another node.
Your requirement is also assuming a replication factor of 1, which doesn't give you any data redundancy (which is a bad thing if you lose a data node).
Let the namenode manage block assignments and use the balancer periodically if you want to keep your cluster evenly distibuted

NameNode is an ultimate authority to decide on the block placement.
There is Jira about the requirements to make this algorithm pluggable:
https://issues.apache.org/jira/browse/HDFS-385
but unfortunetely it is in the 0.21 version, which is not production (alhough working not bad at all).
I would suggest to plug you algorithm to 0.21 if you are on the research state and then wait for 0.23 to became production, or, to downgrade the code to 0.20 if you do need it now.

Related

HDFS replication factor on single node cluster

Can I have more than one replica's for single node cluster? i have updated replication factor as 2 in hdfs-site.xml and restarted all nodes, but still only one block created for new files, help me to get clarity on this
No. You can't have more than one replication factor for a single node cluster. What makes you think that it is even possible?
Replication is the procedure to save your data so that you don't lose it in any worst conditions. If you're setting it to 2, that means you want your data to be copied on 2 nodes(machines) so that if one goes down you'll still have your data safe on another node.
Now, the default replication provided by Hadoop is 3. Which means there will 3 Replications(Copy) of data on 3 different nodes on different racks(That's another concept which is called Hadoop's Rack awareness)
So you won't be able to get more than one copy of your data on a Single node cluster. I hope it clears your query!

Data replication in hadoop cluster

I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.

why isn't hadoop distributing a file to all nodes?

I set up a 4 node hadoop cluster according to the walk-through in http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. I used replication of 1 (the cluster is just for testing)
I copied a 2GB file from local. When browsing the file in the http interface I see it was split to 31 blocks, but all of them are on one node (the master)
Is this correct? How can I investigate the reason?
They are all on one node because by default Hadoop will write to the local node first by default. I'm going to guess you were using the Hadoop client from that node. Since you have a replication of one, it's only going to be on that node.
Since you are just playing around, you might want to force spreading the data out. To do this, you can run the rebalancer with hadoop rebalancer. Just control-C it after a few minutes.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

How can I be sure that data is distributed evenly across the hadoop nodes?

If I copy data from local system to HDFS, сan I be sure that it is distributed evenly across the nodes?
PS HDFS guarantee that each block will be stored at 3 different nodes. But does this mean that all blocks of my files will be sorted on same 3 nodes? Or will HDFS select them by random for each new block?
If your replication is set to 3, it will be put on 3 separate nodes. The number of nodes it's placed on is controlled by your replication factor. If you want greater distribution then you can increase the replication number by editing the $HADOOP_HOME/conf/hadoop-site.xml and changing the dfs.replication value.
I believe new blocks are placed almost randomly. There is some consideration for distribution across different racks (when hadoop is made aware of racks). There is an example (can't find link) that if you have replication at 3 and 2 racks, 2 blocks will be in one rack and the third block will be placed in the other rack. I would guess that there is no preference shown for what node gets the blocks in the rack.
I haven't seen anything indicating or stating a preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at whatever value) a simple option is $HADOOP_HOME/bin/start-balancer.sh which will run a balancing process to move blocks around the cluster automatically.
This and a few other balancing options can be found in at the Hadoop FAQs
Hope that helps.
You can open HDFS Web UI on port 50070 of Your namenode. It will show you the information about data nodes. One thing you will see there - used space per node.
If you do not have UI - you can look on the space used in the HDFS directories of the data nodes.
If you have a data skew, you can run rebalancer which will solve it gradually.
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
Yes, Hadoop distributes data per block, so each block would be distributed separately.

Resources