why isn't hadoop distributing a file to all nodes? - hadoop

I set up a 4 node hadoop cluster according to the walk-through in http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. I used replication of 1 (the cluster is just for testing)
I copied a 2GB file from local. When browsing the file in the http interface I see it was split to 31 blocks, but all of them are on one node (the master)
Is this correct? How can I investigate the reason?

They are all on one node because by default Hadoop will write to the local node first by default. I'm going to guess you were using the Hadoop client from that node. Since you have a replication of one, it's only going to be on that node.
Since you are just playing around, you might want to force spreading the data out. To do this, you can run the rebalancer with hadoop rebalancer. Just control-C it after a few minutes.

Related

Deleting HDFS Block Pool

I am running a Spark on Hadoop cluster. I tried running a Spark job and noticed I was getting some issues, eventually realised by looking at the logs of the data node that the file system of one of the datanodes is full
I looked at hdfs dfsadmin -report to identify this. The category DFS remaining is 0B because the non-DFS used is massive (155GB of 193GB configured capacity).
When I looked at the file system on this data node I could see most of this comes from the /usr/local/hadoop_work/ directory. There are three block pools there and one of them is very large (98GB). When I look on the other data node in the cluster it only has one block pool.
What I am wondering is can I simply delete two of these block pools? I'm assuming (but don't know enough about this) that the namenode (I have only one) will be looking at the most recent block pool which is smaller in size and corresponds to the one on the other data node.
As outlined in the comment above, eventually I did just delete the two block pools. I did this based on the fact that these block pool ID's didn't exist in the other data node and by looking through the local filesystem I could see the files under these ID's hadn't been updated for a while.

Hadoop doesn't use one node for job

I've got a four node YARN cluster set up und running. I recently had to format the namenode due to a smaller problem.
Later I ran Hadoop's PI example to verify every node was still taking part in the calculation, which they all did. However when I start my own job now one of the nodes is not being used at all.
I figured this might be because this node doesn't have any data to work on. So I tried to balance the cluster using the balancer. This doesn't work and the balancer tells me the cluster is balanced.
What am I missing?
While processing, your ApplicationMaster would negoriate with the NodeManager for containers and NodeManager in turn would try to obtain the nearest datanode resource. Since your replication factor is 3, HDFS would try to place 1 whole copy on a single datanode and distribute the rest across all the datanodes.
1) Change the replication factor to 1 (Since you are only trying to benchmark, reducing replication should not be a big issue).
2) Make sure your client(machine from where you would give your -copyFromLocal command) does not have a datanode running on it. If not, HDFS will tend to place most of the data in this node since it would have reduced latency.
3) Control the file distribution using dfs.blocksize property.
4) Check the status of your datanodes using hdfs dfsadmin -report.
Make sure your node is joinig the resourcemanager. Look into nodemanager log on t the problem node, see if there are errors. Look into the resourcemanager Web UI (:8088 by default) make sure the node is listed there.
Make sure the node is bringing enough resources to the pool to be able to run a job. Check yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb in yarn-site.xml on the node. The memory should be more than the minimum memory requested by a container (see yarn.scheduler.minimum-allocation-mb).

How to explicilty define datanodes to store a particular given file in HDFS?

I want to write a script or something like .xml file which explicitly defines the datanodes in Hadoop cluster to store a particular file blocks.
for example:
Suppose there are 4 slave nodes and 1 Master node (total 5 nodes in hadoop cluster ).
there are two files file01(size=120 MB) and file02(size=160 MB).Default block size =64MB
Now I want to store one of two blocks of file01 at slave node1 and other one at slave node2.
Similarly one of three blocks of file02 at slave node1, second one at slave node3 and third one at slave node4.
So,my question is how can I do this ?
actually there is one method :Make changes in conf/slaves file every time to store a file.
but I don't want to do this
So, there is another solution to do this ??
I hope I made my point clear.
Waiting for your kind response..!!!
There is no method to achieve what you are asking here - the Name Node will replicate blocks to data nodes based upon rack configuration, replication factor and node availability, so even if you do managed to get a block on two particular data nodes, if one of those nodes goes down, the name node will replicate the block to another node.
Your requirement is also assuming a replication factor of 1, which doesn't give you any data redundancy (which is a bad thing if you lose a data node).
Let the namenode manage block assignments and use the balancer periodically if you want to keep your cluster evenly distibuted
NameNode is an ultimate authority to decide on the block placement.
There is Jira about the requirements to make this algorithm pluggable:
https://issues.apache.org/jira/browse/HDFS-385
but unfortunetely it is in the 0.21 version, which is not production (alhough working not bad at all).
I would suggest to plug you algorithm to 0.21 if you are on the research state and then wait for 0.23 to became production, or, to downgrade the code to 0.20 if you do need it now.

starting and stopping hadoop daemons/processes in a cluster

I have a linux cluster with 9 nodes and I have installed hadoop 1.0.2. I have a GIS program that I am running using multiple slaves. I need to measure the speedUp of my program by using say 1, 2, 3, 4 .. 8 slave nodes. I use start-all.sh/stop-all.sh script to start/stop my cluster once I make changes in the conf/slaves file by varying the number of slaves.
But I am getting wierd errors while doing so, and it feels that I am not using the correct technique to add/remove slave nodes in the cluster.
Any help regarding the ideal "technique to make changes in slaves file and to restart the cluster" will be appreciated.
The problem likely is that you are not allowing Hadoop to gracefully remove the nodes from the system.
What you want to be doing is decommissioning the nodes so that HDFS has times to re-replicate the files elsewhere. The process is essentially to add some nodes to an excludes file. Then, you run bin/hadoop dfsadmin -refreshNodes, which reads the configurations and refreshes the cluster's view of the nodes.
When adding nodes and even perhaps when removing nodes, you should think about running the rebalancer. This will spread the data out evenly and would help in some performance you may see if new nodes don't have any data.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

Resources