Adding a new volume to a pseudo-distributed Hadoop node failing silently - amazon-ec2

Im attempting to add a new volume to a Hadoop pseudo-distributed node, by adding the location of the volume in dfs.name.dir in hdfs-site.xml, and i can see the lock file in this location - but try as i might, it seems that when i load files (using hive) these locations are hardly used (even though the lock files, and some sub-folders appears.. so Hadoop clearly had access to them). When the main volume comes close to running out of space, i get the following exception:
Failed with exception java.io.IOException: File /tmp/hive-ubuntu/hive_2011-02-24_15-39-15_997_1889807000233475717/-ext-10000/test.csv could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:643)
Any pointers on how to add new volumes to Hadoop ? FWIW im using EC2.

There are a few things you can do, according to the FAQ:
Manually copy files in HDFS to a new name, delete the old files, then rename the new files to be what they were originally.
Increase the replication factor temporarily, setting it back once blocks have balanced out between nodes.
Remove the full node, wait for its blocks to replicate to the other nodes, then bring it back up. This doesn't really help because your full node is still full when you bring it back online.
Run the rebalancer script on the head node.
I'd try running #4 first, then #2.

When adding new disks / capacity to a data node Hadoop does not guarantee that the disks will be load balanced fairly (Ex: It will not put more blocks on drives with more free space). The best way I have solved this is to increase the replication factor (Ex: From 2 to 3).
hadoop fs -setrep 3 -R /<path>
Watch the 'under replicated blocks' report on the name node. As soon as this reaches 0, decrease the replication factor (Ex: From 3 to 2). This will randomly delete replicas from the system which should balance out the local node.
hadoop fs -setrep 2 -R /<path>
It's not going to be 100% balanced, but it should be in a lot better shape then it was before. This is covered in the Hadoop wiki to some extent. If you are running pseudo-distributed, and have no other data nodes then the balancer script will not help you.
http://wiki.apache.org/hadoop/FAQ#If_I_add_new_DataNodes_to_the_cluster_will_HDFS_move_the_blocks_to_the_newly_added_nodes_in_order_to_balance_disk_space_utilization_between_the_nodes.3F

Related

Deleting HDFS Block Pool

I am running a Spark on Hadoop cluster. I tried running a Spark job and noticed I was getting some issues, eventually realised by looking at the logs of the data node that the file system of one of the datanodes is full
I looked at hdfs dfsadmin -report to identify this. The category DFS remaining is 0B because the non-DFS used is massive (155GB of 193GB configured capacity).
When I looked at the file system on this data node I could see most of this comes from the /usr/local/hadoop_work/ directory. There are three block pools there and one of them is very large (98GB). When I look on the other data node in the cluster it only has one block pool.
What I am wondering is can I simply delete two of these block pools? I'm assuming (but don't know enough about this) that the namenode (I have only one) will be looking at the most recent block pool which is smaller in size and corresponds to the one on the other data node.
As outlined in the comment above, eventually I did just delete the two block pools. I did this based on the fact that these block pool ID's didn't exist in the other data node and by looking through the local filesystem I could see the files under these ID's hadn't been updated for a while.

Copying a large file (~6 GB) from S3 to every node of an Elastic MapReduce cluster

Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get throttled as # nodes gets large.
I'm running a job flow with 22 steps, and this file is needed by maybe 8 of them. Sure, I can copy from S3 to HDFS and cache the file before every step, but that's a major speed kill (and can affect scalability). Ideally, the job flow would start with the file on every node.
There are StackOverflow questions at least obliquely addressing persisting a cached file through a job flow:
Re-use files in Hadoop Distributed cache,
Life of distributed cache in Hadoop .
I don't think they help me. Anyone have some fresh ideas?
Two ideas, please consider your case specifics and disregard at will:
Share the file through NFS with a server with a instance type with good enough networking on the same placement group or AZ.
Have EBS PIOPS volumes and EBS-Optimized instances with the file pre-loaded and just attach them to your nodes in a bootstrap action.

why isn't hadoop distributing a file to all nodes?

I set up a 4 node hadoop cluster according to the walk-through in http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. I used replication of 1 (the cluster is just for testing)
I copied a 2GB file from local. When browsing the file in the http interface I see it was split to 31 blocks, but all of them are on one node (the master)
Is this correct? How can I investigate the reason?
They are all on one node because by default Hadoop will write to the local node first by default. I'm going to guess you were using the Hadoop client from that node. Since you have a replication of one, it's only going to be on that node.
Since you are just playing around, you might want to force spreading the data out. To do this, you can run the rebalancer with hadoop rebalancer. Just control-C it after a few minutes.

When I store files in HDFS, will they be replicated?

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?
Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1
Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining
Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.
It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.
Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks
Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

Transferring whole HDFS from one Cluster to another

I have lots of hive Tables stored in my HDFS on a Test Cluster with 5 nodes. The Data should be around 70 Gb * 3 (Replipication). No i want to transfer the whole setup to a different environment with much more nodes. A Network Connection between the two Clusters is not possible.
The thing is that i dont have much time with the new Cluster and also no possibilities to Test the Transfering with an other Test environment. Therefore i need a solid plan. :)
What options do i have?
How can i transfer the hive setup with a minimum of configuration effort on the new cluster?
Is it possible to just copy the hdfs directorys of the 5 Nodes to 5 Nodes of the new Cluster, then add the rest of the nodes to the new cluster and start the balancer?
Without a network connection, it will be tricky!
I would
Copy the files out of HDFS onto some kind of removable storage (USB stick, external HDD, etc.)
Move the storage to the new cluster
Copy the files back into HDFS
Note that this won't preserve metadata like file creation/last access time, and, more importantly, ownership and permissions.
Small-scale testing of this process should be pretty simple.
If you can get (even temporarily) network connectivity between the two clusters, then distcp would be the way to go. It uses map reduce to parallelise the transfers, potentially resulting in massive time savings.
You can copy directories and files from one cluster to another using hadoop distcp command
Here is a small examples that describes its usage
http://souravgulati.webs.com/apps/forums/topics/show/8534378-hadoop-copy-files-from-one-hadoop-cluster-to-other-hadoop-cluster
you can copy data by using this command :
sudo -u hdfs hadoop --config {PathtotheVpcCluster}/vpcCluster distcp hdfs://SourceIP:8020/user/hdfs/WholeData hdfs://DestinationIP:8020/user/hdfs/WholeData

Resources