Does Hadoop distcp copy replicas - hadoop

If I use distcp to copy data within 2 clusters, does it copy all replicas or does it just copy 1 replica of data and replicates it on the new cluster ?
Say for example, I try to copy 3gb of data from a cluster with replication factor(RF) of 3. Will distcp copy the full 3gb of data, or does it know that since RF is 3, it needs to move only 1gb (one copy) of data. Finally on the destination cluster it looks at the RF and accordingly replicates the data.

The raw data size matters. In case the raw data is 1 GB, it takes up to 3x1 GB for replication factor = 3. When copying data from one cluster to another the raw data matters. Only your raw 1 GB of data will be copied to the destination cluster.
HDFS handles the replication of blocks internally. It will notice new data on the cluster and replicate those blocks accordingly, which are under-replicated, i.e. have less replicas than RF.

While you replicate using distcp only the actual data (that is 1 copy of the data) will be replicated/copied. The replication will be handled by the framework just like how it is handled when a fresh data is written to HDFS. In addition to that, in case of distcp's between 2 clusters, you can also specify whether you want to preserve the replication factor at the source.
For more information :
https://hadoop.apache.org/docs/stable1/distcp.html

Related

Data Storage in Hadoop cluster

This is a question from a hadoop book and the answer i thougt was 200 but that is not correct.Can anyone explain?
Assume that there are 50 nodes in your Hadoop cluster with a total of 200 TB (4 TB per node) of raw disk space allocated HDFS storage. Assuming Hadoop's default configuration, how much data will you be able to store?
HDFS has the default replication level set to 3, therefore, each of your data would have 3 copies in HDFS unless specified clearly at the time of creation.
Therefore, under the default HDFS configuration, you could only store 200/3 TB of actual data.

When and who exactly creates the input splits for MapReduce in Hadoop?

When I copy the data file to HDFS by using -copyFromLocal command` data gets copied into to HDFS. When I see this file through web browser, it shows that the replication factor is 3 and file is in location "/user/hduser/inputData/TestData.txt" with a size of 250 MB.
I have 3 CentOS servers as DataNodes, CentOS Desktop as NameNode and client.
When I copy from local to the above mentioned path, where exactly it copies to?
Does it copy to NameNode or DataNode as blocks of 64 MB?
Or, it won't replicate until I run MapReduce job and map prepares splits and replicates the data to DataNodes?
Please clarify my queries.
1 . When i copy from local to this above mentioned path. Where exactly it copies to ? Ans: The data gets copied to HDFS or HADOOP Distributed file system. which consists of data node and name node. The data that you copy resides in data nodes as blocks (64MB or multiple of 64 MB) and the information of which blocks resides in which data node and its replica is stored in namenode.
2. is it copies to namenode or datanode as many splits of 64 MB ? or Ans: your file will be stored in data node as blocks of 64MB and the location and order of the splits is stored in name node.
3 it wont replicate untill i run MapReduce Job. and map prepares splits and replicates to datanodes. Ans: This is not true. As soon as the data is copied in HDFS, Filesystem replicates the data based on the set replication ratio irrespective of process used to copy the data.

Does HDFS needs 3 times the data space?

I was attending a course on Hadoop and MapReduce on Udacity.com and the instructor mentioned that In HDFS to reduce the point of failures each block is replicated 3 times in Database. Is it true for real?? Does it mean that If I have 1 petabytes of Logs I will need 3 Petabytes of Storage?? Beacuse that will cost me more
Yes, is true, HDFS requires space for each redundant copy and requires copies to achieve failure tolerance and data locality during processing.
But this is not necessarily true about MapReduce, which can run on other file systems like S3 or Azure blobs for instance. It is HDFS that requires the 3 copies.
By default, HDFS conf parameter dfs.replication is set with value 3. That allow fault tolerance, disponibility, etc... (All parameters of HDFS here)
But in install time, you could set the parameter in 1, and HDFS don't make replicas of your data. With dfs.replication=1, 1 petabyte is storaged in the same space amount.
Yes that's true. So say if you have say 4 machines with datanodes running on them, then by default replication will happen in other two machines at random as well. If you don't want that, you can switch it to 1 by setting dfs.replication property in hdfs-site.xml
This is because HDFS replicates data when you store it. The default replication factor for hdfs is 3, which you can find in hdfs-site.xml file under dfs.replication property. You can set this value to 1 or 5 as per your requirement.
Data replication is much useful as if some node particularly goes down, you will have the copy of data available on other node/nodes for processing.

What happens when data to be inserted into hdfs is larger than the capacity of datanodes

I know data uploaded into hdfs are replicated across datanodes in a hadoop cluster as blocks. My question is what happens when the capacity of all datanodes in the cluster put together is insufficient? e.g. I have 3 datanodes each with a 10GB data capacity (30GB altogether) and I want to insert a data of size 60GB into the hdfs on the same cluster. I don't see how the 60GB data can be split into blocks (~64MB typically) to be accommodated by the datanodes?
Thanks
I haven't tested it, but it should fail with an out of storage message. As each block of data is written into HDFS, it goes through the replication factor process. Your upload would get about half way through and then die.
That being said, you could potentially gzip the data (high compression) before the upload and potentially squeeze it in, depending on how compressible the data is.
I got this issue when I was trying to move a large file from local fs to hdfs, it was stuck in middle and responded the java error out of space and cancel the move/copy command and deleted all the blocks of file which were already copied to hdfs.
So that means we can't copy a single file greater than the capacity of hdfs size of the cluster.

When I store files in HDFS, will they be replicated?

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?
Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1
Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining
Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.
It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.
Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks
Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

Resources