When I store files in HDFS, will they be replicated? - hadoop

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?

Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1

Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining

Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.

It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.

Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).

Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks

Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

Related

When does file from local system is moved to HDFS

I am new to Hadoop, so please excuse me if my questions are trivial.
Is local file system is different than HDFS.
While creating a mapreduce program, we file input file path using fileinputformat.addInputPath() function. Does it split that data into multiple data node and also perform inputsplits as well? If yes, how long this data will stay in datanodes? And can we write mapreduce program to the existing data in HDFS?
1:HDFS is actually a solution to distributed storage, and there will be more storage ceilings and backup problems in localized storage space. HDFS is the server cluster storage resource as a whole, through the nameNode storage directory and block information management, dataNode is responsible for the block storage container. HDFS can be regarded as a higher level abstract localized storage, and it can be understood by solving the core problem of distributed storage.
2:if we use hadoop fileinputformat , first it create an open () method to filesystem and get connection to namenode to get location messages return those message to client . then create a fsdatainputstream to read from different nodes one by one .. at the end close the fsdatainputstream
if we put data into hdfs the client the data will be split into multiple data and storged in different machine (bigger than 128M [64M])
Data persistence is stored on the hard disk
SO if your file is much bigger beyond the pressure of Common server & need Distributed computing you can use HDFS
HDFS is not your local filesystem - it is a distributed file system. This means your dataset can be larger than the maximum storage capacity of a single machine in your cluster. HDFS by default uses a block size of 64 MB. Each block is replicated to at least 3 other nodes in the cluster to account for redundancies (such as node failure). So with HDFS, you can think of your entire cluster as one large file system.
When you write a MapReduce program and set your input path, it will try to locate that path on the HDFS. The input is then automatically divided up into what is known as input splits - fixed size partitions containing multiple records from your input file. A Mapper is created for each of these splits. Next, the map function (which you define) is applied to each record within each split, and the output generated is stored in the local filesystem of the node where map function ran from. The Reducer then copies this output file to its node and applies the reduce function. In the case of a runtime error when executing map and the task fails, Hadoop will have the same mapper task run on another node and have the reducer copy that output.
The reducers use the outputs generated from all the mapper tasks, so by this point, the reducers are not concerned with the input splits that was fed to the mappers.
Grouping answers as per the questions:
HDFS vs local filesystem
Yes, HDFS and local file system are different. HDFS is a Java-based file system that is a layer above a native filesystem (like ext3). It is designed to be distributed, scalable and fault-tolerant.
How long do data nodes keep data?
When data is ingested into HDFS, it is split into blocks, replicated 3 times (by default) and distributed throughout the cluster data nodes. This process is all done automatically. This data will stay in the data nodes till it is deleted and finally purged from trash.
InputSplit calculation
FileInputFormat.addInputPath() specifies the HDFS file or directory from which files should be read and sent to mappers for processing. Before this point is reached, the data should already be available in HDFS, since it is now attempting to be processed. So the data files themselves have been split into blocks and replicated throughout the data nodes. The mapping of files, their blocks and which nodes they reside on - this is maintained by a master node called the NameNode.
Now, based on the input path specified by this API, Hadoop will calculate the number of InputSplits required for processing the file/s. Calculation of InputSplits is done at the start of the job by the MapReduce framework. Each InputSplit then gets processed by a mapper. This all happens automatically when the job runs.
MapReduce on existing data
Yes, MapReduce program can run on existing data in HDFS.

Hadoop - data balanced automatically on copying to HDFS?

If I copy a set of files to HDFS in a Hadoop 7 node cluster, would HDFS take care of automatically balancing out the data across the 7 nodes, is there any way I can tell HDFS to constrain/force data to a particular node in the cluster?
NameNode is 'the' master who decides about where to put data blocks on different nodes in the cluster. In theory, you should not alter this behavior as it is not recommended. If you copy files to hadoop cluster, NameNode will automatically take care of distributing them almost equally on all the DataNodes.
If you want to force change this behaviour (not recommended), these posts could be useful:
How to put files to specific node?
How to explicilty define datanodes to store a particular given file in HDFS?

When and who exactly creates the input splits for MapReduce in Hadoop?

When I copy the data file to HDFS by using -copyFromLocal command` data gets copied into to HDFS. When I see this file through web browser, it shows that the replication factor is 3 and file is in location "/user/hduser/inputData/TestData.txt" with a size of 250 MB.
I have 3 CentOS servers as DataNodes, CentOS Desktop as NameNode and client.
When I copy from local to the above mentioned path, where exactly it copies to?
Does it copy to NameNode or DataNode as blocks of 64 MB?
Or, it won't replicate until I run MapReduce job and map prepares splits and replicates the data to DataNodes?
Please clarify my queries.
1 . When i copy from local to this above mentioned path. Where exactly it copies to ? Ans: The data gets copied to HDFS or HADOOP Distributed file system. which consists of data node and name node. The data that you copy resides in data nodes as blocks (64MB or multiple of 64 MB) and the information of which blocks resides in which data node and its replica is stored in namenode.
2. is it copies to namenode or datanode as many splits of 64 MB ? or Ans: your file will be stored in data node as blocks of 64MB and the location and order of the splits is stored in name node.
3 it wont replicate untill i run MapReduce Job. and map prepares splits and replicates to datanodes. Ans: This is not true. As soon as the data is copied in HDFS, Filesystem replicates the data based on the set replication ratio irrespective of process used to copy the data.

Does HDFS needs 3 times the data space?

I was attending a course on Hadoop and MapReduce on Udacity.com and the instructor mentioned that In HDFS to reduce the point of failures each block is replicated 3 times in Database. Is it true for real?? Does it mean that If I have 1 petabytes of Logs I will need 3 Petabytes of Storage?? Beacuse that will cost me more
Yes, is true, HDFS requires space for each redundant copy and requires copies to achieve failure tolerance and data locality during processing.
But this is not necessarily true about MapReduce, which can run on other file systems like S3 or Azure blobs for instance. It is HDFS that requires the 3 copies.
By default, HDFS conf parameter dfs.replication is set with value 3. That allow fault tolerance, disponibility, etc... (All parameters of HDFS here)
But in install time, you could set the parameter in 1, and HDFS don't make replicas of your data. With dfs.replication=1, 1 petabyte is storaged in the same space amount.
Yes that's true. So say if you have say 4 machines with datanodes running on them, then by default replication will happen in other two machines at random as well. If you don't want that, you can switch it to 1 by setting dfs.replication property in hdfs-site.xml
This is because HDFS replicates data when you store it. The default replication factor for hdfs is 3, which you can find in hdfs-site.xml file under dfs.replication property. You can set this value to 1 or 5 as per your requirement.
Data replication is much useful as if some node particularly goes down, you will have the copy of data available on other node/nodes for processing.

How does HDFS download files?

If the Hadoop replication is set to 3 and I use hadoop dfs -get to download a file, how many datanodes are transpoting data to me simultaneously ? Is the download method a parallel way like RAID, or just read datanodes one by one sequentially?
The data is read sequentially from only one node.
Note that the file might be multiple blocks, in which case the blocks are pulled from different nodes.

Resources