When does file from local system is moved to HDFS - hadoop

I am new to Hadoop, so please excuse me if my questions are trivial.
Is local file system is different than HDFS.
While creating a mapreduce program, we file input file path using fileinputformat.addInputPath() function. Does it split that data into multiple data node and also perform inputsplits as well? If yes, how long this data will stay in datanodes? And can we write mapreduce program to the existing data in HDFS?

1:HDFS is actually a solution to distributed storage, and there will be more storage ceilings and backup problems in localized storage space. HDFS is the server cluster storage resource as a whole, through the nameNode storage directory and block information management, dataNode is responsible for the block storage container. HDFS can be regarded as a higher level abstract localized storage, and it can be understood by solving the core problem of distributed storage.
2:if we use hadoop fileinputformat , first it create an open () method to filesystem and get connection to namenode to get location messages return those message to client . then create a fsdatainputstream to read from different nodes one by one .. at the end close the fsdatainputstream
if we put data into hdfs the client the data will be split into multiple data and storged in different machine (bigger than 128M [64M])
Data persistence is stored on the hard disk
SO if your file is much bigger beyond the pressure of Common server & need Distributed computing you can use HDFS

HDFS is not your local filesystem - it is a distributed file system. This means your dataset can be larger than the maximum storage capacity of a single machine in your cluster. HDFS by default uses a block size of 64 MB. Each block is replicated to at least 3 other nodes in the cluster to account for redundancies (such as node failure). So with HDFS, you can think of your entire cluster as one large file system.
When you write a MapReduce program and set your input path, it will try to locate that path on the HDFS. The input is then automatically divided up into what is known as input splits - fixed size partitions containing multiple records from your input file. A Mapper is created for each of these splits. Next, the map function (which you define) is applied to each record within each split, and the output generated is stored in the local filesystem of the node where map function ran from. The Reducer then copies this output file to its node and applies the reduce function. In the case of a runtime error when executing map and the task fails, Hadoop will have the same mapper task run on another node and have the reducer copy that output.
The reducers use the outputs generated from all the mapper tasks, so by this point, the reducers are not concerned with the input splits that was fed to the mappers.

Grouping answers as per the questions:
HDFS vs local filesystem
Yes, HDFS and local file system are different. HDFS is a Java-based file system that is a layer above a native filesystem (like ext3). It is designed to be distributed, scalable and fault-tolerant.
How long do data nodes keep data?
When data is ingested into HDFS, it is split into blocks, replicated 3 times (by default) and distributed throughout the cluster data nodes. This process is all done automatically. This data will stay in the data nodes till it is deleted and finally purged from trash.
InputSplit calculation
FileInputFormat.addInputPath() specifies the HDFS file or directory from which files should be read and sent to mappers for processing. Before this point is reached, the data should already be available in HDFS, since it is now attempting to be processed. So the data files themselves have been split into blocks and replicated throughout the data nodes. The mapping of files, their blocks and which nodes they reside on - this is maintained by a master node called the NameNode.
Now, based on the input path specified by this API, Hadoop will calculate the number of InputSplits required for processing the file/s. Calculation of InputSplits is done at the start of the job by the MapReduce framework. Each InputSplit then gets processed by a mapper. This all happens automatically when the job runs.
MapReduce on existing data
Yes, MapReduce program can run on existing data in HDFS.

Related

How does Hadoop HDFS decide what data to be put into each block?

I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()

Does uploading a file to HDFS automatically store the files in a distributed manner?

I just started learning Hadoop, and I am little confused regarding how the data is stored in a distributed manner. I have an MPI background. With MPI, we typically have a master processor that sends out data to various other processors. This is done explicitly by the programmer.
With Hadoop, you have a Hadoop Distributed File System (HDFS). So when you put some file from your local server into HDFS, does HDFS automatically store this file in a distributed manner without anything needed to be done by the programmer? The name, HDFS, seems to imply this, but I just wanted to verify.
Yes, it does.
The file is uploaded, the NameNode coordinates the replication based on the replication factor (usually 3) to the DataNodes where it is stored.
In addition, the NameNode has a job that looks for under-replicated files or blocks and will duplicate them to maintain the replication factor. See HDFS Architecture - Data Replication for more information.

Concept of blocks in Hadoop HDFS

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.
First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks
Refer below SE questions for commands to view blocks :
Viewing the number of blocks for a file in hadoop
2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).
A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.
Mapper process input splits and emit output to Reducer job.
3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :
How does Hadoop perform input splits?
4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.
Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.
Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.
Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes, the blocks exist physically on disk across the datanodes in your cluster. I suppose you could "see" them if you were on one of the datanodes and you really wanted to, but it would likely not be illuminating. It would only be a random 128m (or whatever dfs.block.size is set to in hdfs-site.xml) fragment of the file with no meaningful filename. The hdfs dfs commands enable you to treat HDFS as a "real" filesystem.
Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop takes care of splitting the file into blocks and distributing them among the datanodes when you put a file in HDFS (through whatever method applies to your situation).
Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Not entirely sure what you mean, but the blocks exist before, and irrespective of, any processing you do with them.
Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Again, blocks in HDFS are determined before any processing is done, if any is done at all. HDFS is simply a way to store a large file in a distributed fashion. When you do processing, for example with a MapReduce job, Hadoop will write intermediate results to disk. This is not related to the blocking of the raw file in HDFS.

MapReduce intermediate data output location

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?
A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.
TaskTracker is a demon responsible for spawning map and reduce workers and it usually resides on a datanode. Map and reduce jobs run in a buffer until a certain threshold is reached; at that point records are written to disk in the background (see Memory Management in Hadoop's MapReduce tutorial). The process of writing to disk after the threshold capacity is reached is also called spill to disk. Thresholds values are given by parameters (e.g. mapreduce.task.io.sort.mb, mapreduce.map.sort.spill.percent, for Map, that can be configured).
Answer A is off because intermediate data may be written to disk.
Answers B and E can be excluded because spilled intermediate data isn't written to HDFS but to the local filesystem.
Finally, D is wrong because the question is asking for intermediate data of the Mapper’s map method. Also, it's not necessary to specify "outside HDFS" because in Hadoop context local filesystem is always understood as non-HDFS.
So, the correct answer is C.
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job complete
i think this is the parameter that has to be modified to change the intermediate data location
mapreduce.cluster.local.dir
The mapper output is stored on a local filesystem (Not HDFS) of the tasktracker node. So your answer is option "C"

When I store files in HDFS, will they be replicated?

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?
Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1
Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining
Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.
It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.
Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks
Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

Resources