Does Hadoop Distcp copy at block level? - hadoop

Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel.
However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level.
Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.
Can anyone tell how exactly this will work?
additional question: if a file is assigned to only one mapper, how does the mapper find all the input splits on one node that it's running on?

For a single file of ~50G size, 1 map task will be triggered to copy the data since files are the finest level of granularity in Distcp.
Quoting from the documentation:
Why does DistCp not run faster when more maps are specified?
At
present, the smallest unit of work for DistCp is a file. i.e., a file
is processed by only one map. Increasing the number of maps to a value
exceeding the number of files would yield no performance benefit. The
number of maps launched would equal the number of files.
UPDATE
The block locations of the file is obtained from the namenode during mapreduce. On Distcp, each Mapper will be initiated, if possible, on the node where the first block of the file is present. In cases where the file is composed of multiple splits, they will be fetched from the neighbourhood if not available on the same node.

Related

When does file from local system is moved to HDFS

I am new to Hadoop, so please excuse me if my questions are trivial.
Is local file system is different than HDFS.
While creating a mapreduce program, we file input file path using fileinputformat.addInputPath() function. Does it split that data into multiple data node and also perform inputsplits as well? If yes, how long this data will stay in datanodes? And can we write mapreduce program to the existing data in HDFS?
1:HDFS is actually a solution to distributed storage, and there will be more storage ceilings and backup problems in localized storage space. HDFS is the server cluster storage resource as a whole, through the nameNode storage directory and block information management, dataNode is responsible for the block storage container. HDFS can be regarded as a higher level abstract localized storage, and it can be understood by solving the core problem of distributed storage.
2:if we use hadoop fileinputformat , first it create an open () method to filesystem and get connection to namenode to get location messages return those message to client . then create a fsdatainputstream to read from different nodes one by one .. at the end close the fsdatainputstream
if we put data into hdfs the client the data will be split into multiple data and storged in different machine (bigger than 128M [64M])
Data persistence is stored on the hard disk
SO if your file is much bigger beyond the pressure of Common server & need Distributed computing you can use HDFS
HDFS is not your local filesystem - it is a distributed file system. This means your dataset can be larger than the maximum storage capacity of a single machine in your cluster. HDFS by default uses a block size of 64 MB. Each block is replicated to at least 3 other nodes in the cluster to account for redundancies (such as node failure). So with HDFS, you can think of your entire cluster as one large file system.
When you write a MapReduce program and set your input path, it will try to locate that path on the HDFS. The input is then automatically divided up into what is known as input splits - fixed size partitions containing multiple records from your input file. A Mapper is created for each of these splits. Next, the map function (which you define) is applied to each record within each split, and the output generated is stored in the local filesystem of the node where map function ran from. The Reducer then copies this output file to its node and applies the reduce function. In the case of a runtime error when executing map and the task fails, Hadoop will have the same mapper task run on another node and have the reducer copy that output.
The reducers use the outputs generated from all the mapper tasks, so by this point, the reducers are not concerned with the input splits that was fed to the mappers.
Grouping answers as per the questions:
HDFS vs local filesystem
Yes, HDFS and local file system are different. HDFS is a Java-based file system that is a layer above a native filesystem (like ext3). It is designed to be distributed, scalable and fault-tolerant.
How long do data nodes keep data?
When data is ingested into HDFS, it is split into blocks, replicated 3 times (by default) and distributed throughout the cluster data nodes. This process is all done automatically. This data will stay in the data nodes till it is deleted and finally purged from trash.
InputSplit calculation
FileInputFormat.addInputPath() specifies the HDFS file or directory from which files should be read and sent to mappers for processing. Before this point is reached, the data should already be available in HDFS, since it is now attempting to be processed. So the data files themselves have been split into blocks and replicated throughout the data nodes. The mapping of files, their blocks and which nodes they reside on - this is maintained by a master node called the NameNode.
Now, based on the input path specified by this API, Hadoop will calculate the number of InputSplits required for processing the file/s. Calculation of InputSplits is done at the start of the job by the MapReduce framework. Each InputSplit then gets processed by a mapper. This all happens automatically when the job runs.
MapReduce on existing data
Yes, MapReduce program can run on existing data in HDFS.

Concept of blocks in Hadoop HDFS

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.
First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks
Refer below SE questions for commands to view blocks :
Viewing the number of blocks for a file in hadoop
2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).
A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.
Mapper process input splits and emit output to Reducer job.
3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :
How does Hadoop perform input splits?
4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.
Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.
Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.
Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes, the blocks exist physically on disk across the datanodes in your cluster. I suppose you could "see" them if you were on one of the datanodes and you really wanted to, but it would likely not be illuminating. It would only be a random 128m (or whatever dfs.block.size is set to in hdfs-site.xml) fragment of the file with no meaningful filename. The hdfs dfs commands enable you to treat HDFS as a "real" filesystem.
Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop takes care of splitting the file into blocks and distributing them among the datanodes when you put a file in HDFS (through whatever method applies to your situation).
Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Not entirely sure what you mean, but the blocks exist before, and irrespective of, any processing you do with them.
Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Again, blocks in HDFS are determined before any processing is done, if any is done at all. HDFS is simply a way to store a large file in a distributed fashion. When you do processing, for example with a MapReduce job, Hadoop will write intermediate results to disk. This is not related to the blocking of the raw file in HDFS.

Does input split get copied to JobTracker FileSystem?

As mentioned in Hadoop definitive guide, during submission of an MR job, Input splits get computed and then, get copied to JobTracker's FileSystem. However, it does not make sense to me if the data is really huge. This copy will take a lot of time and also, if the node running JobTracker does not have enough space, what would happen to this copy? Please clarify this processing framework.
Thanks in advance.
InputSplits are just a logical abstraction of block boundaries. Generally a InputSplit contains the following information:
Path to the file
Block start position
Number of bytes in the file to process
List of hosts containing the blocks for file being processed
For a given job its the responsibility of the JobClient to compute the input splits information (which is just an ArrayList of above stated FileSplit objects) by calling writeSplits method which internally calls the InputFormat's getSplits method, once computed this information is copied to HDFS from where the JobTracker will read and will schedule the mappers based on data-locality.
If you are interested in how the splits themselves are calculated take a look at the FileInputFormat.getSplits method.

improving performance when you have many small input files using Pig Latin

Currently I'm working with approximately 19 gigabytes of log data,
and they are much seperated so that the nubmer of input files is 145258(pig stat).
Between executing application and starting mapreduce job in web UI,
enormous time is wasted to get prepared(about 3hours?) and then the mapreduce job starts.
and also mapreduce job itself(through Pig script) is pretty slow, it takes about an hour.
mapreduce logic is not that complex, just like a group by operation.
I have 3 datanodes and 1 namenode, 1 secondary namenode.
How can I optimize configuration to improve mapreduce performance?
You should set pig.maxCombinedSplitSize to a reasonable size and make sure that pig.splitCombination is set to its default true.
Where is your data? on HDFS? on S3? If the data is on S3, you should merge the data into larger files once and then execute your pig scripts on it, otherwise, it's going to take a long time anyway - S3 returns object lists with pagination and it takes a long time to fetch the list (also if you have more objects in the bucket and you're not searching for your files with a prefix only pattern, hadoop will list all of the objects (because there's no other option in S3).
Try a hadoop fs -ls /path/to/files | wc -l and look at how long that takes to come back - you have two problems:
Discovering the files to process - the above ls will probably take a good number of minutes to complete. Each file then has to be queried for its block size to determine whether it can be split / processed by multiple mappers
Retaining all the information from the above is most probably going to push the JVM limits of your client, you'll probably see a huge amount of GC trying to assign, allocate and grow the collection used to store the split information for the at minimum 145k splits.
So as already suggested, try to combine your files into more sensible file sizes (somewhere near you block size, or a multiple thereof). Maybe you can combine all files for the same hour into a single concatenated file (or to day, depends on your processing use case).
Looks like the problem is more of Hadoop than Pig. You might want to try to combine all the small files into a Hadoop Archive and see if it improves the performance. For details refer to this link
Another approach you can try is run a separate Pig job which periodically UNIONs all the log files into one "big" log file. This should help in reducing the processing time for your main job.

processing very small file with hadoop

I have a question about using hadoop to process a small file. My file only has about a 1,000 or so records but i want the records to roughly be evenly distributed among the nodes. Is there a way to do this? I'm new to hadoop and so far it seems that all the execution is happening on one node instead a multiple simultaneously. Let me know if my question makes sense or if I need to clarify anything. Like I said, i'm very new to Hadoop but am hoping to get some clarification. Thanks.
Use the NLineInputFormat and specify the number of records to be processed by each mapper. This way the records in a single block will be processed by multiple mappers.
The other option is to split your one input file into multiple input files (in the one input path directory).
Each of those input files will then be able to be spread across the hdfs and the map
operations will occur on the worker machines that own those input splits.

Resources