Hadoop input split for a compressed block - hadoop

If i have a compressed file of 1GB which is splittable and by default the block size and input split size is 128MB then there are 8 blocks created and 8 input split. When the compressed block is read by map reduce it is uncompressed and say after uncompression the size of the block becomes 200MB. But the input split for this assigned is of 128MB, so how is the rest of the 82MB processed.
Is it processed by the next input split?
Is the same input split size is increased?

Here is my understanding:
Lets assume 1 GB compressed data = 2 GB decompressed data
so you have 16 block of data, Bzip2 knows the block boundary as a bzip2 file provides a synchronization marker between blocks. So bzip2 splits data into 16 blocks and sends the data to 16 mappers. Each mapper gets decompressed data size of 1 input split size = 128 MB.
(of-course if data is not exactly multiple of 128 MB, last mapper will get less data)

I am here referring to the compressed files that can be split-table like bzip2 which is splittable. If an input split is created for 128MB block of bzip2 and during map reduce processing when this is uncompressed to 200MB, what happens?

Total file size : 1 GB
Block size : 128 MB
Number of splits: 8
Creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way. For this reason, gzip does not support splitting.
MapReduce will does not split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 8 HDFS blocks, most of which will not be local to the map.
Have a look at : this article and section name: Issues about compression and input split
EDIT: ( for splittable uncompression)
BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper)
Source: https://issues.apache.org/jira/browse/HADOOP-4012

Related

Blocks in Mapreduce

I have very important question cause I must make a presentation about map-reduce.
My Question is:
I have read that the file in map-reduce is divided into blocks and every blocks is replicated in 3 different nodes. the block can be 128 MB is this Block the input file? i mean this 128 MB block will be Splitting into parts and every part will go to single map? if yes so this 128 MB will be divided into Which Size?
or the File breaks into blocks and this blocks is the input for mapper
I'm little bit confused.
Could you see the photo and tell me which one is right.
Here HDFS File is divided into blocks and every singel block 128. MB will be as input for 1 Map
Here the HDFS file Is A Block and this 128 M.B will be splitting and every part will be input for 1 Map
Let's say you have a file of 2GB and you want to place that file in HDFS, then there will be 2GB/128MB = 16 blocks and these block will be distributed across the different DataNodes.
Data splitting happens based on file offsets. The goal of splitting the file and store it into different blocks, is parallel processing and fail over of data.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other data-processing techniques in Hadoop. Split size is user defined value and one can choose his own split size based on the volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split. (i.e., Input Split = Input Block. So 16 mappers will be triggered for a 2 GB file). If Split size is defined as 100 MB (lets say), then 21 Mappers will be triggered (20 Mappers for 2000MB and 21st Mapper for 48MB).
Hope this clears your doubt.
HDFS stores the file as blocks and each block is 128Mb in size (default).
Mapreduce processes this HDFS file. Each mapper processes a block (input split).
So, to answer your question, 128 Mb is a single block size which will not be further split.
Note : input split size used in mapreduce context is logical split, whereas the split size mentioned in the HDFS is physical split.

How does HDFS stores single data which is larger than the block size?

How hadoop will split the data, in case one of my single data is more than the block size?
Eg. Data(talking about single record) I am storing is of size 80 mb and the block size is 64 mb, so how hadoop manages such scenario?
If we use 64MB of block size then data will be load into only two blocks(64MB and 16MB).Hence the size of metadata is decreased.
Edit:
Hadoop framework divides the large file into blocks (64MB or 128 MB) and stores in the slave nodes. HDFS is unware of the content of the block. While writing the data into block it may happen that the record crosses the block limit and part of same record is written on one block and the other is written on other block.
So, the way Hadoop tracks this split of data is by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record over heads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data needed to complete the record. This usually happens in the multi-line record as Hadoop is intelligent enough to handle the single line record scenario.
Usually, input split is configured same as the size of block size but consider if the input split is larger than the block size. Input split represents the size of data that will go in one mapper. Consider below example
• Input split = 256MB
• Block size = 128 MB
Then, mapper will process two blocks that can be on different machines. Which means to process the block the mapper will have to transfer the data between machines to process. Hence to avoid the unnecessary data movement (data locality) we usually keep the same Input split as block size.

Hadoop Mapreduce with compressed/encrypted files (file of large size)

I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user.
For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256.
A 4.4GB csv file is compressed to 40MB on HDFS.
Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat.
When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.
How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it.
Since I don't have actual length of the file, so I don't know how much data to read from input stream.
If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below.
If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file.
How to handle this scenario.?
Hadoop Version - 2.7.1
The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.
However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.
From Hadoop definitive guide:
"For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".
You can refer to the section compression in Hadoop I/O chapter, for more information.

Convert .tar.gz to sequence file after splitting tar.gz

Is it possible to convert 1 .tar.gz file to 1 sequence file using map reduce ?
So far came across all solutions that are doing this without splitting tar.gz or from local file system.
http://qethanm.cc/projects/forqlift/examples/
Imagine your gzip-compressed file stored in HDFS whose size is 1 GB. With an HDFS block size of
64 MB, the file will be stored as 16 blocks. However, creating a split for each block won’t
work since it is impossible to start reading at an arbitrary point in the gzip stream, and
therefore impossible for a map task to read its split independently of the others. The
gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data
as a series of compressed blocks. The problem is that the start of each block is not
distinguished in any way that would allow a reader positioned at an arbitrary point in
the stream to advance to the beginning of the next block, thereby synchronizing itself
with the stream. For this reason, gzip does not support splitting.

Why does mapreduce split a compressed file into input splits?

So from my understanding, when hdfs stores a bzip2 compressed 1GB file with a block size of 64 MB, the file will be stored as 16 different blocks. If I want to run a map-reduce job on this compressed file, map reduce tries to split the file again. Why doesn't mapreduce automatically use the 16 blocks in hdfs instead of splitting the file again?
I think I see where your confusion is coming from. I'll attempt to clear it up.
HDFS slices your file up into blocks. These are physical partitions of the file.
MapReduce creates logical splits on top of these blocks. These splits are defined based on a number of parameters, with block boundaries and locations being a huge factor. You can set your minimum split size to 128MB, in which case each split will likely be exactly two 64MB blocks.
All of this is unrelated to your bzip2 compression. If you had used gzip compression, each split would be an entire file because gzip is not a splittable compression.

Resources