Convert .tar.gz to sequence file after splitting tar.gz - hadoop

Is it possible to convert 1 .tar.gz file to 1 sequence file using map reduce ?
So far came across all solutions that are doing this without splitting tar.gz or from local file system.
http://qethanm.cc/projects/forqlift/examples/

Imagine your gzip-compressed file stored in HDFS whose size is 1 GB. With an HDFS block size of
64 MB, the file will be stored as 16 blocks. However, creating a split for each block won’t
work since it is impossible to start reading at an arbitrary point in the gzip stream, and
therefore impossible for a map task to read its split independently of the others. The
gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data
as a series of compressed blocks. The problem is that the start of each block is not
distinguished in any way that would allow a reader positioned at an arbitrary point in
the stream to advance to the beginning of the next block, thereby synchronizing itself
with the stream. For this reason, gzip does not support splitting.

Related

HDFS compression at the block level

One of the big issues with HDFS is compression: If you compress a file, you have to deal with splittable compression. Why does HDFS require you to compress an entire file, and not instead implement compression at the HDFS block level?
That would solve the problem: a 64 MB block is read or written in a single chunk, it's big enough to compress, and doesn't interefere with operations or require splittable compression.
Are there any implementations of this?
I'm speculating here, but I can see several problems.
HDFS contains a feature called local short-circuit reads. This allows the datanode to open the block file, validate security, and then pass on the filedescriptor to the application running on the same node. This completely bypasses any file transfer via HTTP or other means from HDFS to the M/R app (or to whatever HDFS app is reading the file). On performant clusters short circuit reads are the norm, rather than an exception, as processing occurs where the split is located. What you describe would require the reader to comprehend the block compression in order to read the block.
Other considerations relate to splits that span blocks. Compressed formats in general lack random access and require sequential access. Reading last few bytes from a block, to make up a split that spans on the next block, could be as expensive as reading the entire block, due to compression.
I'm not saying block compression is impossible, but I feel is more complex that what you expect.
Besides, block compression can be transparently delegated to filesystem.
And, last but not least, better compression formats exists at data layers above HDFS: ORC, Parquet.

Hadoop input split for a compressed block

If i have a compressed file of 1GB which is splittable and by default the block size and input split size is 128MB then there are 8 blocks created and 8 input split. When the compressed block is read by map reduce it is uncompressed and say after uncompression the size of the block becomes 200MB. But the input split for this assigned is of 128MB, so how is the rest of the 82MB processed.
Is it processed by the next input split?
Is the same input split size is increased?
Here is my understanding:
Lets assume 1 GB compressed data = 2 GB decompressed data
so you have 16 block of data, Bzip2 knows the block boundary as a bzip2 file provides a synchronization marker between blocks. So bzip2 splits data into 16 blocks and sends the data to 16 mappers. Each mapper gets decompressed data size of 1 input split size = 128 MB.
(of-course if data is not exactly multiple of 128 MB, last mapper will get less data)
I am here referring to the compressed files that can be split-table like bzip2 which is splittable. If an input split is created for 128MB block of bzip2 and during map reduce processing when this is uncompressed to 200MB, what happens?
Total file size : 1 GB
Block size : 128 MB
Number of splits: 8
Creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way. For this reason, gzip does not support splitting.
MapReduce will does not split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 8 HDFS blocks, most of which will not be local to the map.
Have a look at : this article and section name: Issues about compression and input split
EDIT: ( for splittable uncompression)
BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper)
Source: https://issues.apache.org/jira/browse/HADOOP-4012

Hadoop Mapreduce with compressed/encrypted files (file of large size)

I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user.
For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256.
A 4.4GB csv file is compressed to 40MB on HDFS.
Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat.
When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.
How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it.
Since I don't have actual length of the file, so I don't know how much data to read from input stream.
If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below.
If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file.
How to handle this scenario.?
Hadoop Version - 2.7.1
The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.
However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.
From Hadoop definitive guide:
"For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".
You can refer to the section compression in Hadoop I/O chapter, for more information.

Why does mapreduce split a compressed file into input splits?

So from my understanding, when hdfs stores a bzip2 compressed 1GB file with a block size of 64 MB, the file will be stored as 16 different blocks. If I want to run a map-reduce job on this compressed file, map reduce tries to split the file again. Why doesn't mapreduce automatically use the 16 blocks in hdfs instead of splitting the file again?
I think I see where your confusion is coming from. I'll attempt to clear it up.
HDFS slices your file up into blocks. These are physical partitions of the file.
MapReduce creates logical splits on top of these blocks. These splits are defined based on a number of parameters, with block boundaries and locations being a huge factor. You can set your minimum split size to 128MB, in which case each split will likely be exactly two 64MB blocks.
All of this is unrelated to your bzip2 compression. If you had used gzip compression, each split would be an entire file because gzip is not a splittable compression.

Advantages of Sequence file over hdfs textfile

What is the advantage of Hadoop Sequence File over HDFS flat file(Text)? In what way Sequence file is efficient?
Small files can be combined and written into a sequence file, but the same can be done for a HDFS text file also. Need to know the difference between the two ways. I have been googling about this for a while, would be helpful if i get clarity on this?
Sequence files are appropriate for situations in which you want to store keys and their corresponding values. For text files you can do that but you have to parse each line.
Can be compressed and still be splittable which means better workload. You can't split a compressed text file unless you use a splittable compression format.
Can be approached as binary files => more storage efficient. In a text file a double will be a number of chars => large storage overhead.
Advantages of Hadoop Sequence files ( As per Siva's article from hadooptutorial.info website)
More compact than text files
Provides support for compression at different levels - Block or Record etc.
Files can be split and processed in parallel
They can solve large number of small files problem in Hadoop where Hadoop main advantage is processing large file with Map reduce jobs. It can be used as a container for large number of small files
Temporary output of Mapper can be stored in sequential files
Disadvantages:
Sequential files are append only
Sequence files are intermediate files generated during mapper and reducer phase of MapReduce processing. Sequence file are compressible and fast in processing it is used to write output during mapper and reducer reds from it.
There are APIs in Hadoop and Spark to read/write sequence files

Resources