How does file compression format affect my spark processing - hadoop

I am confused in understanding the splittable and non splittable file format in big data world .
I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat that basically unzipping it then processing it .
Then i moved to gzip format and i am able to process it in my spark job but i always had a doubt why people are saying gzip file format is also not splittable ?
How does it going to affect
my spark job performance ?
So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?
Should i use gzip in my case or any other compression ?if yes then why ?
Also what is the difference in the performance
CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it

First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.
For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.
Now, to answer you questions :
if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
Its loaded by only one CPU on one executor since the file is not splittable.
(bzip2) same size file and then load this in spark and run count on it
Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file
Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.

Related

Compressed file VS uncompressed file in mapreduce. which one gives better performence?

I have a 10 GB csv file and i want to process it in Hadoop MapReduce.
I have a 15 nodes(Datanode) cluster and i want to maximize the throughput.
What compression format should i use ? or Text file without compression will always give me better result over the compressed Text file. please explain the reason.
I used uncompressed file and it gave me better results over Snappy . Why is it so?
The problem with Snappy compression is that it is not splittable, so Hadoop can't divide input file into chunks and run several mappers for input. So most likely your 10Gb file is processed by a single mapper (check it in application history UI). Since hadoop stores big files in separate blocks on different machines, some parts of this file are not even located on the mapper machine and have to be transferred over the network. That seems to be the main reason why Snappy compressed file works slower than plain text.
To avoid the problem you can use bzip2 compression or divide the file into chunks manually and compress each part with snappy.

Hadoop Mapreduce with compressed/encrypted files (file of large size)

I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user.
For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256.
A 4.4GB csv file is compressed to 40MB on HDFS.
Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat.
When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.
How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it.
Since I don't have actual length of the file, so I don't know how much data to read from input stream.
If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below.
If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file.
How to handle this scenario.?
Hadoop Version - 2.7.1
The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.
However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.
From Hadoop definitive guide:
"For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".
You can refer to the section compression in Hadoop I/O chapter, for more information.

Hadoop split method

I know and read many times Hadoop is not aware of what's inside the input file, and the split depends on the InputFileFormat, but let's be more specific ... for example, i read GZIP is not splittable, so if i have a unique gzipped input file of 1 TB, and no one of the node have an hd of that size, what happens? input will be splitted but hadoop will add info about the dependencies between one chunk and others? other question, if i have a huge .xml file, so basically text, how the split works, by line or by the configured MB of the block size?
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
ZSTD (zstandard) is now splittable as well in hadoop/Spark/Flink by leveraging hadoop-4mc.
Please have a look at Hadoop Elephant Bird to process complex input in your jobs. Anyways XML is not natively splittable in EB or hadoop, AFAIK.

Pig and Java MR are only allocating one mapper per LZO file

I have a large amount of data in HDFS in LZO format. I have also indexed the LZO files. When I ran the Java MR Job or Load and process the LZO files using PIG, I see that only one mapper is being used per LZO file (Job completes with out any issues but slow). I have number of mappers configured to 50 in my hadoop config, but when I process LZO files I see only 10 mappers are used (one per lzo file). Is there any other configuration I should turn on?
Software versions:
Hadoop 1.0.4
Pig 0.11
Thanks.
LZO-compressed text files can not be processed in parallel because they are not splittable (i.e. if you read from an arbitrary point in the file you can't determine how to decompress the following compressed data), and so MapReduce is forced to read such a file in serial with a single mapper.
One way to deal with this is to pre-process the LZO text files to create LZO index files that MapReduce can use to split the text files and so process them in parallel.
A more effective approach is to convert the LZO text files into a splittable binary format such as Avro, Parquet, or SequenceFile. These formats allow various data compression codecs (note that Snappy is now much more popular than LZO), and can also provide other benefits such as fast serialization/deserialization, column pruning, and bundled metadata.
The book "Hadoop: The Definitive Guide" has a lot of information on this topic.

Does HDFS encrypt or compress the data while storing?

When I put a file into HDFS, for example
$ ./bin/hadoop/dfs -put /source/file input
Is the file compressed while storing?
Is the file encrypted while storing? Is there a config setting that we can specify to change whether it is encrypted or not?
There is no implicit compression in HDFS. In other words, if you want your data to be compressed, you have to write it that way. If you plan on writing map reduce jobs to process the compressed data, you'll want to use a splittable compression format.
Hadoop can process compressed files and here is a nice article on it. Also, the intermediate and the final MR output can be compressed.
There is a JIRA on 'Transparent compression in HDFS', but I don't see much progress on it.
I don't think there is a separate API for encryption, though you can you use a compression codec for encryption/decryption also. Here are more details about encryption and HDFS.
I very recently set compression up on a cluster. The other posts have helpful links, but the actual code you will want to get LZO compression working is here: https://github.com/kevinweil/hadoop-lzo.
You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload a file in one of those formats. When using the file as an input to a job, you will need to specify that the file is compressed as well as the proper CODEC. Here is an example for LZO compression.
-jobconf mapred.output.compress=true
-jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Why am I going on an on about LZO compression? The cloudera article reference by Praveen goes into this. LZO compression is a splittable compression (unlike GZIP, for example). This means that a single file can be split into chunks to be handed off to a mapper. Without a splittable compressed file, a single mapper will receive the entire file. This may cause you to have too few mappers and to move too much data around your network.
BZIP2 is also splittable. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However it is optimized to be extremely fast. In fact, it can even increase the performance of your job by minimizing disk I/O.
It takes a bit of work to set up, and is a bit of a pain to use, but it is worth it (transparent encryption would be awesome). Once again, the steps are:
Install LZO and LZOP (command-line utility)
Install hadoop-lzo
Upload a file compressed with LZOP.
Index the file as described by hadoop-lzo wiki (the index allows it to be split).
Run your job (with the proper parameters mapred.output.compress and mapred.output.compression.code)

Resources