I have a large amount of data in HDFS in LZO format. I have also indexed the LZO files. When I ran the Java MR Job or Load and process the LZO files using PIG, I see that only one mapper is being used per LZO file (Job completes with out any issues but slow). I have number of mappers configured to 50 in my hadoop config, but when I process LZO files I see only 10 mappers are used (one per lzo file). Is there any other configuration I should turn on?
Software versions:
Hadoop 1.0.4
Pig 0.11
Thanks.
LZO-compressed text files can not be processed in parallel because they are not splittable (i.e. if you read from an arbitrary point in the file you can't determine how to decompress the following compressed data), and so MapReduce is forced to read such a file in serial with a single mapper.
One way to deal with this is to pre-process the LZO text files to create LZO index files that MapReduce can use to split the text files and so process them in parallel.
A more effective approach is to convert the LZO text files into a splittable binary format such as Avro, Parquet, or SequenceFile. These formats allow various data compression codecs (note that Snappy is now much more popular than LZO), and can also provide other benefits such as fast serialization/deserialization, column pruning, and bundled metadata.
The book "Hadoop: The Definitive Guide" has a lot of information on this topic.
Related
I am confused in understanding the splittable and non splittable file format in big data world .
I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat that basically unzipping it then processing it .
Then i moved to gzip format and i am able to process it in my spark job but i always had a doubt why people are saying gzip file format is also not splittable ?
How does it going to affect
my spark job performance ?
So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?
Should i use gzip in my case or any other compression ?if yes then why ?
Also what is the difference in the performance
CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it
First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.
For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.
Now, to answer you questions :
if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
Its loaded by only one CPU on one executor since the file is not splittable.
(bzip2) same size file and then load this in spark and run count on it
Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file
Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.
I have a 10 GB csv file and i want to process it in Hadoop MapReduce.
I have a 15 nodes(Datanode) cluster and i want to maximize the throughput.
What compression format should i use ? or Text file without compression will always give me better result over the compressed Text file. please explain the reason.
I used uncompressed file and it gave me better results over Snappy . Why is it so?
The problem with Snappy compression is that it is not splittable, so Hadoop can't divide input file into chunks and run several mappers for input. So most likely your 10Gb file is processed by a single mapper (check it in application history UI). Since hadoop stores big files in separate blocks on different machines, some parts of this file are not even located on the mapper machine and have to be transferred over the network. That seems to be the main reason why Snappy compressed file works slower than plain text.
To avoid the problem you can use bzip2 compression or divide the file into chunks manually and compress each part with snappy.
I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user.
For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256.
A 4.4GB csv file is compressed to 40MB on HDFS.
Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat.
When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.
How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it.
Since I don't have actual length of the file, so I don't know how much data to read from input stream.
If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below.
If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file.
How to handle this scenario.?
Hadoop Version - 2.7.1
The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.
However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.
From Hadoop definitive guide:
"For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".
You can refer to the section compression in Hadoop I/O chapter, for more information.
We're choosing the file format to store our raw logs, major requirements are compressed and splittable. Block-compressed (whichever codec) SequenceFiles and Hadoop-LZO look the most suitable so far.
Which one would be more efficient to be processed by Map-Reduce and easier to deal with overall?
For raw logs, it is recommended to use a container file format like SequenceFileFormat, which supports both compression and splitting. For storing the logs using this format, you will have to chose timestamp as the key and logged line as the value. In our team, we use SequenceFiles extensively.
For splittable LZO, you need to pre-process the files to generate the index. Without the index, the MapReduce framework will process the entire file as a single split (one mapper) and processing will be inefficient.
In "Hadoop The Definitive Guide" book (I suggest you read the section on "Compression"), there is a section recommending the compression format to use. As per the recommendation, following are the choices from most effective to least effective:
Container file formats like SequenceFile, Avro, ORCFiles, Parquet files with a fast compressor like LZO, LZ4 or Snappy
Compression format that supports splitting: bzip2 or splittable LZO
Split the file into chunks and compress each chunk separately using a compression format
I know and read many times Hadoop is not aware of what's inside the input file, and the split depends on the InputFileFormat, but let's be more specific ... for example, i read GZIP is not splittable, so if i have a unique gzipped input file of 1 TB, and no one of the node have an hd of that size, what happens? input will be splitted but hadoop will add info about the dependencies between one chunk and others? other question, if i have a huge .xml file, so basically text, how the split works, by line or by the configured MB of the block size?
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
ZSTD (zstandard) is now splittable as well in hadoop/Spark/Flink by leveraging hadoop-4mc.
Please have a look at Hadoop Elephant Bird to process complex input in your jobs. Anyways XML is not natively splittable in EB or hadoop, AFAIK.