Splitting BZip2 is not working - hadoop

I have 1.2GB file in Hadoop, compressed in BZip2 codec. Our Hadoop YARN cluster has 10 nodes. HDFS block size is 128 MB so I think the file is splitted into 10 blocks. BZip2 should be splittable codec so I thought when I start processing the input file, Hadoop executes 10 map task (one for each block). But when I look at job logs, I can see only one Map task.
I did not find any settings which limits number of mappers in YARN (in contrast with Hadoop 1).
What am I missing or what am I doing wrong?
Thank you

I've never used BZip2, but I think this issue may have to do with your fileInputFormat You may also need to configure your fileInputFormat plz take a look at this answer.

Related

How does file compression format affect my spark processing

I am confused in understanding the splittable and non splittable file format in big data world .
I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat that basically unzipping it then processing it .
Then i moved to gzip format and i am able to process it in my spark job but i always had a doubt why people are saying gzip file format is also not splittable ?
How does it going to affect
my spark job performance ?
So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?
Should i use gzip in my case or any other compression ?if yes then why ?
Also what is the difference in the performance
CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it
First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.
For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.
Now, to answer you questions :
if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
Its loaded by only one CPU on one executor since the file is not splittable.
(bzip2) same size file and then load this in spark and run count on it
Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file
Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.

Is it possible to create/work with a non-paralleized file in hadoop

we always talk about how much faster will be if we use hadoop to paralleized our data and programme .
I would like to know is that possible to keep a small file in one specific dataNode(not paralleized)?
possible to keep a small file in one specific dataNode
HDFS will try to split any file into HDFS blocks. The datanodes don't store the entire file, nor should you attempt to store on a particular one. Let Hadoop manage the data-locality.
Your file will be replicated 3 times by default in Hadoop for fault tolerance anyway.
If you have small files (less than the HDFS block size, 64 or 128MB, depending on the Hadoop version), then you probably shouldn't be using Hadoop. If you need parallelized processing, start with multi-threading. If you actually need distributed processes, my recommendation nowadays would be Spark or Flink, not Hadoop (MapReduce).
If you want this, seems like you want object storage, not block storage

Compressed file VS uncompressed file in mapreduce. which one gives better performence?

I have a 10 GB csv file and i want to process it in Hadoop MapReduce.
I have a 15 nodes(Datanode) cluster and i want to maximize the throughput.
What compression format should i use ? or Text file without compression will always give me better result over the compressed Text file. please explain the reason.
I used uncompressed file and it gave me better results over Snappy . Why is it so?
The problem with Snappy compression is that it is not splittable, so Hadoop can't divide input file into chunks and run several mappers for input. So most likely your 10Gb file is processed by a single mapper (check it in application history UI). Since hadoop stores big files in separate blocks on different machines, some parts of this file are not even located on the mapper machine and have to be transferred over the network. That seems to be the main reason why Snappy compressed file works slower than plain text.
To avoid the problem you can use bzip2 compression or divide the file into chunks manually and compress each part with snappy.

Hadoop split method

I know and read many times Hadoop is not aware of what's inside the input file, and the split depends on the InputFileFormat, but let's be more specific ... for example, i read GZIP is not splittable, so if i have a unique gzipped input file of 1 TB, and no one of the node have an hd of that size, what happens? input will be splitted but hadoop will add info about the dependencies between one chunk and others? other question, if i have a huge .xml file, so basically text, how the split works, by line or by the configured MB of the block size?
BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.
LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.
LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.
ZSTD (zstandard) is now splittable as well in hadoop/Spark/Flink by leveraging hadoop-4mc.
Please have a look at Hadoop Elephant Bird to process complex input in your jobs. Anyways XML is not natively splittable in EB or hadoop, AFAIK.

Mahout seqdirectory mapreduce just gets one map task

I want to have more map tasks to increase parallelism of Mahout seqdirectory job. But every time I try, it creates just one map task.
Hadoop version: 1.2.1
Mahout version: 0.8/0.9 (tested, both not working for more map tasks)
Scenario: lots of small files(about 566114 few KB files) store in HDFS
Before getting this problem (always one map task), I had faced another problem (GC overhead limit exceeded). Hence I had set more memory to solve it.
When I found that only one map task is getting spawned, I configured the Hadoop configuration file (mapred-site.xml,hadoop-env.sh...). I set mapred.map.tasks to 20 in mapred-site.xml. This did not help.
I found out that, the number of map tasks is decided by the number of chunks. In my case, total size of files is over 500 MB (bigger than default chunk size of 64 MB). I dug into Mahout source code (v0.9). I am not finding any solution.
Also, since I suspected that, the problem could be because of small files, I created two big files (2 500 MB files), and loaded them into HDFS, using seqdirectory command. But, still just one map task is launched.
I have no idea, how to solve this. Can someone with a knowledge of Hadoop and Mahout help me?
Note, I found someone contribute to this pb. note as mahout-833
I am tracing SequenceFileFromDirectory.java from Mahout 0.9 source code, try to figure out why.
Here is my configuration info: job config file, hadoop config file(mapred-site.xml, hadoop-env.sh)

Resources