I have a job which is going very slowly because I think hadoop is creating too many map tasks for the size of the data. I read on some websites that its efficient for fewer maps to process bigger chunks of data -- is there any way to force this? Thanks
Two possibilities:
increase the block size of your 90gb data, setting this to 128m or larger will make your map tasks "work more"
use the CombineFileInputFormat and batch your blocks together to the size you think it is appropriate.
The first solution requires you to rewrite the data to change the block size, the second solution can be embedded in your job.
Many maps indeed can have serious performance impact since overhead of map task starting is from 1 to 3 seconds, depending on your settings and hardware.
The main setting here is JVM reuse (mapred.job.reuse.jvm.num.tasks). Set it to -1 and you will probabbly got serious performance boost.
The usual root cause of this problem is a lot of small files. It is discussed here:
Processing large set of small files with Hadoop
The solutions are around orgenizing them together.
If your files are indeed big, but splittable - you can increase block side, thus reducing number of split and thereof - number of maps
Increase the split size or use CombineFileInputFormat for packing multiple files in a single split thus reducing the number of map tasks required to process the data.
Related
I've tried to combine my file to upload into HDFS into one file. So, the HDFS have files number smaller than before but with the same size. So, in this condition I get faster mapreduce time, because I think the process make fewer container (map task or reduce task).
So, I want to ask, how can I set the block size properly, to get faster mapreduce? Should I set bigger than default (minimze container number)?
Thanks a lot....
Do you know, why hadoop have strong and fast compute capacity? Because it divides a single big job into many small jobs. That's the spirit of hadoop.
And there are many mechanism to coordinate it work flow, maybe adjust the blocksize couldn't attain your target.
You can set the parameter "dfs.block.size" in bytes to adjust the blocksize.
The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).
I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.
I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?
I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.
Can someone please help me understand how to control the number of maps and possibly share a working example?
There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.
Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.
The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.
Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.
You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.
After running a map-reduce job in Hadoop, the result is a directory with part files. The number of part files depend on the number of reducers, and can reach dozens (80 in my case).
Does keeping multiple part files affect the performance of future map-reduce operations, to the better or worse? Will taking an extra reduction step and merging all the parts improve or worsen the speed of further processing?
Please refer only to map-reduce performance issues. I don't care about splitting or merging these results in any other way.
Running further mapreduce operations on the part directory should have little to no impact on overall performance.
The reason is the first step Hadoop does is split the data in the input directory according to the size and places the split data onto the Mappers. Since it's already splitting the data into separate chunks, splitting one file vs many shouldn't impact performance, the amount of data being transferred over the network should be roughly equal, as should the amount of processing and disk time.
There might be some degenerate cases where part files will be slower. For example instead of 1 large file you had thousands/millions of part files. I also can think of situations where having many part files would be faster. For example, if you don't have splittable files (not usually the case unless you are using certain compression schemes), then you would have to put your 1 big file on a single mapper since its unsplittable, where the many part files would be distributed more or less as normal.
It all depends on what the next task needs to do.
If you have analytics data and you have 80 files per (partially processed) input day then you have a huge performance problem if the next job needs to combine the data over the last two years.
If however you have only those 80 then I wouldn't worry about it.
I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.
To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html
Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.