I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.
I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?
I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.
Can someone please help me understand how to control the number of maps and possibly share a working example?
There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.
Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.
The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.
Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.
You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.
Related
I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.
Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.
What do I need to do to have smaller/larger blocks in Hadoop?
Concretely, I want to have larger number of mappers, that gets smaller piece of data to work on. It seems that I need to decrease the block size, but I'm confused (I'm new to Hadoop) - do I need to do something while putting the file on HDFS, or do I need to specify something related to input split size, or both?
I'm sharing the cluster, so I cannot perform global settings, so need this on a per-job basis, if possible? And I'm running the job from code (later from Oozie, possibly).
What a mapper runs is controlled by the input split, and is completely up to you how you specify it. The HDFS block size has nothing to do with it (other than the fact that most splitters use the block size as a basic 'block' for creating the input splits in order to achieve good data locality). You can write your own splitter that takes an HDFS block and splits in 100 splits, if you so fancy. Aslo look at Change File Split size in Hadoop.
Now that being said, the wisdom of doing that ('many mappers with small splits') is highly questionable. Everybody else is trying to do the opposite (create few mappers with aggregated splits). See Dealing with Hadoop's small files problem, The Small Files Problem, Amazon Elastic MapReduce Deep Dive and Best Practices and so on.
You dont really have to decrease the block size to have more mappers , that would process smaller amount of data.
You dont have to modify the HDFS block size ( dfs.blocksize ), let it be with th default global value as per your cluster configuration.
You may use mapreduce.input.fileinputformat.split.maxsize property in your job configuration with lower value than the block size.
The input splits will be calculated with this value and one mapper will be triggered for every input split calculated.
I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.
To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html
Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.
I am using Mahout 0.7's MatrixMultiplicationJob for multiplying a large matrix. But it always uses 1 map task which makes it slow. its probably due to the InputSplit which forces the number of mappers to be 1.
Is there a way I can efficiently multiply matrices in Hadoop / Mahout or change the number of mappers?
Ultimately, it is Hadoop that decides how many mappers to use. Generally it will use one mapper per HDFS block (typically 64 or 128MB). If your data is smaller than that, it's too small to bother with more than 1 mapper.
You can encourage it to use more anyway by setting mapred.max.split.size to something smaller than 64MB (remember the value is set in bytes, not MB). But, are you sure you want to? It is much more common to need more reducers, not mappers, since Hadoop will never use more than 1 unless you (or your job) tells it to.
Also know that Hadoop will not be able to use more than one mapper on a single compressed file. So if your input is one huge compressed file, it will only ever use 1 mapper on that file. You can however split it up yourself into many smaller compressed files.
had you tried to specify number of mappers via command line with -Dmapred.map.tasks=N option? I hadn't tried it, but it should work. If it won't work, then try to set this parameter in the MAHOUT_OPTS environment variable...
I have a job which is going very slowly because I think hadoop is creating too many map tasks for the size of the data. I read on some websites that its efficient for fewer maps to process bigger chunks of data -- is there any way to force this? Thanks
Two possibilities:
increase the block size of your 90gb data, setting this to 128m or larger will make your map tasks "work more"
use the CombineFileInputFormat and batch your blocks together to the size you think it is appropriate.
The first solution requires you to rewrite the data to change the block size, the second solution can be embedded in your job.
Many maps indeed can have serious performance impact since overhead of map task starting is from 1 to 3 seconds, depending on your settings and hardware.
The main setting here is JVM reuse (mapred.job.reuse.jvm.num.tasks). Set it to -1 and you will probabbly got serious performance boost.
The usual root cause of this problem is a lot of small files. It is discussed here:
Processing large set of small files with Hadoop
The solutions are around orgenizing them together.
If your files are indeed big, but splittable - you can increase block side, thus reducing number of split and thereof - number of maps
Increase the split size or use CombineFileInputFormat for packing multiple files in a single split thus reducing the number of map tasks required to process the data.