Different block size Hadoop - hadoop

What do I need to do to have smaller/larger blocks in Hadoop?
Concretely, I want to have larger number of mappers, that gets smaller piece of data to work on. It seems that I need to decrease the block size, but I'm confused (I'm new to Hadoop) - do I need to do something while putting the file on HDFS, or do I need to specify something related to input split size, or both?
I'm sharing the cluster, so I cannot perform global settings, so need this on a per-job basis, if possible? And I'm running the job from code (later from Oozie, possibly).

What a mapper runs is controlled by the input split, and is completely up to you how you specify it. The HDFS block size has nothing to do with it (other than the fact that most splitters use the block size as a basic 'block' for creating the input splits in order to achieve good data locality). You can write your own splitter that takes an HDFS block and splits in 100 splits, if you so fancy. Aslo look at Change File Split size in Hadoop.
Now that being said, the wisdom of doing that ('many mappers with small splits') is highly questionable. Everybody else is trying to do the opposite (create few mappers with aggregated splits). See Dealing with Hadoop's small files problem, The Small Files Problem, Amazon Elastic MapReduce Deep Dive and Best Practices and so on.

You dont really have to decrease the block size to have more mappers , that would process smaller amount of data.
You dont have to modify the HDFS block size ( dfs.blocksize ), let it be with th default global value as per your cluster configuration.
You may use mapreduce.input.fileinputformat.split.maxsize property in your job configuration with lower value than the block size.
The input splits will be calculated with this value and one mapper will be triggered for every input split calculated.

Related

How to set hadoop block size properly?

I've tried to combine my file to upload into HDFS into one file. So, the HDFS have files number smaller than before but with the same size. So, in this condition I get faster mapreduce time, because I think the process make fewer container (map task or reduce task).
So, I want to ask, how can I set the block size properly, to get faster mapreduce? Should I set bigger than default (minimze container number)?
Thanks a lot....
Do you know, why hadoop have strong and fast compute capacity? Because it divides a single big job into many small jobs. That's the spirit of hadoop.
And there are many mechanism to coordinate it work flow, maybe adjust the blocksize couldn't attain your target.
You can set the parameter "dfs.block.size" in bytes to adjust the blocksize.

Pig: Control number of mappers

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.
I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?
I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.
Can someone please help me understand how to control the number of maps and possibly share a working example?
There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.
Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.
The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.
Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.
You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.

Wordcount: More than 1 map task per block, with speculative execution off

In Wordcount, it appears that you can get More than 1 map task per block, with speculative execution off.
Does the jobtracker do some magic under the hood to distribute tasks more than provided by the InputSplits?
Blocks and Splits are 2 different things. You might get more than one mappers for one Block if that Block has more than one Splits.
The answer to this lies in the way that Hadoop InputFormats work:
IN HDFS :
Lets take an example where the blocks are of size 1MB, an input file to HDFS is of size 10MB, and the minimum split size is > 1MB
1) First, a file is added to HDFS.
2) The file is split in to 10 blocks, each of size 1MB.
3) Then, each 1MB block is read by input splitter.
4) Since the 1MB block is SMALLER then the MIN_SPLIT_SIZE, HDFS processes 1MB at a time, with no extra splitting.
The moral of the story: FileInputFormat will actually do extra splitting for you if your splits are below the minimum split size.
I guess I totally forgot about this, but looking back, this has been a feature in hadoop since the beginning. The ability of an input format to arbitrarily split blocks up at runtime is used by many ecosystem tools to distribute loads in an applicaiton specific way.
The part that is tricky here is the fact that in toy mapreduce jobs, would expect one block per split in all cases, and then in real clusters, we overlook the split default size parameters, which dont come into play unless you are using large files.

Hadoop smaller input file

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.
To me, your workload resembles SETI#Home work load -- small payloads but hours of crunching time.
Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.
If you want to keep your workload together:
1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB
2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line
reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html
Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.
As Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is split into one or more InputSplits typically upper bounded by block size. This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it’s dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.
The basic parameter which drives the spit size is mapred.max.split.size.
Using CombineFileInputFormat and this parameter we can control the number of mappers.
Checkout the implementation I had for another answer here.

How to increase number of mappers in Mahout MatrixMultiplicationJob?

I am using Mahout 0.7's MatrixMultiplicationJob for multiplying a large matrix. But it always uses 1 map task which makes it slow. its probably due to the InputSplit which forces the number of mappers to be 1.
Is there a way I can efficiently multiply matrices in Hadoop / Mahout or change the number of mappers?
Ultimately, it is Hadoop that decides how many mappers to use. Generally it will use one mapper per HDFS block (typically 64 or 128MB). If your data is smaller than that, it's too small to bother with more than 1 mapper.
You can encourage it to use more anyway by setting mapred.max.split.size to something smaller than 64MB (remember the value is set in bytes, not MB). But, are you sure you want to? It is much more common to need more reducers, not mappers, since Hadoop will never use more than 1 unless you (or your job) tells it to.
Also know that Hadoop will not be able to use more than one mapper on a single compressed file. So if your input is one huge compressed file, it will only ever use 1 mapper on that file. You can however split it up yourself into many smaller compressed files.
had you tried to specify number of mappers via command line with -Dmapred.map.tasks=N option? I hadn't tried it, but it should work. If it won't work, then try to set this parameter in the MAHOUT_OPTS environment variable...

Resources