How to set hadoop block size properly? - hadoop

I've tried to combine my file to upload into HDFS into one file. So, the HDFS have files number smaller than before but with the same size. So, in this condition I get faster mapreduce time, because I think the process make fewer container (map task or reduce task).
So, I want to ask, how can I set the block size properly, to get faster mapreduce? Should I set bigger than default (minimze container number)?
Thanks a lot....

Do you know, why hadoop have strong and fast compute capacity? Because it divides a single big job into many small jobs. That's the spirit of hadoop.
And there are many mechanism to coordinate it work flow, maybe adjust the blocksize couldn't attain your target.
You can set the parameter "dfs.block.size" in bytes to adjust the blocksize.

Related

How if I set hdfs blocksize to 1 GB?

I want to ask. How if I set the hdfs blocksize to 1 GB, and I'll upload file with size almost 1 GB. Would it become faster to process mapreduce? I think that with larger block size, the container request to resource manager (map task) will fewer than the default. So, it will decrease the latency of initialize container, and also decrease network latency too.
So, what do you think all?
Thanks
There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).
With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access,and make it more difficult for the MapReduce scheduler to schedule data-local tasks.
When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).
Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "hdfs dfs -put localpath dfspath -D dfs.block.size=xxxxxxx"
Source: http://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS
Useful link to read:
Change block size of dfs file
How Mappers get assigned.
The up is right.You couldn't just to determine the goodness and badness of Hadoop system by adjust the blocksize.
But according to my test that used different blocksize in hadoop, the 256M is a good choice.

Different block size Hadoop

What do I need to do to have smaller/larger blocks in Hadoop?
Concretely, I want to have larger number of mappers, that gets smaller piece of data to work on. It seems that I need to decrease the block size, but I'm confused (I'm new to Hadoop) - do I need to do something while putting the file on HDFS, or do I need to specify something related to input split size, or both?
I'm sharing the cluster, so I cannot perform global settings, so need this on a per-job basis, if possible? And I'm running the job from code (later from Oozie, possibly).
What a mapper runs is controlled by the input split, and is completely up to you how you specify it. The HDFS block size has nothing to do with it (other than the fact that most splitters use the block size as a basic 'block' for creating the input splits in order to achieve good data locality). You can write your own splitter that takes an HDFS block and splits in 100 splits, if you so fancy. Aslo look at Change File Split size in Hadoop.
Now that being said, the wisdom of doing that ('many mappers with small splits') is highly questionable. Everybody else is trying to do the opposite (create few mappers with aggregated splits). See Dealing with Hadoop's small files problem, The Small Files Problem, Amazon Elastic MapReduce Deep Dive and Best Practices and so on.
You dont really have to decrease the block size to have more mappers , that would process smaller amount of data.
You dont have to modify the HDFS block size ( dfs.blocksize ), let it be with th default global value as per your cluster configuration.
You may use mapreduce.input.fileinputformat.split.maxsize property in your job configuration with lower value than the block size.
The input splits will be calculated with this value and one mapper will be triggered for every input split calculated.

Pig: Control number of mappers

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.
I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?
I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.
Can someone please help me understand how to control the number of maps and possibly share a working example?
There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.
Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.
The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.
Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.
You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.

How to increase number of mappers in Mahout MatrixMultiplicationJob?

I am using Mahout 0.7's MatrixMultiplicationJob for multiplying a large matrix. But it always uses 1 map task which makes it slow. its probably due to the InputSplit which forces the number of mappers to be 1.
Is there a way I can efficiently multiply matrices in Hadoop / Mahout or change the number of mappers?
Ultimately, it is Hadoop that decides how many mappers to use. Generally it will use one mapper per HDFS block (typically 64 or 128MB). If your data is smaller than that, it's too small to bother with more than 1 mapper.
You can encourage it to use more anyway by setting mapred.max.split.size to something smaller than 64MB (remember the value is set in bytes, not MB). But, are you sure you want to? It is much more common to need more reducers, not mappers, since Hadoop will never use more than 1 unless you (or your job) tells it to.
Also know that Hadoop will not be able to use more than one mapper on a single compressed file. So if your input is one huge compressed file, it will only ever use 1 mapper on that file. You can however split it up yourself into many smaller compressed files.
had you tried to specify number of mappers via command line with -Dmapred.map.tasks=N option? I hadn't tried it, but it should work. If it won't work, then try to set this parameter in the MAHOUT_OPTS environment variable...

how to force hadoop to process more data per map

I have a job which is going very slowly because I think hadoop is creating too many map tasks for the size of the data. I read on some websites that its efficient for fewer maps to process bigger chunks of data -- is there any way to force this? Thanks
Two possibilities:
increase the block size of your 90gb data, setting this to 128m or larger will make your map tasks "work more"
use the CombineFileInputFormat and batch your blocks together to the size you think it is appropriate.
The first solution requires you to rewrite the data to change the block size, the second solution can be embedded in your job.
Many maps indeed can have serious performance impact since overhead of map task starting is from 1 to 3 seconds, depending on your settings and hardware.
The main setting here is JVM reuse (mapred.job.reuse.jvm.num.tasks). Set it to -1 and you will probabbly got serious performance boost.
The usual root cause of this problem is a lot of small files. It is discussed here:
Processing large set of small files with Hadoop
The solutions are around orgenizing them together.
If your files are indeed big, but splittable - you can increase block side, thus reducing number of split and thereof - number of maps
Increase the split size or use CombineFileInputFormat for packing multiple files in a single split thus reducing the number of map tasks required to process the data.

Resources