I want to ask. How if I set the hdfs blocksize to 1 GB, and I'll upload file with size almost 1 GB. Would it become faster to process mapreduce? I think that with larger block size, the container request to resource manager (map task) will fewer than the default. So, it will decrease the latency of initialize container, and also decrease network latency too.
So, what do you think all?
Thanks
There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).
With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access,and make it more difficult for the MapReduce scheduler to schedule data-local tasks.
When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).
Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "hdfs dfs -put localpath dfspath -D dfs.block.size=xxxxxxx"
Source: http://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS
Useful link to read:
Change block size of dfs file
How Mappers get assigned.
The up is right.You couldn't just to determine the goodness and badness of Hadoop system by adjust the blocksize.
But according to my test that used different blocksize in hadoop, the 256M is a good choice.
Related
I know the default block size is 64M, split is 64M,
then for files less than 64M , when the number of nodes increase from 1 to 6 , there will be only one node to do with the split, so the speed will not improve? Is that right?
If it is a 128M file, there will be 2 nodes to do with the 2 splits, the speed is faster than 1 node, if there are more than 3 nodes, the speed doesn't increase,Is that right?
I don't know if my understanding is correct.Thanks for any comment!
Here is the answer for your query
I know the default block size is 64M,
In hadoop version 1.0 default size is 64MB and in version 2.0 default size is 128MB. the default block size can be overriden by setting valuue for parameter dfs.block.size in the configuration file hdfs-site.xml.
split is 64M,
Not necessary, as block size is not same as split size. Read this post for more clarity. For a normal wordcount example program, we can safely assume that the split size is approximately same as block size.
then for files less than 64M , when the number of nodes increase from 1 to 6 , there will be only one node to do with the split, so the speed will not improve? Is that right?
Yes you are right. If the file size is actually less than block size, then it would be processed by one node, and increasing node from 1 to 6 may not affect the execution speed. However, you must consider the case of speculative execution. In the case of speculative execution, even a smaller file may be processed by 2 nodes simultaneously and hence improve on speed of execution.
From Yahoo Dev KB link, Speculative execution is explained as below:
Speculative execution:
One problem with the Hadoop system is that by
dividing the tasks across many nodes, it is possible for a few slow
nodes to rate-limit the rest of the program. For example if one node
has a slow disk controller, then it may be reading its input at only
10% the speed of all the other nodes. So when 99 map tasks are already
complete, the system is still waiting for the final map task to check
in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual
tasks do not know where their inputs come from. Tasks trust the Hadoop
platform to just deliver the appropriate input. Therefore, the same
input can be processed multiple times in parallel, to exploit
differences in machine capabilities. As most of the tasks in a job are
coming to a close, the Hadoop platform will schedule redundant copies
of the remaining tasks across several nodes which do not have other
work to perform. This process is known as speculative execution. When
tasks complete, they announce this fact to the JobTracker. Whichever
copy of a task finishes first becomes the definitive copy. If other
copies were executing speculatively, Hadoop tells the TaskTrackers to
abandon the tasks and discard their outputs. The Reducers then receive
their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by setting the
mapred.map.tasks.speculative.execution and
mapred.reduce.tasks.speculative.execution JobConf options to false,
respectively using old API, while with newer API you may consider changing mapreduce.map.speculative and mapreduce.reduce.speculative.
You're assuming a large file is splittable to begin with, which isn't always the case.
If your files are ever less than a block size, adding more nodes will never increase processing time, it'll only help with replication and total cluster capacity.
Otherwise, your understanding seems correct, though, I think the latest default is actually 128 MB, not 64
Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.
If we can change the data block size in Hadoop please let me know how to do that.
Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how?
You can change the block size any time unless dfs.blocksize parameter is defined as final in hdfs-site.xml.
To change block size
while running hadoop fs command you can run hadoop fs -Ddfs.blocksize=67108864 -put <local_file> <hdfs_path>. This command will save file with 64MB block size
while running hadoop jar command - hadoop jar <jar_file> <class> -Ddfs.blocksize=<desired_block_size> <other_args>. Reducer will use the defined block size while storing the output in HDFS
as part of the map reduce program, you can use job.set and set the value
Criteria for changing block size:
Typically 128 MB for uncompressed files works well
You can consider reducing block size on compressed files. If the compression rate is too high then having higher block size might slow down the processing. If the compression codec is not splittable, it will aggravate the issue.
As long as the file size is more than block size, you need not change the block size. If the number of mappers to process the data is very high, you can reduce number of mappers by increasing the split size. For example if you have 1TB of data with 128 MB block size, then by default it will take 8000 mappers. Instead of changing the block size you can consider changing the split size to 512 MB or even 1 GB and it will take far fewer number of mappers to process the data.
I have covered most of this in 2 and 3 of this performance tuning playlist.
There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented:
HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is stored as an actual file on a datanode. In fact the same file is stored on several datanodes, according to the replication factor. The blocksize of these individual files and their other performance characteristics in turn depend on the underlying filesystems of the individual datanodes.
The mapping between an HDFS-File and the individual files on the datanodes is maintained
by the namenode. But the namenode doesn't expect a specific blocksize, it just stores the
mappings which where created during the creation of the HDFS file, which is usually split
according to the default dfs.blocksize (but can be individually overwritten).
This means for example if you have 1 MB file with a replication of 3 and a blocksize of 64
MB, you don't lose 63 MB * 3 = 189 MB, since physically just three 1 MB files are stored
with the standard blocksize of the underlying filesystems (e.g. ext4).
So the question becomes what a good dfs.blocksize is and if it's advisable to change it.
Let me first list the aspects speaking for a bigger blocksize:
Namenode pressure: As mentioned the namenode has to maintain the mappings between dfs files and their blocks to physical files on datanodes. So the less blocks/file the less memory pressure and communication overhead it has
Disk throughput: Files are written by a single process in hadoop, which usually results in data written sequentially to disk. This is especially advantageous for rotational disks because it avoids costly seeks. If the data is written that way, it can also be read that way so it becomes an advantage for reads and writes. In fact this optimization in combination with data locally (i.e. do the processing where the data is) is one of the main ideas of mapreduce.
Network throughput: Data locality is the more important optimization, but in a distributed system this can not always be achieved, so sometimes it's necessary to copy data between nodes. Normally one file (dfs block) is transferred via one persistent TCP connection which can reach a higher throughput when big files are transferred.
Bigger default splits: even though the splitsize can be configured on Job level, most people don't consider this and just go with the default which is usually the blocksize. If your splitsize is too small though, you can end up with too many mappers which don't have much work to do which in turn can lead to even smaller output files, unnecessary overhead and many occupied containers which can starve other jobs. This also has an adverse affect on the reduce phase, since the results must be fetched from all mappers.
Of course the ideal splitsize heavily depends on the kind of work you've to do. But you always can set a lower splitsize when necessary, whereas when you set a higher splitsize than the blocksize you might lose some data locality.
The latter aspect is less of an issue than one would think though, because the rule for block placement in HDFS is: the first block is written on the datanode where the process creating the file runs, the second one on another node in the same rack and the third one on a node on another rack. So usually one replica for each block of a file can be found on a single datanode, so data locality can still be achieved even when one mapper is reading several blocks due to a splitsize which is a multiple of the blocksize. Still in this case the mapred framework can only select one node instead of the usual three to achieve data locality so an effect can't be denied.
But ultimately this point for a bigger blocksize is probably the weakest of all, since one can set the splitsize independently if necessary.
But there also have to be arguments for a smaller blocksize otherwise we should just set it to infinity…
Parallelism/Distribution: If your input data lies on just a few nodes even a big cluster doesn't help to achieve parallel processing, at least if you want to maintain some data locality. As a rule I would say a good blocksize should match what you also can accept as a splitsize for your default workload.
Fault tolerance and latency: If a network connection breaks the perturbation of retransmitting a smaller file is less. TCP throughput might be important but individual connections shouldn't take forever either.
Weighting these factors against each other depends on your kind of data, cluster, workload etc. But in general I think the default blocksize 128 MB is already a little low for typical usecases. 512 MB or even 1 GB might be worth considering.
But before you even dig into that you should first check the size of your input files. If most of your files are small and don't even reach the max default blocksize your blocksize is basically always the filesize and it wouldn't help anything to increase the default blocksize. There are workarounds like using an input combiner to avoid spawning too many mappers, but ultimately you need to ensure your input files are big enough to take advantage of a big blocksize.
And if your files are already small don't compound the problem by making the blocksize even smaller.
It depends on the input data. The number of mappers is directly proportional to input splits,which depend on DFS block size.
If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best.
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
For smaller files, using a smaller block size is better.
Have a look at this article
If you have small files and which are less than minimum DFS block size, you can use some alternatives like HAR or SequenceFiles.
Have a look at this cloudera blog
I've tried to combine my file to upload into HDFS into one file. So, the HDFS have files number smaller than before but with the same size. So, in this condition I get faster mapreduce time, because I think the process make fewer container (map task or reduce task).
So, I want to ask, how can I set the block size properly, to get faster mapreduce? Should I set bigger than default (minimze container number)?
Thanks a lot....
Do you know, why hadoop have strong and fast compute capacity? Because it divides a single big job into many small jobs. That's the spirit of hadoop.
And there are many mechanism to coordinate it work flow, maybe adjust the blocksize couldn't attain your target.
You can set the parameter "dfs.block.size" in bytes to adjust the blocksize.
In Wordcount, it appears that you can get More than 1 map task per block, with speculative execution off.
Does the jobtracker do some magic under the hood to distribute tasks more than provided by the InputSplits?
Blocks and Splits are 2 different things. You might get more than one mappers for one Block if that Block has more than one Splits.
The answer to this lies in the way that Hadoop InputFormats work:
IN HDFS :
Lets take an example where the blocks are of size 1MB, an input file to HDFS is of size 10MB, and the minimum split size is > 1MB
1) First, a file is added to HDFS.
2) The file is split in to 10 blocks, each of size 1MB.
3) Then, each 1MB block is read by input splitter.
4) Since the 1MB block is SMALLER then the MIN_SPLIT_SIZE, HDFS processes 1MB at a time, with no extra splitting.
The moral of the story: FileInputFormat will actually do extra splitting for you if your splits are below the minimum split size.
I guess I totally forgot about this, but looking back, this has been a feature in hadoop since the beginning. The ability of an input format to arbitrarily split blocks up at runtime is used by many ecosystem tools to distribute loads in an applicaiton specific way.
The part that is tricky here is the fact that in toy mapreduce jobs, would expect one block per split in all cases, and then in real clusters, we overlook the split default size parameters, which dont come into play unless you are using large files.