how to deal with large map output in hadoop? - hadoop

I am new in hadoop and i'm working with 3 node in a cluster(each of them has 2GB RAM).
the input file is small(5 MB) but map output is very large(about 6 GB).
in the map phase my memory becomes full and the tasks run very slowly.
what's its reason?
Can anyone helps me how to make my program faster?

Use a NLineInputFormat , where N refers to the number of lines of input each mapper will receive. This way, you have more splits created , there by forcing smaller input data to each mapper task. If not, the entire 5 MB will go into one Map task.

Size of map output by itself does not cause memory problem, since mapper can work in "streaming" mode. It consume records, process them and write to output. Hadoop will store some amount of data in memory and then spill it to disc.
So you problems can be caused by one of the two:
a) Your mapper algorithm somehow accumulate data during processing.
b) Cumulative memory given to your mappers is less then RAM of the Nodes. Then OS start swapping and your performance can fell orders of magnitude.
Case b is more likely since 2 GB is actually too little for usual hadoop configuration. If you going to work on it - I would suggest to configure 1, maximum 2 mapper slots per node.

Related

MapReduce shuffle phase bottleneck

I am reading the original MapReduce paper. My understanding is that when working with, say hundreds of GBs of data, the network bandwidth for transferring so much data can be the bottleneck of a MapReduce job. For map tasks, we can reduce network bandwidth by scheduling map tasks on workers that already contain the data for any given split, since reading from local disk does not require network bandwidth.
However, the shuffle phase seems to be a huge bottleneck. A reduce task can potentially receive intermediate key/value pairs from all map tasks, and almost all of these intermediate key/value pairs will be streamed across the network.
When working with hundreds of GBs of data or more, is it necessary to use a combiner to have an efficient MapReduce job?
Combiner plays important role if it can fit into that situation it acts like a local reducer so instead of sending all data it will send only few values or local aggregated value but combiner can't be applied in all the cases .
If a reduce function is both commutative and associative, then it can be used as a Combiner.
Like in case of Median it won't work .
Combiner can't be used in all the situation.
There are other parameters which can be tuned Like :
When map emits output it directly does not go disk it goes to 100 MB circular buffer which when fill 80% it spill the records into disk.
so you can increase the buffer size and increase thresh hold value in that case spillage would be less.
if there are so many spills then spills would merge to make a single file we can play with spill factor.
There are so threads which copies data from local disk to reducer jvm's so their number can be increased.
Compression can be used at intermediate level and final level.
So Combiner is not the only solution and won't be used in all the situation.
Lets say the mapper is emitting (word, count). If you don't use combiner then if a mapper has the word abc 100 times then the reducer has to pull (abc, 1) 100 times Lets say the size of (word, count) is 7 bytes. Without combiner, the reducer has to pull 7 * 100 bytes of data where as the with combiner, the reducer only needs to pull 7 bytes of data. This example just illustrates how the combiner can reduce network traffic.
Note : This is a vague example just to make the understanding simpler.

Shuffle phase lasts too long Hadoop

I'm having a MR job in which shuffle phase lasts too long.
At first I thought that it is because I'm emitting a lot of data from Mapper (around 5GB). Then I fixed that problem by adding a Combiner, thus emitting less data to Reducer. After that shuffle period did not shorten, as I thought it would.
My next idea was to eliminate Combiner, by combining in Mapper itself. That idea I got from here, where it says that data needs to be serialized/deserialized to use Combiner. Unfortunately shuffle phase is still the same.
My only thought is that it can be because I'm using a single Reducer. But this shouldn't be a case since I'm not emitting a lot of data when using Combiner or combining in Mapper.
Here are my stats:
Here are all the counters for my Hadoop (YARN) job:
I should also add that this is run on a small cluster of 4 machines. Each has 8GB of RAM (2GB reserved) and number of virtual cores is 12 (2 reserved).
These are virtual machines. At first they were all on a single unit, but then I separated them 2-2 on two units. So they were sharing HDD at first, now there are two machines per disk. Between them is a gigabit network.
And here are more stats:
Whole memory is occupied
CPU is constantly under pressure while the job is run (the picture shows CPU for two consecutive runs of same job)
My question is - why is shuffle time so big and how to fix it? I also don't understand how there was no speedup even though I have dramatically reduced the amount of data emitted from Mapper?
Few observations :
For a job of 30 mins, the GC time is too high (Try reusing objects rather creating a new one for each call in map()/Reduce() method)
Average map time is TOOOOO hight , 16 mins what are you doing in ur map ?
YARN memory is 99% , this signifies you are running too many services on your HDP cluster and RAM is not sufficient to support those many services.
Please increse YAN container memory, please give at least 1 GB.
This looks like a GC + overscheduled cluster problem

Can hadoop map/reduce be speeded up by splitting data size?

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.

default / finding number of mapper and reducers in hadoop 1.x

Can somebody please help me in understanding below questions related to Hadoop 1.x?
Say I have just a single node where I have 8 GB of RAM and 40 TB of hard disk with quad core processor. Block size is 64 MB. We need to process 4 TB of data.
How do we decide the number of Mappers and Reducers?
Can someone please explain in detail? Please let me know if I need to consider any other parameter for calculation.
Say I have 10 Data nodes in a cluster and each node is having 8 GB of RAM and 40 TB of Hard disk with quad core processor. Block size is 64MB. We need to process 40 TB data. How do we decide the number of Mappers and Reducers?
What is the default number for mapper and reducer slots in a Data node with quad core processor?
Many Thanks,
Manish
Number of mappers = Number of splits.
Input file would be divided into splits. Each split will have set of records. On an average, each split is of one block size(64 MB above). So in your case you would have around 62500 mappers(or splits) (4TB/64). You also have option to give configurable input split size. Generally this is done when reading the entire file once, and you decide how records should be processed.
Number of reducers = Number of unique keys in the mapper output. You can choose the number of reducers by configuring them in job class or at jab running command. The above number is based on default hash partitioner. You can create your own partitioner, which can decide number of reducers.

How if I set hdfs blocksize to 1 GB?

I want to ask. How if I set the hdfs blocksize to 1 GB, and I'll upload file with size almost 1 GB. Would it become faster to process mapreduce? I think that with larger block size, the container request to resource manager (map task) will fewer than the default. So, it will decrease the latency of initialize container, and also decrease network latency too.
So, what do you think all?
Thanks
There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).
With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access,and make it more difficult for the MapReduce scheduler to schedule data-local tasks.
When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).
Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "hdfs dfs -put localpath dfspath -D dfs.block.size=xxxxxxx"
Source: http://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS
Useful link to read:
Change block size of dfs file
How Mappers get assigned.
The up is right.You couldn't just to determine the goodness and badness of Hadoop system by adjust the blocksize.
But according to my test that used different blocksize in hadoop, the 256M is a good choice.

Resources