I have a cluster setup which has 8 nodes and I am parsing a 20GB text file with mapreduce. Normally, my purpose is get every line by mapper and send with a key which is one of the columns on the row of input file. When reducer gets it, it will be written to different directory based on the key value. If I give an example:
input file:
test;1234;A;24;49;100
test2;222;B;29;22;22
test2;0099;C;29;22;22
So these rows will be written like this:
/output/A-r-0001
/output/B-r-0001
/output/C-r-0001
I am using MultipleOutputs object in reducer and if I use a small file everything is ok. But when I use 20GB file, 152 mappers and 8 reducers are initializing. Everything finishes really fast on mapper side, but one reducer keeps continue. 7 of the reducers finishes max 18 minutes, but the last one takes 3 hours.
First, I suspect the input of that reducer is bigger than the rest of them, but it is not the case. One reducer has three times more input than the slow one and that finishes in 17 minutes.
I've also tried to increase the number of reducer to 14, but this was resulted with 2 more slow reduce tasks.
I've checked lots of documentation and could no figure why this is happening. Could you guys help me with it?
EDITED
The problem was due to some corrupt data in my dataset. I've put some strict checks on the input data at mapper side and it is working fine now.
Thanks guys.
I've seen that happen often when dealing with skewed data, so my best guess is that your dataset is skewed, which means your Mapper will emit lots of records with the same key that will go to the same reducer which will be overloaded because it has a lot of values to go through.
There is no easy solution for this and it really depends on the business logic of your job, you could maybe have a check in your Reducer and say if you have more than N values ignore all values after N.
I've also found some doc about SkewReduce which is supposed to make it easier to manage skewed data in a Hadoop environment as described in their paper, but I haven't tried it myself.
Thanks for the explanation. I knew that my dataset does not have evenly distributed key value pairs. Below is from one of tests which I used 14 reducers and 152 mappers.
Task which finished 17 minutes 27 seconds:
FileSystemCounters
FILE_BYTES_READ 10,023,450,978
FILE_BYTES_WRITTEN 10,023,501,262
HDFS_BYTES_WRITTEN 6,771,300,416
Map-Reduce Framework
Reduce input groups 5
Combine output records 0
Reduce shuffle bytes 6,927,570,032
Reduce output records 0
Spilled Records 28,749,620
Combine input records 0
Reduce input records 19,936,319
Task which finished 14hrs 17minutes 54 sec :
FileSystemCounters
FILE_BYTES_READ 2,880,550,534
FILE_BYTES_WRITTEN 2,880,600,816
HDFS_BYTES_WRITTEN 2,806,219,222
Map-Reduce Framework
Reduce input groups 5
Combine output records 0
Reduce shuffle bytes 2,870,910,074
Reduce output records 0
Spilled Records 8,259,030
Combine input records 0
Reduce input records 8,259,030
The one which takes so much time has less records to go through.
In addition to this, after some time, same tasks are initializing from different nodes. I am guessing hadoop thinks that task is slow and initialize an another one. But it does not help at all.
Here is the counters from slow running reducer and fast running reducer
task_201403261540_0006_r_000019 is running very slow and task_201403261540_0006_r_000000 had completed very fast
Its very clear that one of my reducer is getting huge number of keys.
We need to optimize our Custom partitioner
Related
Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.
So I am using MultipleOutputs from the package org.apache.hadoop.mapreduce.lib.output.
I have a reducer that is doing a join of 2 data sources and emitting 3 different outputs.
55 reduce tasks were invoked and on an average each of them took about 6 minutes to emit data. There were outliers that took about 11 minutes.
So I observed that if I comment the pieces where actual output is happening, i.e. the call to mos.write() (multiple output) then the average time reduces to seconds and the whole job completes in about 2 minutes.
I do have a lot of data to emit (Approximately 40-50 GBs) of data.
What can I do to speed up things a bit, with and without considering compression.
Details: I am using TextOutputFormat and giving a hdfs path/uri.
Further clarifications:
I have small input data to my reducer, however the reducers are doing a reduce side join and hence emitting large amount of data. Since an outlier reducer is approximately taking about 11 minutes, reducing the number of reducers will increase this time and hence increase the overall time of my job and won't solve my purpose.
Input to the reducer comes from 2 mappers.
Mapper 1 -> Emits about 10,000 records. (Key Id)
Mapper 2 -> Emits about 15M records. (Key Id, Key Id2, Key Id3)
In reducer I get everything belonging to Key Id, sorted by Key Id, KeyId2 and KeyId3.
So I know I have an iterator which is like:
Mapper1 output and then Mapper2 output.
Here I store Mapper1 output in an ArrayList and start streaming Mapper2's output.
For every Mapper2 record I do, a mos.write(....)
I conditionally store a part of this record in memory (in a HashSet)
For every time KeyId2 changes, I do an extra mos.write(...)
At the close method of my reducer, I emit if I stored anything in the conditional step. So third mos.write(...)
I have gone through the article http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
as mentioned:
Tip1 : Configuring my cluster correctly is beyond my control.
Tip2 : Use LZO compression - or compression in general. Something I am trying alongside.
Tip3 : Tune the number of mappers and reducers - My mappers finish really fast (in order of seconds) Probably because they are almost identity mappers. Reducers take some time, as mentioned above (This is the time I'm trying to reduce) So increasing the number of reducers will probably help me - but then there will be resource contention and some reducers will have to wait. This is more of an experimental try and error sort of stuff for me.
Tip4 : Write a combiner. Does not apply to my case (of reduce side joins)
Tip5 : Use apt writable - I need to use Text as of now. All these 3 outputs are going into directories that have a hive schema sitting on top of it. Later when I figure how to emit ParquetFormat files from multiple outputs, I might change this and the tables storage method.
Tip6 : Reuse writables. Okay this is something I have not considered so far, but I still believe that its the disk writes that are taking time and not processing or java heap. But anyway I'll give it a shot again.
Tip7 : Use poor man's profiling. Kind of already done that and figured out that its actually mos.write steps that are taking most of the time.
DO
1. Reduce the number of reducers.
Optimized no of reducer count(for a general ETL operation) is to have around 1GB data for 1 reducer.
Here your input data(in GBs) itself is less than the no of reducers.
2. Code optimization can be done. Share the code else refer http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ to optimize it.
3. If this does not help then understand your data. The data might be skewed. In case if you don't know what is skewed, skewed join in pig will help.
I am reading Hadoop: The definitive guide 3rd edtition by Tom White. It is an excellent resource for understanding the internals of Hadoop, especially Map-Reduce which I am interested in.
From the book, (Page 205):
Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.
What I infer from this, is that before keys are sent to reducer, they are sorted, indicating that output of map phase of job is sorted. please note: I don't call it mapper, since a map phase include both mapper (written by programmer) and in-built sort mechanism of MR framework.
The Map Side
Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default, a size which can be tuned by changing the io.sort.mb property. When the contents of the buffer reaches a certain threshold size (io.sort.spill.per cent, default 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the back- ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.
My understanding of the above paragraph is that as the mapper is producing key-value pairs, key-value pairs are partitioned and sorted. A hypothetical example:
consider mapper-1 for a word-count program:
>mapper-1 contents
partition-1
xxxx: 2
yyyy: 3
partition-2
aaaa: 15
zzzz: 11
(Note with-in each partition data is sorted by key, but it is not necessary that partition-1's data and partition-2's data must follow sequential order)
Continuing reading the chapter:
Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record there could be several spill files. Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.
My understanding here is (please know the bold phrase in above para, that tricked me):
Within a map-task, several files may be spilled to disk but they are merged to a single file which still contains partition and is sorted. consider the same example as above:
Before a single map-task is finished, its intermediate data could be:
mapper-1 contents
spill 1: spill 2: spill 2:
partition-1 partition-1 partition-1
hhhh:5
xxxx: 2 xxxx: 3 mmmm: 2
yyyy: 3 yyyy: 7 yyyy: 9
partition-2 partition-2 partition-2
aaaa: 15 bbbb: 15 cccc: 15
zzzz: 10 zzzz: 15 zzzz: 13
After the map-task is completed, the output from mapper will be a single file (note three spill files above are added now but no combiner applied assuming no combiner specified in job conf):
>Mapper-1 contents:
partition-1:
hhhh: 5
mmmm: 2
xxxx: 2
xxxx: 3
yyyy: 3
yyyy: 7
yyyy: 9
partition-2:
aaaa: 15
bbbb: 15
cccc: 15
zzzz: 10
zzzz: 15
zzzz: 13
so here partition-1 may correspond to reducer-1. That is data corresponding parition-1 segment above is sent to reducer-1 and data corresponding to partition-2 segment is sent to reducer-2.
If so far, my understanding is correct,
how will I be able to get the intermediate file that has both partitions and sorted data from the mapper output.
It is interesting to note that running mapper alone does not produce sorted output
contradicting the points that data send to reducer is not sorted. More details here
Even no combiner is applied if No only Mapper is run: More details here
Map-only jobs work differently than Map-and-Reduce jobs. It's not inconsistent, just different.
how will I be able to get the intermediate file that has both partitions and sorted data from the mapper output.
You can't. There isn't a hook to be able to get pieces of data from intermediate stages of MapReduce. Same is true for getting data after the partitioner, or after a record reader, etc.
It is interesting to note that running mapper alone does not produce sorted output contradicting the points that data send to reducer is not sorted. More details here
It does not contradict. Mappers sort because the reducer needs it sorted to be able to do a merge. If there are no reducers, it has no reason to to sort, so it doesn't. This is the right behavior because I don't want it sorted in a map only job which would make my processing slower. I've never had a situation where I wanted my map output to be locally sorted.
Even no combiner is applied if No only Mapper is run: More details here
Combiners are an optimization. There is no guarantee that they actually run or over what data. Combiners are mostly there to make the reducers more efficient. So, again, just like the local sorting, combiners do not run if there are no reducers because it has no reason to.
If you want combiner-like behavior, I suggest writing data into a buffer (hashmap perhaps) and then writing out locally-summarized data in the cleanup function that runs when a Mapper finishes. Be careful of memory usage if you want to do this. This is a better approach because combiners are specified as a good-to-have optimization and you can't count on them running... even when they do run.
I understand from When do reduce tasks start in Hadoop that the reduce task in hadoop contains three steps: shuffle, sort and reduce where the sort (and after that the reduce) can only start once all the mappers are done. Is there a way to start the sort and reduce every time a mapper finishes.
For example lets we have only one job with mappers mapperA and mapperB and 2 reducers. What i want to do is:
mapperA finishes
shuffles copies the appropriate partitions of the mapperAs output lets say to reducer 1 and 2
sort on reducer 1 and 2 starts sorting and reducing and generates some intermediate output
now mapperB finishes
shuffles copies the appropriate partitions of the mapperBs output to reducer 1 and 2
sort and reduce on reducer 1 and 2 starts again and the reducer merges the new output with the old one
Is this possible? Thanks
You can't with the current implementation. However, people have "hacked" the Hadoop code to do what you want to do.
In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first.
However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at theorg.apache.hadoop.mapred.ReduceTask.ReduceCopier class, which implements ShuffleConsumerPlugin.
EDIT: Finally, as #teo points out in this related SO question, the
ReduceCopier.fetchOutputs() method is the one that holds the reduce
task from running until all map outputs are copied (through the while
loop in line 2026 of Hadoop release 1.0.4).
You can configure this using the slowstart property, which denotes the percentage of your mappers that need to be finished before the copy to the reducers starts. It usual default is in the 0.9 - 0.95 (90-95%) mark, but you can override to be 0 if your want
`mapreduce.reduce.slowstart.completed.map`
Starting the sort process before all mappers finish is sort of a hadoop-antipattern (if I may put it that way!), in that the reducers cannot know that there is no more data to receive until all mappers finish. you, the invoker may know that, based on your definition of keys, partitioner etc., but the reducers don't.
For one of my hadoop jobs, the amount of data fed into my reducer tasks is extremely unbalanced. For instance, if I have 10 reducer tasks, the input size to 9 of them will be in the 50KB range and the last will be close to 200GB. I suspect that my mappers are generating a large number of values for a single key but I don't know what that key is. It's a legacy job and I don't have access to the source code anymore. Is there a way to see the key/value pairs, either as output from the mapper or input to the reducer, while the job is running?
Try this adding this to your CLI job run: -D mapred.reduce.tasks=0
This should set the number of reducers to 0, which in effect will have the mappers dump output directly to HDFS. However, there may be some code that is overwriting the number of reducers regardless... so this might not work.
If this works, this will show the output of the mapper.
You can always count the total amount of values of your keys with a different simple map reduce job.