So I am using MultipleOutputs from the package org.apache.hadoop.mapreduce.lib.output.
I have a reducer that is doing a join of 2 data sources and emitting 3 different outputs.
55 reduce tasks were invoked and on an average each of them took about 6 minutes to emit data. There were outliers that took about 11 minutes.
So I observed that if I comment the pieces where actual output is happening, i.e. the call to mos.write() (multiple output) then the average time reduces to seconds and the whole job completes in about 2 minutes.
I do have a lot of data to emit (Approximately 40-50 GBs) of data.
What can I do to speed up things a bit, with and without considering compression.
Details: I am using TextOutputFormat and giving a hdfs path/uri.
Further clarifications:
I have small input data to my reducer, however the reducers are doing a reduce side join and hence emitting large amount of data. Since an outlier reducer is approximately taking about 11 minutes, reducing the number of reducers will increase this time and hence increase the overall time of my job and won't solve my purpose.
Input to the reducer comes from 2 mappers.
Mapper 1 -> Emits about 10,000 records. (Key Id)
Mapper 2 -> Emits about 15M records. (Key Id, Key Id2, Key Id3)
In reducer I get everything belonging to Key Id, sorted by Key Id, KeyId2 and KeyId3.
So I know I have an iterator which is like:
Mapper1 output and then Mapper2 output.
Here I store Mapper1 output in an ArrayList and start streaming Mapper2's output.
For every Mapper2 record I do, a mos.write(....)
I conditionally store a part of this record in memory (in a HashSet)
For every time KeyId2 changes, I do an extra mos.write(...)
At the close method of my reducer, I emit if I stored anything in the conditional step. So third mos.write(...)
I have gone through the article http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
as mentioned:
Tip1 : Configuring my cluster correctly is beyond my control.
Tip2 : Use LZO compression - or compression in general. Something I am trying alongside.
Tip3 : Tune the number of mappers and reducers - My mappers finish really fast (in order of seconds) Probably because they are almost identity mappers. Reducers take some time, as mentioned above (This is the time I'm trying to reduce) So increasing the number of reducers will probably help me - but then there will be resource contention and some reducers will have to wait. This is more of an experimental try and error sort of stuff for me.
Tip4 : Write a combiner. Does not apply to my case (of reduce side joins)
Tip5 : Use apt writable - I need to use Text as of now. All these 3 outputs are going into directories that have a hive schema sitting on top of it. Later when I figure how to emit ParquetFormat files from multiple outputs, I might change this and the tables storage method.
Tip6 : Reuse writables. Okay this is something I have not considered so far, but I still believe that its the disk writes that are taking time and not processing or java heap. But anyway I'll give it a shot again.
Tip7 : Use poor man's profiling. Kind of already done that and figured out that its actually mos.write steps that are taking most of the time.
DO
1. Reduce the number of reducers.
Optimized no of reducer count(for a general ETL operation) is to have around 1GB data for 1 reducer.
Here your input data(in GBs) itself is less than the no of reducers.
2. Code optimization can be done. Share the code else refer http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ to optimize it.
3. If this does not help then understand your data. The data might be skewed. In case if you don't know what is skewed, skewed join in pig will help.
Related
Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.
I just started with Camus.
I am planning to run Camus, every one hour. We get around ~80000000 messages every hour and average message size is 4KB (we have a single topic in Kafka).
I first tried with 10 mappers, it took ~2hours to copy one hour's data and it created 10 files with ~7GB size.
Then I tried 300 mappers, it brought down the time to ~1 hour. But it created 11 files. Later, I tried with 150 mappers and it took ~30 minutes.
So, how do I choose the number of mappers in this? Also, I want to create more files in hadoop as one size is growing to 7GB. What configuration do I have to check?
It should ideally be equal or less than the kafka partitions you have , in your topic .
That means , for better throughput you topic should have more partitions and same number of camus mappers
I have found best answer in this article
The number of maps is usually driven by the number of DFS blocks in the input files. It causes people to adjust their DFS block size to adjust the number of maps.
The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks.
It is best if the maps take at least a minute to execute.
All depends on the power of CPU you have, the type of application - IO Bound (heavy read/write) Or CPU bound ( heavy processing) and number of nodes in your Hadoop cluster.
Apart from setting number of mappers and reducers at global level, override those values at Job level depending on data to be processing needs of the Job.
And one more thing in the end : If you think Combiner reduces the IO transfers between Mapper and Reducer, use it effectively in combination with Partitioner
I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks
Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906
The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.
I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)
I have a cluster setup which has 8 nodes and I am parsing a 20GB text file with mapreduce. Normally, my purpose is get every line by mapper and send with a key which is one of the columns on the row of input file. When reducer gets it, it will be written to different directory based on the key value. If I give an example:
input file:
test;1234;A;24;49;100
test2;222;B;29;22;22
test2;0099;C;29;22;22
So these rows will be written like this:
/output/A-r-0001
/output/B-r-0001
/output/C-r-0001
I am using MultipleOutputs object in reducer and if I use a small file everything is ok. But when I use 20GB file, 152 mappers and 8 reducers are initializing. Everything finishes really fast on mapper side, but one reducer keeps continue. 7 of the reducers finishes max 18 minutes, but the last one takes 3 hours.
First, I suspect the input of that reducer is bigger than the rest of them, but it is not the case. One reducer has three times more input than the slow one and that finishes in 17 minutes.
I've also tried to increase the number of reducer to 14, but this was resulted with 2 more slow reduce tasks.
I've checked lots of documentation and could no figure why this is happening. Could you guys help me with it?
EDITED
The problem was due to some corrupt data in my dataset. I've put some strict checks on the input data at mapper side and it is working fine now.
Thanks guys.
I've seen that happen often when dealing with skewed data, so my best guess is that your dataset is skewed, which means your Mapper will emit lots of records with the same key that will go to the same reducer which will be overloaded because it has a lot of values to go through.
There is no easy solution for this and it really depends on the business logic of your job, you could maybe have a check in your Reducer and say if you have more than N values ignore all values after N.
I've also found some doc about SkewReduce which is supposed to make it easier to manage skewed data in a Hadoop environment as described in their paper, but I haven't tried it myself.
Thanks for the explanation. I knew that my dataset does not have evenly distributed key value pairs. Below is from one of tests which I used 14 reducers and 152 mappers.
Task which finished 17 minutes 27 seconds:
FileSystemCounters
FILE_BYTES_READ 10,023,450,978
FILE_BYTES_WRITTEN 10,023,501,262
HDFS_BYTES_WRITTEN 6,771,300,416
Map-Reduce Framework
Reduce input groups 5
Combine output records 0
Reduce shuffle bytes 6,927,570,032
Reduce output records 0
Spilled Records 28,749,620
Combine input records 0
Reduce input records 19,936,319
Task which finished 14hrs 17minutes 54 sec :
FileSystemCounters
FILE_BYTES_READ 2,880,550,534
FILE_BYTES_WRITTEN 2,880,600,816
HDFS_BYTES_WRITTEN 2,806,219,222
Map-Reduce Framework
Reduce input groups 5
Combine output records 0
Reduce shuffle bytes 2,870,910,074
Reduce output records 0
Spilled Records 8,259,030
Combine input records 0
Reduce input records 8,259,030
The one which takes so much time has less records to go through.
In addition to this, after some time, same tasks are initializing from different nodes. I am guessing hadoop thinks that task is slow and initialize an another one. But it does not help at all.
Here is the counters from slow running reducer and fast running reducer
task_201403261540_0006_r_000019 is running very slow and task_201403261540_0006_r_000000 had completed very fast
Its very clear that one of my reducer is getting huge number of keys.
We need to optimize our Custom partitioner
For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?