why Hadoop shuffling time takes longer than expected - hadoop

I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks

Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906

The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.

I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)

Related

Hadoop Mapper Reducer Taking So much Time

I have a MR code for small files its taking 7 minutes for 15gb single file.
But for multiple file of 37gb its taking too much time and showing percentage 1% completed in 1 minute consistently.
Please suggest me.
MapReduce was never designed for low latency. The idea of MapReduce is that you have cases where you process all the data in parallel. The key idea was to reduce the time by parallelism.
Take wordcount for example. Lets say, you want to run a wordcount on 50 GB. Running this on a single machine, might take long. Paralleling it to lets say 10 machines means 5 GB per machine in parallel. Thats an improvement. This are the cases what MapReduce is designed for.
If you looking for a technology that returns result fasts and also does this with random reads, consider a different technology. Depending on your specific needs, there are several approaches that might solve your problem better.
It was my mistake I put custom logger in the code so every time when MR run it was logging in MR log file that's why it was taking time.

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.

In which part/class of mapreduce is the logic of stopping reduce tasks implemented

In Hadoop MapReduce no reducer starts before all mappers are finished. Can someone please explain me at which part/class/codeline is this logic implemented? I am talking about Hadoop MapReduce version 1 (NOT Yarn). I have searched the map reduce framework but there are so many classes and i don't understand much the method calls and their ordering.
In other words i need (first for test purposes) to let the reducers start reducing even if there are still working mappers. I know that this way i am getting false results for the job but for know this is the start of some work for changing parts of the framework. So where should i start to look and make changes?
This is done in the shuffle phase. For Hadoop 1.x, take a look at org.apache.hadoop.mapred.ReduceTask.ReduceCopier, which implements ShuffleConsumerPlugin. You may also want to read the "Breaking the MapReduce Stage Barrier" research paper by Verma et al.
EDIT:
After reading #chris-white 's answer, I realized that my answer needed an extra explanation. In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first. However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at the classes I mentioned above.
Some points for clarification:
A reducer cannot start reducing until all mappers have finished, their partitions copied to the node where the reducer task is running, and finally sorted.
What you may see is a reducer pre-empting the copy of map outputs while other map tasks are still running. This is controlled via a configuration property known as slowstart (mapred.reduce.slowstart.completed.map). This value represents a ratio (0.0 - 1.0) of the number of map tasks that need to have completed before the reducer tasks will start up (copying over the map outputs from those map tasks that have completed). The default value is usually around 0.9, meaning that if you have 100 map tasks for your job, 90 of them would need to finish before the job tracker can start to launch the reduce tasks.
This is all controlled by the job tracker, in the JobInProgress class, lines 775, 1610, 1664.

Why submitting job to mapreduce takes so much time in General?

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time
Some process I'm aware:
1. data splitting
2. jar file sharing
A few things to understand about HDFS and M/R that helps understand this latency:
HDFS stores your files as data chunk distributed on multiple machines called datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.
How much time does it take to do a no-operation map-reduce program on a cluster?
Use MRBench for this purpose:
MRbench loops a small job a number of times
Checks whether small job runs are responsive and running efficiently on your cluster.
Its impact on the HDFS layer is very limited
To run this program, try the following (Check the correct approach for latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds.
Another issue is file size.
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.
As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
Calculating splits.
Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
I have seen similar issue and I can state the solution to be broken in following steps :
When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
Try with the data nodes and name nodes:
Stop all the services using stop-all.sh.
Format name-node
Reboot machine
Start all services using start-all.sh
Check data and name nodes.
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.

Idea's for balancing out a HDFS -> HBase map reduce job

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

Resources