Shuffle phase lasts too long Hadoop - hadoop

I'm having a MR job in which shuffle phase lasts too long.
At first I thought that it is because I'm emitting a lot of data from Mapper (around 5GB). Then I fixed that problem by adding a Combiner, thus emitting less data to Reducer. After that shuffle period did not shorten, as I thought it would.
My next idea was to eliminate Combiner, by combining in Mapper itself. That idea I got from here, where it says that data needs to be serialized/deserialized to use Combiner. Unfortunately shuffle phase is still the same.
My only thought is that it can be because I'm using a single Reducer. But this shouldn't be a case since I'm not emitting a lot of data when using Combiner or combining in Mapper.
Here are my stats:
Here are all the counters for my Hadoop (YARN) job:
I should also add that this is run on a small cluster of 4 machines. Each has 8GB of RAM (2GB reserved) and number of virtual cores is 12 (2 reserved).
These are virtual machines. At first they were all on a single unit, but then I separated them 2-2 on two units. So they were sharing HDD at first, now there are two machines per disk. Between them is a gigabit network.
And here are more stats:
Whole memory is occupied
CPU is constantly under pressure while the job is run (the picture shows CPU for two consecutive runs of same job)
My question is - why is shuffle time so big and how to fix it? I also don't understand how there was no speedup even though I have dramatically reduced the amount of data emitted from Mapper?

Few observations :
For a job of 30 mins, the GC time is too high (Try reusing objects rather creating a new one for each call in map()/Reduce() method)
Average map time is TOOOOO hight , 16 mins what are you doing in ur map ?
YARN memory is 99% , this signifies you are running too many services on your HDP cluster and RAM is not sufficient to support those many services.
Please increse YAN container memory, please give at least 1 GB.
This looks like a GC + overscheduled cluster problem

Related

How many mappers and reducers would be advised to process 2TB of data in Hadoop?

I am trying to develop a Hadoop project for one of our clients. We will be receiving a data of around 2 TB per day, so as a part of reconciliation we would like to read the 2 TB of data and perform sorting and filter operations.
We have set up the Hadoop cluster with 5 data nodes running on t2x.large AWS instances containing 4 CPU cores and 16GB RAM. What is the advisable count of mappers and reducers we need to launch to complete the data processing quickly?
Take a look on this:
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-1/
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-2/
This depends on the task nature if it is RAM or CPU consuming and how parallel your system can be.
If every node contains 4 CPU cores and 16GB RAM. On average I suggest 4 to 6 map-reduce task on each node.
Creating too much mapred tasks will degrade your cpu performance and you may face container problems regarding not enough memory.

Tuning Hadoop job execution on YARN

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).
A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.
Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.
How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?
The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).
I have a feeling that my cluster setup is not good, because of the following reason:
I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours
Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.
Could this be the consequence of a bad setup, or am I wrong?
Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

hadoop cassandra cpu utilization

Summary: How can I get Hadoop to use more CPUs concurrently on my server?
I'm running Cassandra and Hadoop on a single high-end server with 64GB RAM, SSDs, and 16 CPU cores. The input to my mapreduce job has 50M rows. During the map phase, Hadoop creates seven mappers. Six of those complete very quickly, and the seventh runs for two hours to complete the map phase. I've suggested more mappers like this ...
job.getConfiguration().set("mapred.map.tasks", "12");
but Hadoop continues to create only seven. I'd like to get more mappers running in parallel to take better advantage of the 16 cores in the server. Can someone explain how Hadoop decides how many mappers to create?
I have a similar concern during the reduce phase. I tell Hadoop to create 12 reducers like this ...
job.setNumReduceTasks(12);
Hadoop does create 12 reducers, but 11 complete quickly and the last one runs for hours. My job has 300K keys, so I don't imagine they're all being routed to the same reducer.
Thanks.
The map task number is depend on your input data.
For example:
if your data source is HBase the number is the region number of you data
if your data source is the file the map number is your file size/the block size(64mb or 128mb).
you cannot specify the map number in code
The problem of 6 fast and 1 slow is because the data unbalanced. I did not use Cassandra before, so I cannot tell you how to fix it.

how to deal with large map output in hadoop?

I am new in hadoop and i'm working with 3 node in a cluster(each of them has 2GB RAM).
the input file is small(5 MB) but map output is very large(about 6 GB).
in the map phase my memory becomes full and the tasks run very slowly.
what's its reason?
Can anyone helps me how to make my program faster?
Use a NLineInputFormat , where N refers to the number of lines of input each mapper will receive. This way, you have more splits created , there by forcing smaller input data to each mapper task. If not, the entire 5 MB will go into one Map task.
Size of map output by itself does not cause memory problem, since mapper can work in "streaming" mode. It consume records, process them and write to output. Hadoop will store some amount of data in memory and then spill it to disc.
So you problems can be caused by one of the two:
a) Your mapper algorithm somehow accumulate data during processing.
b) Cumulative memory given to your mappers is less then RAM of the Nodes. Then OS start swapping and your performance can fell orders of magnitude.
Case b is more likely since 2 GB is actually too little for usual hadoop configuration. If you going to work on it - I would suggest to configure 1, maximum 2 mapper slots per node.

How to decrease number of map sweeps in a job (without changing data chunk size)?

The gist of my problem is..how does one decrease the number of map sweeps a job may need ? The number of map tasks for a job is data_size/HDFS_BLOCK_SIZE. The number of sweeps it may take to complete this is dependent on how many map slots we have. Assuming I am running nothing else and just one job, I find that the per node CPU utilization is low (implying I could actually run more map jobs per node). I played with mapred.tasktracker.map.tasks.maximum parameter (for example, each of my node has 32 processors and I set it to as high as 30) - but I could never increase the number of map slots and the overall CPU utilization is 60% or so. Are there any other parameters to play with? The data size I have is large enough (32GB, 8 node cluster each with 32 cpus) and it does take two map sweeps (first sweep does map 1-130 and second sweep completes the rest).
In case anyone haven't told you yet:
MapReduce is mainly IO bound, it has to read a lot of data from disk, write it back, read it and write it again. In between the reads and writes it executes your map and reduce logic.
So what I have heard lifting the CPU usage is making a cluster not IO bound anymore
RAID-0 or RAID-10 your hard disks, get the fastest harddisk out there. In consumer market there are the Western Digital VelociRaptors with 10k RPM.
SSD's don't contribute too much, since Hadoop is mostly optimized for sequencial rads.
Give as much network bandwidth as possible.
Lots of RAM for disk caching.
Even then, you should face <100% CPU utilization, but it is much better and the perfomance will skyrocket.
However, CPU utilization is not a good metric for a Hadoop cluster, as you might conclude from the points above.
Hadoop is mainly about the reliable storage of data, giving neat features to crunch it. Not given you the super-computer performance, if you need this get a MPI cluster and a PH.D to code your algorithms ;)
Sorry for the thrash - but something must have gone wrong with my installation. I happen to reinstall hadoop and it works as expected. I guess some parameter must have been conflicting.

Resources