hadoop cassandra cpu utilization - performance

Summary: How can I get Hadoop to use more CPUs concurrently on my server?
I'm running Cassandra and Hadoop on a single high-end server with 64GB RAM, SSDs, and 16 CPU cores. The input to my mapreduce job has 50M rows. During the map phase, Hadoop creates seven mappers. Six of those complete very quickly, and the seventh runs for two hours to complete the map phase. I've suggested more mappers like this ...
job.getConfiguration().set("mapred.map.tasks", "12");
but Hadoop continues to create only seven. I'd like to get more mappers running in parallel to take better advantage of the 16 cores in the server. Can someone explain how Hadoop decides how many mappers to create?
I have a similar concern during the reduce phase. I tell Hadoop to create 12 reducers like this ...
job.setNumReduceTasks(12);
Hadoop does create 12 reducers, but 11 complete quickly and the last one runs for hours. My job has 300K keys, so I don't imagine they're all being routed to the same reducer.
Thanks.

The map task number is depend on your input data.
For example:
if your data source is HBase the number is the region number of you data
if your data source is the file the map number is your file size/the block size(64mb or 128mb).
you cannot specify the map number in code
The problem of 6 fast and 1 slow is because the data unbalanced. I did not use Cassandra before, so I cannot tell you how to fix it.

Related

How many mappers and reducers would be advised to process 2TB of data in Hadoop?

I am trying to develop a Hadoop project for one of our clients. We will be receiving a data of around 2 TB per day, so as a part of reconciliation we would like to read the 2 TB of data and perform sorting and filter operations.
We have set up the Hadoop cluster with 5 data nodes running on t2x.large AWS instances containing 4 CPU cores and 16GB RAM. What is the advisable count of mappers and reducers we need to launch to complete the data processing quickly?
Take a look on this:
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-1/
http://crazyadmins.com/tune-hadoop-cluster-to-get-maximum-performance-part-2/
This depends on the task nature if it is RAM or CPU consuming and how parallel your system can be.
If every node contains 4 CPU cores and 16GB RAM. On average I suggest 4 to 6 map-reduce task on each node.
Creating too much mapred tasks will degrade your cpu performance and you may face container problems regarding not enough memory.

Estimation Of Mappers for a cluster

Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(Genaral)
Thanks.
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
You are right. Number of mappers are usually based on number of DFS blocks in the input.
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
By default, Sqoop will use four tasks in parallel to import/export data.
You may change this by using -m <number of mappers> option.
Refer: Sqoop parallelism
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(General)
CPU cores are processing units. In simple words "More the Cores the better.", that is if we have more cores it can process more parallely.
Example: if you have 4 cores, 4 mappers can run parallely.(theoretically!)
Need some clarifications regarding the estimation of mappers for a particular job in Hadoop cluster. As per my understanding the no of mappers depends on the input splits taken for processing . But this is in the case if we are gonna to do processing for input data residing already in HDFS . Here I need clarification regarding the mappers and reducers triggered by a SQOOP job . PFB..
How mappers count are estimated for a dedicated cluster,based on RAM or based on the input splits/blocks?(In General)
answer : No it has nothing to do with the RAM size. it all depends on the number of input split.
How mappers count are estimated for a sqoop job for retrieving data from an RDBMS to HDFS based on input size?(Sqoop based)
answer : by default the number of mappers for a Sqoop job are 4. you can change the default by using -m (1,2,3,4,5...) or --num-mappers parameter , but you have to make sure that either you have primary key in you db or you are using -split-by parameter otherwise there will be only one mapper running and you have to explicitly say -m 1.
what is meant by core CPU’s and how it affects the count of mappers that can be run parallel?(General)
answer : core in CPU is the processing unit which can run a task. and when you say 4 core processor that means it can run 4 task at a time. the number of cores does not participate in calculating the number of mappers by mapreduce framework. but yes if there are 4 cores and mapreduce calculates the number of mappers are 12 then at a time 4 mappers will be running in parallel and after that rest will be running serially.

Hadoop: Does using CombineFileInputFormat for small files gives performance improvement?

I am new to hadoop and peforming some tests on local machine.
There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.
I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?
I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes
But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.
Any help will be appreciated.
Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).
That means if you have 1000 1Mb size file the Map-reduce job based on normal TextInputFormat will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job
In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.
Please refer this link for more details and Benchmark results.

Shuffle phase lasts too long Hadoop

I'm having a MR job in which shuffle phase lasts too long.
At first I thought that it is because I'm emitting a lot of data from Mapper (around 5GB). Then I fixed that problem by adding a Combiner, thus emitting less data to Reducer. After that shuffle period did not shorten, as I thought it would.
My next idea was to eliminate Combiner, by combining in Mapper itself. That idea I got from here, where it says that data needs to be serialized/deserialized to use Combiner. Unfortunately shuffle phase is still the same.
My only thought is that it can be because I'm using a single Reducer. But this shouldn't be a case since I'm not emitting a lot of data when using Combiner or combining in Mapper.
Here are my stats:
Here are all the counters for my Hadoop (YARN) job:
I should also add that this is run on a small cluster of 4 machines. Each has 8GB of RAM (2GB reserved) and number of virtual cores is 12 (2 reserved).
These are virtual machines. At first they were all on a single unit, but then I separated them 2-2 on two units. So they were sharing HDD at first, now there are two machines per disk. Between them is a gigabit network.
And here are more stats:
Whole memory is occupied
CPU is constantly under pressure while the job is run (the picture shows CPU for two consecutive runs of same job)
My question is - why is shuffle time so big and how to fix it? I also don't understand how there was no speedup even though I have dramatically reduced the amount of data emitted from Mapper?
Few observations :
For a job of 30 mins, the GC time is too high (Try reusing objects rather creating a new one for each call in map()/Reduce() method)
Average map time is TOOOOO hight , 16 mins what are you doing in ur map ?
YARN memory is 99% , this signifies you are running too many services on your HDP cluster and RAM is not sufficient to support those many services.
Please increse YAN container memory, please give at least 1 GB.
This looks like a GC + overscheduled cluster problem

Can hadoop map/reduce be speeded up by splitting data size?

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.

Resources