Hadoop MapReduce2 Optimization in Heterogeneous Cluster - hadoop

I have this configuration:
Hadoop: v2.7.1 (Yarn)
An input file: Size = 100 GB.
3 Slaves: each has 4 VCORES with Speed = 2 GHz and RAM = 8 GB
5 Slaves: each has 2 VCORES with Speed = 1 GHz and RAM = 2 GB
MapReduce program: WordCount
How can I minimize WordCount execution time by assigning small input splits to the 5 slower slaves and big input splits to the 3 fastest slaves?

For each machine you can determine number of map/reduce slots, so if you want to send less workload to the slower machines you can define, for example 2 map/reduce task slots for each slower machine and 4 map/reduce task slot for each of the fast machines. This way you can control how much work load each different node in the cluster receives.

Related

How to increase hive concurrent mappers to more than 4?

Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

Yarn: How to utilize full cluster resources?

So I am having a cloudera cluster with 7 worker nodes.
30GB RAM
4 vCPUs
Here are some of my configurations which I found important (from Google) in tuning performance of my cluster. I am running with:
yarn.nodemanager.resource.cpu-vcores => 4
yarn.nodemanager.resource.memory-mb => 17GB (Rest reserved for OS and other processes)
mapreduce.map.memory.mb => 2GB
mapreduce.reduce.memory.mb => 2GB
Running nproc => 4 (Number of processing units available)
Now my concern is, when I look at my ResourceManager, I see Available Memory as 119 GB which is fine. But when I run a heavy sqoop job and my cluster is at its peak it uses only ~59 GB of memory, leaving ~60 GB memory unused.
One way which I see, can fix this unused memory issue is increasing map|reduce.memory to 4 GB so that we can use upto 16 GB per node.
Other way is to increase the number of containers, which I am not sure how.
4 cores x 7 nodes = 28 possible containers. 3 being used by other processes, only 5 are currently being available for sqoop job.
What should be the right config to improve cluster performance in this case. Can I increase the number of containers, say 2 containers per core. And is it recommended?
Any help or suggestions on the cluster configuration would be highly appreciated. Thanks.
If your input data is in 26 splits, YARN will create 26 mappers to process those splits in parallel.
If you have 7 nodes with 2 GB mappers for 26 splits, the repartition should be something like:
Node1 : 4 mappers => 8 GB
Node2 : 4 mappers => 8 GB
Node3 : 4 mappers => 8 GB
Node4 : 4 mappers => 8 GB
Node5 : 4 mappers => 8 GB
Node6 : 3 mappers => 6 GB
Node7 : 3 mappers => 6 GB
Total : 26 mappers => 52 GB
So the total memory used in your map reduce job if all mappers are running at the same time will be 26x2=52 GB. Maybe if you add the memory user by the reducer(s) and the ApplicationMaster container, you can reach your 59 GB at some point, as you said ..
If this is the behaviour you are witnessing, and the job is finished after those 26 mappers, then there is nothing wrong. You only need around 60 GB to complete your job by spreading tasks across all your nodes without needing to wait for container slots to free themselves. The other free 60 GB are just waiting around, because you don't need them. Increasing heap size just to use all the memory won't necessarily improve performance.
Edited:
However, if you still have lots of mappers waiting to be scheduled, then maybe its because your installation insconfigured to calculate container allocation using vcores as well. This is not the default in Apache Hadoop but can be configured:
yarn.scheduler.capacity.resource-calculator :
The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator only uses Memory while DominantResourceCalculator uses Dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. A Java ResourceCalculator class name is expected.
Since you defined yarn.nodemanager.resource.cpu-vcores to 4, and since each mapper uses 1 vcore by default, you can only run 4 mappers per node at a time.
In that case you can double your value of yarn.nodemanager.resource.cpu-vcores to 8. Its just an arbitrary value it should double the number of mappers.

Hadoop Running reducers in parallel

I have a 4G file with ~ 16 mill lines, maps are running distributed with 6 maps in parallel out of 15 maps. Generates 35000 keys. I am using MultipleTextoutput so each reducer generates a output independent of other reducer.
I have configured the conf with 25-50 reducers, but it always runs 1 reducer at a time.
Machine - 4 core 32 G ram single machine running hortonworks stack
How do I get more than 1 reduce task to run in parallel ?
Have a look hadoop MapReduce Tutorial
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * ).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Have a look at related SE questions:
How hadoop decides how many nodes will do map and reduce tasks
What is Ideal number of reducers on Hadoop?
With specifying a lower reducer memory of 2 GB, the default in the mapred-site xml was 6GB, the framework brings up 3 reducers in parallel rather than 1.

Apache Spark tuning on distributed environment

I'd maximize Hadoop performance in a distributed environment (using Apache Spark with Yarn) and I'm following the hints on a blog post of Cloudera with this configuration:
6 nodes, 16 core/node, ram 64G/node
and the proposed solution is:
--num-executors 17 --executor-cores 5 --executor-memory 19G
But i didn't understand why they use 17 num executors (in other words 3 executors for each node).
Our configuration is instead:
8 nodes, 8 core/node, ram 8G/node
What is the best solution?
Your ram is pretty low. I would expect this to be higher.
But, we start off with 8 nodes, and 8 cores. To determine our max executors we do nodes*(cores-1) = 56. Minus 1 core from each node for management.
So I would start off with
56 executors, 1 executor core, 1G ram.
If you have out of memory issues, double the ram, have the executors, up the cores.
28 executors, 2 executor cores, 2G ram
but your max executors will be less, because an executor must fit onto a node. You will be able to get a total of 24 allocated containers max.
I would try 3 cores before 4 cores next, as 3 cores will fit 2 executors on each node, while with 4 cores you will have the same executors as 7.
Or, you can skip right to...
8 executors, 7 cores, 7gig ram(want to leave some for the rest of cluster).
I also found if CPU Scheduling was disabled, yarn was overriding my cores setting, and it was always staying at 1, no matter my config. Other settings must also be changed to turn this on.
yarn.schedular.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN.
The test environment is as follows:
Number of data nodes: 3
Data node machine spec:
CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)
Network: 1Gb
Spark version: 1.0.0
Hadoop version: 2.4.0 (Hortonworks HDP 2.1)
Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile
Input data
Type: single text file
Size: 165GB
Number of lines: 454,568,833
Output
Number of lines after second filter: 310,640,717
Number of lines of the result file: 99,848,268
Size of the result file: 41GB
The job was run with following configurations:
--master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3 (executors per data node, use as much as cores)
--master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3 (# of cores reduced)
--master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12 (less core, more executor)
Elapsed times:
50 min 15 sec
55 min 48 sec
31 min 23 sec
To my surprise, (3) was much faster.
I thought that (1) would be faster, since there would be less inter-executor communication when shuffling.
Although # of cores of (1) is fewer than (3), #of cores is not the key factor since 2) did perform well.
(Followings were added after pwilmot's answer.)
For the information, the performance monitor screen capture is as follows:
Ganglia data node summary for (1) - job started at 04:37.
Ganglia data node summary for (3) - job started at 19:47. Please ignore the graph before that time.
The graph roughly divides into 2 sections:
First: from start to reduceByKey: CPU intensive, no network activity
Second: after reduceByKey: CPU lowers, network I/O is done.
As the graph shows, (1) can use as much CPU power as it was given. So, it might not be the problem of the number of the threads.
How to explain this result?
To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as
possible: Imagine a cluster with six nodes running NodeManagers, each
equipped with 16 cores and 64GB of memory. The NodeManager capacities,
yarn.nodemanager.resource.memory-mb and
yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 *
1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100%
of the resources to YARN containers because the node needs some
resources to run the OS and Hadoop daemons. In this case, we leave a
gigabyte and a core for these system processes. Cloudera Manager helps
by accounting for these and configuring these YARN properties
automatically.
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17
--executor-cores 5 --executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors.
--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.
The explanation was given in an article in Cloudera's blog, How-to: Tune Your Apache Spark Jobs (Part 2).
Short answer: I think tgbaggio is right. You hit HDFS throughput limits on your executors.
I think the answer here may be a little simpler than some of the recommendations here.
The clue for me is in the cluster network graph. For run 1 the utilization is steady at ~50 M bytes/s. For run 3 the steady utilization is doubled, around 100 M bytes/s.
From the cloudera blog post shared by DzOrd, you can see this important quote:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.
So, let's do a few calculations see what performance we expect if that is true.
Run 1: 19 GB, 7 cores, 3 executors
3 executors x 7 threads = 21 threads
with 7 cores per executor, we expect limited IO to HDFS (maxes out at ~5 cores)
effective throughput ~= 3 executors x 5 threads = 15 threads
Run 3: 4 GB, 2 cores, 12 executors
2 executors x 12 threads = 24 threads
2 cores per executor, so hdfs throughput is ok
effective throughput ~= 12 executors x 2 threads = 24 threads
If the job is 100% limited by concurrency (the number of threads). We would expect runtime to be perfectly inversely correlated with the number of threads.
ratio_num_threads = nthread_job1 / nthread_job3 = 15/24 = 0.625
inv_ratio_runtime = 1/(duration_job1 / duration_job3) = 1/(50/31) = 31/50 = 0.62
So ratio_num_threads ~= inv_ratio_runtime, and it looks like we are network limited.
This same effect explains the difference between Run 1 and Run 2.
Run 2: 19 GB, 4 cores, 3 executors
3 executors x 4 threads = 12 threads
with 4 cores per executor, ok IO to HDFS
effective throughput ~= 3 executors x 4 threads = 12 threads
Comparing the number of effective threads and the runtime:
ratio_num_threads = nthread_job2 / nthread_job1 = 12/15 = 0.8
inv_ratio_runtime = 1/(duration_job2 / duration_job1) = 1/(55/50) = 50/55 = 0.91
It's not as perfect as the last comparison, but we still see a similar drop in performance when we lose threads.
Now for the last bit: why is it the case that we get better performance with more threads, esp. more threads than the number of CPUs?
A good explanation of the difference between parallelism (what we get by dividing up data onto multiple CPUs) and concurrency (what we get when we use multiple threads to do work on a single CPU) is provided in this great post by Rob Pike: Concurrency is not parallelism.
The short explanation is that if a Spark job is interacting with a file system or network the CPU spends a lot of time waiting on communication with those interfaces and not spending a lot of time actually "doing work". By giving those CPUs more than 1 task to work on at a time, they are spending less time waiting and more time working, and you see better performance.
As you run your spark app on top of HDFS, according to Sandy Ryza
I’ve noticed that the HDFS client has trouble with tons of concurrent
threads. A rough guess is that at most five tasks per executor can
achieve full write throughput, so it’s good to keep the number of
cores per executor below that number.
So I believe that your first configuration is slower than third one is because of bad HDFS I/O throughput
From the excellent resources available at RStudio's Sparklyr package page:
SPARK DEFINITIONS:
It may be useful to provide some simple definitions
for the Spark nomenclature:
Node: A server
Worker Node: A server that is part of the cluster and are available to
run Spark jobs
Master Node: The server that coordinates the Worker nodes.
Executor: A sort of virtual machine inside a node. One Node can have
multiple Executors.
Driver Node: The Node that initiates the Spark session. Typically,
this will be the server where sparklyr is located.
Driver (Executor): The Driver Node will also show up in the Executor
list.
I haven't played with these settings myself so this is just speculation but if we think about this issue as normal cores and threads in a distributed system then in your cluster you can use up to 12 cores (4 * 3 machines) and 24 threads (8 * 3 machines). In your first two examples you are giving your job a fair number of cores (potential computation space) but the number of threads (jobs) to run on those cores is so limited that you aren't able to use much of the processing power allocated and thus the job is slower even though there is more computation resources allocated.
you mention that your concern was in the shuffle step - while it is nice to limit the overhead in the shuffle step it is generally much more important to utilize the parallelization of the cluster. Think about the extreme case - a single threaded program with zero shuffle.
I think one of the major reasons is locality. Your input file size is 165G, the file's related blocks certainly distributed over multiple DataNodes, more executors can avoid network copy.
Try to set executor num equal blocks count, i think can be faster.
Spark Dynamic allocation gives flexibility and allocates resources dynamically. In this number of min and max executors can be given. Also the number of executors that has to be launched at the starting of the application can also be given.
Read below on the same:
http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
There is a small issue in the First two configurations i think. The concepts of threads and cores like follows. The concept of threading is if the cores are ideal then use that core to process the data. So the memory is not fully utilized in first two cases. If you want to bench mark this example choose the machines which has more than 10 cores on each machine. Then do the bench mark.
But dont give more than 5 cores per executor there will be bottle neck on i/o performance.
So the best machines to do this bench marking might be data nodes which have 10 cores.
Data node machine spec:
CPU: Core i7-4790 (# of cores: 10, # of threads: 20)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)
In the 2.) configuration you're reducing the parallel tasks and thus I believe your comparison isn't fair.
Make the --num-executors to atleast 5.
Thus, you will have 20 tasks running in comparison to your 21 tasks in 1.) configuration.
Then, the comparison will be fair as per me.
Also, please calculate the executor memory accordingly.

Resources