Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also
Related
I am trying to run a Simple Query Assuming running queries with spark.sql("query") compared to Dataframes has no performance Difference as I am using Spark 2.1.0 i have Catalyst Optimizer to take care of the optimization part & Tungsten Enabled.
Here i am joining 2 tables with a Left-Outer join. My 1st table is 200 GB & is the Driving table(being on left side) and the 2nd table is 2GB and there has to be no Filters as per our Business requirement.
Configuration of My Cluster. As this is Shared Cluster i have a assigned a specific queue which allows me to use 3-TB of Memory(Yes 3 tera bytes) but the No.of VCORES is 480 . That means i can only run 480 Parallel tasks. On top of that AT YARN LEVEL i have a Constraint to having MAX of 8 cores per node. And MAX of 16 GB of Container Memory Limit. Because of which i cannot give my Executor-Memory(which is per node) more than 12 GB as i am giving 3-GB as ExecutorMemoryOverhead to be on safer side which becomes 15 GB of per node memory utilization.
So after calculating 480 total allowed vcores with 8-cores per node limit i have got 480/8 = 60 Nodes for my computation. Which comes to 60*15 = 900 GB of usable memory(I don't why total queue memory is assigned 3 TB) And this is at peak .. IF i am the only one using the Queue but that's not always the case.
Now the doubt is how Spark this whole 900 GB of memory. From the Numbers & stats i can clearly say that my Job will run without any issues as the data size i am trying to process is just 210-250 GB MaX & i have 900 GB of available memory.
But i keep getting Container getting killed error msgs. And i cannot increase the YARN Container size becoz it is at YARN level and overall cluster will get the increased container size which is not the right thing. I have also tried Disabling vmem-check.enabled property to FALSE in my code using sparksession.config(property) but that doesn't help too May be i am not allowed to change anything at YARN Level so it might be ignoring that.
Now on what basis spark splits the data initially is it based on the Block size defined at Cluster Level (assuming 128 MB) I am thinking this because when my Job is started i see that my Big Table which is of around 200 GB has 2000 tasks so on what basis Spark calculates this 2000 tasks(partitions) I thought may be the Default partition size when spark starts to load my table is quite big by seeing the Input Size/Records && Shuffle Write Size/Records Under the Stage Tab of Spark UI and that is the reason why i am getting Container Killed Error & suggestion to increase Executor memory overhead which did not helped either.
I tried to Repartition the Data from 10k to 100k partitions and tried persisting to MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY but nothing Helped. Many of my task were getting failed and at the End job used to get Fail. Sometimes with Container killed, Direct Buffer, and others.
Now here what is the use of Persist /Caching and how does it behave ..???? I am doing
val result = spark.sql("query big_table").repartition(10000, $<column name>).persist()
The column in Repartition is the Joining key so it gets distributed. TO make this work before the JOIN i am doing result.show(1) . So the action is performed and data gets persisted on DISK and Spark will read data persisted on DISK for JOIN and there will be no load on memory as it is stored in small chunks on Disks(Am i correct over HERE ..??)
Why in HIVE this same job with the same Big Table plus some additional tables with Left Join get completed. Though it takes time but it completes successfully But it Fails in Spark..?? Why ?? Is Spark not the Complete Replacement of HIVE..?? Doesn't Spark works like HIVE when it comes to Spilling to Disk & write data to disk while using DISK for PERSISTING.
Does yarn-container size plays a role if we have less container size but good number of nodes ??
Does Spark combines memory of all the available nodes (15 GB Per Node as per container size) and Combine them to load a large partition..??
i have to reduce ram size of virtual box from 4 gb to 1 gb .I had tried for reducing it But it is unchangable so please suggest ways to do it in right manner . I am attaching screenshot .
The same error had occured when i had tried for hadoop , now you can use these things .
Configuring YARN
In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so that processing is not constrained by any one of these cluster resources. As a general recommendation, we’ve found that allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster node with 12 disks and 12 cores, we will allow for 20 maximum Containers to be allocated to each node.
Each machine in our cluster has 48 GB of RAM. Some of this RAM should be reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:
In yarn-site.xml
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>
The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of 20 Containers, and thus need (40 GB total RAM) / (20 # of Containers) = 2 GB minimum per container:
In yarn-site.xml
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.
For more information you can visit hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
So I am having a cloudera cluster with 7 worker nodes.
30GB RAM
4 vCPUs
Here are some of my configurations which I found important (from Google) in tuning performance of my cluster. I am running with:
yarn.nodemanager.resource.cpu-vcores => 4
yarn.nodemanager.resource.memory-mb => 17GB (Rest reserved for OS and other processes)
mapreduce.map.memory.mb => 2GB
mapreduce.reduce.memory.mb => 2GB
Running nproc => 4 (Number of processing units available)
Now my concern is, when I look at my ResourceManager, I see Available Memory as 119 GB which is fine. But when I run a heavy sqoop job and my cluster is at its peak it uses only ~59 GB of memory, leaving ~60 GB memory unused.
One way which I see, can fix this unused memory issue is increasing map|reduce.memory to 4 GB so that we can use upto 16 GB per node.
Other way is to increase the number of containers, which I am not sure how.
4 cores x 7 nodes = 28 possible containers. 3 being used by other processes, only 5 are currently being available for sqoop job.
What should be the right config to improve cluster performance in this case. Can I increase the number of containers, say 2 containers per core. And is it recommended?
Any help or suggestions on the cluster configuration would be highly appreciated. Thanks.
If your input data is in 26 splits, YARN will create 26 mappers to process those splits in parallel.
If you have 7 nodes with 2 GB mappers for 26 splits, the repartition should be something like:
Node1 : 4 mappers => 8 GB
Node2 : 4 mappers => 8 GB
Node3 : 4 mappers => 8 GB
Node4 : 4 mappers => 8 GB
Node5 : 4 mappers => 8 GB
Node6 : 3 mappers => 6 GB
Node7 : 3 mappers => 6 GB
Total : 26 mappers => 52 GB
So the total memory used in your map reduce job if all mappers are running at the same time will be 26x2=52 GB. Maybe if you add the memory user by the reducer(s) and the ApplicationMaster container, you can reach your 59 GB at some point, as you said ..
If this is the behaviour you are witnessing, and the job is finished after those 26 mappers, then there is nothing wrong. You only need around 60 GB to complete your job by spreading tasks across all your nodes without needing to wait for container slots to free themselves. The other free 60 GB are just waiting around, because you don't need them. Increasing heap size just to use all the memory won't necessarily improve performance.
Edited:
However, if you still have lots of mappers waiting to be scheduled, then maybe its because your installation insconfigured to calculate container allocation using vcores as well. This is not the default in Apache Hadoop but can be configured:
yarn.scheduler.capacity.resource-calculator :
The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator only uses Memory while DominantResourceCalculator uses Dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. A Java ResourceCalculator class name is expected.
Since you defined yarn.nodemanager.resource.cpu-vcores to 4, and since each mapper uses 1 vcore by default, you can only run 4 mappers per node at a time.
In that case you can double your value of yarn.nodemanager.resource.cpu-vcores to 8. Its just an arbitrary value it should double the number of mappers.
I want to ask. Why if I configured on mapred-site.xml in mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts to bigger value than default value make my job slower?
But If I configured it too low, then I'll get task failed. And I think on this condition, my memory configuration on hadoop is not necessary...
Can you give me an explanation?
What might be happening in your environment is, when you increase values of the mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts configurations to upper bound, it actually reduces the number of containers allowed to execute Map/Reduce task in every node thus eventually causes the slowness in the over all job time.
If you have 2 nodes, each with 25 GB of free ram , and say you configured the mapreduce.map/reduce.memory.mb as 4 GB, then you might get atleast 6 containers on every node, totally it is 12. So you would get a chance of running 12 mapper/reducer tasks in parallel.
In case if you configure mapreduce.map/reduce.memory.mb as 10 GB , then you might get only 2 containers on every node , totally it would be 4 containers to execute your mapper/reducer tasks in parallel. So the mapper/reducer tasks would mostly run in sequence due to lack of free containers, thus causes a delay in the over all job completion time.
You should justify the approprite value for the configuration with considering the resources available and the amount of resources required for the Map/Reduce containers according to your environment. Hope this makes sense.
you can allocate memory for map/reduce containers based on two factors
available memory per each Datanode
total number of cores(vcores) you have.
try to create number of containers equivalent to number of cores you have in each detained. ( including hyper threading)
for example if you have 10 physical core ( 20 cores including hyper threading)
so total number containers you can plan is 19 ( leaving 1 core for other processes)
assume that you have 'X' GB Ram in each data node, then
leave some memory(assume Y GB) for other processes (heap) like, Datanode, Node Manager,Region server ,etc
Now memory available for YARN is X -Y = Z
Memory for Map container = Y/number of containers per node
Memory for Reduce container = Y/(2 * number of containers per node)
I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram.
My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want to run some Hive jobs on about 500-600 of those files.
Due to the incongruent input file size, I havent changed blocksize in Hadoop so far. As I understand best-case scenario would be if blocksize = input file size, but will Hadoop fill that block until its full if the file is less than blocksize? And how does the size and amount of input files affect performance, as opposed to say one big ~40 GB file?
And how would my optimal configuration for this setup look like?
Based on this guide (http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/) I came up with this configuration:
32 GB Ram, with 2 GB reserved for the OS gives me 30720 MB that can be allocated to Yarn containers.
yarn.nodemanager.resource.memory-mb=30720
With 8 cores I thought a maximum of 10 containers should be safe. So for each container (30720 / 10) 3072 MB of RAM.
yarn.scheduler.minimum-allocation-mb=3072
For Map Task Containers I doubled the minimum container size, which would allow for a maximum of 5 Map Tasks
mapreduce.map.memory.mb=6144
And if I want a maximum of 3 Reduce task I allocate:
mapreduce.map.memory.mb=10240
With JVM heap size to fit into the containers:
mapreduce.map.java.opts=-Xmx5120m
mapreduce.reduce.java.opts=-Xmx9216m
Do you think this configuration would be good, or would you change anything, and why?
Yeah, this configuration is good. But few changes I would like to mention.
For reducer memory, it should be
mapreduce.reduce.memory.mb=10240(I think its just a typo.)
Also one major addition I will suggest will be the cpu configuration.
you should put
Container Virtual CPU Cores=15
for Reducer as you are running only 3 reducers, you can give
Reduce Task Virtual CPU Cores=5
And for Mapper
Mapper Task Virtual CPU Cores=3
number of containers that will be run in parallel in (reducer OR
mapper) = min(total ram / mapreduce.(reduce OR map).memory.mb, total
cores/ (Map OR Reduce) Task Virtual CPU Cores).
Please refer http://openharsh.blogspot.in/2015/05/yarn-configuration.html for detailed understading.