I have a Sun Grid Engine cluster on AWS EC2 that I set up using Starcluster. Each node has 4 processors and 16G RAM. I would like to submit a task array that will dispatch 2 jobs at a time each using up a full node (all 4 processors and 16G RAM). However, I don't want to create a parallel environment with flags like -pe smp 4 because empirically that reduces performance substantially. Is there a flag for qsub that says something like "submit job to a node that has 16G of memory that hasn't been allocated to any other job"? The flags I'm aware of are
-l mem_free=16g - submit job to node if it has 16g free at the moment
-l h_vmem=16g - kill job if memory usage goes above 16g
Neither of these work for my problem. With mem_free=16g, because the jobs initially use memory slowly, qsub allocates all of the tasks to the 2 nodes and then they all run out of memory at the same time.
I do that with a manual variable. Here is the StarCluster code to it.
So basically it creates a variable "da_mem_gb". Each machine has an initial value for it equal to its RAM. Then the jobs request how much RAM they need using that variable. If they need all the RAM of a machine, then a single job is assigned to that machine at once.
Related
I'm trying to squeeze every single bit from my cluster when configuring the spark application but it seems I'm not understanding everything completely right. So I'm running the application on an AWS EMR cluster with 1 master and 2 core nodes from type m3.xlarge(15G ram and 4 vCPU for every node). This means that by default 11.25 GB are reserved on every node for applications scheduled by yarn. So the master node is used only by the resource manager(yarn) and that means the remaining 2 core nodes will be used to schedule applications(so we have 22.5G for that purpose). So far so good. But here comes the part which I don't get. I'm starting the spark application with the following parameters:
--driver-memory 4G --num-executors 4 --executor-cores 7 --executor-memory 4G
What this means by my perceptions(from what I found as information) is that for the driver will be allocated 4G and 4 executors will be launched with 4G every one of them. So a rough estimate makes it 5*4=20G(lets make them 21G with the expected memory reserves), which should be fine as we have 22.5G for applications. Here's a screenshot from the UI of the hadoop yarn after the launch:
What we can see is that 17.63 are used by the application but this a little bit less than the expected ~21G and this triggers the first question- what did happen here?
Then I go to the spark UI's executors page. Here comes the bigger question:
The executors are 3(not 4), the memory allocated for them and the driver is 2.1G(not the specified 4G). So hadoop yarn says 17.63G are used, but the spark says 8.4G are allocated. So, what is happening here? Is this related to the Capacity Scheduler(from the documentation I couldn't come up with this conclusion)?
Can you check whether spark.dynamicAllocation.enabled is turned on. If that is the case then spark your application may give resources back to the cluster if they are no longer used. The minimum number of executors to be launched at the startup will be decided by spark.executor.instances.
If that is not the case, what is your source for spark application and what is the partition size set for that, spark will literally map the partition size to the spark cores, if your source has only 10 partitions, and when you try to allocate 15 cores it will only use 10 cores because that is what is needed. I guess this might be the cause that spark has launched 3 executors instead of 4. Regarding memory i would recommend to revisit because you are asking for 4 executors and 1 driver with 4Gb each which would be 5*4+5*384MB approx equals to 22GB and you are trying to use up everything and not much is left for your OS and nodemanager to run that would not be the ideal way to do.
I have an environment that combines 4 physical nodes with a small amount of RAM and each has 8 CPU cores.
I noticed that spark decides automatically to split the RAM for each CPU. The result is that a memory error occurred.
I'm working with big data structures, and I want that each executor will have the entire RAM memory on the physical node (otherwise i'll get a memory error).
I tried to configure 'yarn.nodemanager.resource.cpu-vcores 1' on 'yarn-site.xml' file or 'spark.driver.cores 1' on spark-defaults.conf without any success.
Try setting spark.executor.cores 1
On Hadoop YARN, if I have more containers to run map task or reduce task, would it become faster to process a job?
So if that's true when I make container allocation memory smaller than default, I can get more containers run on the host, and make the job faster.
And how about vcore, I mean if we have more containers to run, but it will run one by one according to vcore allocation right? In other words, whether many containers or few, it still runs one by one.
No, tasks can run in parallel.
Lets consider your YARN cluster has 24 core and 96 GB memory.
default value of mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores is 1
So, you can launch 24 container with 4 GM memory each and they can run in parallel. If your job needs more than 24 container then first 24 tasks will be launched initially and subsequent tasks will be launched as soon as required resources(containers) are available.
I want to ask. Why if I configured on mapred-site.xml in mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts to bigger value than default value make my job slower?
But If I configured it too low, then I'll get task failed. And I think on this condition, my memory configuration on hadoop is not necessary...
Can you give me an explanation?
What might be happening in your environment is, when you increase values of the mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts configurations to upper bound, it actually reduces the number of containers allowed to execute Map/Reduce task in every node thus eventually causes the slowness in the over all job time.
If you have 2 nodes, each with 25 GB of free ram , and say you configured the mapreduce.map/reduce.memory.mb as 4 GB, then you might get atleast 6 containers on every node, totally it is 12. So you would get a chance of running 12 mapper/reducer tasks in parallel.
In case if you configure mapreduce.map/reduce.memory.mb as 10 GB , then you might get only 2 containers on every node , totally it would be 4 containers to execute your mapper/reducer tasks in parallel. So the mapper/reducer tasks would mostly run in sequence due to lack of free containers, thus causes a delay in the over all job completion time.
You should justify the approprite value for the configuration with considering the resources available and the amount of resources required for the Map/Reduce containers according to your environment. Hope this makes sense.
you can allocate memory for map/reduce containers based on two factors
available memory per each Datanode
total number of cores(vcores) you have.
try to create number of containers equivalent to number of cores you have in each detained. ( including hyper threading)
for example if you have 10 physical core ( 20 cores including hyper threading)
so total number containers you can plan is 19 ( leaving 1 core for other processes)
assume that you have 'X' GB Ram in each data node, then
leave some memory(assume Y GB) for other processes (heap) like, Datanode, Node Manager,Region server ,etc
Now memory available for YARN is X -Y = Z
Memory for Map container = Y/number of containers per node
Memory for Reduce container = Y/(2 * number of containers per node)
Say I have a EMR job running on 11 node cluster: m1.small master node while 10 m1.xlarge slave nodes.
Now one m1.xlarge node has 15 GB of RAM.
How to then decide on the number of parallel mappers and reducers which can be set?
My jobs are memory intensive and I would like to have more and more of heap allotted to JVM.
Another related question:
If we set the following parameter:
<property><name>mapred.child.java.opts</name><value>-Xmx4096m</value></property>
<property><name>mapred.job.reuse.jvm.num.tasks</name><value>1</value></property>
<property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
<property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2</value></property>
So will this 4GB be shared by 4 processes (2 mapper and 2 reducer) or will they all get 4GB each?
They will each get 4gb.
You should check what your heap setting is for the task trackers and the data nodes, then you'll have an idea of how much memory you have left over to allocate to children (the actual mappers / reducers).
Then it's just a balancing act. If you need more memory, you'll want less mappers / reducers, and vice versa.
Also try to keep in mind how many cores your CPU has, you don't want 100 map tasks on a single core. To tweak, it's best to monitor both heap usage and cpu utilization over time so you can fiddle with the knobs.