So I'm trying to run some hadoop jobs on AWS R3.4xLarge machines. They have 16 vcores and 122 gigabytes of ram available.
Each of my mappers requires about 8 gigs of ram and one thread, so these machines are very nearly perfect for the job.
I have mapreduce.memory.mb set to 8192,
and mapreduce.map.java.opts set to -Xmx6144
This should result in approximately 14 mappers (in practice nearer to 12) running on each machine.
This is in fact the case for a 2 slave setup, where the scheduler shows 90 percent utilization of the cluster.
When scaling to, say, 4 slaves however, it seems that hadoop simply doesnt create more mappers. In fact it creates LESS.
On my 2 slave setup I had just under 30 mappers running at any one time, on four slaves I had about 20. The machines were sitting at just under 50 percent utilization.
The vcores are there, the physical memory is there. What the heck is missing? Why is hadoop not creating more containers?
So it turns out that this is one of those hadoop things that never makes sense, no matter how hard you try to figure it out.
there is a setting in yarn-default called yarn.nodemanager.heartbeat.interval-ms.
This is set to 1000. Apparently it controls the minimum period between assigning containers in milliseconds.
This means it only creates one new map task per second. This means the number of containers is limited by how many containers I have running*the time that it takes for a container to be finished.
By setting this value to 50, or better yet, 1, I was able to get the kind of scaling that is expected from a hadoop cluster. Honestly should be documented better.
Related
How can I change/suggest a different allocation of containers to tasks in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not EMR) and I want the scheduling/allocating of the containers (Mappers/Reducers) would be more balanced than it is currently.
It seems like RM is allocating the Mappers in a Bin Packing way (where the data resides) and for the reducers, it seems more balanced.
My setup includes three Machines with replication rate three (all the data is on every machine) and I run my jobs with mapreduce.job.reduce.slowstart.completedmaps=0 in order to start shuffle as fast as possible (It is important for me that all the containers concurrently, it is a must condition).
In addition, according to the EC2 instances I have chosen and my settings of the YARN cluster, I can run at most 93 containers (31 each).
For example, if I want to have 9 reducers then (93-9-1=83) 83 containers could be left for the mappers and one is for the AM.
I have played with the size of split input (mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize) in order to find the right balance where all of the machines have the same "work" for the map phase.
But it seems like the first 31 mappers would be allocated in one machine, the next 31 to the second one and the last 31 in the last machine. Thus, I can try to use 87 mappers where 31 of them in Machine #1, another 31 in Machine #2 and another 25 in Machine #3 and the rest is left for the reducers and as Machine #1 and Machine #2 are fully occupied then the reducers would have to be placed in Machine #3. This way I get an almost balanced allocation of mappers at the expense of unbalanced reducers allocation.
And this is not what I want...
# of mappers = size_input / split size [Bytes],
split size= max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.blocksize))
I was using the default scheduler (Capacity) and by default yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments was set to -1 (infinity) which explained why every node that answer first to the RM (with Heartbeat) has been "packing" containers as much as it can.
To conclude, inserting to hadoop/etc/hadoop/capacity-scheduler.xml the above parameter (using a third of the number of mappers would result in balanced scheduling of mappers) and following yarn rmadmin -refreshQueues after restarting the RM will grant you the option to balance the containers allocation in YARN.
For more details, please search my discussion here.
I have two linux machines, both with different configuration
Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)
Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)
I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.
I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.
My spark code looks something like:
conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()
So far, I have tried the following:
When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.
Apache Spark: The number of cores vs. the number of executors
Any help will be greatly appreciated.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
Its not clear where your file is stored.
I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).
In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.
If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.
As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes
I want to ask. Why if I configured on mapred-site.xml in mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts to bigger value than default value make my job slower?
But If I configured it too low, then I'll get task failed. And I think on this condition, my memory configuration on hadoop is not necessary...
Can you give me an explanation?
What might be happening in your environment is, when you increase values of the mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts configurations to upper bound, it actually reduces the number of containers allowed to execute Map/Reduce task in every node thus eventually causes the slowness in the over all job time.
If you have 2 nodes, each with 25 GB of free ram , and say you configured the mapreduce.map/reduce.memory.mb as 4 GB, then you might get atleast 6 containers on every node, totally it is 12. So you would get a chance of running 12 mapper/reducer tasks in parallel.
In case if you configure mapreduce.map/reduce.memory.mb as 10 GB , then you might get only 2 containers on every node , totally it would be 4 containers to execute your mapper/reducer tasks in parallel. So the mapper/reducer tasks would mostly run in sequence due to lack of free containers, thus causes a delay in the over all job completion time.
You should justify the approprite value for the configuration with considering the resources available and the amount of resources required for the Map/Reduce containers according to your environment. Hope this makes sense.
you can allocate memory for map/reduce containers based on two factors
available memory per each Datanode
total number of cores(vcores) you have.
try to create number of containers equivalent to number of cores you have in each detained. ( including hyper threading)
for example if you have 10 physical core ( 20 cores including hyper threading)
so total number containers you can plan is 19 ( leaving 1 core for other processes)
assume that you have 'X' GB Ram in each data node, then
leave some memory(assume Y GB) for other processes (heap) like, Datanode, Node Manager,Region server ,etc
Now memory available for YARN is X -Y = Z
Memory for Map container = Y/number of containers per node
Memory for Reduce container = Y/(2 * number of containers per node)
I am trying to benchmark Hadoop on EC2. I am using a 1GB file with 1 Master and 5 slaves. When I varied the dfs.blocksize like 1m, 64m, 128m, 500m. I was expecting the best performance at 128m since the file size is 1GB and there are 5 slaves. But to my surprise, irrespective of the block size, time taken falls more or less within the same range. How am I achieving this wierd performance?
Couple of things to think about most likely explanation first
Check you are correctly passing in the system variables to control the split size of the job, if you don't change this you won't alter the numbers of mappers (which you can check in the jobtracker UI). If you get the same number of mappers each time your not actually changing anything. To change the split size, use the system props mapred.min.split.size and mapred.max.split.size
Make sure you are really hitting the cluster and not accidentally running locally with 1 process
Be aware that (unlike Spark) Hadoop has a horrifying job initialization time. IME it's around 20 seconds, and therefore for only 1 GB of data your not really seeing much time difference as the majority of the job is spent in initialization.
I'm evaluating EC2/EMR for running a ~20 node Hadoop cluster. (custom JAR cluster). I've run the simple WordCount example on a single-node 3.3 GHz 2GB RAM local VMWare instance which takes less than 10 seconds to complete. The WordCount example takes 3 minutes to complete on EMR with 2 c1.mediumm instances (excluding the startup time of 3-5 minutes). Takes the same time for 2 m1.small instances. There will be some overhead for running a job on EMR, and maybe this problem size is too small, so this seems understandable.
At about what size problems do you begin to see the performance advantage of the cloud? Or at about how many nodes or compute units?
If you're spinning up an EMR job, that essentially means you're asking Amazon to provide you an on-demand cluster of N machines, and the simple fact of provisioning and giving you these machines can easily take several minutes, not to mention that these machines need to be setup, can have bootstrap actions, and so on. I've rarely seen EMR jobs (even big ones) take more than 10 minutes to have the cluster ready, but I've also rarely seen a cluster be up in less than a couple minutes.
If you have a job that you're running frequently (for example every hour), then the cost of setting up and shutting down your EMR cluster might be too big, in this case it would be a good idea to create your cluster with some reserved instances on EC2. With reserved instances, you will have your own cluster always up and administered by you, so there is no time lost setting up/shutting down your cluster, this behaves like a regular Hadoop cluster.
What I've been doing in the past couple years is use an EC2 cluster on reserved instances that is always up and all the jobs are running on it, but for some jobs that are very large and that couldn't fit on my cluster, I run them on EMR where I can choose how many nodes I want and since these are large jobs the time to setup/shutdown the cluster is small in comparison to the total runtime. I would not recommend using EMR for small/frequent jobs.