Yarn and MapReduce resource configuration - hadoop

I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram.
My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want to run some Hive jobs on about 500-600 of those files.
Due to the incongruent input file size, I havent changed blocksize in Hadoop so far. As I understand best-case scenario would be if blocksize = input file size, but will Hadoop fill that block until its full if the file is less than blocksize? And how does the size and amount of input files affect performance, as opposed to say one big ~40 GB file?
And how would my optimal configuration for this setup look like?
Based on this guide (http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/) I came up with this configuration:
32 GB Ram, with 2 GB reserved for the OS gives me 30720 MB that can be allocated to Yarn containers.
yarn.nodemanager.resource.memory-mb=30720
With 8 cores I thought a maximum of 10 containers should be safe. So for each container (30720 / 10) 3072 MB of RAM.
yarn.scheduler.minimum-allocation-mb=3072
For Map Task Containers I doubled the minimum container size, which would allow for a maximum of 5 Map Tasks
mapreduce.map.memory.mb=6144
And if I want a maximum of 3 Reduce task I allocate:
mapreduce.map.memory.mb=10240
With JVM heap size to fit into the containers:
mapreduce.map.java.opts=-Xmx5120m
mapreduce.reduce.java.opts=-Xmx9216m
Do you think this configuration would be good, or would you change anything, and why?

Yeah, this configuration is good. But few changes I would like to mention.
For reducer memory, it should be
mapreduce.reduce.memory.mb=10240(I think its just a typo.)
Also one major addition I will suggest will be the cpu configuration.
you should put
Container Virtual CPU Cores=15
for Reducer as you are running only 3 reducers, you can give
Reduce Task Virtual CPU Cores=5
And for Mapper
Mapper Task Virtual CPU Cores=3
number of containers that will be run in parallel in (reducer OR
mapper) = min(total ram / mapreduce.(reduce OR map).memory.mb, total
cores/ (Map OR Reduce) Task Virtual CPU Cores).
Please refer http://openharsh.blogspot.in/2015/05/yarn-configuration.html for detailed understading.

Related

Using hadoop cluster with different machine configuration

I have two linux machines, both with different configuration
Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)
Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)
I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.
I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.
My spark code looks something like:
conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()
So far, I have tried the following:
When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.
Apache Spark: The number of cores vs. the number of executors
Any help will be greatly appreciated.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
Its not clear where your file is stored.
I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).
In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.
If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.
As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes

How to increase hive concurrent mappers to more than 4?

Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

Why more memory on hadoop map task make mapreduce job slower?

I want to ask. Why if I configured on mapred-site.xml in mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts to bigger value than default value make my job slower?
But If I configured it too low, then I'll get task failed. And I think on this condition, my memory configuration on hadoop is not necessary...
Can you give me an explanation?
What might be happening in your environment is, when you increase values of the mapreduce.map/reduce.memory.mb and mapreduce.map/reduce.java.opts configurations to upper bound, it actually reduces the number of containers allowed to execute Map/Reduce task in every node thus eventually causes the slowness in the over all job time.
If you have 2 nodes, each with 25 GB of free ram , and say you configured the mapreduce.map/reduce.memory.mb as 4 GB, then you might get atleast 6 containers on every node, totally it is 12. So you would get a chance of running 12 mapper/reducer tasks in parallel.
In case if you configure mapreduce.map/reduce.memory.mb as 10 GB , then you might get only 2 containers on every node , totally it would be 4 containers to execute your mapper/reducer tasks in parallel. So the mapper/reducer tasks would mostly run in sequence due to lack of free containers, thus causes a delay in the over all job completion time.
You should justify the approprite value for the configuration with considering the resources available and the amount of resources required for the Map/Reduce containers according to your environment. Hope this makes sense.
you can allocate memory for map/reduce containers based on two factors
available memory per each Datanode
total number of cores(vcores) you have.
try to create number of containers equivalent to number of cores you have in each detained. ( including hyper threading)
for example if you have 10 physical core ( 20 cores including hyper threading)
so total number containers you can plan is 19 ( leaving 1 core for other processes)
assume that you have 'X' GB Ram in each data node, then
leave some memory(assume Y GB) for other processes (heap) like, Datanode, Node Manager,Region server ,etc
Now memory available for YARN is X -Y = Z
Memory for Map container = Y/number of containers per node
Memory for Reduce container = Y/(2 * number of containers per node)

Swapping of the RAM while Shuffling in HADOOP

I work with hadoop 1.1.1. My project is processing more than 6000 documents. My cluster contains 2 nodes: master(CPU:COREi7, RAM:6G) and slave(CPU:COREi3, RAM:12G). The number of MAPPER is 16. When I assign the number of REDUCER more than 1(e.i. 2,...,16) at the phase of shuffling the RAM begins to SWAP and this causes a significant reduction on my system speed.
How can I stop the RAM from swapping?
What is kept in RAM in the process between MAP and REDUCE?
Is there any reference?
Thanks a lot.
So on the master:
6G physical RAM;
2G allocated per process;
8 mappers and 8 reducers can run concurrently;
8x2 + 8x2, 32G memory required if all tasks are maxed out - over 5x your physical amount.
On the slave:
12G physical RAM;
2G allocated per task;
4 mappers, 4 reducers;
4x2 + 4x2, 16G memory required - 50% more than physical.
Now if you're only running a single job at a time, you can set the slowstart configuration property to 1.0 to ensure that the mappers and reducers don't run concurrently and that will help, but you're still maxed out on the master.
I suggest you reduce either the memory allocation to 1G (if you really want that many map / reduce slots on each node), or reduce the maximum number of tasks for both nodes, such that you're closer to the physical amount (if running maxed out).

HDFS sequence file performance tuning

I'm trying to use Hadoop to process a lot of small files which are stored in sequence file. My program is highly IO bound so I want to make sure that IO throughput is high enough.
I wrote a MR program that reads small sample files from sequence file and write these files to ram disk (/dev/shm/test/). There's another stand alone program that will delete files written to ram disk without any computation. So the test should be almost pure IO bound. However, the IO throughput is not as good as I expected.
I have 5 datanode and each of the datanode has 5 data disk. Each disk can provide about 100MB/s throughput. Theoretically this cluster should be able to provide 100MB/s * 5 (disks) * 5 (machines) = 2500MB/s. However, I get about 600MB/s only. I run "iostat -d -x 1" on the 5 machines and found that the IO loading is not well balanced. Usually only a few of the disk have 100% utilization, some disks have very low utilization ( 10% or less). And some machine even have no IO loading at some time. Here's the screenshot. (Of course the loading for each disk/machine varies quickly)
Here's another screenshot the shows CPU usage by "top -cd1" command:
Here're some more detailed config about my case:
Hadoop cluster hardware: 5 Dell R620 machines which equipped with 128GB ram and 32 core CPU (actually 2 Xeon E5-2650). 2 HDD consist of a RAID 1 disk for CentOS and 5 data disks for HDFS. So uou can see 6 disks in the above screenshot.
Hadoop settings: block size 128MB; data node handler count is 8; 15 maps per task tracker; 2GB Map reduce child heap process.
Testing file set: about 400,000 small files, total size 320GB. Stored in 160 sequence files, each seq file has the size about 2GB. I tried to store all the files in many different size seq files(1GB, 512MB, 256MB, 128MB), but the performance didn't change much.
I won't expect the whole system can have 100% IO throughput(2500MB/s), but I think 40% (1000MB/s) or more should be reasonable. Can anyone provide some guide for performance tuning?
I solved the problem myself. Hint: the high CPU usage.
It's very abnormal that the CPU usage is so high since it's an almost pure IO job.
The root cause is that each task node gets about 500 map and each map use exactly one JVM. By default, hadoop map reduce is configured to create a new JVM for a new map.
Solution: Modify the value of "mapred.job.reuse.jvm.num.tasks" from 1 to -1, which indicates that the JVM will be reused without limitation.

Resources