I'm working on configuring our CDH cluster for YARN (currently we use MR1). I've been trying to wrap my head around all the yarn memory parameters, to make sure we're fully utilizing our machines and ended up diagramming it:
We use Cloudera manager, so I've included relevant CDM settings and processes, which won't apply to every cluster.
My question: am I missing any configurable memory-allocation parameters, or am I interpreting any incorrectly? I've had trouble parsing the documentation, and want to avoid any trial-and-error possible.
Related
Using dc/os we like to schedule tasks close to the data that the task requires that in our case is stored in hadoop/hdfs (on an HDP cluster). Issue is that the hadoop cluster is not run from within dc/os and so we are looking for a way to offer only a subset of the system resources.
For example: say we like to reserve 8GB of memory to data node services, then we like to provide the remainder to dc/os to schedule tasks.
From what i have read so far, the task can specify the resources it requires, but i have not found any means to specify what you want to offer from the node perspective.
I'm aware that a CDH cluster can be run on dc/os, that would be one way to go, but for now that is not provided for HDP.
Thanks for any idea's/tips,
Paul
When I look at my logs, I see that my oozie java actions are actually running on multiple machines.
I assume that is because they're wrapped inside m/r job? (is this correct)
Is there a way to have only a single instance of the java action executing on the entire cluster?
The Java action runs inside an Oozie "launcher" job, with just one YARN "map" container.
The trick is that every YARN job requires an application master (AM) container for coordination.
So you end up with 2 containers, _0001 for the AM and _0002 for the Oozie action, probably on different machines.
To control the resource allocation for each one, you can set the following Action properties to override your /etc/hadoop/conf/*-site.xml config and/or hard-coded defaults (which are specific to each version and each distro, by the way):
oozie.launcher.yarn.app.mapreduce.am.resource.mb
oozie.launcher.yarn.app.mapreduce.am.command-opts (to align the max heap size with the global memory max)
oozie.launcher.mapreduce.map.memory.mb
oozie.launcher.mapreduce.map.java.opts (...)
oozie.launcher.mapreduce.job.queuename (in case you've got multiples queues with different priorities)
Well, actually, the explanation above is not entirely true... On a HortonWorks distro you end up with 2 containers, as expected.
But with a Cloudera distro, you typically end up with just one container, running both the AM and the action in the same Linux process.
And I have no idea how they do that. Maybe there's a generic YARN config somewhere, maybe it's a Cloudera-specific feature.
Initially I had two machines to setup hadoop, spark, hbase, kafka, zookeeper, MR2. Each of those machines had 16GB of RAM. I used Apache Ambari to setup the two machines with the above mentioned services.
Now I have upgraded the RAM of each of those machines to 128GB.
How can I now tell Ambari to scale up all its services to make use of the additional memory?
Do I need to understand how the memory is configured for each of these services?
Is this part covered in Ambari documentation somewhere?
Ambari calculates recommended settings for memory usage of each service at install time. So a change in memory post install will not scale up. You would have to edit these settings manually for each service. In order to do that yes you would need an understanding of how memory should be configured for each service. I don't know of any Ambari documentation that recommends memory configuration values for each service. I would suggest one of the following routes:
1) Take a look at each services documentation (YARN, Oozie, Spark, etc.) and take a look at what they recommend for memory related parameter configurations.
2) Take a look at the Ambari code that calculates recommended values for these memory parameters and use those equations to come up with new values that account for your increased memory.
I used this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/determine-hdp-memory-config.html
Also, Smartsense is must http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.2.0/index.html
We need to define cores, memory, Disks and if we use Hbase or not then script will provide the memory settings for yarn and mapreduce.
root#ttsv-lab-vmdb-01 scripts]# python yarn-utils.py -c 8 -m 128 -d 3 -k True
Using cores=8 memory=128GB disks=3 hbase=True
Profile: cores=8 memory=81920MB reserved=48GB usableMem=80GB disks=3
Num Container=6
Container Ram=13312MB
Used Ram=78GB
Unused Ram=48GB
yarn.scheduler.minimum-allocation-mb=13312
yarn.scheduler.maximum-allocation-mb=79872
yarn.nodemanager.resource.memory-mb=79872
mapreduce.map.memory.mb=13312
mapreduce.map.java.opts=-Xmx10649m
mapreduce.reduce.memory.mb=13312
mapreduce.reduce.java.opts=-Xmx10649m
yarn.app.mapreduce.am.resource.mb=13312
yarn.app.mapreduce.am.command-opts=-Xmx10649m
mapreduce.task.io.sort.mb=5324
Apart from this, we have formulas there to do calculate it manually. I tried with this settings and it was working for me.
The HortonWorks HDP, could be implemented in two ways:
Sandbox (VM)
Manual Installation.
I would like to understand, whether HDP SandBox, or the manual installation is preferred in the production environment. The choice could be made on obvious reasons like performance, but I would like to understand whether there are any other considerations?
The Hortonworks Sandbox allows to try out the features and functionality in Hadoop and its' ecosystem of projects. That's all.
If you want to go to production, you have three installation type:
Automated with Ambari
Manual
Cloud with Cloudbreak
Regards,
Alain
performance. hadoop is about parallel processing. Can't do that with a single node.
storage. hadoop uses a distributed file system. With a single node your storage space is very limited.
redundancy. if this node dies, everything is gone. Normal hadoop configuration include a redundancy factor (of 3 by default) so that when some nodes or disks go down, all of the data is still reachable. Similarly with a standby namenode.
There are a few other points, but these are the main ones IMO.
Single node hadoop only makes sense for proof of concept, and experimentation. Not for providing production level value.
We have been exploring Apache Ambari with HDP 2.2 to setup a cluster. Our backend features three environments: testing, staging and production which is a standard practice in our industry.
When we would deploy a cluster in the testing environment with Ambari, what is the easiest way to have the same cluster configuration on the staging, and later, production environment ?
The initial step seems easy: you create a cluster in the testing environment using the UI and then you export the configuration as a blueprint. Subsequently, you use the exported blueprint to create a new cluster in the other environments. So far, so good.
Inevitably, we will need to change our Ambari configuration (e.g. deploy a new service, increase heap size for the JVM's,...). I was hoping we could just update the blueprint (using the UI or by hand) and then use the updated blueprint to also update the different clusters. However, this seems not possible unless you destroy and recreate the cluster which seems a bit harsh.. (we don't want to lose our data) ?
Alternatively we could use the REST API of Ambari to do specific updates to the configuration but as configuration changes with respect to the initial blueprint will undoubtedly accumulate, this will prove unwieldy and unmaintainable over time, I am afraid.
Can you suggest us a better solution for this use case?
I believe the easiest way would be to dump each services configuration to a file. Then import each of those configurations into the other clusters. This could be done simply by using the Ambari API or by using the script provided by Ambari to update configurations (/var/lib/ambari-server/resources/scripts/configs.sh).