Modifying yarn config on EMR - hadoop

I need to make a change to the YARN configuration on an EMR cluster.
Do I need to make the change to just the yarn-site.xml file on the Hadoop master ? If so, how can I propagate the change to the datanodes ? Do I just need to restart yarn as detailed here ? I am using EMR 5.8.0.
https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/

You will need to identigy which YARN Daemon enforces that parameter and if needed will need to restart that Daemon accordingly.
Ex:
EMR Master has YARN ResourceManager
EMR Core has YARN Nodemanager
If you need to change a parameter that corresponds to YARN ResourceManager(like yarn.resourcemanager.*), then you might need to edit yarn-site on just master and restart just the ResourceManager daemon.
If you want to change a parameter like yarn.nodemanager.* , then you will need to change yarn-site on all core nodes and might need to restart NodeManager daemon on all core nodes.
Now, when it comes to how to change this setting on all core's at once, there are lot of tools out there to do it(Like Ansible, PDSH , AWS SSM etc. ). EMR does not have any API that supports changing config's on fly. If you are trying to provision a cluster with desired configuration , use EMR Configurations API. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

Related

Get list of executed job on Hadoop cluster after cluster reboot

I have a hadoop cluster 2.7.4 version. Due to some reason, I have to restart my cluster. I need job IDs of those jobs that were executed on cluster before cluster reboot. Command mapred -list provide currently running of waiting jobs details only
You can see a list of all jobs on the Yarn Resource Manager Web UI.
In your browser go to http://ResourceManagerIPAdress:8088/
This is how the history looks on the Yarn cluster I am currently testing on (and I restarted the services several times):
See more info here

How to connect Apache Spark with Yarn from the SparkContext?

I have developed a Spark application in Java using Eclipse.
So far, I am using the standalone mode by configuring the master's address to 'local[*]'.
Now I want to deploy this application on a Yarn cluster.
The only official documentation I found is http://spark.apache.org/docs/latest/running-on-yarn.html
Unlike the documentation for deploying on a mesos cluster or in standalone (http://spark.apache.org/docs/latest/running-on-mesos.html), there is not any URL to use within SparkContext for the master's adress.
Apparently, I have to use line commands to deploy spark on Yarn.
Do you know if there is a way to configure the master's adress in the SparkContext like the standalone and mesos mode?
There actually is a URL.
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager
You should have at least hdfs-site.xml, yarn-site.xml, and core-site.xml files that specify all the settings and URLs for the Hadoop cluster you connect to.
Some properties from yarn-site.xml include yarn.nodemanager.hostname and yarn.nodemanager.address.
Since the address has a default of ${yarn.nodemanager.hostname}:0, you may only need to set the hostname.

How to run hadoop balancer from client node?

I want to ask how can I run the hadoop balancer? I've tried before on the namenode to run hadoop balancer command, but it has no effect at all (my new datanode still empty). I also read that hadoop balancer is not run on namenode but on client node. So what is the client node, how can I configure it, and how can client node access the hadoop file system?
Thanks all, I need your suggest
Client node is also know as edge node, Usually all the developers in a organization will not have access to all nodes on cluster. So for developers to accesss cluster usually we will have a Client node. You need to install hadoop-client packages on client node. If you are using cloudera RPM based installation, you can use below command.
sudo yum install hadoop-client
After client node installation update your configuration files like core-site.xml, hdfs-site.xml and other required files. Now when you execute hadoop CLI commands, they will be executed on cluster.
Balancer can be run from any node in the cluster. It can be a client machine/any node in cluster.
sudo -u hdfs hdfs balancer
Regarding newly added datanode, Just check in the namenode web UI if your node is added ? If you are able to see there, just check logs.

Running a script on all nodes of Hadoop in Amazon EMR

How do you run a script on all nodes (master and slaves) on Amazon EMR, the script-runner.jar runs only on the Namenode.
You have the bootstrap option:
You can use a bootstrap action to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can create custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR.
from the documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
It's as simple as placing a script to do the copying into S3, and then if you're starting EMR from the command line, add a parameter like this:
--bootstrap-action 's3://my-bucket/boostrap.sh'
Or if you're doing it through the web interface, just enter the location of the file in as a "Custom action" in "Bootstrap Actions".

where is the hadoop task manager UI

I installed the hadoop 2.2 system on my ubuntu box using this tutorial
http://codesfusion.blogspot.com/2013/11/hadoop-2x-core-hdfs-and-yarn-components.html
Everything worked fine for me and now when I do
http://localhost:50070
I can see the management UI for HDFS. Very good!!
But the I am going through another tutorial which tells me that there must be a task manager UI running at http://mymachine.com:50030 and http://mymachine.com:50060
on my machine I cannot open these ports.
I have already done
start-dfs.sh
start-yarn.sh
start-all.sh
is something wrong? why can't I see the task manager UI?
You have installed YARN (MRv2) which runs the ResourceManager. The URL http://mymachine.com:50030 is the web address for the JobTracker daemon that comes with MRv1 and hence you are not able to see it.
To see the ResourceManager UI, check your yarn-site.xml file for the following property:
yarn.resourcemanager.webapp.address
By default, it should point to : resource_manager_hostname:8088
Assuming your ResourceManager runs on mymachine, you should see the ResourceManager UI at http://mymachine.com:8088/
Make sure all your deamons are up and running before you visit the URL for the ResourceManager.
For Hadoop 2[aka YARN/MRV2] - Any hadoop installation version-ed 2.x or higher its at port number 8088. eg. localhost:8088
For Hadoop 1 - Any hadoop installation version-ed lower than 2.x[eg 1.x or 0.x] its at port number 50030. eg localhost:50030
By default HadoopUI location is as below
http://mymachine.com:50070

Resources