Monitoring and checking status of YARN - hadoop

How can I access to YARN metrics such as status of resource manager and node manager?
Moreover, the same question about running yarn containers. I would like to do it via web interface.

You can use the Yarn Resource Manager UI, which is usually accessible at port 8088 of your resource manager (although the port can be configured). Here you get an overview over your cluster.
Details about the nodes of the cluster can be found in this UI in the Cluster menu, submenu Nodes. Here one finds the health information, some information about the hardware, currently running jobs and the software version of the node manager. For more details, you would have to log into each node and check the nodemanager logs (for the node where the resource manager is running, this log is also available via Tools menu -> Local logs, but this would not be sufficient if you have more than one node in your cluster).
More details about the Resource manager (including runtime statistics) are available in the Tools menu -> Server metrics.
If you want to access these information programatically, you can use the Resource Manager's REST API.
Another option might be to use Ambari. This tools is an Hadoop management tool that can be used to monitor the different services within an Hadoop cluster and to trigger alerts in case of unusual or unexpected events. However, it requires some installation and configuration efforts.

Related

Some automatically launched Hadoop YARN application

I'm new to Apache Hadoop. I've installed a cluster of YARN with one master and two slaves on AWS. When I just start the cluster YARN, I could observe that some applications are launched by user dr.who with app type YARN automatically. It bothers me a lot. Hoping someone could help me out of this. Thanks!
application_1531399885156_0041 dr.who hadoop YARN default Thu Jul 12 14:58:37 +0200 2018 N/A ACCEPTED UNDEFINED ApplicationMaster 0
This is a known bug in latest launch of Hadoop and a JIRA has also been created. The apps submission by dr.who and when the user kills all the jobs then the NodeManager goes down.
EDIT: Problem Resolution
PROBLEM Customer unable to see logs via Resource Manager UI due to incorrect permissions for the default user dr.who.
RESOLUTION Customer changed the following property in core-site.xml to resolve the issue. Other values such as hdfs or mapred also resolve the issue. If the cluster is managed by Ambari, this should be added in Ambari > HDFS > Configurations >Advanced core-site > Add Property
hadoop.http.staticuser.user=yarn
The same threat was posted on Hortonworks and was answered by Sandeep Nemuri who wrote:
Stop further attacks:
a. Use Firewall / IP table settings to allow access only to whitelisted IP addresses for Resource Manager port (default 8088). Do this on both Resource Managers in your HA setup. This only addresses the current attack. To permanently secure your clusters, all HDP end-points ( e.g WebHDFS) must be blocked from open access outside of firewalls.
b. Make your cluster secure (kerberized).
Clean up existing attacks:
a. If you already see the above problem in your clusters, please filter all applications named “MYYARN” and kill them after verifying that these applications are not legitimately submitted by your own users.
b. You will also need to manually login into the cluster machines and check for any process with “z_2.sh” or “/tmp/java” or “/tmp/w.conf” and kill them.
The link to that thread is : dr.who

When do YARN and NameNode interact

When a job is submitted, when do YARN and NameNode interact? When a job is submitted, who does it get sent to? Could someone explain the end-to-end flow - how hadoop ecosystem works?
Thanks!
Namenode: Stores the meta-data of all the data stored in data nodes and monitors the health of data nodes. Basically, it is a master-slave architecture.
YARN: It stands for Yet Another Resource Negotiator. The yarn has mainly two components.
1.> Scheduling
2.> Application Manager
Yarn also contains the master, i.e Resource Manager and Slave, i.e Node Manager.
For scheduling purpose, there are 3 Schedulers:
1.> FIFO 2.> Capacity 3.> Fair-share
There is a component called Application Master assigned by Resource Manager under the Node Manager.
One application master is assigned to one application.
The job is directly submitted by the client and Resource Manager assigns the job to the Application Master and Node manager monitors the liveliness of Application Master
Now, whenever the job comes in, Resource manager creates a job id and assign an Application Master for that job. Resource Manager contacts to the Namenode to retrieve the information about the metadata of the required data on which the task has to be performed. And the information received by Resource Manager is then passed to Application Master.
This is the basic overview of the working of Yarn with Namenode. You can also read in detail from YARN
Also, NameNode interaction is just in the Hadoop applications running within YARN that talk to the NameNode. Not all YARN applications need to communicate with HDFS
Basically there is no direct interaction between YARN and HDFS, see https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
However YARN jobs require some files (libraries, configuration, etc) which usually resides on HDFS

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Provision to start group of applications on same Mesos slave

I have cluster of 3 Mesos slaves, where I have two applications: “redis” and “memcached”. Where redis depends on memcached and the requirement is both of the applications/services should start on same node instead of different slave nodes.
So I have created the application group and added the dependency properly in the JSON file. After launching the JSON file via “v2/groups” REST API, I observe that sometime both application group will start on same node but sometimes it will start on different slaves which breaks our requirement.
So intent/requirement is; if any application fails to start on a slave both the application should failover to other slave node. Also can I configure the JSON file to tell Marathon to start the application group on slave-1 (specific slave first) if it is available else start it on other slave in a cluster. Due to some reason if this application group will start on other slave can Marathon relaunch the application group to slave-1 if it is available to serve the request.
Thanks in advance for help.
Edit/Update (2):
Mesos, Marathon, and DC/OS support for PODs is available now:
DC/OS: https://dcos.io/docs/1.9/usage/pods/using-pods/
Mesos: https://github.com/apache/mesos/blob/master/docs/nested-container-and-task-group.md
Marathon: https://github.com/mesosphere/marathon/blob/master/docs/docs/pods.md
I assume you are talking about marathon apps.
Marathon application groups don't have any semantics concerning co-location on the same node and the same is the case for dependencies.
You seem to be looking for a Kubernetes like Pod abstraction in marathon, which is on the roadmap but not yet available (see update above :-)).
Hope this helps!
I think this should be possible (as a workaround) if you specify the correct app contraints within the group's JSON.
Have a look at the example request at
https://mesosphere.github.io/marathon/docs/generated/api.html#v2_groups_post
and the constraints syntax at
https://mesosphere.github.io/marathon/docs/constraints.html
e.g.
"constraints": [["hostname", "CLUSTER", "slave-1"]]
should do. Downside is that there will be no automatic failover to another slave that way. Still, I'd be curious why both apps need to specifically run on the same slave node...

What happens when the Resource Manager (RM) goes down in Yarn?

What happens when the Resource Manager (RM) goes down in Yarn?
In the middle of running a job, if the Resource Manager goes down, then what will happen to the job?
Does the job gets submitted automatically or do we need to submit the job again?
Thanks,
Venkat
Resource manager (RM) high availability is explained in Apache link as follows.
ResourceManager HA is realized through an Active/Standby architecture.
At any point of time, one of the RMs is Active, and other standby node is waiting to take over if Active RM fails.
The RM being promoted to an active state loads the RM internal state from State-store and continues to operate from where the previous active left off.
A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work.
The State-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore.
The ZKRMStateStore (ZooKeeper) implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster.
Using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role.This situation is handled with ZooKeeper very well.
ZooKeeper is not only used for Resource Manager fail over. Many of applications now a days using ZooKeeper. Example of other fail over use cases in Hadoop - Name Node fail over also happens through ZooKeeper. Have a look at Name node fail over process too.
After Hadoop 2.x and Before Hadoop 2.6.x:
When a ResourceManager dies and is restarted, or fails over to another ResourceManager in the case of an HA cluster, the newly active ResourceManager instructs running ApplicationMasters to abort. This uses up an application attempt.
Also, if the ResourceManager is down for some time and the ApplicationMaster is unable to connect, it will timeout and abort. That uses up an application attempt too.
When a new ResourceManager becomes active, it can recover applications with failed attempts that have not exceeded their max-attempts.
Have a look at this article for more details
From Hadoop 2.6.0:
Resource Manager recovers its running state by taking advantage of the container statuses sent from all Node Managers. Node Manager will not kill the containers when it re-syncs with the restarted Resource Manager.
It continues managing the containers and send the container statuses across to Resource Manager when it re-registers.
Resource Manager reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information
The admin will create a new resource manager.Will take the latest information from all the application managers and update the Persistent Storage which the new Resource Manager will use. It is purely an admin task
No application or task can be launched of RM is unavailable.
If you have HA of RM then it will restart from HA.

Resources