How to get app runtime on hadoop yarn

How to get app runtime on hadoop yarn - hadoop

Will yarn store informations about finished app including runtime on hdfs? I just want to get the app runtime through some files on the hdfs(if there did exist such file, I have checked the logs and there is no runtime informations) without using any monitoring software.

You can use the ResourceManager REST to fetch the information of all the Finished applications.
http://resource_manager_host:port/ws/v1/cluster/apps?state=FINISHED
A GET request to the URL will return a JSON response (XML can also be obtained). The response has to be parsed for elapsedTime for each application to get the running time of the application.

To look up persistent job history file, you will need to check Job History Server or Timeline Server instead of Resource Manager:
Job history is aggregated onto HDFS, and can be seen from job history server UI (or REST API). The history files are stored on mapreduce.jobhistory.done-dir on HDFS.
Job history can also be aggregated by timeline server (filesystem based, aka ATS 1.5) and can be seen from timeline server UI (or REST API). The history files are stored on yarn.timeline-service.entity-group-fs-store.done-dir on HDFS.

Related

Where are the Hadoop Resource Manager metrics storage location?

The Resource Manager REST API provides the status of all applications.
I'm curious to know where does this information is actually stored?
Is it possible to get this information to HBase/Hive?

No, you cannot get this information from HBase or Hive because the Resource Manager REST APIs return live data from data structures in the RM. The application logs are stored locally on Node Managers and in HDFS and Zookeeper maintains some state information that could be extracted independent of the RM but that's all.

Have you looked on Timeline Server v2? ATSv2 can store all application metrics. As the storage this service uses HBase. (Link: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
Check that ATSv2 supported in your version of Hadoop.

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree

'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

Bluemix Analytics for Apache Hadoop Big SQL - How to access logs for debug?

I am using Big SQL from Analytics for Apache Hadoop in Bluemix and would like to look into logs in order to debug (e.g. map reduce job log - usually available under http://my-mapreduce-server.com:19888/jobhistory, bigsql.log from the Big SQL worker nodes).
Is there a way in Bluemix to access those logs?

Log files for most IOP components (e.g. MapReduce Job History Log, Resource Manager Log) are accessible from Ambari console's Quick Links. Just navigate to the respective service page. Log files for BigSQL is currently not available. Since the cluster is not hosted as Bluemix appls, they cannot be retrieved using the Bluemix cf command.

Retrieving tasktracker logs for a particular job programatically

Hi am working with OozieClient API.
I need to retrieve the task tracker logs for a particular workflow job using the OozieClient API. If not with OozieClient API any other way using a program is also fine. As of now with the OozieClient i am able to get the job log using client.getJobLog(), but i need task tracker logs and not job logs. Kindly help.

Try retrieving the yarn application Id from oozie using OozieClient API.
Once you have this ID you can make a call to history server using its rest api/or history server's client library, to fetch the Log dir path using "jobAttempts" api.
Now you can browse this directory using hadoop client.

Using Hadoop Cluster Remotely

I have a web application and 1 remote clusters(It can be one or more). These cluster can be on different machines.
I want to perform following operations from my web application:
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has completed
Time taken by the job to finish
I need a tool that can help me do these tasks from the web application - via an API, via REST calls etc. I'm assuming that the tool will be running on the same machine( as the web application) and can point to a particular, remote cluster.
Though as a last option(as there can be multiple,disparate clusters, it would be difficult to ensure that each of them has the plug-in,library etc. installed), I'm wondering if there would be some Hadoop library,plug-in that rests on the cluster,allows access from remote machines and performs the mentioned tasks.

The best framework which allows everything you have listed here is Spring Data - Apache Hadoop. This has Java Scripting API based implementations to do the following
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
As well spring scheduling based implementations to do the following
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has comleted
Time taken by the job to finish

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio