Retrieving tasktracker logs for a particular job programatically - hadoop

Hi am working with OozieClient API.
I need to retrieve the task tracker logs for a particular workflow job using the OozieClient API. If not with OozieClient API any other way using a program is also fine. As of now with the OozieClient i am able to get the job log using client.getJobLog(), but i need task tracker logs and not job logs. Kindly help.

Try retrieving the yarn application Id from oozie using OozieClient API.
Once you have this ID you can make a call to history server using its rest api/or history server's client library, to fetch the Log dir path using "jobAttempts" api.
Now you can browse this directory using hadoop client.

Related

How to get app runtime on hadoop yarn

Will yarn store informations about finished app including runtime on hdfs? I just want to get the app runtime through some files on the hdfs(if there did exist such file, I have checked the logs and there is no runtime informations) without using any monitoring software.
You can use the ResourceManager REST to fetch the information of all the Finished applications.
http://resource_manager_host:port/ws/v1/cluster/apps?state=FINISHED
A GET request to the URL will return a JSON response (XML can also be obtained). The response has to be parsed for elapsedTime for each application to get the running time of the application.
To look up persistent job history file, you will need to check Job History Server or Timeline Server instead of Resource Manager:
Job history is aggregated onto HDFS, and can be seen from job history server UI (or REST API). The history files are stored on mapreduce.jobhistory.done-dir on HDFS.
Job history can also be aggregated by timeline server (filesystem based, aka ATS 1.5) and can be seen from timeline server UI (or REST API). The history files are stored on yarn.timeline-service.entity-group-fs-store.done-dir on HDFS.

How to expose Hadoop job and workflow metadata using Hive

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

Running Hadoop Job Through Web Interface

Is there any way to run Hadoop job through a web interface? e.g. giving command for Hadoop job execution with a button.
I want to implement a web interface for my Hadoop project.
Thanks!
Cloudera will be useful, which is designed for this purpose.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.2/Hue-2-User-Guide/hue26.html
Try the below options :
Option -1
Create a java web project with Web service and Add all UI inputs to this client server.
Create another web project as remote server , And receive all the above inputs of the job and pass it to the Jobs.
The remote server web project should be in the cluster always up and running, and capturing the client signals.
Use JSCH in the server side and invoke as and when all the hadoop commands you pass from the UI.
OR
Option - 2
You could you a MySql database and store all job parameters from the UI . Then write a simple java code with JSCH to run these hadoop comands by polling the DB. A runnable jar running all time.
Hope the above 2 ideas help you.

hadoop mapreduce - API for getting job log

I’m developing a hadoop mapreduce application and i need to present the end user the task log.
(same as hue does).
is there a java-api that extract the logs of specific job?
i tried "JobClient" API without any success.
the Job Attempts API of the HistoryServer provides a link to the logs of each task

How to access Hive log information

I am trying to analyze the performance of the Hive queries. Though I was able to make Hive queries with Java but I still need to access the log information getting generated after each query. Instead of using a hack to read the latest log on the disk and using regex to extract the numbers I am looking for a graceful method if already available.
Any pointers will be helpful. Thanks in advance.
-lg
Query execution details like Status,Finished at, Finished in are displayed in Job Tracer, you can access job tracker programmatically . Related info at this link
How could I programmatically get all the job tracker and tasktracker information that is displayed by Hadoop in the web interface?
Once hive starts running a corresponding map-reduce job starts. The logs of this hadoop job can be found on the corresponding tasktracker on which each task runs.
Use jobclient API to retrieve these logs programmatically.

Resources