How to access Hive log information - hadoop

I am trying to analyze the performance of the Hive queries. Though I was able to make Hive queries with Java but I still need to access the log information getting generated after each query. Instead of using a hack to read the latest log on the disk and using regex to extract the numbers I am looking for a graceful method if already available.
Any pointers will be helpful. Thanks in advance.
-lg

Query execution details like Status,Finished at, Finished in are displayed in Job Tracer, you can access job tracker programmatically . Related info at this link
How could I programmatically get all the job tracker and tasktracker information that is displayed by Hadoop in the web interface?

Once hive starts running a corresponding map-reduce job starts. The logs of this hadoop job can be found on the corresponding tasktracker on which each task runs.
Use jobclient API to retrieve these logs programmatically.

Related

As a Hadoop Regular User, Is There a Way to See Details about Running Jobs?

I do not have access to any CLI on any of the Hadoop nodes, but I have access to the cluster via Hue and Jupyter. The engineering team has also configured the Hadoop UI that shows New, Running, Submitted, Finish, etc. applications. However, it appears all spark jobs have a generic name, for instance, something like this:
HIVE-f23fa1a1-4444-4ab2-1c44-12345a123456
or similar and when I click on the application_id, I get a Failed to read the attempts of the application error. (even for my own jobs). Similarly, spark jobs, which you can normally name using setAppName, are all named generic "Spark-something" because the spark context is already initialized upon bringing up Jupyter on an edge node (i.e. I can't establish a name because one already exists).
Is there a way for a unprivileged Hadoop user to see into what job is actually running (i.e. the Hive query or the Spark / Hadoop command ), without having some sort of CLI privilege?
I have tried using a few urls that I suspect have job information in them, for instance:
http://cluster_master:<portnum>/history/application_1234123412341234_12345/jobs/ or
http://cluster_master:<portnum>/jobs/application_1234123412341234_12345/
but neither attempt returns any details about the job itself (even things I named myself within the hive / spark context using setAppName.
Please let me know if there's a better way to ask this question. I am relatively new to Hadoop/Spark. All the reference docs and SO answers I've found assume CLI or privileged access and I can't find any documentation in either Spark or Hadoop that applies to this problem.

How to get resources used for FINISHED hadoop jobs from YARN logs using job names?

I have a unix shell script which runs multiple hive scripts. I have given Job names for every hive queries inside the hive scripts.
What I need is that at the end of the shell script, I want to retrieve the resources (in terms of memory used,containers) used for the hive queries based on the job names from the YARN logs/application having appstatus as 'FINISHED'
How do I do this?
Any help would be appreciated.
You can pull this information from the Yarn History server via rest apis.
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html
Scroll through this documentation and you will see examples of how to get cluster level information on jobs executed and then how to get information on individual jobs.

How to expose Hadoop job and workflow metadata using Hive

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

hadoop mapreduce - API for getting job log

I’m developing a hadoop mapreduce application and i need to present the end user the task log.
(same as hue does).
is there a java-api that extract the logs of specific job?
i tried "JobClient" API without any success.
the Job Attempts API of the HistoryServer provides a link to the logs of each task

Resources