How to expose Hadoop job and workflow metadata using Hive

How to expose Hadoop job and workflow metadata using Hive - hadoop

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8

If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]

If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

Related

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW

Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

As a Hadoop Regular User, Is There a Way to See Details about Running Jobs?

I do not have access to any CLI on any of the Hadoop nodes, but I have access to the cluster via Hue and Jupyter. The engineering team has also configured the Hadoop UI that shows New, Running, Submitted, Finish, etc. applications. However, it appears all spark jobs have a generic name, for instance, something like this:
HIVE-f23fa1a1-4444-4ab2-1c44-12345a123456
or similar and when I click on the application_id, I get a Failed to read the attempts of the application error. (even for my own jobs). Similarly, spark jobs, which you can normally name using setAppName, are all named generic "Spark-something" because the spark context is already initialized upon bringing up Jupyter on an edge node (i.e. I can't establish a name because one already exists).
Is there a way for a unprivileged Hadoop user to see into what job is actually running (i.e. the Hive query or the Spark / Hadoop command ), without having some sort of CLI privilege?
I have tried using a few urls that I suspect have job information in them, for instance:
http://cluster_master:<portnum>/history/application_1234123412341234_12345/jobs/ or
http://cluster_master:<portnum>/jobs/application_1234123412341234_12345/
but neither attempt returns any details about the job itself (even things I named myself within the hive / spark context using setAppName.
Please let me know if there's a better way to ask this question. I am relatively new to Hadoop/Spark. All the reference docs and SO answers I've found assume CLI or privileged access and I can't find any documentation in either Spark or Hadoop that applies to this problem.

Consuming External Webservices data in Hadoop

I have the following requirement that I plan to fulfill through Hadoop frameworks.
I have 40% of data sitting in a SQL Server Database
I have 20% of data available through a Web service
I have the rest 40% available through another database.
The data from the three sources need to be joined together to make a fourth data set , that I need to send to a 2 systems - one through Webservice call , another thru direct database import.
To achieve the above feature, Im planning to use Hadoop platform that we already have. The database pulls and push can be managed through Sqoop. The transformation is managed through SQL queries written through Hive. All of this is orchestrated through Oozie workflow.
In the complete gamut of things, what I would like to get help on is -
a. Is it a good approach to directly invoke a Webservice to fetch the data from hadoop? Or should I not use hadoop at all , if it involves fetching data from external webservices? I dont believe as there are ways to make it work but I would like your views.
b. If this approach is good, how can I materialize this? One option is to provide a oozie action that can invoke the webservice and write the response to the HDFS location. Are there any other better options?

Customize an InputFormat and record reader for the webservice, by this way, hadoop just regard it as normal input. What you have to do first is to find a good way to split input from webservice into small ones, because mapreduce would start as many tasks as you have inputsplits.
at the same time, there may already have jdbc inputformat for you DB

How to access Hive log information

I am trying to analyze the performance of the Hive queries. Though I was able to make Hive queries with Java but I still need to access the log information getting generated after each query. Instead of using a hack to read the latest log on the disk and using regex to extract the numbers I am looking for a graceful method if already available.
Any pointers will be helpful. Thanks in advance.
-lg

Query execution details like Status,Finished at, Finished in are displayed in Job Tracer, you can access job tracker programmatically . Related info at this link
How could I programmatically get all the job tracker and tasktracker information that is displayed by Hadoop in the web interface?

Once hive starts running a corresponding map-reduce job starts. The logs of this hadoop job can be found on the corresponding tasktracker on which each task runs.
Use jobclient API to retrieve these logs programmatically.

How to know that a new data is been added to HDFS?

I am implementing a Notification system based on publish subscribe model to notify about the availability of data as it arrives/loaded to HDFS. I did n't find a ways where to look for this. Is there any HDFS API which can be used to do this or what method should I use to get information of new data written to HDFS? I am using Hadoop v2.0.2 and I don't want to use HCatalog, I want to implement my own tool to do this.

What you are looking for is Oozie Coordinator.
HDFS is a file system, so something must be built on top of HDFS to check for file availability. HBase has coprocessor which are triggered procedures . But it is only available for HBase tables. So it cannot be used for detecting data availabilty in HDFS.
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Also you can execute other programs from it :
Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system
specific jobs (such as Java programs and shell scripts).
So you can use the file availability trigger for your notification system too.

If you use HDFS you might want to check out HBase as it has the functionality you want. In HBase, you can create a pre-put (or post-put) coprocessor essentially acting equivilant to a MySQL Trigger- running a bit of code for every time data is written to a table.
If HBase doesn't suit your use case and you must use HDFS, AFAIK there aren't similar triggers. You can try wrapping the HDFS API with your own code to perform the notification whenever data is written to your file system under the appropriate circumstances. Alternatively, you can poll HDFS for changes (which sounds like an ugly alternative)...
Hope that helps

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio