Where MapReduce History Server stores its data? - hadoop

Based on the documentation: MapReduce History Server API,
I can get all the information using different REST calls.
Does anyone know where that data is originally stored/read from by History Server? Also what format is that in?

It stores the data in HDFS. It will be under /user/history/done and owned by mapred in Cloudera and Hortonworks distributions.
We can also provide custom locations using parameters mapreduce.jobhistory.done-dir and mapreduce.jobhistory.intermediate-done-dir.

Related

Hadoop Integration with Document Capture Software

We have requirement to send documents to Hadoop (Hortonworks) from our Image Capture Software: Image Capture Software release PDF document with metadata.
I don't have much idea about HDP. Is there any REST service or any tool that can able to add documents to Hadoop by providing Documents with metadata.
Please help
Hadoop HDFS has both WebHDFS and NFSGateway
However, it's generally recommended not to just store raw data immediately onto HDFS if you have better control over how the data gets there. That way, you have better control over auditing where and how data gets written.
For example, you could use Apache Nifi processors to start a ListenHTTP processor, read the document data, parse it, filter and enrich, then you can optionally write to HDFS or many other destinations.

Where are the Hadoop Resource Manager metrics storage location?

The Resource Manager REST API provides the status of all applications.
I'm curious to know where does this information is actually stored?
Is it possible to get this information to HBase/Hive?
No, you cannot get this information from HBase or Hive because the Resource Manager REST APIs return live data from data structures in the RM. The application logs are stored locally on Node Managers and in HDFS and Zookeeper maintains some state information that could be extracted independent of the RM but that's all.
Have you looked on Timeline Server v2? ATSv2 can store all application metrics. As the storage this service uses HBase. (Link: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
Check that ATSv2 supported in your version of Hadoop.

How to expose Hadoop job and workflow metadata using Hive

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

Accessing Hadoop data using REST service

I am trying to update HDP architecture so data residing in Hive tables can be accessed by REST APIs. What are the best approaches how to expose data from HDP to other services?
This is my initial idea:
I am storing data in Hive tables and I want to expose some of the information through REST API therefore I thought that using HCatalog/WebHCat would be the best solution. However, I found out that it allows only to query metadata.
What are the options that I have here?
Thank you
You can very well use WebHDFS which is basically a REST Service over Hadoop.
Please see documentation below:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
The REST API gateway for the Apache Hadoop Ecosystem is called KNOX
I would check it before explore any other options. In other words, Do you have any reason to avoid using KNOX?
What version of HDP are you running?
The Knox component has been available for quite a while and manageable via Ambari.
Can you get an instance of HiveServer2 running in HTTP mode?
This would give you SQL access through J/ODBC drivers without requiring Hadoop config and binaries (other than those required for the drivers) on the client machines.

siebel applications hadoop connectivity

I would like to understand does hadoop support for siebel applications , can any body share experience in doing that. I looked for online documentation and not able to find any proper link to explain this so posting question here
I have and siebel application run with Oracle database, I would like to replace with HAdoop ..is it possible ?
No is the answer.
Basically Hadoop isn't a database at all.
Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system it provides an API for processing all stored data using something called as Map-Reduce.

Resources