The Resource Manager REST API provides the status of all applications.
I'm curious to know where does this information is actually stored?
Is it possible to get this information to HBase/Hive?
No, you cannot get this information from HBase or Hive because the Resource Manager REST APIs return live data from data structures in the RM. The application logs are stored locally on Node Managers and in HDFS and Zookeeper maintains some state information that could be extracted independent of the RM but that's all.
Have you looked on Timeline Server v2? ATSv2 can store all application metrics. As the storage this service uses HBase. (Link: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
Check that ATSv2 supported in your version of Hadoop.
Related
Will yarn store informations about finished app including runtime on hdfs? I just want to get the app runtime through some files on the hdfs(if there did exist such file, I have checked the logs and there is no runtime informations) without using any monitoring software.
You can use the ResourceManager REST to fetch the information of all the Finished applications.
http://resource_manager_host:port/ws/v1/cluster/apps?state=FINISHED
A GET request to the URL will return a JSON response (XML can also be obtained). The response has to be parsed for elapsedTime for each application to get the running time of the application.
To look up persistent job history file, you will need to check Job History Server or Timeline Server instead of Resource Manager:
Job history is aggregated onto HDFS, and can be seen from job history server UI (or REST API). The history files are stored on mapreduce.jobhistory.done-dir on HDFS.
Job history can also be aggregated by timeline server (filesystem based, aka ATS 1.5) and can be seen from timeline server UI (or REST API). The history files are stored on yarn.timeline-service.entity-group-fs-store.done-dir on HDFS.
I am trying to update HDP architecture so data residing in Hive tables can be accessed by REST APIs. What are the best approaches how to expose data from HDP to other services?
This is my initial idea:
I am storing data in Hive tables and I want to expose some of the information through REST API therefore I thought that using HCatalog/WebHCat would be the best solution. However, I found out that it allows only to query metadata.
What are the options that I have here?
Thank you
You can very well use WebHDFS which is basically a REST Service over Hadoop.
Please see documentation below:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
The REST API gateway for the Apache Hadoop Ecosystem is called KNOX
I would check it before explore any other options. In other words, Do you have any reason to avoid using KNOX?
What version of HDP are you running?
The Knox component has been available for quite a while and manageable via Ambari.
Can you get an instance of HiveServer2 running in HTTP mode?
This would give you SQL access through J/ODBC drivers without requiring Hadoop config and binaries (other than those required for the drivers) on the client machines.
Based on the documentation: MapReduce History Server API,
I can get all the information using different REST calls.
Does anyone know where that data is originally stored/read from by History Server? Also what format is that in?
It stores the data in HDFS. It will be under /user/history/done and owned by mapred in Cloudera and Hortonworks distributions.
We can also provide custom locations using parameters mapreduce.jobhistory.done-dir and mapreduce.jobhistory.intermediate-done-dir.
It is possible, to connect my Hadoop cluster to multiple Google Cloud Projects at once ?
I can easly use any Google Storage bucket in single Google Project via Google Cloud Storage Connector as explained in this thread Migrating 50TB data from local Hadoop cluster to Google Cloud Storage. But i can't find any documentation or example how to connect to two or more Google Cloud Project from single map-reduce job. Do You have any suggestion/trick ?
Thanks a lot.
Indeed, it is possible to connect your cluster to buckets from multiple different projects at once. Ultimately, if you're using the instructions for using a service-account keyfile, the GCS requests are performed on behalf of that service-account, which can be treated more-or-less like any other user. You can either add the service account email your-service-account-email#developer.gserviceaccount.com to all the different cloud projects owning buckets you want to process, using the permissions section of cloud.google.com/console and simply adding that email address like any other member, or you can set GCS-level access to add that service-account like any other user.
I would like to understand does hadoop support for siebel applications , can any body share experience in doing that. I looked for online documentation and not able to find any proper link to explain this so posting question here
I have and siebel application run with Oracle database, I would like to replace with HAdoop ..is it possible ?
No is the answer.
Basically Hadoop isn't a database at all.
Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system it provides an API for processing all stored data using something called as Map-Reduce.