I have few Hive jobs and Mapreduce programs running in my cluster. I am able to check in Ambari about general resource utilization. But I want to see the resources utilized by individual applications. Is it possible through Ambari API? Can you provide some clues?
To my knowledge metrics that are provided by Ambari are for whole cluster.
But you can check MapReduce2 Job History UI, it seems like you are looking for this stuff. Check this link out, there is more detailed description there
http://hortonworks.com/blog/elephants-can-remember-mapreduce-job-history-in-hdp-2-0/
Related
I'm looking into the possibilities of monitoring hadoop cluster with ELK/EFK stack. I have searched over the public domains but couldn't find anything relevant.
Any help in this regard will be highly appreciated
It's not clear what you're trying to monitor.
Everything in Hadoop is mostly a Java process, so adding some JMX exporters like Prometheus or Jolokia would expose metrics over REST, and from there you would have to periodically poll those into Elasticsearch.
To enable JMX, you'd have to edit the hadoop-env.sh scripts, I believe, for YARN and HDFS, to control any JVM options. Hive, Spark, Hbase, etc all have similar scripts
General example here on Jolokia https://www.elastic.co/blog/monitoring-java-applications-with-metricbeat-and-jolokia
Other than that, Filebeat and Metricbeat operate the same as any other system
If you used Cloudera Manager or Ambari to control your cluster, then monitoring would be provided for you from those tools
I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).
Might be a lame question but even after a lot of research i still couldn't figure out how to check map reduce job logs from resource manager UI in hortonworks sandbox.
Any help would be greatly appreciated.
Thanks.
#Ragzz
There are two mechanisms to check logs.
From Ambari UI, navigate to MapReduce2 and then find the QuickLinks on the top of the screen. Navigate to JobHistory Logs
Alternatively, navigate to /var/log/hadoop-mapreduce/mapred directory where all the mapreduce job related logs can be found out.
Let me know how it goes. Many Thanks
I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.
My goal is to provide Hadoop jobs statistics web UI for administrative users.
I use HortonWorks Hadoop2 cluster and jobs run on YARN.
From the architecture perspective , I am planning to collect jobs related information ( such as start time, end time, mappers, etc ) from YARN Resource Manager REST API as scheduled cron job >> index them in to elastic search >> show them in Kibana.
I wonder if there is better way to do this.
Have you looked into Ambari? It provides metrics, dashboards, and alerting without having to create the framework from scratch.
Apache Ambari
Ambari provides statistics on an infrastructure level not on job level. So, you need to write a custom code to use yarn-rest API which provides you a JSON response. Based on which you can use the JSON parser and get the exact details. I have written one on python, you can refer to this link-https://dzone.com/articles/customized-alerts-for-hadoop-jobs-using-yarn-rest
http://thelearnguru.com/customized-alerts-for-hadoop-jobs-using-yarn-rest-api