How to monitor Hadoop cluster with ELK - hadoop

I'm looking into the possibilities of monitoring hadoop cluster with ELK/EFK stack. I have searched over the public domains but couldn't find anything relevant.
Any help in this regard will be highly appreciated

It's not clear what you're trying to monitor.
Everything in Hadoop is mostly a Java process, so adding some JMX exporters like Prometheus or Jolokia would expose metrics over REST, and from there you would have to periodically poll those into Elasticsearch.
To enable JMX, you'd have to edit the hadoop-env.sh scripts, I believe, for YARN and HDFS, to control any JVM options. Hive, Spark, Hbase, etc all have similar scripts
General example here on Jolokia https://www.elastic.co/blog/monitoring-java-applications-with-metricbeat-and-jolokia
Other than that, Filebeat and Metricbeat operate the same as any other system
If you used Cloudera Manager or Ambari to control your cluster, then monitoring would be provided for you from those tools

Related

What is the best practice for nifi production deployment

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

Deploy Elasticsearch for Apache Spark on Kubernetes

I'm wondering if anyone has experience configuring a Kubernetes cluster using the Elasticsearch for Hadoop library. I'm running into issues with the node discovery timing out when trying to write from spark to elasticsearch. I have Elasticsearch up and running thanks to the elasticsearch-cloud-kubernetes plugin for ES, which handles discovery, but I'm not sure how best to configure elasticsearch-hadoop to be aware of the nodes (pods) within the kubernetes cluster. I've tried setting spark.es.nodes to a es-client service, but that doesn't seem to work. I'm also aware that I could enable es.nodes.wan.only, but as noted in the documentation, this would severely impact performance, which defeats the purpose of having them running on the same cluster. Any help would be appreciated.
I'm not that schooled on elasticsearch-hadoop but have you tried pointing your elasticsearch-hadoop to your elasticsearch service instead of specific nodes? Your master nodes will normally take care of everything in your ES cluster.

Does hadoop itself contains fault-tolerance failover functionality?

I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).

Resource usage from Ambari

I have few Hive jobs and Mapreduce programs running in my cluster. I am able to check in Ambari about general resource utilization. But I want to see the resources utilized by individual applications. Is it possible through Ambari API? Can you provide some clues?
To my knowledge metrics that are provided by Ambari are for whole cluster.
But you can check MapReduce2 Job History UI, it seems like you are looking for this stuff. Check this link out, there is more detailed description there
http://hortonworks.com/blog/elephants-can-remember-mapreduce-job-history-in-hdp-2-0/

monitoring hadoop cluster with ganglia

I'm new to hadoop and trying to monitor the working of a multi node cluster using ganglia, The setup of gmond is done on all nodes and ganglia monitor only on the master.However,there are hadoop metrics graphs only for the master node and just system metrics for slaves. Do these hadoop metrics on the master include the slave metrics as well?Or is there any mistake in configuration files? Any help would be appreciated.
I think you should read this in order to understand how metrics flow between master and slave.
However, I would like to brief that, in genral, hadoop based or hbase based metrics are directly emitted/ sent to the master server (By master server, I mean the server on which gmetad is installed). All other OS related metrics are first collected by gmond installed on the corresponding slave and then redirected to the gmond installed on the master server.
So, if you are not getting any OS related metrics of slave servers then there is some misconfiguration in your gmond.conf. To know more about how to configure ganglia, please read this. This has helped me and could help you for sure, if you go through carefully.
There is a mistake in your configuration files.
More precisely, in transmitting / collecting the data, whichever approach you use.

Resources