Deploy Elasticsearch for Apache Spark on Kubernetes - hadoop

I'm wondering if anyone has experience configuring a Kubernetes cluster using the Elasticsearch for Hadoop library. I'm running into issues with the node discovery timing out when trying to write from spark to elasticsearch. I have Elasticsearch up and running thanks to the elasticsearch-cloud-kubernetes plugin for ES, which handles discovery, but I'm not sure how best to configure elasticsearch-hadoop to be aware of the nodes (pods) within the kubernetes cluster. I've tried setting spark.es.nodes to a es-client service, but that doesn't seem to work. I'm also aware that I could enable es.nodes.wan.only, but as noted in the documentation, this would severely impact performance, which defeats the purpose of having them running on the same cluster. Any help would be appreciated.

I'm not that schooled on elasticsearch-hadoop but have you tried pointing your elasticsearch-hadoop to your elasticsearch service instead of specific nodes? Your master nodes will normally take care of everything in your ES cluster.

Related

Elastic search cluster on Kubernetes Cluster vs VM

I want to setup elastic stack (elastic search, logstash, beats and kibana) for monitoring my kubernetes cluster which is running on on-prem bare metals. I need some recommendations on the following 2 approaches, like which one would be more robust,fault-tolerant and of production grade. Let's say I have a K8 cluster named as K8-abc.
Approach 1- Will be it be good to setup the elastic stack outside the kubernetes cluster?
In this approach, all the logs from pods running in kube-system namespace and user-defined namespaces would be fetched by beats(running on K8-abc) and put into into the ES Cluster which is configured on Linux Bare Metals via Logstash (which is also running on VMs). And for fetching the kubernetes node logs, the beats running on respective VMs (which are participating in forming the K8-abc) would fetch the logs and put it into the ES Cluster which is configured on VMs. The thing to note here is the VMs used for forming the ES Cluster are not the part of the K8-abc.
Approach 2- Will be it be good to setup the elastic stack on the kubernetes cluster k8-abc itself?
In this approach, all the logs from pods running in kube-system namespace and user-defined namespaces would be send to Elastic search cluster configured on the K8-abc via logstash and beats (both running on K8-abc). For fetching the K8-abc node logs, the beats running on VMs (which are participating in forming the K8-abc) would put the logs into ES running on K8-abc via logstash which is running on k8-abc.
Can some one help me in evaluating the pros and cons of the before mentioned two approaches? It will be helpful even if the relevant links to blogs and case studies is provided.
I would be more inclined to the second solution. It has many advantages over the first one however it may seem more complex as it comes to the initial setup. You can actually ask similar question when it comes to migrate any other type of workload to Kubernetes. It has many advantages over VM. To name just a few:
self-healing cluster,
service discovery and integrated load balancing,
Such solution is much easier to scale (HPA) in comparison with VMs,
Storage orchestration. Kubernetes allows you to automatically mount a storage system of your choice, such as local storage, public cloud providers, and many more including Dynamic Volume Provisioning mechanism.
All the above points could be easily applied to any other workload and may bee seen as Kubernetes advantages in general so let's look why to use it for implementing Elastic Stack:
It looks like Elastic is actively promoting use of Kubernetes on their website. See also this article.
They also provide an official elasticsearch helm chart so it is already quite well supported by Elastic.
Probably there are many other reasons in favour of Kubernetes solution I didn't mention here. Here you can find a hands-on article about setting up Highly Available and Scalable Elasticsearch on Kubernetes.

How to monitor Hadoop cluster with ELK

I'm looking into the possibilities of monitoring hadoop cluster with ELK/EFK stack. I have searched over the public domains but couldn't find anything relevant.
Any help in this regard will be highly appreciated
It's not clear what you're trying to monitor.
Everything in Hadoop is mostly a Java process, so adding some JMX exporters like Prometheus or Jolokia would expose metrics over REST, and from there you would have to periodically poll those into Elasticsearch.
To enable JMX, you'd have to edit the hadoop-env.sh scripts, I believe, for YARN and HDFS, to control any JVM options. Hive, Spark, Hbase, etc all have similar scripts
General example here on Jolokia https://www.elastic.co/blog/monitoring-java-applications-with-metricbeat-and-jolokia
Other than that, Filebeat and Metricbeat operate the same as any other system
If you used Cloudera Manager or Ambari to control your cluster, then monitoring would be provided for you from those tools

Ambari Hadoop/Spark and Elasticsearch SSL Integration

I have a Hadoop/Spark cluster setup via Ambari (​HDP -2.6.2.0). Now that I have my cluster running, I want to feed some data into it. We have an Elasticsearch cluster on premise (version 5.6). I want to setup the ES-Hadoop Connector (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/doc-sections.html) that Elastic provides so I can dump some data from Elastic to HDFS.
I grabbed the ZIP file with the JARS and followed the directions on a blog post at CERN:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
So far, this seems reasonable, but I have some questions:
We have SSL/TLS setup on our Elasticsearch cluster, so when I perform a query, I obviously get an error using the example on the blog. What do I need to do on my Hadoop/Spark side and on the Elastic side to make this communication work?
I read that I need to add those JARS to the Spark classpath - is there a rule of thumb as to where i should put those on my cluster? I assume on of my Spark Client nodes, but I am not sure. Also, once i put them there, is there a way to add them to the classpath so that all of my nodes / client nodes have the same classpath? Maybe something in Ambari provides that?
Basically what I am looking for is to be able to preform a query to ES from Spark that triggers a job that tells ES to push "X" amount of data to my HDFS. Based on what I can read on the Elastic site, this is how I think it should work, but I am really confused by the documentation. It's lacking and has confused both me and my Elastic team. Can someone provide some clear directions or some clarity around what I need to do to set this up?
For the project setup part of the question you can take a look at
https://github.com/zouzias/elasticsearch-spark-example
which a project template integrating elasticsearch with spark.

How to setup an elasticsearch cluster

I am trying to setup a multi node elastic search cluster.Any useful link which i can follow to setup cluster.
I am trying to run a map reduce programe in cluster to find out exact matches .
From my experience, if you just run the executable in two or more machines connected via a network, elasticsearch will somehow figure it out and all nodes will be added to the same cluster. I don't think you have to do anything.
This is the tutorial I've used: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html
Here you have a step by step guide on how to setup an EMR cluster with Elasticsearch and Kibana installed using the bootstrap actions mentioned before.
http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR
The article also provides basic Elasticsearch tests on the installed cluster.
The bootstrap actions also provide the Elasticsearch-Hadoop plugin that will allow you to run Mapreduce or other Hadoop applications.
Last version of Elasticsearch Bootstrap actions are available here:
https://github.com/awslabs/emr-bootstrap-actions/tree/master/elasticsearch
The only thing to cluster two elasticsearch node is, identical cluster name of elasticsearch nodes.you can find cluster name elasticsearch.yml file.[you can find the file in config folder of elasticsearch ].The default cluster name is elasticsearch.
To change name edit the property in elasticsearch.yml
cluster.name: "custom cluster name"
Elasticsearch uses zen discovery to find the the nodes in cluster during start up.If the cluster name is identical the elasticsearch ll automatically form the cluster.
Check out this link. You need to install the Amazon Powershell but replace the variables in the script for what you want and it should launch a EMR with elasicsearch.
https://github.com/awslabs/emr-bootstrap-actions/tree/master/elasticsearch
you can use kubernetes to create a cluster of elasticsearch nodes running inside docker containers
take a look at
https://github.com/kubernetes/kubernetes/tree/master/examples/elasticsearch

Sharing elasticsearch between Logstash/graylog2 and my own application

Would it be safe to share an elasticsearch cluster (or single-node elasticsearch cluster) between Logstash or graylog2 and my own application? what configuration changes/additions should be made for accomodating that? what kind of name-spacing would the application require for storing its own data in separation from graylog/Logstash?
I'd rather avoid maintaining separate clusters, especially on dev boxes but also in general - if the architecture allows.
It is technically possible but not recommended. You will experience load on the logging cluster that you want to decouple from the other applications using ES.
Graylog2 supports defining an index prefix for having multiple setups running in one ES cluster.
We have both(Kibana and Graylog) running with shared Elasticsearch. It's just that the indexing pattern is something different we have to add Circuit breakers in Elasticsearch so that Kibana search query for logs would not expand beyond a certain size.

Resources