Ambari Hadoop/Spark and Elasticsearch SSL Integration - hadoop

I have a Hadoop/Spark cluster setup via Ambari (​HDP -2.6.2.0). Now that I have my cluster running, I want to feed some data into it. We have an Elasticsearch cluster on premise (version 5.6). I want to setup the ES-Hadoop Connector (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/doc-sections.html) that Elastic provides so I can dump some data from Elastic to HDFS.
I grabbed the ZIP file with the JARS and followed the directions on a blog post at CERN:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
So far, this seems reasonable, but I have some questions:
We have SSL/TLS setup on our Elasticsearch cluster, so when I perform a query, I obviously get an error using the example on the blog. What do I need to do on my Hadoop/Spark side and on the Elastic side to make this communication work?
I read that I need to add those JARS to the Spark classpath - is there a rule of thumb as to where i should put those on my cluster? I assume on of my Spark Client nodes, but I am not sure. Also, once i put them there, is there a way to add them to the classpath so that all of my nodes / client nodes have the same classpath? Maybe something in Ambari provides that?
Basically what I am looking for is to be able to preform a query to ES from Spark that triggers a job that tells ES to push "X" amount of data to my HDFS. Based on what I can read on the Elastic site, this is how I think it should work, but I am really confused by the documentation. It's lacking and has confused both me and my Elastic team. Can someone provide some clear directions or some clarity around what I need to do to set this up?

For the project setup part of the question you can take a look at
https://github.com/zouzias/elasticsearch-spark-example
which a project template integrating elasticsearch with spark.

Related

Can Kafka be used as a messaging service between oracle and elasticsearch

Can Kafka be used as a messaging service between oracle and elastic search ? any downsides of this approach?
Kafka Connect provides you a JDBC Source and an Elasticsearch Sink.
No downsides that I am aware of, other than service maintenance.
Feel free to use Logstash instead, but Kafka provides better resiliency and scalability.
I have tried this in the past with Sql server instead of Oracle and it works great, and I am sure you could try the same approach with Oracle as well since I know the logstash JDBC plugin that I am going to describe below has support for Oracle DB.
So basically you would need a Logstash JDBC input plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html that points to your Oracle DB instance and pushes the rows over to Kafka using the Kafka Output plugin https://www.elastic.co/guide/en/logstash/current/plugins-outputs-kafka.html.
Now to read the contents from Kafka you would need, another Logstash instance(this is the indexer) and use the Kafka input plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html. And finally use the Elastic search output plugin in the Logstash indexer configuration file to push the events to Elastic Search.
So the pipeline would look like this,
Oracle -> Logstash Shipper -> Kafka -> Logstash Indexer -> Elastic search.
So overall I think this is a pretty scalable way to push events from your DB to Elastic search. Now, if you look at downsides, at times you can feel that there are one too many components in your pipeline and can be frustrating especially when you have failures. So you need to put in appropriate controls and monitoring at every level to make sure you have a functioning data aggregation pipeline that is described above. Give it a try and good luck!

Deploy Elasticsearch for Apache Spark on Kubernetes

I'm wondering if anyone has experience configuring a Kubernetes cluster using the Elasticsearch for Hadoop library. I'm running into issues with the node discovery timing out when trying to write from spark to elasticsearch. I have Elasticsearch up and running thanks to the elasticsearch-cloud-kubernetes plugin for ES, which handles discovery, but I'm not sure how best to configure elasticsearch-hadoop to be aware of the nodes (pods) within the kubernetes cluster. I've tried setting spark.es.nodes to a es-client service, but that doesn't seem to work. I'm also aware that I could enable es.nodes.wan.only, but as noted in the documentation, this would severely impact performance, which defeats the purpose of having them running on the same cluster. Any help would be appreciated.
I'm not that schooled on elasticsearch-hadoop but have you tried pointing your elasticsearch-hadoop to your elasticsearch service instead of specific nodes? Your master nodes will normally take care of everything in your ES cluster.

How to connect elasticsearch to apache spark streaming or storm?

We are building a real-time big data tool with open source tools. Our main goal is to supervise and analyze a network by getting logs from a kafka server in real-time. We saw in tutorials that we have to divide our tool in two sections: Analytic and Supervision as shown below.
For the supervision section we chose the solution Elasticsearch and Logstash.
Regarding the section analytic, my team and I are comparing Apache Storm Streaming and Apache Storm in order to use it with Elasticsearch. Despite the fact that Apache Storm is a true real-time data processing tool and faster than Apache Spark Streaming, it does not provide machine learning libraries like with Apache Spark. That's why we are thinking to choose Apache Spark. The elastic website indicates that it exists a connector ES-Hadoop to connect a Elasticsearch database to a Hadoop ecosystem. We can see that in the below figure.
However, We are a little bit confused with this picture because there is only spark SQL and not all the spark frameworks (MLlib, Spark Streaming..). We did some assumptions and we came out with two final possible architectures. We only wanted to know if there are technically correct and if we are not in the wrong direction.
With Apache Spark streaming:
With Apache Storm:
Both your architectural diagrams are ok. Keep on mind that spark streaming will not work in this scenario. Es-hadoop provides you with easy access apis to get and put data from and into elastic. Its also provides the methods to get the data inro the spark framework (RDD) or data frames inthe case of spark sql. Once the data is in the framework, all ml libraries can be applied to the data for ml or analytics generation. Elastic is not capable of streaming data so spark streaming in the strict sense is not possible. So in the diagram, the arrow to hdfs optional and then to spark streaming can be removed and the arrow juat pointa to hdfs. My concern, however, would be running mllib algos on the data in realtime and expect realtime performance. Typical use case might be do modwl generation off line and use the model in realtime for analysis.

crawler + elasticsearch integration

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

How to integrate Cassandra with Hadoop

I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?
Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR

Resources