Combing Hadoop and Elasticsearch - hadoop

I'm trying to find an easy way to combine my hadoop cluster (I'm using cloudera manager CDH5.3) with the elasticSearch and so on.
I found that to do that; I have to install ElasticSearch in a different cluster, uploading the ElasticSearch library (I'll be using hive and pig) to combine the ES with my cluster !
Well it appears simple. But hey, I'm not a centos 6.6 expert!
I have NO CLUE how that can be done
How do I import the specific jars via centos6.6 command line and where do I put them?
How to make my cluster see the ES
how combine the 2 hadoop and ES clusters using the imported jars?

Related

How to import data from HDFS (Hadoop) into ElasticSearch?

We have a big Hadoop cluster and recently installed Elastic Search for evaluation.
Now we want to bring data from HDFS to ElasticSearch.
ElasticSearch is installed in a different cluster and so far - we could run a Beeling or HDFS script to extract data from Hadoop into some file and then from a local file bulk load it to ElasticSearch.
Wondering if there is a direct connection from HDFS to ElasticSearch.
I start reading about it here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
But since our team is not DevOps (does not configure nor manage Hadoop cluster) and can only access Hadoop via Kerberos/user/pass - wondering if this is possible to configure (and how) without involving whole DevOps team that manages Hadoop cluster to install/setup all these libraries before direct connect?
How to do it from a Client side?
Thanks.

Ambari Hadoop/Spark and Elasticsearch SSL Integration

I have a Hadoop/Spark cluster setup via Ambari (​HDP -2.6.2.0). Now that I have my cluster running, I want to feed some data into it. We have an Elasticsearch cluster on premise (version 5.6). I want to setup the ES-Hadoop Connector (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/doc-sections.html) that Elastic provides so I can dump some data from Elastic to HDFS.
I grabbed the ZIP file with the JARS and followed the directions on a blog post at CERN:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
So far, this seems reasonable, but I have some questions:
We have SSL/TLS setup on our Elasticsearch cluster, so when I perform a query, I obviously get an error using the example on the blog. What do I need to do on my Hadoop/Spark side and on the Elastic side to make this communication work?
I read that I need to add those JARS to the Spark classpath - is there a rule of thumb as to where i should put those on my cluster? I assume on of my Spark Client nodes, but I am not sure. Also, once i put them there, is there a way to add them to the classpath so that all of my nodes / client nodes have the same classpath? Maybe something in Ambari provides that?
Basically what I am looking for is to be able to preform a query to ES from Spark that triggers a job that tells ES to push "X" amount of data to my HDFS. Based on what I can read on the Elastic site, this is how I think it should work, but I am really confused by the documentation. It's lacking and has confused both me and my Elastic team. Can someone provide some clear directions or some clarity around what I need to do to set this up?
For the project setup part of the question you can take a look at
https://github.com/zouzias/elasticsearch-spark-example
which a project template integrating elasticsearch with spark.

How to integrate Cassandra with Hadoop

I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?
Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR

Elasticsearch Hadoop

I have set up a Hadoop cluster with 3 DataNodes and 1 NameNode. I have also installed elasticsearch on one of the DataNodes. But I'm not able to access the HDFS using elasticsearch.(Hadoop cluster and Elasticsearch are working fine independently) Now, I want to integrate my Hadoop cluster with elasticsearch. I found there is a seperate plugin for that. But I'm not able to download it.(bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/1.3.0.M3 command is not working. It is failing everytime I executed it). Can anyone suggest me which plugin I should download. Also the path to place that plugin and how to aceess it using the url.
Thanks in advance
I suggest you try to use this repo.
It's an Elasticsearch real-time search and analytics natively integrated with Hadoop and you can follow the documentation provided here to use it.
The repo is provided by Elasticsearch.
Try this
1) Download jars from this link
2) Unzip it and place the jars in plugin folder of Elasticsearch
3) restart the server and start using it..!
The elasticsearch hadoop library is not a plugin. You need to download or build it and put it into the classpath of the hadoop/spark application you will use.

Running mahout using hadoop on Amazon's EMR/EC2

I want to migrate my current local hadoop cluster into amazon . In this hadoop cluster I am using services like mahout,hbase and hive . I have two option now in amazon either go for pure EC2 instances or Elastic map reduce cluster . I want some suggestion on what is better option to move the cluster which has these kinds of requirement .
I always suggest people to go for EMR, as that is managed and will be a bit more costlier than using pure ec2, but the headache and time you will spent in configuring the clusters and then managing them can be saved by running managed services like EMR.
Mahout can easily be run like a custom jar.
Hive cluster can also be launched within minutes.
Similary for HBase, Amazon has recently added creating HBase cluster over EMR.
See other views here.

Resources