Connect tableau with Elastic search via Spark SQL - elasticsearch

I found a post that discusses about connecting Tableau to Elastic Search via Hive SQL. I was wondering if there is a way to connect to Elastic Search via Spark SQL as I am not much familiar with hive.
Thanks.

#busybug91,
The right driver is here please try with this one. Could be solve your issue.

#NicholasY It got it resolved after a couple of trials. Two steps that I took:-
I wasn't using the right driver for connection. I was using datastax enterprise driver. However, they have a driver for spark sql as well. I used windows 64bit version of driver. Using MapR Hadoop Hive and Hortonworks Hadoop Hive drivers didn't work as I've Apache hive.
When I used right driver (from DataStax) I realized that my hive metastore and spark-thrift-server running on same port. I changed spark-thrift-server's port to 10001 and a successful connection was established.
A new problem: I've created external table in hive. I am able to query the data as well. I start hive-metastore as a service. However, as mentioned on this link I am not able to see my tables in hive in Spark SQL. My connection of Tableau with Spark Sql is of no use unless I see tables from hive metastore!! When I do show tables; in spark sql (via spark-sql shell and hive metastore running as a service as same time), it runs a job which gives a completion time also but now table names. I monitored it via localhost:4040 I see that input and output size are 0.0. I believe I am not able to get tables from hive in spark sql that is why I don't see any table after connection is established from Tableau to spark sql.
EDIT
I changed metastore from derby to mysql for both hive and spark sql.

I'm trying to do that, so maybe i can help you to warn up something.
First, compile a Spark SQL version with Hive and thrift Server (ver 0.13):
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
You need to have a hive-site.xml properly configurered to work with Hive and to copy it to spark/conf folder.
Then, you have to set the $CLASSPATH with the elasticsearch-hadoop jar path.
Be careful ! Spark SQL 1.2.0 is not working with elasticsearch-hadoop-2.0.x. You have to use a elasticsearch-hadoop-2.1.0-Beta4 or BUILD-SNAPSHOT available here.
To finish you have to run thriftserver with something like that:
./start-thriftserver.sh --master spark://master:7077 --driver-class-path $CLASSPATH --jars /root/spark-sql/spark-1.2.0/lib/elasticsearch-hadoop-2.1.0.Beta4.jar --hiveconf hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port 10000
It works for me but only on small docType ( 5000 rows ) , the data-colocation seems not working. I looking for a solution to move elasticsearch-hadoop.jar on each Spark workers as ryrobes did for Hadoop.
If you find a way to locate access to elasticsearch, let me know ;)
HTH,

Related

Connecting Apache Superset with Hive

I've my Hadoop cluster running in AWS environment where the schema got mapped with Hive. And I could see the complete Data in Hive.
Now, Here is the Problem - I am trying to connect my hive to Superset where I couldn't able to connect with.
This is how I have provided my URI:
jdbc+hive://MYIP:PORT
Also tried:
hive://username:password#MYIP:PORT
Make sure hive server2 is up and running
Also you can try this one
hive://hostname:10000/default?auth=NOSASL

How to configure HUE to be connected to remote Hive server?

I'm trying to use HUE Beeswax to connect my company's Hive database. Firstly, is it possible to use HUE installed on my mac to be connected with remote Hive server? If it does, how am I supposed to find the address for the Hive server which is running on our private server? Only thing I can do is to type 'hive' and put some sql queries in hive shell. I already installed HUE but can't figure out how to connect it to the remote Hive server. Any tips would be much appreciated.
If all you want is a desktop connection to Hive, you only need a JDBC client, not a full web app like Hue.
In any case, Hive CLI is deprecated. Beeline is preferred. To use Beeline and Hue, you need a HiveServer2 running.
To find the address of the HiveServer2, if you have it, you need to find your hive-site.xml file on the Hadoop cluster, and export it. Other ways to get this information are available in Ambari or Cloudera Manager (but if you're using a Cloudera CDH cluster, you already have Hue). The Thrift interface is what you want. Default port is 10000
When you setup the Hue, you will need to find the hue.ini file, in which, edit the section that starts with [beeswax] and fill in the necessary values. Personally, I find that section fairly straightforward
You can read the Hue github to find the requirements for running it on a Mac

Connect Apache Zeppelin to Hive

I try to connect my apache zeppelin with my hive metastore. I use zeppelin 0.7.3 so there is not a hive interpreter only jdbc. I have copied my hive-site.xml to zeppelin conf folder but I don't know how to create a new hive interpreter.
I also tried to access hive tables through spark's hive context but when I try this way, I can not see my hive databases only a default database is shown.
Can someone explain either how to create a hive interpreter or how to access my hive metastore through spark correctly?
Any answer is appreciated.
I solved it by following this documentation. After adding this parameters in jdbc connector you should be able to run hive interpreter with
%jdbc(hive)
In my case it was a little trickier because I used Cloudera Hadoop so the standard jdbc hive connector was not working. So I changed the external hive-jdbc.jar with the one suitable for my cdh version (for cdh 5.9.- for example it located here).
I also find out that you can change hive.url with the one for impala port and connect with jdbc to impala if you prefer.

Is it possible to integrate apache spark with jasper using the spark's jdbc driver?

We want to use apache spark for real time analytics ? We currently use hive/MR for data crunching and mysqlsql to store the aggregated results , and jasper reports for analytics ? This approach is far from ideal because of scalability issues with mysql. We are in the process of exploring apache spark to run on top of hdfs or cassandra , only problem is if there is a way for spark to integrate with jasper server? If not what are other UI options to use with spark ?
I figured out the answer and thought of sharing , if you use hive metastore with spark you could persist RDDs as hive Tables , once you have done that any client which talks hive:jdbc2 protocol can run hive or sql like queries using spark's execution engine.
These are the steps -
1) Configure spark to use mysql as metastore database.
2) copy hive-site.xml in spark conf directory , pointing to mysql database.
3) start thrift service , you can do this using $SPARK_HOME/sbin/start-thrift.sh , if successfully started it listens on port 10000.
4) test this using a client like beeline , which is under $SPARK_HOME/bin directory.
5) From beeline use this url - !connect hive:jdbc2://localhost 10000 (no username or password)
6) run any hive create or select query.
7) If it runs, congrats !! , use the same url as above from jasper (!connect hive:jdbc2://localhost 10000 , replace localhost with ip) using hive:jdbc2 .

How to connect to hive via CLI on cloudera

We are running CDH 4.1.1 from the HUE / Beeswax Hive is runng fine and /beeswax/tables shows all tables.
I want to use the hive CLI to list all tables:
overlord#overlord-datanode1:~$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/overlord/hive_job_log_overlord_201211280646_1426149164.txt
hive> SHOW TABLES;
OK
Time taken: 0.071 seconds
This appears to be empty, which leads me to believe that I'm maybe connecting to the wrong hive metastore?
How can I access the same hive data as from HUE/beeswax?
One possible reason is hive cli and beehive is using 2 different users(with different previlage) so when you switch users Meta store switch automatically(if it does not exist already).
If you are using derby as your metastore i would suggest you to migrated it to Mysql or PostgreSQL as derby is not suitable for production.
to migrate follow these guides.
http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx
https://ccp.cloudera.com/display/CDHDOC/Hive+Installation

Resources