Apache Drill - Slow Queries - hadoop

I have the following storage plugin set-up in Drill:
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://hivemetastore.hostname.com:9083",
"hive.metastore.sasl.enabled": "false"
}
}
however, a simple
SELECT * FROM hive.table LIMIT 5;
...
5 rows selected (35.383 seconds)
0: jdbc:drill:>
is taking over 30 seconds to respond. What am I missing / where should I begin the troubleshoot?
The Hive metastore server is the same as Drill right now. And there are less than 20,000 records in the table.

Only MapR Drill on the MapR sandbox should use the sparse storage plugin config you are using. In the sandbox, things are configured under the covers.
EMBEDDED METASTORE SERVICE
Assuming you are using a Drill installation, not the sandbox, and you're using the embedded metastore service (the default), configProps needs to look something like this (per the docs):
"configProps": {
"hive.metastore.uris": "",
"javax.jdo.option.ConnectionURL": "jdbc:<database>://<host:port>/<metastore database>",
"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
"fs.default.name": "file:///",
"hive.metastore.sasl.enabled": "false"
}
Remove the "hive.metastore.uris": "thrift://:" from your storage plugin config. That is for use with a remote hive metastore service.
The "javax.jdo.option.ConnectionURL" might be a MySQL database. The Hive metastore service provides access to the physical DB like MySQL. MySQL stores the metadata. The "fs.default.name" is the file system location where the data is located.
The embedded metastore configuration is for testing only, and not for use in production systems per the docs. For performance improvement, please configure the remote metastore. Also, check the compatibility of the version of Hive you are using. Open source Apache Drill 1.0 supports Hive 0.13. Drill 1.1 and later supports Hive 1.0.
REMOTE METASTORE SERVICE
If you're using the remote metastore, the "fs.default.name" should point to the main control node. Point to a NameNode, for example. If you're using MapR Drill, "fs.default.name" should be maprfs:///. The MapR FileClient figures out CLDB locations from mapr-clusters.conf. Start the metastore service, which is installed on top of Hive as a separate package:
hive --service metastore
The remote metastore config should look something like this if you're using open source Apache Drill:
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://mfs41.mystore:9083",
"hive.metastore.sasl.enabled": "false",
"fs.default.name": "maprfs://10.10.10.41/"
}
}

Related

Migrating from one cluster to another

I have a source emr cluster with hive metastore as a external mysql ( emr version 3.1 ).
We are planning to upgrade a cluster to 5.11.1.
Does anyone know how to migrate hive and hdfs from one cluster to another with a remote mysql metastore?
None of your data should be on HDFS persistently. Copy any important files to S3.
HIve provides metastore upgrade scripts for all versions.
Use schematool command available under /usr/lib/hive/bin to perform an upgrade migration.
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

How to configure HUE to be connected to remote Hive server?

I'm trying to use HUE Beeswax to connect my company's Hive database. Firstly, is it possible to use HUE installed on my mac to be connected with remote Hive server? If it does, how am I supposed to find the address for the Hive server which is running on our private server? Only thing I can do is to type 'hive' and put some sql queries in hive shell. I already installed HUE but can't figure out how to connect it to the remote Hive server. Any tips would be much appreciated.
If all you want is a desktop connection to Hive, you only need a JDBC client, not a full web app like Hue.
In any case, Hive CLI is deprecated. Beeline is preferred. To use Beeline and Hue, you need a HiveServer2 running.
To find the address of the HiveServer2, if you have it, you need to find your hive-site.xml file on the Hadoop cluster, and export it. Other ways to get this information are available in Ambari or Cloudera Manager (but if you're using a Cloudera CDH cluster, you already have Hue). The Thrift interface is what you want. Default port is 10000
When you setup the Hue, you will need to find the hue.ini file, in which, edit the section that starts with [beeswax] and fill in the necessary values. Personally, I find that section fairly straightforward
You can read the Hue github to find the requirements for running it on a Mac

Connect Apache Zeppelin to Hive

I try to connect my apache zeppelin with my hive metastore. I use zeppelin 0.7.3 so there is not a hive interpreter only jdbc. I have copied my hive-site.xml to zeppelin conf folder but I don't know how to create a new hive interpreter.
I also tried to access hive tables through spark's hive context but when I try this way, I can not see my hive databases only a default database is shown.
Can someone explain either how to create a hive interpreter or how to access my hive metastore through spark correctly?
Any answer is appreciated.
I solved it by following this documentation. After adding this parameters in jdbc connector you should be able to run hive interpreter with
%jdbc(hive)
In my case it was a little trickier because I used Cloudera Hadoop so the standard jdbc hive connector was not working. So I changed the external hive-jdbc.jar with the one suitable for my cdh version (for cdh 5.9.- for example it located here).
I also find out that you can change hive.url with the one for impala port and connect with jdbc to impala if you prefer.

Browsing Hbase data in Hue through Phoenix

I am using CDH 5.4.4 and installed Phoenix parcel to be able to run SQL on hbase tables. Has anyone tried to browse that data using Hue?
I know since we can connect using JDBC connection to Phoenix, there must be a way for Hue to connect to it too.
The current status is that we would need to add HUE-2745 and then it would show up in DBQuery / Notebook
The latest https://phoenix.apache.org/server.html is brand new and JDBC only.
If there was an HiveServer2 Thrift API or ODBC for Phoenix it would work almost out of the box in the SQL or DB Query apps. Hue could work with JDBC but there will be a JDBC connector that is GPL (so to install separately).
The Hue jira for integrating Phoenix is https://issues.cloudera.org/browse/HUE-2121.

Connect tableau with Elastic search via Spark SQL

I found a post that discusses about connecting Tableau to Elastic Search via Hive SQL. I was wondering if there is a way to connect to Elastic Search via Spark SQL as I am not much familiar with hive.
Thanks.
#busybug91,
The right driver is here please try with this one. Could be solve your issue.
#NicholasY It got it resolved after a couple of trials. Two steps that I took:-
I wasn't using the right driver for connection. I was using datastax enterprise driver. However, they have a driver for spark sql as well. I used windows 64bit version of driver. Using MapR Hadoop Hive and Hortonworks Hadoop Hive drivers didn't work as I've Apache hive.
When I used right driver (from DataStax) I realized that my hive metastore and spark-thrift-server running on same port. I changed spark-thrift-server's port to 10001 and a successful connection was established.
A new problem: I've created external table in hive. I am able to query the data as well. I start hive-metastore as a service. However, as mentioned on this link I am not able to see my tables in hive in Spark SQL. My connection of Tableau with Spark Sql is of no use unless I see tables from hive metastore!! When I do show tables; in spark sql (via spark-sql shell and hive metastore running as a service as same time), it runs a job which gives a completion time also but now table names. I monitored it via localhost:4040 I see that input and output size are 0.0. I believe I am not able to get tables from hive in spark sql that is why I don't see any table after connection is established from Tableau to spark sql.
EDIT
I changed metastore from derby to mysql for both hive and spark sql.
I'm trying to do that, so maybe i can help you to warn up something.
First, compile a Spark SQL version with Hive and thrift Server (ver 0.13):
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
You need to have a hive-site.xml properly configurered to work with Hive and to copy it to spark/conf folder.
Then, you have to set the $CLASSPATH with the elasticsearch-hadoop jar path.
Be careful ! Spark SQL 1.2.0 is not working with elasticsearch-hadoop-2.0.x. You have to use a elasticsearch-hadoop-2.1.0-Beta4 or BUILD-SNAPSHOT available here.
To finish you have to run thriftserver with something like that:
./start-thriftserver.sh --master spark://master:7077 --driver-class-path $CLASSPATH --jars /root/spark-sql/spark-1.2.0/lib/elasticsearch-hadoop-2.1.0.Beta4.jar --hiveconf hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port 10000
It works for me but only on small docType ( 5000 rows ) , the data-colocation seems not working. I looking for a solution to move elasticsearch-hadoop.jar on each Spark workers as ryrobes did for Hadoop.
If you find a way to locate access to elasticsearch, let me know ;)
HTH,

Resources