Migrating from one cluster to another - hadoop

I have a source emr cluster with hive metastore as a external mysql ( emr version 3.1 ).
We are planning to upgrade a cluster to 5.11.1.
Does anyone know how to migrate hive and hdfs from one cluster to another with a remote mysql metastore?

None of your data should be on HDFS persistently. Copy any important files to S3.
HIve provides metastore upgrade scripts for all versions.
Use schematool command available under /usr/lib/hive/bin to perform an upgrade migration.
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

Related

How to query kerberos enabled hbase using apache drill?

We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.

Creating s3 external table in amazon EMR with remote metastore

We started using Amazon EMR recently for one new project (Version emr-5.11.0).We made some architecture changes in the EMR cluster
1) We moved metastore to another Postgres instance instead of default mysql/derby
2) Running metastore service in a different instance (which is not part of amazon EMR cluster) and made the necessary changes in hive-site.xml.
In EMR
stop hive-hcatalog-server
In new instance
hive --service metastore
Everything is working as expected except 's3 external tables' .When I try to create an external s3 table it is giving us an error like below
message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
We tried with s3/s3n/s3a with credentials also for creating the external table .If we are running the metastore service inside the EMR master node and ran the same query .It is working without issues .
Do we need to do any configuration / adding additional libraries in metastore instance this to work ?
Note: The metastore instance has both Apache hadoop and hive latest binaries .We are going with HDFS filesystem .Able to perform all operations other than external s3 tables .Tried everything from beeline and hive CLI

In a hadoop cluster, should hive be installed on all nodes? Install Pig

I am new to Hadoop / Pig and I have just started reading the docs.
There are lots of blogs on installing Hadoop in cluster mode.
I know that Pig runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes.
Should I also install Pig on all the cluster nodes or only on the master node?
You would want to install Hive Metastore and Hive Server on 2 different nodes. By default, hive uses derby database, but most of the people choose to go with MySQL so there will be a MYSQL server daemon also.
So not to confuse you anymore :
Install HiveServer and WebHcat Server on one node
Install Hive Metastore and MySQL server on another node.
This is the best practice. If you have any other doubt you can ask!
I cannot tell if the question is about Hive or Pig, but there's a difference between clients and servers.
For Hive, the master services are the Metastore and HiveServer2. You can install these daemons on the same server to improve network traffic between the metastore and the Hive query compiler. You only need one client to communicate with those masters.
For Pig, it communicates directly to YARN and HDFS (optionally Hive, if you use Hcatalog). Again, it's only a client, so only one hosts needs it.
It is generally preferred to have a dedicated set of machines for Hive and the backing RDBMS for the metastore (Mysql or Postgres being the more popular options)
You also don't need to "install Pig in the cluster". For example, I could grab the Hadoop XML configs and run some Pig code against the YARN cluster from any outside computer after downloading Pig locally (same applies to Spark)

Connect Apache Zeppelin to Hive

I try to connect my apache zeppelin with my hive metastore. I use zeppelin 0.7.3 so there is not a hive interpreter only jdbc. I have copied my hive-site.xml to zeppelin conf folder but I don't know how to create a new hive interpreter.
I also tried to access hive tables through spark's hive context but when I try this way, I can not see my hive databases only a default database is shown.
Can someone explain either how to create a hive interpreter or how to access my hive metastore through spark correctly?
Any answer is appreciated.
I solved it by following this documentation. After adding this parameters in jdbc connector you should be able to run hive interpreter with
%jdbc(hive)
In my case it was a little trickier because I used Cloudera Hadoop so the standard jdbc hive connector was not working. So I changed the external hive-jdbc.jar with the one suitable for my cdh version (for cdh 5.9.- for example it located here).
I also find out that you can change hive.url with the one for impala port and connect with jdbc to impala if you prefer.

Hive Server doesn't see old hdfs tables

I'm having a problem about hive server that I don't understand. I've just set up a hadoop cluster and want to access to it from a hive service. First try I did was running the hive server in one of the cluster machines.
Everything worked nicely but I wanted to move the hive service to another machine outside the hadoop cluster.
So I just started a new machine outside this hadoop cluster. I've just install hive (+ hadoop libraries) and copied the hadoop config from the cluster. When I run the hiveserver almost everything goes ok. I can connect with the hive cli from a different machine to my hiveserver, create new tables in the hive warehouse within the hdfs filesystem in the hadoop cluster, query then and so on.
The thing I don't understand is that hiveserver seems to not recognize old tables which were created in my first try.
Some notes about my config are that all tables are handled by Hive and stored in HDFS. Hive configuration is the default one. I suppose that it has to do with my hive metastore but it couldn't say what.
Thank you!!

Resources