Can we use nosql databases as Hive metastore? - hadoop

I understand that Hadoop community promotes using RDBMS using as hive metastore. But can we use nosql databases like mongodb or hbase for hive metastore?
If not then why? What is the criteria to choose a database for hive metastore?

If you are using the Cloudera Distribution,
Cloudera strongly encourages you to use MySQL because it is the most popular with the rest of the Hive user community, and, hence, receives more testing than the other options.
It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. So we can only use the Relational DB.
If you want the NoSQl, we can go with the MariaDB for the Metastore. as Maria DB is a No SQL Kind Relational DB.

Related

Is it possible to use/query data using Pig/Tableau or some-other tool from HDFS which was inserted/loaded using a HIVE Managed table?

Is it possible to use or query data using Pig or Drill or Tableau or some other tool from HDFS which was inserted/loaded using a HIVE Managed table; or is it only applicable with data in HDFS which was inserted/loaded using a HIVE External table?
Edit 1: Is the data associated with Managed Hive Tables locked to Hive?
Managed and external tables only refer to the file system, not visibility to clients. Both can be accessed with Hive clients
Hive has HiveServer2 (use Thrift) which is a service that lets clients execute queries against Hive. It provides JDBC/ODBC.
So you have query data in hive however it is managed table by hive or external tables.
DBeaver/Tableau can query hive once connected to HiveServer2.
For Pig you can use HCatalog
pig -useHCatalog

Does Hive depend on/require Hadoop?

Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.

what is hive best suited for

I need daily snapshots from all databases of the enterprise and update hive with it.
In case that is the best approach, how do I approach this? I have used sqoop to manually import data to hive but what do I connect PHP to? Hive or Sqoop?
I understand hive is used for OLAP and not OLTP, but taking snapshots once in a day is what hive would be supporting nicely or I should consider other options like Hbase?
I am open to more suggestions considering that the data is structured for the most part.

Using Hive metastore for client application performance

I am new to hadoop. Please help me in below concept.
It is always good practice to use hive metastore(into other db like mysql etc) for production purpose.
What is the exact role and need of storing meatadata on RDBMS ?
If we create a client application to get hive data on UI, will this metadata store help to improve the performance to get data?
If yes What will be the architecture of this kind of client application? Will it hit first RDBMS metastore ? How it will be different form querying hive directly in some other way like using thrift?
Hadoop experts ,please help
Thanks
You can use prestodb that allows you to run/translate SQL queries against HIVE. It has a mysql connector that you can use to exploit your stored hive schema.
Thus from your client application, you just need a JDBC driver as any RDBMS out there.

Is there a way to use a JDBC as a input resource for Hadoop's MapReduce?

I have data in a PostgreSQL DB and I'd like to get it, treat it and save it to a HBase DB. Is it possible to distribute somehow the JDBC operation in a Map operation?
Yes you can do that by DBInputFormat:
DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases.
The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop’s formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database.
LINK
I think you're looking for Sqoop, which is designed to import from SQL servers to HDFS stack technologies. It puts the data it gets from a JDBC connection into HDFS, thereby splitting it across your Hadoop NameNodes. I believe this is what you are looking for.
SQl to hadOOP = SQOOP, get it?
Sqoop can import into HBase. See this link.

Resources