Is it possible to use or query data using Pig or Drill or Tableau or some other tool from HDFS which was inserted/loaded using a HIVE Managed table; or is it only applicable with data in HDFS which was inserted/loaded using a HIVE External table?
Edit 1: Is the data associated with Managed Hive Tables locked to Hive?
Managed and external tables only refer to the file system, not visibility to clients. Both can be accessed with Hive clients
Hive has HiveServer2 (use Thrift) which is a service that lets clients execute queries against Hive. It provides JDBC/ODBC.
So you have query data in hive however it is managed table by hive or external tables.
DBeaver/Tableau can query hive once connected to HiveServer2.
For Pig you can use HCatalog
pig -useHCatalog
I have freshly deployed Hive 2.4.3, however there are few existing tables with partition on older version of the Hive 1.2, I am using Derby as metadata store.
What is the best way to migrate them to new installation of Hive?
Create external tables in new hive and use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual on both commands here: RECOVER PARTITIONS
I understand that Hadoop community promotes using RDBMS using as hive metastore. But can we use nosql databases like mongodb or hbase for hive metastore?
If not then why? What is the criteria to choose a database for hive metastore?
If you are using the Cloudera Distribution,
Cloudera strongly encourages you to use MySQL because it is the most popular with the rest of the Hive user community, and, hence, receives more testing than the other options.
It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. So we can only use the Relational DB.
If you want the NoSQl, we can go with the MariaDB for the Metastore. as Maria DB is a No SQL Kind Relational DB.
I am considering HDInsight with Hive and data loaded on Azure Blob Storage.
There is a combination of both historic and changing data.
Does the solution mentioned in Update , SET option in Hive work with blob storage too?
The below Hive statement change the data in blob storage which is my requirement too?
INSERT OVERWRITE TABLE _tableName_ PARTITION ...
INSERT OVERWRITE will write new file(s) into the cluster file system. In HDInsight the file system is backed by Azure blobs, the wasb://... and wasb:///... names. Everything Hive does to the cluster file system, like overwriting them, will accordingly be reflected in the Azure storage BLOBs. See Use Hive with HDInsight for more details.
I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database.
Is any setting i need to configure?
Thanks in advance.
Make sure you run hive from the same directory every time because when you launch hive CLI for the first time, it creates a metastore derby db in the current directory. This derby DB contains metadata of hive tables. If you change directories, you will have unorganized metadata for hive tables. Also the Derby DB cannot handle multiple sessions. To allow for concurrent Hive access you would need to use a real database to manage the Metastore rather than the wimpy little derbyDB that comes with it. You can download mysql for this and change hive properties for jdbc connection to mysql type 4 pure java driver.
Try emailing the Hive userlist or the IRC channel.
You probably need to setup the central Hive metastore (by default, Derby, but it can be mySQL/Oracle/Postgres). The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.
For more information, see http://wiki.apache.org/hadoop/HiveDerbyServerMode
Examine your hadoop logs. For me this happened when my hadoop system was not setup properly. The namenode was not able to contact the datanodes on other machines etc.
Yeah, it's due to the metastore not being set up properly. Metastore stores the metadata associated with your Hive table (e.g. the table name, table location, column names, column types, bucketing/sorting information, partitioning information, SerDe information, etc.).
The default metastore is an embedded Derby database which can only be used by one client at any given time. This is obviously not good enough for most practical purposes. You, like most users, should configure your Hive installation to use a different metastore. MySQL seems to be a popular choice. I have used this link from Cloudera's website to successfully configure my MySQL metastore.