I have seen some conflicting posts across the web about whether or not Hive uses HCatalog to access the metastore and I was hoping someone could help me out here?
Does Hive use the actual HCatalog api's to access the metastore, or does it have its own mechanism of retrieving metadata and is HCatalog only used by non-hive tools to access the metadata?
No ,hive doesn't uses Hcatalog Api to access metastore.
HCatalog opens up the hive metadata to other mapreduce tools. Every mapreduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables). With having table based abstraction, HCatalog supported mapreduce tools do not need to care about where the data is stored, in which format and storage location (HBase or HDFS).
Related
I was doing a case study on Spotify. I found out that Spotify uses Cassandra as a DB and also Hadoop. My question is, how is Hadoop different from a database. What type of files does Hadoop datanode stores? Why every corporation has DB and Hadoop as well. I know Hadoop is not a DB but what is it used for if there is DB cluster to save data?
Hadoop is not a database at all. Hadoop is a set of tools for distributed storage and processing, such as distributed filesystem (HDFS), MapReduce framework libraries, YARN resource manager.
Other tools like Hive, Spark, Pig, Giraph, sqoop, etc, etc can use Hadoop or it's components. For example Hive is a database. It uses HDFS for storing it's data and MapReduce framework primitives for building query execution graph.
I'm looking for a way to get HBASE data available/queriable in Vertica. I have seen that Vertica has a good integration with Hive's Metastore - HCatalog Connector.
The connector can read a table definition out of Hive Metastore and use the description to read the data directly.
The question is whether the connector supports the reading of Hive external tables configured with non-standard StorageHandler, HBaseStorageHandler in particular.
I have tried this long time ago and I was able to read Hive external tables using the HiveHBaseStorageHandler ( i think the name of the jar is hive-hbase-handler.jar) . Please give it a try and let us know. You need to place this jar in /opt/vertica/packages/hcat/lib/ .
Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.
I'm new to Hcatlog (HCAT), we would like to know in what usecases/scenario's we use HCAT, Benefits of making use of HCAT, Is there any Performance Improvement can be gain from HCatlog. Can any one just provide information on when to use Hcatlog
Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache Map/Reduce, and Apache Hive – to more easily read and write data on the grid.
HCatalog creates a table abstraction layer over data stored on an HDFS cluster. This table abstraction layer presents the data in a familiar relational format and makes it easier to read and write data using familiar query language concepts.
HCatalog data structures are defined using Hive's data definition language (DDL) and the Hive metastore stores the HCatalog data structures. Using the command-line interface (CLI), users can create, alter, and drop tables. Tables are organized into databases or are placed in the default database if none are defined for the table. Once tables are created, you can explore the metadata of the tables using commands such as Show Table and Describe Table.
HCatalog commands are the same as Hive's DDL commands.
HCatalog’s ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
HCatalog opens up the hive metadata to other Map/Reduce tools. Every Map/Reduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables) HCatalog supported Map/Reduce tools do not need to care about where the data is stored, in which format and storage location.
It assist integration with other tools and supplies read and write interfaces for Pig, Hive and Map/Reduce.
It provide shared schema and data types for Hadoop tools.You do not have to explicitly type the data structures in each program.
It Expose the information as Rest Interface for external data access.
It also integrates with Sqoop, which is a tool designed to transfer data back and forth between Hadoop and relational databases such as SQL Server and Oracle
It provide APIs and webservice wrapper for accessing metadata in hive metastore.
HCatalog also exposes a REST interface so that you can create custom tools and applications to interact with Hadoop data structures.
This allows us to use the right tool for the right job. For example, we can load data into Hadoop using HCatalog, perform some ETL on the data using Pig, and then aggregate the data using Hive. After the processing, you could then send the data to your data warehouse housed in SQL Server using Sqoop. You can even automate the process using Oozie.
How it works:
Pig- HCatLoader and HCatStore interface
Map/Reduce- HCatInputFormat and HCatOutputFormat interface
Hive- No Interface Necessary. Direct access to metadata
References:
Microsoft Big Data Solution
http://hortonworks.com/hadoop/hcatalog/
Answer to your question:
As I described earlier HCatalog provides shared schema and data types for hadoop tools It simplifies your work during data processing. If you have created a table using HCatalog, you can directly access that hive table through pig or Map/Reduce (you cannot simply access a hive table through pig or Map Reduce).You don't need to create schema for every tool.
If you are working with the shared data that can be used from multiple
users(some team using Hive, some team using pig, some team using Map/Reduce) then HCatalog will be useful as they just need to table only to access the data for processing.
It is not replacement of any tool It a facility to provide single access to many tools.
Performance depends on your hadoop cluster. You should do some performance benchmarking in your Hadoop cluster to major performance.
I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:
Files are stored in Hadoop (some kind of file manager)
Hive needs tables to store data from Hadoop (data manager)
Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?)
-> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
Can Presto be used without Hive and just on Hadoop directly?
Thanks in advance for answering my questions :)
First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?
Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).
Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.
Please read more info about Hive connector configuration here and about connector plugins here.