When to use Hcatalog and what are its benefits - hadoop

I'm new to Hcatlog (HCAT), we would like to know in what usecases/scenario's we use HCAT, Benefits of making use of HCAT, Is there any Performance Improvement can be gain from HCatlog. Can any one just provide information on when to use Hcatlog

Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache Map/Reduce, and Apache Hive – to more easily read and write data on the grid.
HCatalog creates a table abstraction layer over data stored on an HDFS cluster. This table abstraction layer presents the data in a familiar relational format and makes it easier to read and write data using familiar query language concepts.
HCatalog data structures are defined using Hive's data definition language (DDL) and the Hive metastore stores the HCatalog data structures. Using the command-line interface (CLI), users can create, alter, and drop tables. Tables are organized into databases or are placed in the default database if none are defined for the table. Once tables are created, you can explore the metadata of the tables using commands such as Show Table and Describe Table.
HCatalog commands are the same as Hive's DDL commands.
HCatalog’s ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
HCatalog opens up the hive metadata to other Map/Reduce tools. Every Map/Reduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables) HCatalog supported Map/Reduce tools do not need to care about where the data is stored, in which format and storage location.
It assist integration with other tools and supplies read and write interfaces for Pig, Hive and Map/Reduce.
It provide shared schema and data types for Hadoop tools.You do not have to explicitly type the data structures in each program.
It Expose the information as Rest Interface for external data access.
It also integrates with Sqoop, which is a tool designed to transfer data back and forth between Hadoop and relational databases such as SQL Server and Oracle
It provide APIs and webservice wrapper for accessing metadata in hive metastore.
HCatalog also exposes a REST interface so that you can create custom tools and applications to interact with Hadoop data structures.
This allows us to use the right tool for the right job. For example, we can load data into Hadoop using HCatalog, perform some ETL on the data using Pig, and then aggregate the data using Hive. After the processing, you could then send the data to your data warehouse housed in SQL Server using Sqoop. You can even automate the process using Oozie.
How it works:
Pig- HCatLoader and HCatStore interface
Map/Reduce- HCatInputFormat and HCatOutputFormat interface
Hive- No Interface Necessary. Direct access to metadata
References:
Microsoft Big Data Solution
http://hortonworks.com/hadoop/hcatalog/
Answer to your question:
As I described earlier HCatalog provides shared schema and data types for hadoop tools It simplifies your work during data processing. If you have created a table using HCatalog, you can directly access that hive table through pig or Map/Reduce (you cannot simply access a hive table through pig or Map Reduce).You don't need to create schema for every tool.
If you are working with the shared data that can be used from multiple
users(some team using Hive, some team using pig, some team using Map/Reduce) then HCatalog will be useful as they just need to table only to access the data for processing.
It is not replacement of any tool It a facility to provide single access to many tools.
Performance depends on your hadoop cluster. You should do some performance benchmarking in your Hadoop cluster to major performance.

Related

Purpose of using HBase in Hadoop instead of Hive [duplicate]

This question already has an answer here:
What is the difference between hbase and hive? (Hadoop)
(1 answer)
Closed 5 years ago.
In my project, we are using Hadoop 2, Spark, Scala. Scala is the programming language and Spark is using here for analysing. we are using Hive and HBase both. I can access all details like file etc. of HDFS using Hive.
But my confusions are -
When I can able to performed all jobs using Hive, Then why HBase is required to store the data. Is it not an overhead?
What are the functionality of HIVE and HBase?
If we only used Hive, Then what should be the problem?
Can anyone please let me know.
When I can able to performed all jobs using Hive, Then why HBASE is required to store the data. Is it not a overhead?
What are the functionality of Hive and Hbase
HBase is No Sql database which stores the data in key value pair. Hive has integration with Hbase.Hbase HIve Integration
Advantage :- Hive queries over HBase. Think joins and a easy way to do aggregates and simple operations on your Hbase data.
Hbase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses Hbase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
If we only used Hive, Then what should be the problem?
If we will use Hive There is no problem . But in project there so many scenarios we have to consider .
Performance
Storage
Stability of used technology
Compatibility (Hive ware house is easily accessible for most of the Tools in Hadoop)
When I can able to performed all jobs using Hive, Then why HBase is
required to store the data. Is it not an overhead?
I can't say it's overhead or not. But HBase responds to requests in real-time as its database when it comes to Hive it runs jobs on MapReduce/Spark/Tez engines.
What are the functionality of Hive and HBase?
Hive:
It's a SQL-like language that gets translated into MapReduce/Spark/Tez jobs. it only runs batch processes on Hadoop. for more check this how Hive queries run on MapReduce engine
HBase:
It's key/value store database which runs on top of HDFS/S3(on AWS). It does real-time operations for requests.
If we only used Hive, Then what should be the problem?
As discussed If the query needs to process in real-time then HBase is the choice over Hive.

HDFS into Cassandra

is it possible to migrate/replicate/copy/move processed files (using PIG) from local HDFS (lets say 192.168.0.10) to a cassandra (192.168.0.20)?
What I have in mind is that I literally create a java application to parse the file and re-insert them into cassandra.
Is there any other way in doing so?
thanks alot!
Writing a Java program to migrate Hadoop data to Cassandra tables is actually a overkill. It would become more worse if you happen to perform the same periodically.
Instead , we can utilize a very useful feature of Hive which helps us to integrate Hive tables with external data sources. Its Storage Handler Api of hive, which integrates with external data sources like Cassandra/Oracle/Mysql etc.
There is already an Hive-Cassandra Storage Handler API implementation available , which we can very well reuse, kindly find the same in below url.
https://github.com/tuplejump/cash/tree/master/cassandra-handler
The idea is to create Hive external table which is configured with storage handler specs about the remote Cassandra host/table details.
Any write/read performed to this external table , will be handled by Hive through mapreduce jobs which talks with the Cassandra.
I hope this is the ideal way to integrate Hive and Cassandra which takes very less efforts from us and very efficient too.
Hope this helps.
There are several ways to move the data from Hadoop to Cassandra.
Using Java HDFS API and Cassandra API (inefficient).
Using Java MapReduce program (Parallel loading).
Using Pig (Parallel loading).
Using Hive (Parallel loading).
Using Spark (Parallel loading).
Out of all Pig is easier way to load the data from HDFS to Cassandra.
Pig has a storage type called CassandraStorage. It allows us to load the data into Cassandra in parallel.
Please see this link for more information:
https://wiki.apache.org/cassandra/HadoopSupport#Pig

Does Hive really use HCatalog?

I have seen some conflicting posts across the web about whether or not Hive uses HCatalog to access the metastore and I was hoping someone could help me out here?
Does Hive use the actual HCatalog api's to access the metastore, or does it have its own mechanism of retrieving metadata and is HCatalog only used by non-hive tools to access the metadata?
No ,hive doesn't uses Hcatalog Api to access metastore.
HCatalog opens up the hive metadata to other mapreduce tools. Every mapreduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables). With having table based abstraction, HCatalog supported mapreduce tools do not need to care about where the data is stored, in which format and storage location (HBase or HDFS).

What's the difference between Flume and Sqoop?

Both Flume and Sqoop are meant for data movement, then what is the difference between them? Under what condition should I use Flume or Sqoop?
From http://flume.apache.org/
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data.
Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents can be configured to collect high volume of data.
It scales horizontally.
From http://sqoop.apache.org/
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk
data between Apache Hadoop and structured datastores such as
relational databases.
Sqoop helps to move data between hadoop and other databases and it can transfer data in parallel for performance.
Both Sqoop and Flume, pull the data from the source and push it to the sink. The main difference is Flume is event driven, while Sqoop is not.
Flume:
Flume is a framework for populating Hadoop with data. Agents are populated
throughout ones IT infrastructure – inside web servers, application servers
and mobile devices, for example – to collect data and integrate it into Hadoop.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such
as relational databases and data warehouses – into Hadoop. It allows users to
specify the target location inside of Hadoop and instruct Sqoop to move data
from Oracle,Teradata or other relational databases to the target.
You can see the full Post
Flume:
A very common use case is collecting log data from one system- a bank of web servers(aggregating it in HDFS for later analysis).
Sqoop:
On the other hand is designed for performing bulk imports of data into HDFS from structured data stores. simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a Hive data ware house for analysis.
--From the definitive guide.
Apache Sqoop and Apache Flume work with various kinds of data sources. Flume functions well in streaming data sources which are generated continuously in hadoop environment such as log files from multiple servers.
whereas Apache Sqoop is designed to work well with any kind of relational database system that has JDBC connectivity.
Sqoop can also import data from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in Apache Sqoop data load is not driven by events.
4.Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
5.In Apache Flume, data flows to HDFS through multiple channels whereas in Apache Sqoop HDFS is the destination for importing data.
6.Apache Flume has agent based architecture i.e. the code written in flume is known as agent which is responsible for fetching data whereas in Apache Sqoop the architecture is based on connectors. The connectors in Sqoop know how to connect with the various data sources and fetch data accordingly.
Lastly, Sqoop and Flume cannot be used achieve the same tasks as they are developed specifically to serve different purposes. Apache Flume agents are designed to fetch streaming data like tweets from Twitter or log file from the web server whereas Sqoop connectors are designed to work only with structured data sources and fetch data from them.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly where Apache Flume is used for collecting and aggregating data because of its distributed, reliable nature and highly available backup routes.
Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity.
Sqoop is actually meant for bulk data transfers between hadoop and any other structured data stores. Flume collects log data from many sources, aggregating it, and writing it to HDFS.
I came across this interesting infographic that explains the differences between the two apache projects Sqoop and Flume -
Difference between Sqoop and Flume
Sqoop
Sqoop can perform import/export from RDBMS to HDFS/HIVE/HBASE
sqoop only import/export structured data not unstructured or semi
structured.
Flume
import stream data from multiple sources mostly semi-structured and
unstructured in nature. Now Kafka is better alternative for flume.

Is there a way to use a JDBC as a input resource for Hadoop's MapReduce?

I have data in a PostgreSQL DB and I'd like to get it, treat it and save it to a HBase DB. Is it possible to distribute somehow the JDBC operation in a Map operation?
Yes you can do that by DBInputFormat:
DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases.
The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop’s formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database.
LINK
I think you're looking for Sqoop, which is designed to import from SQL servers to HDFS stack technologies. It puts the data it gets from a JDBC connection into HDFS, thereby splitting it across your Hadoop NameNodes. I believe this is what you are looking for.
SQl to hadOOP = SQOOP, get it?
Sqoop can import into HBase. See this link.

Resources