Why HBase backed Hive table uses MapReduce - hadoop

I am using Hbase backed Hive tables in my project but the reason we opted for Hbase backed Hive is to perform Updates.
Apart from that what are the other advantages of Hbase backed Hive tables. As it still uses MapReduce when queried from Hive.
Even if we want small set of Data and as the table is Huge it takes time to give the result.
But if we perform a Scan with Range or Just a get in Hbase on Hbase shell results come in fraction of seconds. So what are the other advantages of using Hbase backed Hive table apart from updates(which is now available in HIVE as well) & SQL ease.
How does HIVE evaluates and Runs a Query if it is backed by Hbase ?
Why it uses MapReduce to scan & give result instead of Hbase engine which is much faster ??
And does Hbase has its own engine to perform Scan, get operations to fetch data from its HFiles ???

I will advise you not to use Hbase backed Hive.
As you can see the scan with filter runs in friction of the time that hive query runs.
That because Hbase filter the data in the storage level and hive load all the table data and then filter it.
There were suppose to be predicate pushdown from hive to Hbase, but there are lot of open issue in matter. And a lot of the predicate pushdown is disable.
For more you can check the page : Hive HBase Integration

Related

Impala vs Hive. How Impala circumvents MapReduce?

How is Impala able to achieve lower latency than Hive in query processing?
I was going through http://impala.apache.org/overview.html, where it is stated:
To avoid latency, Impala circumvents MapReduce to directly access the
data through a specialized distributed query engine that is very
similar to those found in commercial parallel RDBMSs. The result is
order-of-magnitude faster performance than Hive, depending on the type
of query and configuration.
How Impala fetches the data without MapReduce (as in Hive)?
Can we say that Impala is closer to HBase and should be compared with HBase instead of comparing with Hive?
Edit:
Or can we say that as classically, Hive is on top of MapReduce and does require less memory to work on while Impala does everything in memory and hence it requires more memory to work by having the data already being cached in memory and acted upon on request?
Just read Impala Architecture and Components
Impala is a massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts.... Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries.
It circumvents MapReduce containers by having a long running daemon on every node that is able to accept query requests. There is no singular point of failure that handles requests like HiveServer2; all impala engines are able to immediately respond to query requests rather than queueing up MapReduce YARN containers.
Impala however does rely on the Hive Metastore service because it is just a useful service for mapping out metadata stored in the RDBMS to the Hadoop filesystem. Pig, Spark, PrestoDB, and other query engines also share the Hive Metastore without communicating though HiveServer.
Data is not "already cached" in Impala. Similar to Spark, you must read the data into a large portion of memory in order for operations to be quick. Unlike Spark, the daemons and statestore services remain active for handling subsequent queries.
Impala can query HBase, but it is not similar in architecture and in my experience, a well designed HBase table is faster to query than Impala. Impala is probably closer to Kudu.
Also worth mentioning that it's not really recommended to use MapReduce Hive anymore. Tez is far better, and Hortonworks states Hive LLAP is better than Impala, although as you quoted, it largely "depends on the type of query and configuration."
Impala use "Impala Daemon" service to read data directly from the dataNode (it must be installed with the same hosts of dataNode) .he cache only the location of files and some statistics in memory not the data itself.
that why impala can't read new files created within the table . you must invalidate or refresh (depend on your case) to tell impala to cache the new files and be able to read them directly
since impala is in memory , you need to have enough memory for the data read by the query , if you query will use more data than your memory (complexe query with aggregation on huge tables),use hive with spark engine not the default map reduce
set hive.execution.engine=spark; just before the query
you can use the same query in hive with spark engine
impala is cloudera product , you won't find it for hortonworks and MapR (or others) .
Tez is not included with cloudera for exemple.
it all depends on the platform you are using

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

Performance Issue in Hadoop,HBase & Hive

I am working on Migrating a Data from SQL Database to Hadoop, in which I have used HBase & Hadoop as well. I have successfully imported my data from SQL db to Hadoop, HBase and Hive. But the problem is the Performance of the System. I was getting the results of millions of entries within 5-10 minutes in SQL Db, but it takes around 1 hr to fetch 10 million of data from HBase & Hive. Can anyone help me on this to improve the Performance of my Hadoop System.
Data in HBase is only 'indexed' by rowkey. If you're querying in Hive on anything other than rowkey prefixes, you will generally be performing a full table scan.
There are some optimizations that can be made with HBase filters e.g., when using a FamilyFilter, you may be able to skip entire regions, but I doubt Hive is doing that.
How to improve performance depends on how your data is shaped and what analysis you need to perform on it. When performing frequent ad-hoc analysis, you may be better served by exporting data from HBase into something like Parquet files on HDFS and running your analysis against those with Hive (or Drill or Spark, Imapala, etc).

What is the difference between hbase and hive? (Hadoop)

From my understanding, Hbase is the Hadoop database and Hive is the data warehouse.
Hive allows to create tables and store data in it, you can also map your existing HBase tables to Hive and operate on them.
why we should use hbase if hive do all that? can we use hive by itself?
I'm confused :(
So in simple terms, with hive you can fire SQL like queries (with some exceptions) on your table/s and is used in batch operation. While with hbase, you can do real time querying and is based on key value pair.
"why we should use hbase if hive do all that? can we use hive by itself" Because Hive doesn't supports updating your data set. So if you have large analytical processing application use Hive and if you have real time get/set/update request processing, use Hbase.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Resources