Performance Issue in Hadoop,HBase & Hive - hadoop

I am working on Migrating a Data from SQL Database to Hadoop, in which I have used HBase & Hadoop as well. I have successfully imported my data from SQL db to Hadoop, HBase and Hive. But the problem is the Performance of the System. I was getting the results of millions of entries within 5-10 minutes in SQL Db, but it takes around 1 hr to fetch 10 million of data from HBase & Hive. Can anyone help me on this to improve the Performance of my Hadoop System.

Data in HBase is only 'indexed' by rowkey. If you're querying in Hive on anything other than rowkey prefixes, you will generally be performing a full table scan.
There are some optimizations that can be made with HBase filters e.g., when using a FamilyFilter, you may be able to skip entire regions, but I doubt Hive is doing that.
How to improve performance depends on how your data is shaped and what analysis you need to perform on it. When performing frequent ad-hoc analysis, you may be better served by exporting data from HBase into something like Parquet files on HDFS and running your analysis against those with Hive (or Drill or Spark, Imapala, etc).

Related

Why HBase backed Hive table uses MapReduce

I am using Hbase backed Hive tables in my project but the reason we opted for Hbase backed Hive is to perform Updates.
Apart from that what are the other advantages of Hbase backed Hive tables. As it still uses MapReduce when queried from Hive.
Even if we want small set of Data and as the table is Huge it takes time to give the result.
But if we perform a Scan with Range or Just a get in Hbase on Hbase shell results come in fraction of seconds. So what are the other advantages of using Hbase backed Hive table apart from updates(which is now available in HIVE as well) & SQL ease.
How does HIVE evaluates and Runs a Query if it is backed by Hbase ?
Why it uses MapReduce to scan & give result instead of Hbase engine which is much faster ??
And does Hbase has its own engine to perform Scan, get operations to fetch data from its HFiles ???
I will advise you not to use Hbase backed Hive.
As you can see the scan with filter runs in friction of the time that hive query runs.
That because Hbase filter the data in the storage level and hive load all the table data and then filter it.
There were suppose to be predicate pushdown from hive to Hbase, but there are lot of open issue in matter. And a lot of the predicate pushdown is disable.
For more you can check the page : Hive HBase Integration

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

Loading data into HIVE to support front end application

We have a datawarehousing application which we are planning to convert to Hadoop.
Currently, there are 20 feeds that we receive on daily basis and load this data into MySQL database.
As the data is getting large, we are planning to move to Hadoop for faster query processing.
As the first step we are planning to load the data into HIVE on a daily basis instead of MySQL.
Question:-
1.Can I convert Hadoop similar to a DWH application to process files on daily basis?
2.When I load the data in Master Node, will it be sync'd automatically?
It really depends on the size of your data. The Question is a bit complex but in general you will have to design your own pipeline.
If you are analyzing raw logs HDFS will be a good choice to start from. You can use Java, Python or Scala to schedule the Hive jobs on daily basis and use Sqoop if you still need some MySQL data.
In Hive you will have to create partitioned table to be synced and available upon query execution. Partition creation can be also scheduled.
I would suggest to go with Impala instead of Hive as it is more tunable, fault tolerant and easier to use.

Tableau, Hadoop & Birt

I was trying to migrate a data from SQL db to Hadoop. I have successfully done this by configuring Hive, HBase & Hadoop.
My problem is that I was using Birt & Tableau with my SQL db and was able to load 10 million data in 5-10 mins, but my newly configured Hadoop, Hive & HBase System takes around 50 mins to fetch 10 million entries.
How can I improve this performance?
As Hadoop is specially developed for processing tons of data, why I am not able to do so?
Is there any special configuration for performance?
After lot of research and for the answer of this question I went through HDP as well. Then I come across a scenrio that we cannot compare the performance of SQL Db with Hadoop as both are used for different purposes.
Also Hadoop will show its performance only after the data crosses a limit of Several TB's i.e. the case in which SQL Database fails. So it will be better if one should check first whether for an Application. If there is a requirement of Performance, choosing Hadoop is not a good option; go for the SQL Databases. But if the Application is such that it will have huge amount of Data & one has to do an analysis of such huge data where SQL DB fails; in such case Hadoop is prevalent.

How the data is moved or reflected between Hive and Hbase in Hive-HBase-Integration.?

As per my understanding both HIVE and HBASE are using HDFS to store the data. When we integrate HIVE and HBASE ----
How the data is moved between them? Or is it like the data wont move and it simply reflects? I am interested to know in 2 scenarios.
One: Table_1 has data and its in HIVE, Table_2 has data and its in HBASE. Now integration happened (whether this scenario possible?).
How the data movement happens? Is it from HBASE to HIVE or HIVE to HBASE.
Two: Setup as scenario One. Now for newly inserted records. Where would they go?
I am new to HBASE and interested in understanding the data movement in detail with and example.
Please improve the question if needed. Thanks in advance.
HDFS is a distributed file system that is well suited for the storage of large files but does not provide fast individual record lookups.
Hive is simply a SQL-like abstraction for interacting with the data in HDFS.
HBase is also built on top of HDFS. It provides fast reads and writes for large tables. HBase accomplishes this by storing your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
So in both cases, data reside in HDFS. That's "where they go."
As for the details of how they work, that's a huge topic where you have to familiarize yourself with such topics as the Hive metastore and storage handlers and the HBase API. I believe this tutorial (Part 1 and Part 2) can help you.

Resources