Hive over HBase vs Hive over HDFS - hadoop

My data does not need to be loaded in realtime so I don't have to use HBASE, but I was wondering if there are any performance benefits of using HBASE in MR Jobs, shouldn't the joins be faster due to the indexed data?
Anybody have any benchmarks?

Generally speaking, hive/hdfs will be significantly faster than HBase. HBase sits on top of HDFS so it adds another layer. HBase would be faster if you are looking up individual records but you wouldn't use an MR job for that.

Performance of HBase vs. Hive:
Based on the results of HBase, Hive, and Hive on Hbase: it appears that the performance between either approach is comparable.
Hive on HBase Performance

Respectfully :) I want to tell you that if your data is not real and you are also thinking for mapreduce jobs then only go hive over hdfs as Weblogs can be processed by the Hadoop MapReduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.
As hive also provide us
Bulk processing/ real time(if possible) as well as SQL like interface Built in optimized map-reduce Partitioning of large data which is more compatible with hdfs and help to reduce the layer of HBase otherwise if you add HBase here then it would be redundant features for you :)

Related

Spark with HBASE vs Spark with HDFS

I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
1) What are the added capabilities brought by layering Spark on top of
HBASE instead of using HBASE solely? It depends only on programmer
capabilities or is there any performance reason to do that? Are there
things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between
HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.
HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.
When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.
The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.
Answering original questions:
These are two different tools for completely different problems.
I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.
Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.
So use Hbase when it suits a requirement that your data needs to accessed by other big data.
Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc.
There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....

Real Time Interactive Queries IN HADOOP

Is it possible to do realtime interactive queries in hadoop?
When I use Hive over YARN/tez the latency is still too high, even when it's over parquet/ocr.
Any suggestion?
thanks in advance
Hadoop is not a good choice for realtime or near-realtime queries. The latency overhead of running anything in Hadoop would be high. Consider using Apache Spark (Since I expect that you have a batch processing system, as you are using Hadoop). Spark provides interactive queries using spark shell. You can also use Impala to do queries on data stored in HDFS. Impala, I believe, provides faster queries compared to Hive.

If you store something in HBase, can it be accessed directly from HDFS?

I was told HBase is a DB that sits on top of HDFS.
But lets say you are using hadoop after you put some information into HBase.
Can you still access the information with map reduce?
You can read data of HBase tables either by using map reduce programs or hive queries or pig scripts.
Here is the example for map reduce
Here is the example for Hive. Once you create hive table, you can run select queries on top of HBase tables which will process data using map reduce.
You can easily integrate HBase tables even with other Hadoop eco system tools such as Pig.
Yes, HBase is a column oriented database that sits on top of hdfs.
HBase is a database that stores it's data in a distributed filesystem. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn't mean that HBase can't work on any other filesystem. It's just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
Flexible data model to work with and data is indexed by the row key.
Fast scans across tables.
Scale in terms of writes as well as total volume of data.

Why there is Pig and Hive

I understood what are the components of Hadoop, but my question is:
As an end user, how can I access a file in Hadoop without worrying about the data storage?
So when using Pig/Hive commands, should I worry if the data storage is HDFS or HBase?
Thank you
First of all, HDFS is a file system and HBase a database so yes, you should take that into consideration, since you don't access them the same way.
Knowing that, Pig and Hive let you access the data much easier than in pure Java. For instance, Hive lets you query HBase in a close-to-SQL way.
In the same way, you can browse and manage files with pig almost like with a shell on a standart machine.
To conclude, you should not worry about how files are stored with Hadoop, but where they are stored (HDFS or HBase).
HDFS is a distributed file system just like fxm said.
Almost all of hadoop components built on HDFS.
HBase is a DB which store its data on distributed file system (hdfs, can be other fs).
Pig is a kind of programming language which will be generated to map reduce job.
hive is a kind of db built on HDFS, and its SQL will be generated to map reduce job.
Using udf of hive or pig, you can almost access any format data on hdfs.
excuse my poor English. :D
Data in the Hadoop ecosystem needs to be stored in a distributed filesystem. HDFS is the most popular such filesystem.
But HDFS' value proposition is in offering very high sequential read and write (scan) throughput. What if you wanted fast random reads and writes ?
That's where HBase comes in. HBase sits on top of HDFS and enables fast random reads and writes.
But you store data to ask interesting questions about that data. That is where MapReduce comes in. You express your question in the MapReduce programming paradigm and it gets you the answer you need. But it's low-level and you need to be a programmer. Spark is an alternative to MapReduce - much better optimized for when you need to ask more sophisticated questions than MapReduce. Hive and Pig are higher-level abstractions than MapReduce. Hive let's you ask your question in SQL, and converts your SQL to MapReduce (or Spark) job. Although, with the growing popularity of Spark, you can skip Hive and use SparkSQL (Spark's Dataframe/Dataset APIs) which can also interpret SQL.
The difference between Hive and Pig is explained in this excellent post by Alan Gates (Pig project PMC member and author of Programming Pig).
Pig is used when the data is unstructured and has no schema.
Database recommended - HDFS.
Hive is used when the data is structured and has a schema available.
Database recommended - Hbase.

query reg hbase

As we learnt hadoop is meant for batch processing of data. If we want to go for some trending based on the results produced by hadoop mapreduce jobs, what is the best way. How can we retrive mapreduce results for trending.
Is hbase can be used here. If so, is hbase is having all the capabilities of filtering and aggregate functions on the data stored in hbase?
Thanks
MRK
While there is now perfect solution in hadoop word for this problem, there are a few approaches to solve this kind of problems:
a) To produce some "on demand DataMart" using MR, load it into the RDBMS and run your queries in a real time. It can work if this data subset is much smaller then whole data set.
b) To use MPP database integrated with Hadoop. For example GreenPlum HD has MPP database pre-integrated with hadoop.
c) To use some more light-weight MR framework : Spark. It will have much less latency, but expect your data sets to be comparable with the RAM.
You probably want to look at Hive.

Resources