What is the Future of Apache Hbase - hadoop

Well i wanted to know What is the Future of Hbase. As we know that it is used in case of Real-Time scenario But so is Cassandra & MongoDB. The only advantage HBase gets is that it comes along packaged with Cloudera / HDP distribution.
So how useful or effective is to get Deep into Hbase. .???

Per CAP (Consistency, Availability and Partition Tolerance) theorem the NoSQL databases can have only 2 of the 3 requirements. Depending that use case following 2 groups are formed.
HBASE and MongoDB are CP system (Consistency and partition tolerance)
Cassandra has AP (Availability and partition tolerance).
HBASE and Cassandra are column oriented However, MongoDB is document type.
HBASE architect is master-worker and Cassandra is master-less. Cassandra offers robust support for cluster spanning multiple datacenters.
Functional key difference is HBASE still has low latency read and write compared with Cassandra for high throughput.
https://en.wikipedia.org/wiki/Apache_HBase
https://en.wikipedia.org/wiki/Apache_Cassandra

Related

is it possible to convert from hbase to spark rdd efficiency?

I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.

Cassandra vs HDFS to store analytics data

we have an Apache Spark cluster that analyse data stored in HDFS (.parquet).
The solution is optimal in terms of performance but it's not disaster safe as we would like, indeed, HDFS architecture has a single point of failure (the namenode) even using two namenode (you just have 2 point of failure but it's not enough).
To improve our cluster fault tolerance we would like to move to another data store solution like Cassandra.
Questions are:
With Cassandra as datastore is Spark able to leverage on DataLocality as it do with HDFS?
How this change can affect the performance?
Thanks
Matteo
There's article about data locality, spark and Cassandra, so yes, it is possible:
https://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
I didn't done any performance checks with Spark on HDFS vs Cassandra, and i believe it will vary depending on different workflows, but since Netflix and Microsoft using Cassandra with Spark, i believe performance is acceptable in most cases, and probably is a trade-off between data ingestion speed, existence/nonexistence of ETL and speed of the analytical process.
About hadoop single point of failure - If you will run Cassandra with replication factor 3 and consistency level quorum, you will get same 2 nodes down that will make data unavailable :) , keep it in mind.
And maybe consider MapR hadoop distribution, they've tried to solve namenode problem.

Is Cassandra for OLAP or OLTP or both?

Cassandra does not comply with ACID like RDBMS but CAP. So Cassandra picks AP out of CAP and leaves it to the user for tuning consistency.
I definitely cannot use Cassandra for core banking transaction because C* is slightly inconsistent.
But Cassandra writes are extremely fast which is good for OLTP.
I can use C* for OLAP because reads are extremely fast which is good for reporting too.
So i understood that C* is good only when your application do not need your data to be consistent for some amount of time but reads and writes should be quick?
If my understanding is right kindly list some applications?
ACID are properties of relational databases where BASE are properties of most nosql databases and Cassandra is one of the. CAP theorem just explains the problem of consistency, availability and partition tolerance in distributed systems. Good thing about Cassandra is that it has tunable consistency so you can be pretty much consistent (at the price of partition tolerance) so OLTP is doable. As phact said there are even some banks that built their transaction software on top of Cassandra. OLAP is also doable but not with just Cassandra since its partitioned row storage limits its capabilities. You need to have something like Spark to be able to do complex queries required.
Cassandra should be avoided for OLTP applications , even they state that it might not be the perfect use case for OLTP.Even though you can achieve a fully consistent model with setting Write Consistency to All , this would make writing rather a tough process , for the coordinator node to write that data to all partitions of all replicated nodes.And also if your Cassandra system is massively replicated across different Data Centers, maybe across different Continents then the time taken to write will increase dramatically.

MySQL Cluster vs. Hadoop for handling big data

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework.
What is the better solution. I would like to read your opinion.
I think the advantages of using a MySQL Cluster are:
high availability
good scalability
high performance / real time data access
you can use commodity hardware
And I don't see a disadvantage! Are there any disadvantages that Hadoop do not has?
The advantages of Hadoop with Hive on top of it are:
also good scalability
you can also use commodity hardware
the ability to run in heterogenous environments
parallel computing with the MapReduce framework
Hive with HiveQL
and the disadvantage is:
no real time data access. It may takes minutes or hours to analyze the data.
So in my opinion for handling big data a MySQL cluster is the better solution. Why Hadoop is the holy grail of handling big data? What is your opinion?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. mySQL requires you to store data in a certain format. It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? If it isn't, why spend the time to ETL the data?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase.
If you wanna choose a offline compute & storage arch.
I suggest hadoop not MySQL cluster for offline compute & storage, because of :
Cost : obviously, hadoop cluster is more cheap than MySQL cluster
Scalability : hadoop support more than ten thousands machine in a cluster
Ecosystem : mapreduce, hive, pig, sqoop and etc.
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture.
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. This makes Hadoop better for processing massive data sets in many cases.
The answer to this question has a good explanation of this distinction.

Hadoop on cassandra database

I am using Cassandra to store my data and hive to process my data.
I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs)
So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).
If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/
They built and support hadoop with HDFS replaced with cassandra.
In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/
There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra
Cassandra and MapReduce - minimal setup requirements
Regarding your question - there is a tradeof:
a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.
My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.

Resources