How read made efficient in Cassandra using JanusGraph on OLAP jobs? - janusgraph

I am new to JanusGraph and had some hand-on on Cassandra. I know, Cassandra is not that good at querying with full scan queries and good at only point queries as it maintains index of the primary key. The question is how the queries are optimized when running OLAP jobs in Spark using JanusGraph??

Related

H2 & Ignite query performance

I am trying to compare performance of application queries on H2 database & Ignite with an Oracle baseline.
I created a test including:
A set of tables and indexes.
A data set of random generated data with 50k records per tables.
A query with 1 INNER & 10 LEFT OUTER joins (query returned around 188k records).
I noticed significant differences in terms of performance.
Running the query, on my machine (i5 dual core, 16Gb RAM):
Oracle manages to run this query in around 350ms.
H2 takes 4.5s (regardless of the mode - server & in-memory).
Ignite takes 9s.
Iterating over the JDBC result set:
Less than 50ms for H2 in-memory mode
Around 2s for the H2 server mode
Around 5s for Oracle
Around 1s for Ignite
Couple of questions:
Do these figures make sense? Did I just missed the basics of H2 query optimization?
Looking at H2 explain plans, what is the exact meaning of scanCount? Is this something constant for a given query & data set or a performance indicator?
Is there a way to improve H2 performances by tuning indexing or hinting queries?
How to explain the different between Ignite & H2?
Is the order of joins important? Asking because on Oracle, having up-to-date statistics, the CBO changes the order of joins. I didn't notice such behavior with H2.
Queries & data I used for this test are available here on Github.
Thanks,
L.
Let me share some basic facts related to Ignite vs. RDBMS performance benchmarking. Copy-pasting this from a new GridGain doc that will be released this month. Just replace GridGain occurrences with Ignite. Please double-check these principles are followed. Let me know if you don't see a difference.
GridGain and Ignite are frequently compared to relational databases for their SQL capabilities with an expectation that existing SQL queries, created for an RDBMS, will work out of the box and perform faster in GridGain without any changes. Usually, such a faulty assumption is based on the fact that GridGain stores and processes data in-memory. However, it’s not enough just to put data in RAM and expect an order of magnitude performance increase. GridGain as a distributed platform requires extra changes for the sake of performance and below you can see a standard checklist of best practices to consider before you benchmark GridGain against an RDBMS or do any performance testing:
Ignite/GridGain is optimized for multi-nodes deployments with RAM as
a primary storage. Don’t try to compare a single-node GridGain
cluster to a relational database that was optimized for such
single-node configurations. You should deploy a multi-node GridGain
cluster with the whole copy of data in RAM.
Be ready to adjust your data model and existing SQL queries if any.
Use the affinity collocation concept during the data modelling phase
for proper data distribution. Remember, it’s not enough just to put
data in RAM. If your data is properly collocated you can run SQL
queries with JOINs at massive scale and expect significant
performance benefits.
Define secondary indexes and use other standard, and
GridGain-specific, tuning techniques described below.
Keep in mind that relational databases leverage local caching
techniques and, depending on the total data size, an RDBMS can
complete some queries even faster than GridGain even in a multi-node
configuration. If your data set is around 10-100GB and an RDBMS has
enough RAM for caching data locally than it, for instance, can
outperform a multi-node GridGain cluster because the latter will be
utilizing the network. Store much more data in GridGain to see the
difference.

is it possible to convert from hbase to spark rdd efficiency?

I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.

What is the Future of Apache Hbase

Well i wanted to know What is the Future of Hbase. As we know that it is used in case of Real-Time scenario But so is Cassandra & MongoDB. The only advantage HBase gets is that it comes along packaged with Cloudera / HDP distribution.
So how useful or effective is to get Deep into Hbase. .???
Per CAP (Consistency, Availability and Partition Tolerance) theorem the NoSQL databases can have only 2 of the 3 requirements. Depending that use case following 2 groups are formed.
HBASE and MongoDB are CP system (Consistency and partition tolerance)
Cassandra has AP (Availability and partition tolerance).
HBASE and Cassandra are column oriented However, MongoDB is document type.
HBASE architect is master-worker and Cassandra is master-less. Cassandra offers robust support for cluster spanning multiple datacenters.
Functional key difference is HBASE still has low latency read and write compared with Cassandra for high throughput.
https://en.wikipedia.org/wiki/Apache_HBase
https://en.wikipedia.org/wiki/Apache_Cassandra

Hadoop on cassandra database

I am using Cassandra to store my data and hive to process my data.
I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs)
So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).
If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/
They built and support hadoop with HDFS replaced with cassandra.
In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/
There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra
Cassandra and MapReduce - minimal setup requirements
Regarding your question - there is a tradeof:
a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.
My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.
It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of HBase (bold added):
Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.
Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).
From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner, a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary.
HBase has its own metadata repository and columnar storage layout. It is possible to author HiveQL queries over HBase tables, allowing HBase to take advantage of Hive's grammar and parser, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.
Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...
HBase is a column based key value store based on BigTable. You can't do queries per se, though you can run map reduce jobs over HBase. It's primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a 'family' of columns.
To my humble knowledge, Hive is more comparable to Pig. Hive is SQL-like and Pig is script based.
Hive seems to be more complicated with query optimization and execution engines as well as requires end user needs to specify schema parameters(partition etc).
Both are intend to process text files, or sequenceFiles.
HBase is for key value data store and retrieve...you can scan or filter on those key value pairs(rows). You can not do queries on (key,value) rows.
Hive and HBase are used for different purpose.
Hive:
Pros:
Apache Hive is a data warehouse infrastructure built on top of Hadoop.
It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language, which will be converted into series of Map Reduce Jobs
It only runs batch processes on Hadoop.
it’s JDBC compliant, it also integrates with existing SQL based tools
Hive supports partitions
It supports analytical querying of data collected over a period of time
Cons:
It does not currently support update statements
It should be provided with a predefined schema to map files and directories into columns
HBase:
Pros:
A scalable, distributed database that supports structured data storage for large tables
It provides random, real time read/write access to your Big Data. HBase operations run in real-time on its database rather than MapReduce jobs
it supports partitions to tables, and tables are further split into column families
Scales horizontally with huge amount of data by using Hadoop
Provides key based access to data when storing or retrieving. It supports add or update rows.
Supports versoning of data.
Cons:
HBase queries are written in a custom language that needs to be learned
HBase isn’t fully ACID compliant
It can't be used with complicated access patterns (such as joins)
It is also not a complete substitute for HDFS when doing large batch MapReduce
Summary:
Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
As of the most recent Hive releases, a lot has changed that requires a small update as Hive and HBase are now integrated. What this means is that Hive can be used as a query layer to an HBase datastore. Now if people are looking for alternative HBase interfaces, Pig also offers a really nice way of loading and storing HBase data. Additionally, it looks like Cloudera Impala may offer substantial performance Hive based queries on top of HBase. They are claim up to 45x faster queries over traditional Hive setups.
To compare Hive with Hbase, I'd like to recall the definition below:
A database designed to handle transactions isn’t designed to handle
analytics. It isn’t structured to do analytics well. A data warehouse,
on the other hand, is structured to make analytics fast and easy.
Hive is a data warehouse infrastructure built on top of Hadoop which is suitable for long running ETL jobs.
Hbase is a database designed to handle real time transactions

Resources