H2 & Ignite query performance - h2

I am trying to compare performance of application queries on H2 database & Ignite with an Oracle baseline.
I created a test including:
A set of tables and indexes.
A data set of random generated data with 50k records per tables.
A query with 1 INNER & 10 LEFT OUTER joins (query returned around 188k records).
I noticed significant differences in terms of performance.
Running the query, on my machine (i5 dual core, 16Gb RAM):
Oracle manages to run this query in around 350ms.
H2 takes 4.5s (regardless of the mode - server & in-memory).
Ignite takes 9s.
Iterating over the JDBC result set:
Less than 50ms for H2 in-memory mode
Around 2s for the H2 server mode
Around 5s for Oracle
Around 1s for Ignite
Couple of questions:
Do these figures make sense? Did I just missed the basics of H2 query optimization?
Looking at H2 explain plans, what is the exact meaning of scanCount? Is this something constant for a given query & data set or a performance indicator?
Is there a way to improve H2 performances by tuning indexing or hinting queries?
How to explain the different between Ignite & H2?
Is the order of joins important? Asking because on Oracle, having up-to-date statistics, the CBO changes the order of joins. I didn't notice such behavior with H2.
Queries & data I used for this test are available here on Github.
Thanks,
L.

Let me share some basic facts related to Ignite vs. RDBMS performance benchmarking. Copy-pasting this from a new GridGain doc that will be released this month. Just replace GridGain occurrences with Ignite. Please double-check these principles are followed. Let me know if you don't see a difference.
GridGain and Ignite are frequently compared to relational databases for their SQL capabilities with an expectation that existing SQL queries, created for an RDBMS, will work out of the box and perform faster in GridGain without any changes. Usually, such a faulty assumption is based on the fact that GridGain stores and processes data in-memory. However, it’s not enough just to put data in RAM and expect an order of magnitude performance increase. GridGain as a distributed platform requires extra changes for the sake of performance and below you can see a standard checklist of best practices to consider before you benchmark GridGain against an RDBMS or do any performance testing:
Ignite/GridGain is optimized for multi-nodes deployments with RAM as
a primary storage. Don’t try to compare a single-node GridGain
cluster to a relational database that was optimized for such
single-node configurations. You should deploy a multi-node GridGain
cluster with the whole copy of data in RAM.
Be ready to adjust your data model and existing SQL queries if any.
Use the affinity collocation concept during the data modelling phase
for proper data distribution. Remember, it’s not enough just to put
data in RAM. If your data is properly collocated you can run SQL
queries with JOINs at massive scale and expect significant
performance benefits.
Define secondary indexes and use other standard, and
GridGain-specific, tuning techniques described below.
Keep in mind that relational databases leverage local caching
techniques and, depending on the total data size, an RDBMS can
complete some queries even faster than GridGain even in a multi-node
configuration. If your data set is around 10-100GB and an RDBMS has
enough RAM for caching data locally than it, for instance, can
outperform a multi-node GridGain cluster because the latter will be
utilizing the network. Store much more data in GridGain to see the
difference.

Related

Distributed Spark and HDFS Cluster with 6 to 7 Nodes hardware configuration

I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations

Is Cassandra for OLAP or OLTP or both?

Cassandra does not comply with ACID like RDBMS but CAP. So Cassandra picks AP out of CAP and leaves it to the user for tuning consistency.
I definitely cannot use Cassandra for core banking transaction because C* is slightly inconsistent.
But Cassandra writes are extremely fast which is good for OLTP.
I can use C* for OLAP because reads are extremely fast which is good for reporting too.
So i understood that C* is good only when your application do not need your data to be consistent for some amount of time but reads and writes should be quick?
If my understanding is right kindly list some applications?
ACID are properties of relational databases where BASE are properties of most nosql databases and Cassandra is one of the. CAP theorem just explains the problem of consistency, availability and partition tolerance in distributed systems. Good thing about Cassandra is that it has tunable consistency so you can be pretty much consistent (at the price of partition tolerance) so OLTP is doable. As phact said there are even some banks that built their transaction software on top of Cassandra. OLAP is also doable but not with just Cassandra since its partitioned row storage limits its capabilities. You need to have something like Spark to be able to do complex queries required.
Cassandra should be avoided for OLTP applications , even they state that it might not be the perfect use case for OLTP.Even though you can achieve a fully consistent model with setting Write Consistency to All , this would make writing rather a tough process , for the coordinator node to write that data to all partitions of all replicated nodes.And also if your Cassandra system is massively replicated across different Data Centers, maybe across different Continents then the time taken to write will increase dramatically.

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

Analytics and Mining of data sitting on Cassandra

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.

What is the difference between Cassandra vs Oracle Coherence?

Assume that Oracle Coherence is free :)
Which one do you prefer?
What are the architectural and feature capability differences between Oracle Coherence(Tangosol) and Cassandra?
Best Regards
Oracle Coherence is a pure in-memory cache which can be distributed across nodes. Depending on its configuration it can have strong consistency, or eventual consistency for inserts and updates. Coherence is object based - consistent data model.
Since you buy Coherence from oracle - you can get commercial support, from oracle.
Cassandra is a bigtable data store that is distributed across nodes. No single point of failure. It uses some caching to improve performance before committing the data to disk in its implementation of bigTable. Cassandra requires some structure in its tuple (key/value/timestamp) but otherwise can support flexible data structures.
Preferences should be determined by your use case. They are both pretty cool in their own right.
You might also want to check out
- Terracotta in the in-memory space
- CouchDB and HBase as other players in the big table space.
Lets not forget Gemfire from Gemstone Systems, now owned by VMware (http://www.vmware.com/products/vfabric-gemfire/overview.html). Gemfire is an in memory distributed data fabric similar to Coherence and Terracotta but different in certain key ways. Each one has their pro's and cons but Gemfire is getting more support in a Spring sub project lately called spring-gemfire.
Both are NoSQL Databases. Currently there are 3 types of NoSQL databases that exists - Key Value Store, Tabular and Document Oriented. Coherence is a key value store, Cassandra is more like a tabular and MongoDB is a Document Oriented nosql db.

Resources