Is Cassandra for OLAP or OLTP or both? - hadoop

Cassandra does not comply with ACID like RDBMS but CAP. So Cassandra picks AP out of CAP and leaves it to the user for tuning consistency.
I definitely cannot use Cassandra for core banking transaction because C* is slightly inconsistent.
But Cassandra writes are extremely fast which is good for OLTP.
I can use C* for OLAP because reads are extremely fast which is good for reporting too.
So i understood that C* is good only when your application do not need your data to be consistent for some amount of time but reads and writes should be quick?
If my understanding is right kindly list some applications?

ACID are properties of relational databases where BASE are properties of most nosql databases and Cassandra is one of the. CAP theorem just explains the problem of consistency, availability and partition tolerance in distributed systems. Good thing about Cassandra is that it has tunable consistency so you can be pretty much consistent (at the price of partition tolerance) so OLTP is doable. As phact said there are even some banks that built their transaction software on top of Cassandra. OLAP is also doable but not with just Cassandra since its partitioned row storage limits its capabilities. You need to have something like Spark to be able to do complex queries required.

Cassandra should be avoided for OLTP applications , even they state that it might not be the perfect use case for OLTP.Even though you can achieve a fully consistent model with setting Write Consistency to All , this would make writing rather a tough process , for the coordinator node to write that data to all partitions of all replicated nodes.And also if your Cassandra system is massively replicated across different Data Centers, maybe across different Continents then the time taken to write will increase dramatically.

Related

H2 & Ignite query performance

I am trying to compare performance of application queries on H2 database & Ignite with an Oracle baseline.
I created a test including:
A set of tables and indexes.
A data set of random generated data with 50k records per tables.
A query with 1 INNER & 10 LEFT OUTER joins (query returned around 188k records).
I noticed significant differences in terms of performance.
Running the query, on my machine (i5 dual core, 16Gb RAM):
Oracle manages to run this query in around 350ms.
H2 takes 4.5s (regardless of the mode - server & in-memory).
Ignite takes 9s.
Iterating over the JDBC result set:
Less than 50ms for H2 in-memory mode
Around 2s for the H2 server mode
Around 5s for Oracle
Around 1s for Ignite
Couple of questions:
Do these figures make sense? Did I just missed the basics of H2 query optimization?
Looking at H2 explain plans, what is the exact meaning of scanCount? Is this something constant for a given query & data set or a performance indicator?
Is there a way to improve H2 performances by tuning indexing or hinting queries?
How to explain the different between Ignite & H2?
Is the order of joins important? Asking because on Oracle, having up-to-date statistics, the CBO changes the order of joins. I didn't notice such behavior with H2.
Queries & data I used for this test are available here on Github.
Thanks,
L.
Let me share some basic facts related to Ignite vs. RDBMS performance benchmarking. Copy-pasting this from a new GridGain doc that will be released this month. Just replace GridGain occurrences with Ignite. Please double-check these principles are followed. Let me know if you don't see a difference.
GridGain and Ignite are frequently compared to relational databases for their SQL capabilities with an expectation that existing SQL queries, created for an RDBMS, will work out of the box and perform faster in GridGain without any changes. Usually, such a faulty assumption is based on the fact that GridGain stores and processes data in-memory. However, it’s not enough just to put data in RAM and expect an order of magnitude performance increase. GridGain as a distributed platform requires extra changes for the sake of performance and below you can see a standard checklist of best practices to consider before you benchmark GridGain against an RDBMS or do any performance testing:
Ignite/GridGain is optimized for multi-nodes deployments with RAM as
a primary storage. Don’t try to compare a single-node GridGain
cluster to a relational database that was optimized for such
single-node configurations. You should deploy a multi-node GridGain
cluster with the whole copy of data in RAM.
Be ready to adjust your data model and existing SQL queries if any.
Use the affinity collocation concept during the data modelling phase
for proper data distribution. Remember, it’s not enough just to put
data in RAM. If your data is properly collocated you can run SQL
queries with JOINs at massive scale and expect significant
performance benefits.
Define secondary indexes and use other standard, and
GridGain-specific, tuning techniques described below.
Keep in mind that relational databases leverage local caching
techniques and, depending on the total data size, an RDBMS can
complete some queries even faster than GridGain even in a multi-node
configuration. If your data set is around 10-100GB and an RDBMS has
enough RAM for caching data locally than it, for instance, can
outperform a multi-node GridGain cluster because the latter will be
utilizing the network. Store much more data in GridGain to see the
difference.

What is the difference between a Big Data Warehouse and a traditional Data Warehouse

Usually, data warehouses in the context of big data are managed and implemented on the basis of Hadoop-based system, like Apache Hive (right?).
On the other hand, my question regards the methodological process.
How do big data affect the design process of a data warehouse?
Is the process similar or new tasks must be considered?
Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources.
The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point. Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.
The Hadoop ecosystem starts from the same aim of wanting to collect together as much interesting data as possible from different systems, but approaches it in a radically better way. With this approach, you dump all data of interest into a big data store (usually HDFS – Hadoop Distributed File System). This is often in cloud storage – cloud storage is good for the task, because it’s cheap and flexible, and because it puts the data close to cheap cloud computing power. You can still then do ETL and create a data warehouse using tools like Hive if you want, but more importantly you also still have all of the raw data available so you can also define new questions and do complex analyses over all of the raw historical data if you wish. The Hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse.

what should be considered before choosing hbase?

I am very new in big data space.
We got suggestion from team we should use hbase instead of RDBMS for high performance . We do not have any idea what should/must be considered before switching RDMS to hbase. Any ideas?
One of my favourite book describes..
Coming to #Whitefret's last point : There is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
In this context Hbase supports CP
However, for switching RDBMS to HBASE you can use SQOOP.
It's a difficult question, there are many things to consider.
Can you optimize your RDBMS? Adding indexes, denormalization of joins that cost too much ... There are many path to consider and I am no expert.
Is your data big? This is very vague, and you have a space between RDBMS and Big Data where you can't be sure which one to use. Millions of rows can still be handled by RDBMS efficiently.
Do you need relation in you data? NoSQL database don't use relation, this can be hard for people from a SQL background. There are frameworks that gives SQL to HBase, but it is a bad idea in general to have a RDBMS model when using Big Data
If you can answer those questions and you think NoSQL is the drill, ask your team how they feel about it. NoSQL database comes with problem you would never meet in the SQL world. They should build a prototype first to understand how all this works, and maybe make some training available for them.
In Summary:
- Find if you need non relational database
- Choose the right one (is Hbase really what you need?, why not consider Cassandra or MongoDB?)
HBase like all NoSQL DB come with great new features but sadly nothing is free (not even mentionning the money cost).
In HBase, you really should check if all the query that you might want to do can be fullfilled with the HBase data model. An important thing to consider is the schema design (the modelisation of the rowkey most and foremost).
I advice you to read this really good paper :
http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf
I think that a really good answer to your question can be found on the HBase official site.
"HBase isn’t suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. "
https://hbase.apache.org/book.html

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

Analytics and Mining of data sitting on Cassandra

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.

Resources