We are using Oracle RDBMS in our system. To reduce database load we plan to use a caching layer.
I am looking to see if we can use Apache Cassandra as a Caching Storage frontend to Oracle db.
From what I have looked so far Cassandra is more like a database with built-in caching features. So, using it as a caching layer to Oracle would be more like using another database. I feel it would be better to Cassandra itself as an alternative to Oracle and other RDBMS rather than using it along with Oracle.
Has anyone used Cassandra as a caching layer to RDBMS. But, I have not found any resources or examples for using it. If so can you help me on this.
I'm not sure what you mean by a caching storage frontend.
Cassandra might be useful if you are expecting a large volume of writes that arrive at a rate faster than Oracle could handle. Cassandra can handle a high volume of writes since it can scale by adding more nodes.
You could then do some kind of data analysis and reduction on the data in Cassandra before inserting the crunched data into Oracle. You might then use Oracle for the tasks that suit it better such as financial reporting, ad hoc queries, etc.
Related
My db is Cassandra (datastax enterprise => linux). Since it doesn't support group-by, aggregate and etc. for reporting, according to its fundamentals, it's not a good decision to use Cassandra, downright. I googled about this deficit and found some results as this, and this and also this one.
But I really became confused! Hive uses additional tables, individually. Solr is better for full-text searching and like that. And Spark...it's useful for analysis, but, I didn't understand if it uses Hadoop eventually, or not.
I will have many reports, which needs indexing and grouping, at least. But I don't want to use additional tables which will impose overhead. And also, I'm .Net (and not Java) developer, and my application is besed on .Net Framework, too.
I am not exactly sure what your question is here and your confusion is understandable as with Cassandra and DSE there is a lot going on.
You are correct in stating that Cassandra does not support any aggregations or group by functionality that you would want to use for reporting.
Solr (DSE Search) is used for ad-hoc and full text searching of the data stored in Cassandra. This only works on a single table at a time.
Spark (DSE Analytics) provides analytics capabilities such as Map-Reduce as well as the ability to filter and join tables. This is not done in real-time though as the processing and shuffling of data can be expensive depending on the data load.
Spark does not use Hadoop. It performs many of the same jobs but is more efficient in many scenarios as it allows for in-memory distributed processing on the data.
Since you are using DataStax Enterprise the advantage is that you have built in connectors to both Solr (DSE Search) to provide ad-hoc queries and Spark (DSE Analytics) to provide analytics on your data.
Since I don't know your exact reporting requirements it is difficult to give you a specific recommendation. If you can provide some additional details about what sort of reporting (scheduled versus ad-hoc etc.) you will be running I may be able to help you more.
OLAP directly upon most of the noSQL databases is not possible, but from what I researched it's actually possible in HBase, so I was wondering what features does HBase have in particular that distinguishes it from the others allowing us to do this.
You will have to write lots of data processing logic in your application layer to accomplish this. Hbase is a Data store not a DBMS. So yes as long as the data goes in, you can get it out and process it in your application layer however you want.
If this proves inconvenient for you and a nosql platform that supports SQL for OLAP is desirable, you could try Amisa Server
We currently have a very write-heavy web analytics application which collects a large number of real time events from a large number of websites and stores for subsequent analytics and reporting.
Our initial planned architecture involved a cluster of web servers handling requests, and writing all of the data into a Cassandra cluster, while simultaneously updating a large number of counters for real-time aggregated reports. We also plan to utilize hadoop directly on CassandraFS (as a replacement of HDFS - offered by datastax) to natively run Map Reduce jobs on the data residing in Cassandra for more involved analytics. The output of the MapR jobs would be written back onto ColumnFamilies in Cassandra natively.
Hadoop map reduce runs on a read-only replica of the main cassandra cluster which is write-heavy. The idea was to avoid multiple data hops and have all data for the analytics in one repository.
More recently we hear about, and have faced first hand issues managing and growing a cassandra cluster with frequent node outages and bad response times. Couchbase seems to be much better with response times and dynamically growing and managing the cluster. So we are considering replacing Cassandra with Couchbase.
However this brings up a few questions.
Does Couchbase scale well in a mostly sequential write-heavy scenario? I don't see our scenario making much use of the in-memory caching, as the raw data being written is rarely read back, only aggregated metrics are. Plus I haven't been able to read much about what happens when Couchbase needs to hit the disk to write back data very frequently (or all the time?). Will it end up performing poorly than Cassandra?
What happens to the Hadoop interface? Couchbase has its own map reduce capabilities, but I understand that they are limited in scope. Would I need to transfer data back and forth between CouchbaseDB and HDFS to be able to support all my analytics and reporting out of a single database?
I have recently evaluated Cassandra and Couchbase among other options for a client requirement, so I can shed some light on both datastores.
Couchbase is incredibly easy to manage and once you have installed the server on a node, you can manage the cluster completely from the dashboard. However, couchbase does not scale as well as Cassandra, when as the data size grows. I also did not find a way to integrate Couchbase and HDFS/Hadoop seemlessly.
Cassandra performs very well for super fast write throughput, but it does not have any server side aggregation capabilities. Cluster management is slightly more difficult than Couchbase, as you have to re-balance the cluster every time you add or remove a node. Apart from it, from performance standpoint, Cassandra is runs pretty much seamlessly, as long as you have designed the schema properly.
If you can afford Datastax Enterprise solutions for Hive to do map-reduce for sophisticated analytics, I'd recommend you to stay with Cassandra, as couchbase map-reduce support is not all that good, and benchmarks show Couchbase performance starts to detoriate as the cluster size grows.
I am trying to build a data services layer using cassandra as the backend store. I am new to Cassandra and not sure what client to use for cassandra - thrift or cql 3? We have a lot of mapreduce jobs using Amazon elastic mapreduce (EMR) that will be reading/ writing the data from cassandra at high volume. The total data volume will be > 100 TB with billions of rows in Cassandra. The mapreduce jobs may be read or write heavy with high qps (>1000 qps). The requirements are as follows:
Simplicity of client code. It seems thrift has in-built integration with Hadoop for bulk data loading using sstableloader (http://www.datastax.com/dev/blog/bulk-loading).
Ability to define new columns at run time. We may need to add more columns depending on application requirements. It seems cql3 does not allow definition of columns dynamically at runtime.
Performance of bulk read/ write. Not sure which client is better. However, I found this post that claims thrift client has better performance for high data volume: http://jira.pentaho.com/browse/PDI-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
I could not find any authoritative source of information that answers this question. Appreciate if you could help with this since I am sure this is a common problem for most folks and would benefit the overall community.
Many thanks in advance.
-Prateek
Hadoop and Cassandra are both written in Java so definitely pick a java based driver. As far as simplicity of code goes I'd go for Astyanax, their wiki page is really good and documentation is solid all round. And yes atyanax does allow you to define columns at runtime as you please but be aware that thrift based APIs are being superseded by cql apis.
If however you want to go down the pure cql3 route, datastax's driver is what I'd advise you to use. It allows for asynchronous connections and is continuously updated (view the logs). The code is also very clean although documentation isn't quite there yet, but there are tests in the source that you can look at.
But to be honest, there are so many questions about the APIs that you should read though them and form an opinion for yourself:
Cassandra Client Java API's
About Java Cassandra Client, which one is better? How about CQL?
Advantages of using cql over thrift
Also for performance here some benchmarks (they are however outdated!) showing that cql is catching up (and somewhat overtaking when it comes to prepared statements) thrift:
compare string vs. binary prepared statement parameters
CQL benchmarking
Is it possible to design a twitter like DB using SQL server? a DB that will ensure high scalability and fast queries.
I am building a .NET platform that requires a similar model like twitter (User, Follower, Tweet) and looking into what will fit best in terms of fast queries and scalability.
Will it be possible using a relational DB or is a graph db much better?
SQL Server will most certainly be able to handle any load that you have. SQL Azure supports databases up to 150GB (though I hear you can get more if you ask). With Azure SQL Federation, you can scale out multiple databases on hundreds of nodes around the world.
As for a relational database like SQL Server, or the "NoSQL" variants like Azure Table Storage, it depends on your needs and how structured your data is. Given you'll probably do a lot of joins, querying for followers of users, tweets that someone should see, etc. you're best bet is to go with a relational db. Even Facebook still uses MySQL, so you're not exactly in bad company with using a relational db.