Oracle (RAC) vs NoSQL - oracle

I am curious if anybody did benchmarks for accessing of data in NoSQL databases vs Oracle (particularly I am talking about Oracle RAC)?
The project requires to work with at least 10mil+ of records, search among them (but not necessary have to be real time), the read is very important for speed, and it's also very important to guarantee HA and reliability (can't lose records!!!)
I can see for myself how say Cassandra/MongoDB might be better fit (because key value storage will provide faster reads than SQL when you go over 10mil records), but I find difficult to articulate all of them nicely. Any links? Suggestions? Bullet points?
Thanks!

10 million records. Assume 250 bytes per record. That is about 2.5 Gb of data, which is well within the capacity of a basic desktop / laptop PC. The data volumes are insignificant (unless each record is sized in Mb, such as picture or audio).
What you do need to talk about is transaction volumes (separated into read and write) and what you consider HA. Read-only HA is easy relative to "Read-write HA". It can be trivial to replicate a read-only data set off to multiple servers at different geographic locations and distribute a query workload on them.
It's much harder to scale out an update heavy workload, which is why you often hear about systems going into meltdown when tickets for a big concert are released. Quite simply there's a fixed number of seats and you can't have ten duplicated systems each selling what they think is available. There has to be a single source of truth, which means a bottleneck (and potentially a single point of failure).
On the HA aspect, RAC is a shared storage technology which generally means your RAC nodes are in close proximity. That can make them vulnerable to localized events such as a building fire or telecoms breakdown. Data Guard is the Oracle technology that relates to off-site replication and failover.

Mostly when you come to comparison of NoSQL vs SQL, you have to understand a very important difference between them. Data in NoSQL may be inconsistent in cost order to achieve HA.
What do I mean by inconsistent? It depends, but usually around 3-5 seconds to propagate the data around nodes. NoSQL database provide mechanism to manage and eliminate that, but if you want all your data be consistent in real time, then you simply use classic SQL, like Oracle RAC.
Coming back to speed comparison: it's simply incomparable which one is faster, because it relays on factors like network infrastructure, computing power and database model etc. But important thing is that at some point you may reach the moment that SQL is economically inefficient to maintain and you have to switch to NoSQL.

Related

CQRS (Lagom) elasticsearch read-side

I've read that ElasticSearch isn't the most reliable in terms of durability, but I would like to use it to store data on the read-side for optimal searching.
If we store events (write-side) in a cassandra database, that means that data is never really lost.
I don't really understand what is meant with 'data durability'.
If we use ES on the read-side, does that mean that some data may not be properly imported? Does it mean that one day data may randomly be lost, or the risk that all data may one day just have disappeared?
The use case is a Twitter-like geolocation based app.
How reliable is it in the end to use ES exclusively on the read-side, without needing a more reliable datastore (write-side) to store the data?
Depending on what is meant with this "durability", I wonder what measures should be taken to replay events and keep ES consistent at all times.
Thanks
I don't have a huge amount of experience running ES in production, but essentially, ensuring that when you persist data, it stays persisted, especially in a distributed system, is hard. There are many, many edge cases that are very hard to get right, and it takes time for a database to mature and sort those edge cases out. A less durable database is one that probably hasn't ironed all these issues out.
Of course, ElasticSearch is popular open source database with a thriving community maintaining it, so there's likely no well defined cases where "your data will be lost in this circumstance", rather, there's likely cases that either haven't been come across yet, or when they have been come across by users in the wild, the users that came across them didn't care enough to debug it because they were only using ES as a secondary data store and were able to rebuild it from their primary data store. Whenever a case is identified that ES loses data under well understood circumstances, the maintainers of ES would be quick to fix that.
The most typical use cases for ES are as a secondary database store, and in such a use case, durability isn't as important because the data store can be rebuilt from the primary. Accordingly, you'll find durability isn't as high a priority to the maintainers of ES because their users aren't asking for it - that's not say it's not a high priority, just relative to other databases, it's not as high.
So, if you use ES, you've got a higher chance of encountering bugs where you'll lose data, than with other databases that are either more mature or put more of a focus on durability in their development.
As to whether you should regularly drop your ES database and replay the events, it really depends on your use case and how important it is for your ES database to be consistent. A lot of the edge cases around ES's durability probably result in major corruptions with significant data loss - ie, you'll know if it happens, so there's no need to drop and replay regularly in that case. Another thing to consider is that because of the way CQRS read sides work, you'll only have a limited number of writers to your ES store, and you can easily control that concurrency. What this means is that a spike in load won't result in a spike in concurrent writers, what will happen is that your ES store might temporarily lag behind in consistency from your primary store. Due to this, you're probably less likely to encounter the edge cases that might trigger ES to lose data.
So, you're probably fine not bothering dropping and rebuilding unless something catastrophic happens, unless the consequences of silently losing small amounts of data in a way that you won't notice are so high that the incredibly small chance that that might happen is unacceptable.
I know this topic is more then 3 years old but I am also using Elasticsearch for the read side of the CQRS but I think there are other platforms fitting better to write side but it is not just a database technology, in todays Event Sourced paradigm more is necessary, I am using Akka's Finite State Machine with Cassandra, which in my opinion fits better that sort extreme write loads better then Elasticsearch.
I wrote a blog about it, if anybody likes to see, Write Side for Elasticsearch CQRS

How to distribute data and computation to maximize locality?

Please bear with me, this is a basic architectural question for my first attempt at a "big data" project, but I believe your answers will be of general interest to anyone who is starting out in this field.
I've googled and read the high-level descriptions of Kafka, Storm, Memcached, MongoDB, etc., but now that I'm ready to dig in to start designing my app, I still need some further insight on how in fact the data should be distributed and shared.
The performance of my app is critical, so one objective is to somehow maximize the locality of the data in the RAM of the machines doing the distributed calculations. I need advice for this part of the design.
If my app had some clear criteria for a priori sharding the data and distributing the calculations (such as geographical regions or company divisions) then the solution would be obvious. But unfortunately my app's data access patterns are dynamic and depend on the results of previous calculations.
My app is an analysis program with distinct stages. In the first stage, all the data is accessed once and a metric is calculated for each data object. In the second stage, a subset of the data objects may be accessed, with the probability of access being proportional to each data object's metric that was calculated in the previous stage. In the final stage, a relatively small subset of data objects will be accessed many times for many calculations.
At all stages, it is required that the calculations be distributed across several servers. The calculations are embarassingly parallel, and each distributed calculation only needs to access a few data objects. It is also required that the number of servers can be specified before the app runs (for example, run on one server, or run on fifty servers).
It seems to me that I need some mechanism that distributes the appropriate data objects to the appropriate compute servers, as opposed to just blindly fetching the data from some database service (whether centralized or distributed). Also, it seems to me that some sort of smart caching system might be appropriate, since the data access pattern depends on the previous calculations and cannot be predicted a priori. But as far as I can tell, Memcached is not such a system because the sharding is determined a priori.
I've read many times that the operating system cache performs better than any monkeying around that we may try. I think the ideal solution is that each compute server's RAM cache somehow captures the data objects' dynamic access patterns, but it's not clear to me how this would work with a NoSQL or Memcached service.
Thanks for bearing with me this far. I realize this is a basic question, but the answer eludes me so far. I can't resolve the dynamic access patterns of my app with the a priori sharding of the NoSQL/Memcached packages. Any advice would be greatly appreciated.
I recommend you to take a look at http://tarantool.org. Shard to maximize locality for the most common data access pattern, use Lua for local computations, and net.box to issue a remote RPC when calculation needs to continue on another node. All data is stored in RAM, if you write your computation code carefully it could take advantage of the Just In Time compiler.

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

Oracle setup required for heavy-ish load

I am trying to make a comparison between a system setup using Hadoop and HBase and achieving the same using Oracle DB as back end. I lack knowledge on the Oracle side of things so come to a fair comparison.
The work load and non-functional requirements are roughly this:
A) 12M transactions on two tables with one simple relation and multiple (non-text) indexes within 4 hours. That amounts to 833 transactions per second (TPS), sustained. This needs to be done every 8 hours.
B) Make sure that all writes are durable (so a running transaction survives a machine failure in case of a clustered setup) and have a decent level of availability? With a decent level of availability, I mean that regular failures such as disk and a single network interface / tcp connection drop should not require human intervention. Rare failures, may require intervention, but should be solved by just firing up a cold standby that can take over quickly.
C) Additionally add another 300 TPS, but have these happen almost continuously 24/7 across many tables (but all in pairs of two with the same simple relation and multiple indexes)?
Some context: this workload is 24/7 and the system needs to hold 10 years worth of historical data available for live querying. Query performance can be a bit worse than sub-second, but must be lively enough to consider for day-to-day usage. The ETL jobs are setup in such a way that there is little churn. Also in a relational setup, this workload would lead to little lock contention. I would expect index updates to be the major pain. To make a comparison as fair as possible, I would expect the loosest consistency level that Oracle provides.
I have no intention of bashing Oracle. I think it is a great database for many uses. I am trying to get a feeling for the tradeoff there is between going open source (and NoSQL) like we do and using a commercially supported, proven setup.
Nobody can answer this definitively.
When you go buy a car you can sensibly expect that its top speed, acceleration and fuel consumption will be within a few percent of values from independent testing. The same does not apply to software in general nor to databases in particular.
Even if you had provided exact details of the hardware, OS and data structures, along with full details of the amount of data stored as well as transactions, the performance could easily vary by a factor of 100 times depending on the pattern of usage (due to development of hot spots of record caching, disk fragmentation).
However, having said that there are a few pointers I can give:
1) invariably a nosql database will outperform a conventional DBMS - the reason d'etre for nosql databases is performance and parallelization. That does not mean that conventional DBMS's are redundant - they provide much greater flexibility for interacting with data
2) for small to mid range data volumes, Oracle is relatively slow in my experience compared with other relational databases. I'm not overly impressed with Oracle RAC as a scalable solution either.
3) I suspect that the workload would require a mid-range server for consistent results (something in the region of $8k+) running Oracle
4) While having a hot standby is a quick way to cover all sorts of outages, in a lot of cases, the risk/cost/benefit favours approaches such as RAID, multiple network cards, UPS rather than the problems of maintaining a synchronized cluster.
5) Support - have you ever bothered to ask the developers of an open source software package if they'll provide paid for support? IME, the SLAs / EULAs for commercial software are more about protecting the vendor than the customer.
So if you think its worthwhile considering, and cost is not a big issue, then the best answer would be to try it out for yourself.
No offense here, but if you have little Oracle knowledge there is really no way you can do a fair comparison. I've worked with teams of very experienced Oracle DBAs and sys admins who would argue about setups for comparison tests (the hardware/software setup variables are almost infinite). Usually these tests were justifications for foregone conclusions about infrastructure direction (money being a key issue as well).
Also, do you plan on hiring a team of Hadoop experts to manage your company's data infrastructure? Oracle isn't cheap, but you can find very seasoned Oracle professionals (from DBAs to developers to analysts), not too sure about hadoop admins/dbas...
Just food for thought (and no, I don't work for Oracle ;)

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources