Hi all, my company and I are looking for a product that can give a hand in taking off some loads (big once) from OLTP databases! So we have stumbled on Vertica.
After some tests and studies for the last 2 months we have decided that we will hit our customers with our new product (is not our is Vertica's product).
Well all good and nice until the license price hit in!! And as everybody know there are not to many databases that run on more than 5 TB or at least we don't have to many!! or maybe none !!
Ok, so I will get to the point!! At what point does Vertica pays off!! And I would like to compare it with Oracle.
We all know Oracle license is yearly and is on top to core/proc/mem etc... ; Vertica on the other side it's license costs is up to TB use and I will bold this ROW DATA is not compressed data.
So the question is: At what point (database size) would Vertica will be a good choice?
This point can be let's say a 5 TB database using for 5 year Vertica will bring you a benefit of XXX % in comparing it with ORACLE or SQL SERVER.
Good to remember Vertica has it support service at 21 % of total license purchase price. I mean no offense to any of the company's cited in this text nor i am trying to promote none of them!! I am just looking for an opinion and please if any of my statements are wrong fell free to correct me.
I have no doubts that there are companies that implemented Vertica and done their market studies along with strong bench-marking tests, so fell free to share the knowledge.
As well I know this is such a general question with so many answers and opinions, all of them are welcomed!!
Thank you all!!
In past examinations, Vertica does well because you can compress data and you can use commodity hardware so 5 TB on-disk with Vertica is not equivalent to 5 TB on-disk with Oracle.
First, it's not just cost but speed. If you need speed, Vertica is a better option than Oracle/SQL Server. But to figure out cost, you also need to figure out if your 10 TB Oracle database will actually be a 5-7 TB Vertica database.
Compression levels will vary depending on what your data looks like, so you'll want to run some tests, with Vertica's help, to figure out what those compression levels will look like.
Feel free to message me privately as well.
I am not sure Oracle is alternative for your 5T use case unless you consider Oracle Exadata (Same for SQLserver ) , Exadata should be much more expensive (same for the sql server big data solution .
One comment about commodity server , to get the maximum performance commodity servers will not be enough , vertica very like big memory address space and many cores .
About 20% extra each year for support , same same also in oracle .
At the end i think vertica is Competitive and probably more cheaper .
Sorry I can not provide comparative figures , however if you insist you may get it from your Vertical salesperson
Thanks
Related
I have one table which are number of rows '7515966' and this table depend on another tables. We create View for generating SSRS reports.
Now size of View is increase so that performance issue occur on report.
We start archiving data for large table. but i can't understand which methodology use please guide us..
Thank you...
Table partitioning in 2012 is only available in Enterprise Edition. See https://msdn.microsoft.com/en-us/library/cc645993(v=sql.110).aspx for details on what's available for each edition.
7million rows is not a lot of rows for SQL Server, we routinely deal with billions of rows. However, as your rows get into the 10s of millions range, you'll probably expose various performance gaps in your system. E.g. are your queries efficiently written so they only touch the rows they need, do you have the right indexes, are statistics up to date, is tempdb optimized, etc...
One common weak link in 9 out of 10 databases (regardless of make) I've worked with is the storage subsystem. Is yours able to keep up with the large data set you need to work with. Storage for databases should be designed and configured based on throughput, concurrency and latency requirements first. Space generally the last thing to worry about once the other requirements, including HA/DR, are met.
If you have deficiencies in your current system, you can pay for the expensive enterprise edition and implement table partitioning but you will likely still suffer performance problems soon after, if not immediately.
Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.
We have a SQL server 2008 and one of the tables, say table A has the following characteristics:
Every day we get several heterogeneous feeds from other systems with numerical data.
Feeds are staged elsewhere, converted to a format compliant with A's schema.
Inserted into A.
Schema looks like:
<BusinessDate> <TypeId> <InsertDate> <AxisX> <AxisY> <Value>
The table has a variable number of rows. Essentially we have to purge it at the weekends otherwise the size affects performance. So size ranges from 3m-15m rows during the week. Due to some new requirements we expect this number to be increased by 10m by the end of 2012. So we would be talking about 10m-25m rows.
Now in addition
Data in A never change. The middle tier may use A's data but it will be a read only operation. But typically the middle tier doesn't even care about the contents. It typically (not always but 80% of cases) runs stored procs to generate reports and delivers the reports in other systems.
Clients of these table would typically want to do do long sequential reads for one business date and type. i.e. "get me all type 1 values for today"
Clients will want to join this table with 3-5 more tables and then deliver reports to other systems.
The above assumptions are not necessarily valid for all tables with which A is joined. For example we usually join A with a table B and do a computation like B.value*A.value. B.value is a volatile column.
Question
A's characteristics do sound very much like what HBase and other column oriented schemas can offer.
However some of the joins are with volatile data.
Would you recommend migrating A to an HBase schema?
And also, if we were to move A I would assume we would also have to migrate B and other dependent tables which (on the contrary with A) are being used by several other places from the middle tier. Wouldn't this be complicating things a lot?
25 Million rows doesn't sound big enough to justify using HBase, although the usage pattern fits. You need a name node, a job tracker, a master and then your region servers, so you'll be needing a minimum of maybe 5 nodes to run HBase in any reasonable way. Your rows are so small I'm guessing it's maybe 10gb of data, so storing this across 5 servers seems like overkill.
If you do go this route (perhaps you want to store more than a week's data at once) there are ways to integrate HBase with relational DBs. Hive, for example, provides ODBC/JDBC connectivity and can query HBase. Oracle and Teradata both provide integration between their relational DB software and non-relational storage. I know Microsoft has recently announced that they are dropping Dryad in favor of integrating with Hadoop, but I am not certain how far along that process is wrt SQL Server. And if all you need is "get a list of IDs to use in my SQL query" you can of course write something yourself easily enough.
I think HBase is very exciting, and there may be things you haven't mentioned which would drive you towards it (e.g. high availability). But my gut says you can probably scale out your relational db much more cheaply than switching to HBase.
I am trying to make a comparison between a system setup using Hadoop and HBase and achieving the same using Oracle DB as back end. I lack knowledge on the Oracle side of things so come to a fair comparison.
The work load and non-functional requirements are roughly this:
A) 12M transactions on two tables with one simple relation and multiple (non-text) indexes within 4 hours. That amounts to 833 transactions per second (TPS), sustained. This needs to be done every 8 hours.
B) Make sure that all writes are durable (so a running transaction survives a machine failure in case of a clustered setup) and have a decent level of availability? With a decent level of availability, I mean that regular failures such as disk and a single network interface / tcp connection drop should not require human intervention. Rare failures, may require intervention, but should be solved by just firing up a cold standby that can take over quickly.
C) Additionally add another 300 TPS, but have these happen almost continuously 24/7 across many tables (but all in pairs of two with the same simple relation and multiple indexes)?
Some context: this workload is 24/7 and the system needs to hold 10 years worth of historical data available for live querying. Query performance can be a bit worse than sub-second, but must be lively enough to consider for day-to-day usage. The ETL jobs are setup in such a way that there is little churn. Also in a relational setup, this workload would lead to little lock contention. I would expect index updates to be the major pain. To make a comparison as fair as possible, I would expect the loosest consistency level that Oracle provides.
I have no intention of bashing Oracle. I think it is a great database for many uses. I am trying to get a feeling for the tradeoff there is between going open source (and NoSQL) like we do and using a commercially supported, proven setup.
Nobody can answer this definitively.
When you go buy a car you can sensibly expect that its top speed, acceleration and fuel consumption will be within a few percent of values from independent testing. The same does not apply to software in general nor to databases in particular.
Even if you had provided exact details of the hardware, OS and data structures, along with full details of the amount of data stored as well as transactions, the performance could easily vary by a factor of 100 times depending on the pattern of usage (due to development of hot spots of record caching, disk fragmentation).
However, having said that there are a few pointers I can give:
1) invariably a nosql database will outperform a conventional DBMS - the reason d'etre for nosql databases is performance and parallelization. That does not mean that conventional DBMS's are redundant - they provide much greater flexibility for interacting with data
2) for small to mid range data volumes, Oracle is relatively slow in my experience compared with other relational databases. I'm not overly impressed with Oracle RAC as a scalable solution either.
3) I suspect that the workload would require a mid-range server for consistent results (something in the region of $8k+) running Oracle
4) While having a hot standby is a quick way to cover all sorts of outages, in a lot of cases, the risk/cost/benefit favours approaches such as RAID, multiple network cards, UPS rather than the problems of maintaining a synchronized cluster.
5) Support - have you ever bothered to ask the developers of an open source software package if they'll provide paid for support? IME, the SLAs / EULAs for commercial software are more about protecting the vendor than the customer.
So if you think its worthwhile considering, and cost is not a big issue, then the best answer would be to try it out for yourself.
No offense here, but if you have little Oracle knowledge there is really no way you can do a fair comparison. I've worked with teams of very experienced Oracle DBAs and sys admins who would argue about setups for comparison tests (the hardware/software setup variables are almost infinite). Usually these tests were justifications for foregone conclusions about infrastructure direction (money being a key issue as well).
Also, do you plan on hiring a team of Hadoop experts to manage your company's data infrastructure? Oracle isn't cheap, but you can find very seasoned Oracle professionals (from DBAs to developers to analysts), not too sure about hadoop admins/dbas...
Just food for thought (and no, I don't work for Oracle ;)
I am curious if anybody did benchmarks for accessing of data in NoSQL databases vs Oracle (particularly I am talking about Oracle RAC)?
The project requires to work with at least 10mil+ of records, search among them (but not necessary have to be real time), the read is very important for speed, and it's also very important to guarantee HA and reliability (can't lose records!!!)
I can see for myself how say Cassandra/MongoDB might be better fit (because key value storage will provide faster reads than SQL when you go over 10mil records), but I find difficult to articulate all of them nicely. Any links? Suggestions? Bullet points?
Thanks!
10 million records. Assume 250 bytes per record. That is about 2.5 Gb of data, which is well within the capacity of a basic desktop / laptop PC. The data volumes are insignificant (unless each record is sized in Mb, such as picture or audio).
What you do need to talk about is transaction volumes (separated into read and write) and what you consider HA. Read-only HA is easy relative to "Read-write HA". It can be trivial to replicate a read-only data set off to multiple servers at different geographic locations and distribute a query workload on them.
It's much harder to scale out an update heavy workload, which is why you often hear about systems going into meltdown when tickets for a big concert are released. Quite simply there's a fixed number of seats and you can't have ten duplicated systems each selling what they think is available. There has to be a single source of truth, which means a bottleneck (and potentially a single point of failure).
On the HA aspect, RAC is a shared storage technology which generally means your RAC nodes are in close proximity. That can make them vulnerable to localized events such as a building fire or telecoms breakdown. Data Guard is the Oracle technology that relates to off-site replication and failover.
Mostly when you come to comparison of NoSQL vs SQL, you have to understand a very important difference between them. Data in NoSQL may be inconsistent in cost order to achieve HA.
What do I mean by inconsistent? It depends, but usually around 3-5 seconds to propagate the data around nodes. NoSQL database provide mechanism to manage and eliminate that, but if you want all your data be consistent in real time, then you simply use classic SQL, like Oracle RAC.
Coming back to speed comparison: it's simply incomparable which one is faster, because it relays on factors like network infrastructure, computing power and database model etc. But important thing is that at some point you may reach the moment that SQL is economically inefficient to maintain and you have to switch to NoSQL.