Performance impact of Aerospike UDF's? - user-defined-functions

I would like to know what performance impact I should expect when invoking an UDF (user defined function) written in C everytime some record is created or changed (with the assumption, that the UDF code itself takes no time - I will optimize that on my own).
Let's say I have hardware capable of running an SSD-persisted namespace on 200k writes/s, can I expect atleast 50k writes/s with the UDF run everytime?
Subquestion: what might limit the UDFs performance (context switching?)
Reason for asking is that Aerospike is using those UDFs e.g. for Large Data Types, but those are not highly performant according to AS staff (compared to KVS-Ops). My usecase is to use UDFs to keep a broad range of secondary indices within a Redis Cluster up-to-date, allowing for much richer realtime queries (e.g. intersections/unions of 5-10 secondary indices).

Best thing is to run the test yourself. Its hard to predict. But I believe that you should be able to do 50k tps.
Mainly the UDF's performance is effected because of the memory allocations that happen under the hood before calling the UDF. If you are using simple datatypes like int/string/blob, then you are better off. If you use list/map in UDF, it will do more memory allocations which will impact the performance.

Related

Performance benchmarking between Hive (on Tez) and Spark for my particular use case

I'm playing around with some data on cluster and want to do some aggregations --- nothing too complicated, but more complicated than sum, there are few joins and count distincts. I have implemented this aggregation in Hive and Spark with Scala and want to compare the execution times.
When I submit the scripts from gateway, linux time functions gives me real time smaller than sys time, which I expected. But I'm not sure which one to pick as proper comparision. Maybe just use sys.time and run the both queries for several times? Is it acceptable or I'm complete noob in this case?
Real time. From a performance benchmark perspective, you only care about how long (human time) it takes before your query is completed and you can look at the results, not how many processes are getting spun up by the application internally.
Note, I would be very careful with performance benchmarking, as both Spark and Hive have plenty of tunable configuration knobs that greatly affect performance. See here for a few examples to alter Hive performance with vectorization, data format choices, data bucketing and data sorting.
The "general consensus" is that Spark is faster than Hive on Tez, but that Hive can handle huge data sets that don't fit in memory better. (I'm not going to cite a source since I'm lazy, do some googling)

Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()

I know that some of Spark Actions like collect() cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey() may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist() / cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
You have to consider three types of operations:
transformations implemented using only mapPartitions(WithIndex) like filter, map, flatMap etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk.
transformations which require shuffle. It includes obvious suspects like different variants of combineByKey (groupByKey, reduceByKey, aggregateByKey) or join and less obvious like sortBy, distinct or repartition. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:
network traffic and disk IO - any operation which is not performed in memory will be at least an order of magnitude slower.
skewed data distribution - if distribution is highly skewed shuffle can fail or subsequent operations may suffer from a suboptimal resource allocation
operations which require passing data to and from the driver. Typically it covers actions like collect or take and creating distributed data structure from a local one (parallelize).
Other members of this category are broadcasts (including automatic broadcast joins) and accumulators. Total cost depends of course on a particular operation and the amount of data.
While some of these operations can be expensive none is particularly bad (including demonized groupByKey) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.
Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

Does it make sense to optimize queries for less i/o pressure?

I have a read only database (product) that recides on its own Sql Server 2008.
I already optimized queries by looking at most expensive queries in activity monitor - report. I ordered the report by CPU-cost. I now have something like 50 queries/second and no query is longer than 300ms.
CPU-Time is ok (30%) and Memory is only used by 20% (out of 64GB).
There is one issue: disk time is at steady 100% (I looked at idle time performance counter and used ideras SQL diagnostic manager). I can see that the product db behaves different than my order db which is on a different machine and has smaller tables: If I look at a profiler trace I have queries in product-db that show a value in column "read" higher than 50.000. In my order DB these values are never higher than 1000. The queries in product-db use a lot of Common table expressions, work on large tables (some are around 5 Million entries).
I am not shure if I should invest time in optimizing queries for i/o performance or if I should just add a server. By otimizing for query duration I already added the missing indexes. Is optimizing for i/o something that is usually done?
In short, yes. Optimize for both CPU and IO.
Queries with high CPU tend to be doing unnecessary in-memory sorts, (sometimes inefficient) hash joins, or complex logic.
Queries with high IO (Page Reads) tend to be doing full table scans or working in other inefficient ways.
9 times out of 10, the same queries will be near the top of the list, but if you've worked on the high CPU and you still are unhappy with performance, then by all means, work on the high IO procs next.
There's always a next bottleneck.
they say.
Now that you've tuned CPU usage, it's only natural that I/O load emerges as dominant. Is your performance already acceptable? If yes stop, if no you have to estimate how many hours you will have to invest in further tuning and if buying another server or more hard disks might be cheaper.
Regarding the I/O tuning again, try to see what you can achieve with easy measures. Sometimes you can trade CPU for I/O and vice versa. Compression is an example for this. You would then tune that component that is your current bottlneck.
Before you seek to make the I/O faster try to reduce the I/O that is generated.
Look for obvious IO performance improvements for your query, but more importantly, look at how you can improve your IO performance at the server level.
If your other resources (CPU and memory) aren't overloaded, you probably don't need a new server. Consider adding an SSD for logs and temp files, and/or consider if you can affordably fit your whole DB onto an array of SSDs.
Of course, clearing out your disk IO bottleneck is likely to raise CPU usage, but if your performance is close to acceptable, this will probably improve things to the point that you can stop optimizing for now.
Unless you are using SSDs or a DB optimized SAN then IO is almost always the limit in database applications.
So yes, optimize to get rid of it as much as possible.
Table indexes are the first thing to do.
Then, add as much RAM as you possibly can, up to the complete size of your DB files.
Then partition your data tables (if that is a reasonable thing to do) so that any necessary table or index scans are done on only one or two table partitions.
Then I suppose you either buy bigger machines with even more RAM and/or buy SSDs or a SAN or a SAN with SSDs.
Alternatively you rebuild your entire database application to use something like NoSQL or database sharding, and implement all your relations, joins, constraints, etc in a middle interface layer.

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Resources