Cassandra scaling cheat-sheet

Cassandra scaling cheat-sheet - performance

Of course you can only know the performance of your system with your load with your use-cases by ... actually implementing it! That aside, before embarking on a prototype, I'm searching for some very rough estimates of how Cassandra performs.
For various configurations of nodes and data-centres, and for various read and write consistency levels, what the chances of reading a stale value? What kind of key reads and writes per second would you expect to sustain, and what kind of latency would each read and write have?

Cassandra benchmarking presented at VLDB earlier this year: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf
Performance/consistency tradeoffs: http://www.datastax.com/dev/blog/your-ideal-performance-consistency-tradeoff

We run an application with 500 datapoints posted per second per web node(we have 6 cassandra nodes). We could probably get 1000 datapoints per second per node if we cached 100M of data in the client to avoid the read.
the profile of that is using PlayOrm with one findAll(List keys) and one putAll(List entities) on each request where each key in that list is a single data point as the clients send a batch of datapoints over http so we don't have as much http overhead....maybe that gives you some idea at least though not sure.
We have not yet tested the correct ratio of web nodes to cassandra nodes but I suspect it is like my last client where it was near one to one on this project though it changes with the profile.
We run 4 web nodes and get 2000 datapoints per second right now.

Related

Frequent Updates on Apache Ignite

I hope someone experienced with Apache Ignite can help guide my team towards the answer regarding a new setup with Apache Ignite.
Overall Setup
Data is continuously generated from many distributed sensors and streamed into our database. Each sensor may deliver many updates every second, but generally generates <10 updates/sec.
Daily the magnitude of the data is approx. 50 million records, per site.
Data Description
Each record consists of the following values
Sensor ID
Point ID
Timestamp
Proximity
where 1, is our ID of the sensor, 2 is an ID of some point on the site, and 3 is a proximity measurement from the sensor to the point.
Each second there is approx. 1000 such new records. A record is never updated.
Query Workload
Queries are fairly complex with significant (and dynamic) look-back in time. A query may require data from several sensors in one site, but the required sensors are determined dynamically. Most continuous queries only require data from the last few hours, but frequently it is necessary to query over many days.
Generally, we therefore have a write-once query-many scenario.
Initial Strategy
If we load data into primitive integer arrays in, e.g., java, the space consumption for a week approaches 5 GB. Because that is "peanuts" in the platforms of today, we intend to load all data onto all nodes in the Ignite cluster/distributed cache. In other words, use a replicated cache.
However, the continuous updates keep puzzling me. If I update the entire cache, I image quite substantial amounts of data needs to be transferred across the network every second.
Creating chunks for, say, each minute/hour is not necessarily going to work (well) either as each sensor can be temporarily offline, which will make it deliver stale data at some later point in time.
My question is therefore how to efficiently handle this stream of updates, while maintaining a consistent view of the data for the last 7-10 days.
My current, local, implementation is chunking the data into 1-hour chunks. When a new record for a given chunk arrives, the chunk is replaced with an updated chunk. This works well on a single machine but is likely too expensive in terms of network overhead in a cluster. I do not have an Ignite implementation, yet, so I have not been able to test this.
Ideally, each node in the ignite cluster would maintain its own copy of all data within the last X days, and apply the small update workload continuously.
So my question is, how would fellow Igniters approach this problem?

It sounds like you want to scale the load across multiple servers, but it's not possible with replicated caches, because each update will always update all nodes, and more nodes you have the more network traffic you will get. I think you should use partitioned caches instead and try adding nodes until the system is capable of handling the load.

Infinispan vs memcached for high concurrency need

My web application maintains in memory cache of domain entities which are read/written at high frequency. To make application clustered, i need to synchronize / externalize this cache.
Which will be better option amongst memcached and infinispan considering following application facts-
cache will be read/written at high frequency per second
if infinispan, data need to replicated across nodes near- real time
high concurrent write should not create conflicts issue if replication is slow.
I feel memcached will solve this purpose well since it's centralized and does not need replication delay like infinispan. Can experts provide opinion on this?

Unfortunately I'm not a Memcached expert but let me tell you more about some fundamental concepts so that you could pick the best option for your use case...
First, centralized vs decentralized - if you have only one node in your system, it will be faster (as you said there is no replication). However what will happen if the node is down? Or another scenario - what will happen if the node gets full (as you said you will perform a lot of read/writes per second)? One solution for that is to use master/slave replication where writes are propagated to the slave node asynchronously. This solution will save you in case the node is down but won't do any good if the node is full (if master node is full, slave will get full a couple of minutes later).
Data consistency - if you have more than 1 node in your system, your data might get out of sync. Imagine asynchronous replication between 2 nodes and a client connected to each of them. Both clients perform a write to the same key at the same exact moment. It might seems unlikely but believe me, with highly concurrent reads and writes it will happen. The only way to solve this problem is to use synchronous replication with majority of nodes up and running (or with so called consensus).
Back to your scenario - if a broken node is not a problem for you (for example, you can switch to some other data source automatically) and your data won't grow - go ahead for 1 node solution or master/slave replication. If your data need to be strongly consistent - make sure you're doing sync replication (and possibly with transactions but you need to refer to the user manual for guidance). Otherwise I would recommend picking a more versatile solution which will allow you to add/remove nodes without taking down whole system and will have an option for sync/async replication.
From my experience, people care too much about data consistency whereas should care much more about scalability. And a final piece of advice - please define your performance criteria before evaluating any solution (something like, my writes need to take no longer than X and reads no longer than Y. Define also confidence level for your criteria (I need 99.5% of all reads to be less than X).

Cassandra partition size and performance?

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?

Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.

Bulk insert vs Single insert

The primary dev managing our ES cluster has made the statement that single document loads to ES will only provide us with roughly 30 / 40 creations a second. Whereas the bulk operations will give us more in the range of a 1,000+. I realize that bulk is always faster (or is generally) and there are hardware / environment constraints to any process. However, with other technologies you do not pay such a heavy price for single insertions. I am obviously ignorant when it comes to ES. Why do you pay such a heavy price for document writes in ES? Or are we just not properly informed?
Environment:
Apache Storm writes to our ES cluster
Currently all of the writes are processed in bulk operations.

What you have to take into account is the round trip time between your loader and your cluster. Setting up an http connection, transferring the data, and then waiting for a response can take a while -- in this case it seems it's taking your about 30 ms. Elasticsearch has to setup a parser for your request, hand it off to the node that is really going to do the work, and then generate the response back to you.
By using the bulk API, you remove a lot of back and forth -- ES can group together inserts going to the same node, doesn't have to instantiate a new parser for every request, etc.
HTTP Connection pooling for single requests would help, but doing bulk inserts/updates/deletes is always going to be faster in the long run.

Bulk indexing is indeed way faster but it is not as bad as you system admin suggests. Elasticsearch has gotten a lot better at this stuff over the past two years.
We're able to do hundreds of inserts/updates per second without bulking requests. Most inserts take around 1 ms (including sending the http request and receiving the response). If insert speed becomes an issue, you can back off on the cluster refresh (default 1s). Also, you can use multiple threads to insert. Bulk insert can get in the range of 10000s, depending on how complex your mappings are.
You definitely want http connection pooling (true when using any kind of webservice in anger) or even better, run an embedded elasticsearch node. Another alternative is to run an elasticsearch node on localhost if you don't want to do an embedded node. That way, all http traffic is on localhost.
Finally, if you need to support more concurrent writes, you can always increase the number of shards and nodes. These numbers are not set in stone. If you need tens of thousands of writes per second, it should be possible to engineer a cluster that can do it. It will require a lot of tuning and hardware probably, and you should probably not do this unless you have a really good reason to do so. However, the whole point of elastic search is horizontal scalability.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.

Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.

I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio