Cassandra partition size and performance? - performance

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here:
http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema
And measuring its insert performance.
My observations were:
using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec.
when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data distribution to have 100 clustering keys instead of 1000, the performance was jumping up to ~11000 inserts/sec
when configuring it to have 5000 clustering keys per partition, the performance was reducing to just 700 inserts/sec
The documentation says however Cassandra can support up to 2 billion rows per partition. I don't need that much still I don't get how just 5000 records per partition can slow the writes 10 times down or am I missing something?

Supporting is a little different from "best performaning". You can have very wide partitions, but the rule-of-thumb is to try to keep them under 100mb for misc performance reasons. Some operations can be performed more efficiently when the entirety of the partition can be stored in memory.
As an example (this is old example, this is a complete non issue post 2.0 where everything is single pass) but in some versions when the size is >64mb compaction has a two pass process, that halves compaction throughput. It still worked with huge partitions. I've seen many multi gb ones that worked just fine. but the systems with huge partitions were difficult to work with operationally (managing compactions/repairs/gcs).
I would say target the rule of thumb initially of 100mb and test from there to find own optimal. Things will always behave differently based on use case, to get the most out of a node the best you can do is some benchmarks closest to what your gonna do (true of all systems). This seems like something your already doing so your definitely on the right path.

Related

How to get 2+ million floats a second from a DB to a consumer?

A few days ago I was granted the very interesting project, of recoding the data input of a large simulation framework.
Before I accepted that task, it used an ... interesting ... file reader. That sadly didn't work.
The task is easy to describe: get up to 2+ million floats per second from a database (originally planned PostgresQL, now looking into ElasticSearch or Cassandra) into a Java proxy. Remap, buffer and feed them to a high-performance data bus.
The second part was easy enough, and the first part was standard, right? Erm, sadly wrong.
As it turned out, I somehow smashed into a brick wall of "not more than 1 Mb / s"... which is only 5% of the transfer rate needed. Yes, yes, the problem could surely be killed with more hardware, but why introduce a "built in brake", if there might be better solutions?
A tad background on the data:
A unit consists of a uuid, a type and a timeseries of values (floats). One float per minute for a whole year -> 60*24*365 = 525,600 floats.
The consumer asks 150k units, for 15 time-steps a second -> 2.25 mio floats per second.
We plan to pre-split the timeseries into day-batches, which seems to be a tad more manageable. Even at this rate that's about 9k*150k units = 1.35 GB of data for the buffer.
So, I tried several different ways to store and retrieve the data.
In PostgresQL the most promising approach till now is a table:
uuid uuid,
type text,
t1...t365 float4[]
PK is the combination of uuid and type.
Simple enough, I only need to SELECT tx FROM table.
Info on my testing setup: as I can't yet test with a full dataset (even faking them is painstackingly slow, at 2 sec per unit) I usually test with batches of 2k units.
Under such conditions the SELECT needs 12 seconds. According to my tests that's NOT due to transmission bottleneck (same result if I use a DB on the same machine). Which is roughly 5% of the speed I need, and much slower than I assumed it would be. 1 Mb/s???
I haven't found any apparent bottleneck yet, so throwing more hardware at it, might even not work at all.
Anyone knowledgable about PostgresQL an idea what's slowing it down so much?
OK, so I tried another system. ElasticSearch might be faster, and also solve some other questions, like a nasty cluster access API I might have to add later on.
mapping: {
_index: timeseries_ID
_type: unit
uuid: text(keyword)
type: text(keyword)
day: integer
data: array[float]
}
To my surprise, this was even slower. Only 100k floats per second tops. Only 2% of what I need. Yes yes I know, add more nodes, but isn't ElasticSearch supposed to be a tad faster than Postgres?
Anyone an idea for an ElasticSearch mapping or any configs that might be more promising?
Now, on monday I'll get to work with my tests with Cassandra. The planned table is simple:
day int PRIMARY KEY,
uuid text,
type text,
data array[float]
Is there a more sensible setup for this kind of data for Cassandra?
Or does anyone have a better idea alltogether to solve this problem?
Clarifications
If I do understand what you want I can't see why you are using a database at all.
Reason behind using a DB is, that the data comes from outside people, with varying IT-skills. So we need a central storage for the data, with an API as well as maybe a simple web frontend to check the integrity and clean up and reformate it into our own internal system. A database seems to work better on that front than some proprietory file system. That one could be more easily clogged and made unusable by inexperienced data contributors.
I'm not sure it's clear on what you're trying to do here or why you seem to be randomly trying different database servers.
It's not randomly. The Postgres was a typical case of someone going: "oh but we do use that one already for our data, so why should I have to learn something new?"
The Elasticsearch approach is trying to leverage the distributed cluster and replication features. Thus we can have a central permanent storage, and just order a temporary ES cluster on our vms with replicates of the needed data. This way ES would handle all the data transport into the computer cluster.
Cassandra is a suggestion by my boss, as it's supposed to be much more scalable than Postgres. Plus it could comfort those that prefer to have a more SQLy API.
But - can you show e.g. the query in PostgreSQL you are trying to run - is it multiple uuids but a short time-period or one uuid and a longer one?
Postgres: The simplest approach is to get all uuids and exactly one daybatch. So a simple:
SELECT t1 FROM table;
Where t1 is the column holding the data for dayset one.
For testing with my (till now) limited fake data (roughly 2% of the full set) I sadly have to go with: SELECT t1, t2 ... t50 FROM table
Depending on testing, I might also go with splitting that one large select into some/many smaller ones. I'm thinking about going by a uuid-hash based split, with indexes set accordingly of course. It's all a question of balancing overhead and reliability. Nothing is final yet.
Are we talking multiple consumers or just one?
At the start one consumer, planned is to have multiple instances of the simulation. Much smaller instances though. The 150k unit one is supposed to be the "single large consumer" assumption.
How often are the queries issued?
As needed, in the full fledged approach that would be every 96 seconds. More often if I switch towards smaller queries.
Is this all happening on the same machine or is the db networked to this Java proxy and if so by what.
At the moment I'm testing on one or two two machines only: preliminary tests are made solely on my workstation, with later tests moving the db to a second machine. They are networking over a standard gigabit LAN.
In the full fledged version the simulation(s) will run on vms on a cluster with the DB having a dedicated strong server for itself.
add the execution plan generated using explain (analyze, verbose)
Had to jury rigg something with a small batch (281 units) only:
Seq Scan on schema.fake_data (cost=0.00..283.81 rows=281 width=18) (actual time=0.033..3.335 rows=281 loops=1)
Output: t1, Planning time: 0.337 ms, Execution time: 1.493 ms
Executing the thing for real: 10 sec for a mere 1.6 Mb.
Now faking a 10k unit by calling t1-t36 (I know, not even close to the real thing):
Seq Scan on opsim.fake_data (cost=0.00..283.81 rows=281 width=681) (actual time=0.012..1.905 rows=281 loops=1)
Output: *, Planning time: 0.836 ms, Execution time: 2.040 ms
Executing the thing for real: 2 min for ~60 Mb.
The problem is definitely not the planning or execution. Neither is it the network, as I get the same slow read on my local system. But heck, even a slow HDD has at LEAST 30 Mb/s, a cheap network 12.5 MB/s ... I know I know, that's brutto, but how come that I get < 1Mb/s out of those dbs? Is there some bandwith limit per connection? Aunt google at least gave me no indication for anything like that.

Cassandra integration with hadoop for read performance

I am using Apache Cassandra for storing around 100 million records. There is one single node with the following specifications-
RAM-32GB, HDD-2TB, Intel quad core processor.
With cassandra there is a read performance problem. For some queries it takes around 40mins for giving the output. After searching for how to improve the read performance i came to know about the following factors-
Compaction strategy,compression techniques, key cache, increase the heap space, turning off the swap space for cassandra.
After doing these optimizations, the performance remains the same. After seraching, I came around for integrating Hadoop with cassandra.Is it the correct way to do the queries in cassandra or any other factors I am missing here??
Thanks.
It looks like you data model could be improved. 40 minutes is something impossible. I download all data from 6 million records (around 10gb) within few minutes. And think it because I convert data in the process of download and store them. Trivial selects must take milliseconds.
Did you build it on the base of queries that you must do ?

Duplicate Key Filtering

I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a rolling 10 days’ worth of keys, at approximately 100 bytes per key. I was wondering how this type of large scale problem has been solved before using Hadoop. Would HBase be the correct solution to use? Has anyone ever tried a partially in-memory solution like Zookeeper?
I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?
Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.
If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.
If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java
ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.
So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.
I am afraid that it is impossible with traditional systems :|
Here is what U have mentioned:
100 billion per days means approximation 1 million per second!!!!
size of the key is 100 bytes.
U want to check for duplicates in a 10 day working set means 1 trillion items.
These assumptions results in look up in a set of 1 trillion objects that totally size in 90 TERABYTES!!!!!
Any solution to this real-time problem shall provide a system that can look up 1 million items per second in this volume of data.
I have some experience with HBase, Cassandra, Redis, and Memcached. I am sure that U cannot achieve this performance on any disk-based storage like HBase, Cassandra, or HyperTable (and add any RDBMSs like MySQL, PostgreSQl, and... to these). The best performance of redis and memcached that I have heard practically is around 100k operations per second on a single machine. This means that U must have 90 machines each having 1 TERABYTES of RAM!!!!!!!!
Even a batch processing system like Hadoop cannot do this job in less than an hour and I guess it will take hours and days on even a big cluster of 100 machines.
U R talking about very very very big numbers (90 TB, 1M per second). R U sure about this???

Cassandra running out of memory (heap space)

We are experimenting a bit with Cassandra lately (version 1.0.7) and we seem to have some problems with memory. We use EC2 as our test environment and we have three nodes with 3.7G of memory and 1 core # 2.4G, all running Ubuntu server 11.10.
The problem is that the node we hit from our thrift interface dies regularly (approximately after we store 2-2.5G of data). Error message: OutOfMemoryError: Java Heap Space and according to the log it in fact used all of the allocated memory.
The nodes are under relatively constant load and store about 2000-4000 row keys a minute, which are batched through the Trift interface in 10-30 row keys at once (with about 50 columns each). The number of reads is very low with around 1000-2000 a day and only requesting the data of a single row key. The is currently only one used column family.
The initial thought was that something was wrong in the cassandra-env.sh file. So, we specified the variables 'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1) according to our nodes' specification. We also changed the 'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think the second is related to the Garbage Collection). Unfortunately, that did not solve the issue and the node we hit via thrift keeps on dying regularly.
In case you find this useful, swap is off and unevictable memory seems to be very high on all 3 servers (2.3GB, we usually observe the amount of unevictable memory on other Linux servers of around 0-16KB) (We are not quite sure how the unevictable memory ties into Cassandra, its just something we observed while looking into the problem). The CPU is pretty much idle the entire time. The heap memory is clearly being reduced once in a while according to nodetool, but obviously grows over the limit as time goes by.
Any ideas? Thanks in advance.
cassandra-env.sh defaults are perfect for almost all workloads, so until you know why this is happening best to put them back to their defaults or you may be making things worse without realizing.
I see concurrent reads and writes of 2k/sec/node on our cluster, so 2k-4k writes per minute is very little, although the fact that it's only the node accepting your connections that is dying is a little strange.
If you connect your app to the thrift endpoint on one of the other nodes is it then that one that dies?
Client connections use memory so might be worth double checking you're not connecting too many at a time. "netstat -A inet | grep 9160" on the dying cassandra node should tell you how many client connections you have. Depending heavily on your application you'd expect 10s or 100s rather than 1000s.
What do the writes look like?
Are you writing the same row keys repeatedly and if so are you appending new column names or overwriting the same ones?
How big is each write? Anything else you can tell me?
If you're overwriting the same column names in the same row keys constantly compaction may be struggling.
If you're appending new column names to the same row keys constantly you might be growing your rows too large to fit into memory.
the output of "nodetool -h localhost tpstats" on the dying node might also give some clues as to where you're falling down. Anything constantly pending is probably bad news, especially at such a low write rate.
If you're going to use cassandra in production you should get graphing of the internals to better understand what's going on. jmxtrans and graphite should be your new best friends.
There are some things you can try tweaking. First make sure you dont have row caching on your column family. Also worth while checking the log for errors and tpstats incase something died due to an error and something is getting backed up in a queue. The stack trace of the exception could be meaningful too since there are actually different types of OOMs that could just mean kernel tweaks.
If your just using too much memory per node then you want for the size of your data set try checking the cfstats, you can identify roughly how much space is spent on bloom filters. As you have more rows in a CF this can get linearly larger and is part of the base minimum memory your nodes are going to require.
nodetool cfstats | grep Bloom.*Used | awk '{ SUM += $5} END { print SUM " bytes" }'
Since you dont read very often you can probably increase the false positive rate on them. Each SSTable has a bloom filter it uses to check if a row exists in it or not. You can change with cqlsh
ALTER TABLE MyColumnFamily WITH bloom_filter_fp_chance = 0.1;
After that call an upgrade on that CF (this will be slow) per node
nodetool upgradesstables MyKeyspace MyColumnFamily
There are consequences to this where reads may take longer since there is a 10%-ish (the .1) chance it will check SSTables for rows that dont exist in it, resulting in extra disk seeks.
Another major memory sink if you have column families with large amount of rows is the sampling rate of the index. This can be modified per node level in the cassandra.yaml
http://www.datastax.com/docs/1.1/configuration/node_configuration#index-interval
If you have it set up to take heap dumps on OOM (-XX:+HeapDumpOnOutOfMemoryError on by default I believe) there should be some heap dumps available in the /var/lib/cassandra/data directory. You can open these up in visualvm or whatever tool you like to identify what part of the heaps is where.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.
This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.
A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.
I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Resources