I was listening to this talk on Data modelling in Cassandra. The speakers makes the general statement that 'writes are faster than reads in Cassandra'.
Is this case always true? if so why?
That's still true even though is not a big difference like in past. A write in general perform better because it doesn't involve too much the I/O -- a write operation is completed when the data has been both written in the commit log (file) and in memory (memtable). When the memtable reach the max size then all table is flushed in a disk sstable. Differently a read may require more I/O for different reasons. A read operation first involve reading from a bloom filter (a filter associated to sstable that might save I/O time saying that a data is surely not present in the associated sstable) and then, if filter returns a positive value, Cassandra starts seeking the sstable to look for data. HTH, Carlo
Related
I have a file that could contain about 3 million records. Certain records of this file will need to be updated multiple times throughout the run of the program. If I need to pull specific records from this file, which of the following is more efficient:
Indexed VSAM search
Indexed flat file with a COBOL search all
Buffering all of the data into working storage and writing a loop to handle the search
Obviously, if you can buffer all of the data into memory (and if the host system can support a working-set of pages which is big enough to allow all of it to actually remain in RAM without paging, then this would probably be the fastest possible approach.
But, be very careful to consider "hidden disk-I/O" caused by the virtual-memory paging subsystem! If the requested "in-memory" data is, in fact, not "in memory," a page-fault will occur and your process will stop in its tracks until the page has been retrieved. (And if "page stealing" occurs, well, you're in trouble. Your "in-memory" strategy just turned into a possibly very-inefficient(!) disk-based one. If keys are distributed randomly, then your process has a gigantic working-set that it is accessing randomly. If all of that memory is not actually in memory, and will stay there, you're in trouble.
If you are making updates to a large file, consider sorting the updates-delta file before processing it, so that all occurrences of the same key will be adjacent. You can now write your COBOL program to take advantage of this (and, of course, to abend if an out-of-sequence record is ever detected!). If the key in "this" record is identical to the key of the "previous" one, then you do not need to re-read the record. (And, you do not actually need to write the old record, until the key does change.) As the indexed-file access method is presented with the succession of keys, each key is likely to be "close to" the one previously-requested, such that some of the necessary index-tree pages will already be in-memory. Obviously, you will need to benchmark this, but the amount of time spent sorting the file can be far less than the amount of time spent in index-lookups. (Which actually can be considerable.)
The answer of Mike has the important issue about "hidden I/O" in (depends on the machine, configuration, amount of data)...
If you very likely need to update many records the option Mike suggest is the most useful one.
If you very likely need to update not much records (I'd guess you're likely below 2%) another approach can be quite faster (needs a benchmark !):
read every key via indexed VSAM search
store the changed record in memory (big occurs table), if you will only change some values and the record is quite big then only store all possible changed values + key in the table without an actual REWRITE
before doing a VSAM search: look in your occurs table if you read the key
already, take the values either from there or get a new one
...
at program end: go through your occurs and REQRITE all records (if you have the complete record a REWRITE is enough, otherwise you'd need a READ first to get the complete record)
Performance is often: "know your data and possible program flow, then try the best 2-3 approach, benchmark and decide".
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over? I believe row_cache_size_in_mb is supposed to cache frequently used records in memory, but setting it to say 10MB seems to make no difference.
I tried cassandra-stress of course, but the highest read throughput it achieves with 1KB records (-col size=UNIFORM\(1000..1000\)) is ~15K/s.
With low numbers like above, I can easily write an in-memory hashmap based cache that will give me at least a million reads per second for a small working set size. How do I make cassandra do this automatically for me? Or is it not supposed to achieve performance close to an in-memory map even for a tiny working set size?
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over?
There are some solution for this scenario
One idea is to use row cache but be careful, any update/delete to a single column will invalidate the whole partition from the cache so you loose all the benefit. Row cache best usage is for small dataset and are frequently read but almost never modified.
Are you sure that your cassandra-stress scenario never update or write to the same partition over and over again ?
Here are my findings: when I enable row_cache, counter_cache, and key_cache all to sizable values, I am able to verify using "top" that cassandra does no disk I/O at all; all three seem necessary to ensure no disk activity. Yet, despite zero disk I/O, the throughput is <20K/s even for reading a single record over and over. This likely confirms (as also alluded to in my comment) that cassandra incurs the cost of serialization and deserialization even if its operations are completely in-memory, i.e., it is not designed to compete with native hashmap performance. So, if you want get native hashmap speeds for a small-working-set workload but expand to disk if the map grows big, you would need to write your own cache on top of cassandra (or any of the other key-value stores like mongo, redis, etc. for that matter).
For those interested, I also verified that redis is the fastest among cassandra, mongo, and redis for a simple get/put small-working-set workload, but even redis gets at best ~35K/s read throughput (largely independent, by design, of the request size), which hardly comes anywhere close to native hashmap performance that simply returns pointers and can do so comfortably at over 2 million/s.
We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.
datapointA datapointB datapointC
This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.
One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received
A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.
There may be a few ways to solve this issue:
Store the incoming data in an in-memory HashSet, and set exculsion
will indicate the processing status of the particular record.
Problems will arise when we have this service running with zero
downtime and depending on the surge of data, this collection can
exceed the bounds of memory. Also, in case of system outages, this
data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will
only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe
but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.
....or some other technique
Can somebody point out some case-studies or established patterns or practices to solve this particular issue?
Thanks
you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk
in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).
and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).
You could create a hash of the data and store that in a backing store which would be smaller than the actual data (provided your data isn't smaller than a hash)
If I have large dataset and do random updates then I think updates are mostly disk bounded (in case append only databases there is not about seeks but about bandwidth I think). When I update record slightly one data page must be updated, so if my disk can write 10MB/s of data and page size is 16KB then i can have max 640 random updates per second. In append only databases apout 320 per second bacause one update can take two pages - index and data. In other databases bacause of ranom seeks to update page in place can be even worse like 100 updates per second.
I assume that one page in cache has only one update before write (random updates). Going forward the same will by for random inserts around all data pages (for examle not time ordered UUID) or even worst.
I refer to the situation when dirty pages (after update) must be flushed to disk and synced (can't longer stay in cache). So updates per second count is in this situation disk bandwidth bounded? Are my calculations like 320 updates per second likely? Maybe I am missing something?
"It depends."
To be complete, there are other things to consider.
First, the only thing distinguishing a random update from an append is the head seek involved. A random update will have the head dancing all over the platter, whereas an append will ideally just track like record player. This also assumes that each disk write is the full write and completely independent of all other writes.
Of course, that's in a perfect world.
With most modern databases, each update will typically involve, at a minimum, 2 writes. One for the actual data, the other for the log.
In a typical scenario, if you update a row, the database will make the change in memory. If you commit that row, the database will acknowledge that by making a note in the log, while keeping the actual dirty page in memory. Later, when the database checkpoints it will right the dirty pages to the disk. But when it does this, it will sort the blocks and write them as sequentially as it can. Then it will write a checkpoint to the log.
During recovery when the DB crashed and could not checkpoint, the database reads the log up to the last checkpoint, "rolls it forward" and applies those changes to actual disk page, marks the final checkpoint, then makes the system available for service.
The log write is sequential, the data writes are mostly sequential.
Now, if the log is part of a normal file (typical today) then you write the log record, which appends to the disk file. The FILE SYSTEM will then (likely) append to ITS log that change you just made so that it can update it's local file system structures. Later, the file system will, also, commit its dirty pages and make it's meta data changes permanent.
So, you can see that even a simple append can invoke multiple writes to the disk.
Now consider an "append only" design like CouchDB. What Couch will do, is when you make a simple write, it does not have a log. The file is its own log. Couch DB files grow without end, and need compaction during maintenance. But when it does the write, it writes not just the data page, but any indexes affected. And when indexes are affected, then Couch will rewrite the entire BRANCH of the index change from root to leaf. So, a simple write in this case can be more expensive than you would first think.
Now, of course, you throw in all of the random reads to disrupt your random writes and it all get quite complicated quite quickly. What I've learned though is that while streaming bandwidth is an important aspect of IO operations, overall operations per second are even more important. You can have 2 disks with the same bandwidth, but the one with the slower platter and/or head speed will have fewer ops/sec, just from head travel time and platter seek time.
Ideally, your DB uses dedicated raw storage vs a file system for storage, but most do not do that today. The advantages of file systems based stores operationally typically outweigh the performance benefits.
If you're on a file system, then preallocated, sequential files are a benefit so that your "append only" isn't simply skipping around other files on the file system, thus becoming similar to random updates. Also, by using preallocated files, your updates are simply updating DB data structures during writes rather than DB AND file system data structures as the file expands.
Putting logs, indexes, and data on separate disks allow multiple drives to work simultaneously with less interference. Your log can truly be append only for example compared to fighting with the random data reads or index updates.
So, all of those things factor in to throughput on DBs.
I need to load over 1 billion keys into Berkley DB and therefore I want to tune it in advance to get better performance. With standard configuration it takes me now about 15min to load 1'000'000 keys which is too slow.
Is there a proper way to tune for example the B+Tree of Berkley DB (node size etc...)?
(As an comparision, after tuning tokyo cabinet, it loads 1 billion keys in 25min).
P.S.
I'm looking for tuning tips as a code and not parameters to set for a running system (like jvm size etc...)
I'm curious, when TokyoCabinet loads 1B keys in 25 minutes what are the sizes of the keys/values being stored? What's the I/O systems and the storage system you're using? Are you using the term "load" to mean 1B transactional commits to permanent stable storage? That would be ~666,666 inserts/second, which is physically impossible given any I/O system I'm aware of. Multiply that number times the key and value size and now you're hopelessly beyond physical limits.
Please take a look at Gustavo Duarte's blog, read a bit about I/O systems and how things work in hardware and then review your statement. I'm very interested in finding out what exactly TokyoCabinet is doing and what it isn't doing. If I had to guess I'd say that either it's committing to file-system cache in the operating system, but not flushing (fdsync()-ing) those buffers to disk.
Full Disclosure: I'm a product manager at Oracle for Oracle Berkeley DB (a direct competitor of TokyoCabinet), I've been playing with these databases and the best hardware around for them for about ten years so I'm both biased and skeptical.
Berkeley DB has flags you can set on the transaction handle which mimic this and other similar methods of trading off durability (the "D" in ACID) for speed.
As far as how to make Berkeley DB Java Edition (BDB-JE) faster you can try the following:
Deferred writes: this delays writing
to the transaction log for as long as
possible (when buffers are full, it
flushes the data)
Sort your keys in advance: most
B-Trees (ours included) do much
better with in-order insertions for
fast load times-
Increasing the size of the log
files from the default of 10MiB to
something larger, like 100MiB, this
reduces I/O cost-
It's very important to be clear about claims of performance with databases. They seems simple, but it turns out to be very very tricky to get them right so that they don't ever corrupt data or lose committed transactions.
I hope this helps you a bit.
Bulk inserts on BDB-JE are an order of magnitude faster if you group them into a single transaction. The reason is that each single commit causes (by default) a sync write to disk while a transaction is synchronized on commit. In my application writing 100'000 small keys as single commits tooks more than a minute while in a transaction it takes just a few seconds.