Why does LevelDB needs more than two levels? - leveldb

I think only two levels(level-0 and level-1) is ok, why does LevelDB need level-2, level-3, and more?

I'll point you in the direction of some articles on LevelDB and it's underlying storage structure.
So in the documentation for LevelDB
it discusses merges among levels.
These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
LevelDB is similar in structure to Log Structured Merge Trees. The paper discusses the different levels if you're interested in the analysis of it. If you can get through the mathematics it seems to be your best bet to understanding the data structure.
A much easier to read analysis of levelDB talks about the datastore's relation to LSM Trees but in terms of your questions about the levels all it says is:
Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables.
Probably the LevelDB documentation provides the best answer: (maximizing the size of the writes and reads, since LevelDB is on-disk(slow seek) data storage).
Good Luck!

I think it is mostly to do with easy and quick merging of levels.
In Leveldb, level-(i+1) has approx. 10 times the data compared to level-i. This is more analogous to a multi-level cache structure where in if the database has 1000 records between keys x1 to x2, then 10 of the most frequently accessed ones in that range would be in level-1 and 100 in the same range would be in level-2 and rest in level-3 (this is not exact but just to give an intuitive idea of levels). In this set up, to merge a file in level-i we need to look at at most 10 files in level-(i+1) and it can all be brought into memory, a quick merge done and written back. These results in reading relatively small chunks of data for each compaction/merging operation.
On the other hand if you had just 2 levels, the key range in one level-0 file could potentially match 1000's of files in level-1 and all of them need to be opened up for merging which is going to be pretty slow. Note that an important assumption here is we have fixed size files (say 2MB). With variable length files in level-1, your idea could still work and I think a variant of that is used in systems like HBase and Cassandra.
Now if you are concern is about look up delay with many levels, again this is like a multi-level cache structure, most recently written data would be in higher levels to help with typical locality of reference.

Level 0 is data in memory other levels are disk data. The important part is that data in levels is sorted. If level1 consists of 3 2Mb files then in file1 it's the keys 0..50 (sorted) in file2 150..200 and in file3 300..400 (as an example). So when memory level is full we need to insert it's data to disk in the most efficient manner, which is sequential writing (using as few disk seeks as possible). Imagine in memory we have keys 60-120, cool, we just write them sequentially as file which becomes file2 in level1. Very efficient!
But now imagine that level1 is much larger then level0 (which is reasonable as level0 is memory). In this case there are many files in level1. And now our keys in memory (60-120) belong to many files as the key range in level1 is very fine grained. Now to merge level0 with level1 we need to read many files and make a lot of random seeks, make new files in memory and write them. So this is where many levels idea kicks in, we'll have many layers, each somewhat larger than the previous (x10), but not much larger so when we have to migrate data from i-1 to i-th layer we have a good chance of having to read least amount of files.
Now, since data might change there may be no need to propagate it to higher more expensive layers (it might be changed or deleted) and so we avoid expensive merges altogether. The data that does end up in the last level is statistically least likely to change so is the best fit for most-expensive-to-merge-with last layer.

Related

Balanced trees and space and time trade-offs

I was trying to solve problem 3-1 for large input sizes given in the following link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/assignments/MIT6_006F11_ps3_sol.pdf. The solution uses an AVL tree for range queries and that got me thinking.
I was wondering about scalability issues when the input size increases from a million to a billion and beyond. For instance consider a stream of integers (size: 4 bytes) and input of size 1 billion, the space required to store the integers in memory would be ~3GB!! The problem gets worse when you consider other data types such as floats and strings with the input size the order of magnitude under consideration.
Thus, I reached the conclusion that I would require the assistance of secondary storage to store all those numbers and pointers to child nodes of the AVL tree. I was considering storing the left and right child nodes as separate files but then I realized that that would be too many files and opening and closing the files would require expensive system calls and time consuming disk access and thus at this point I realized that AVL trees would not work.
I next thought about B-Trees and the advantage they provide as each node can have 'n' children, thereby reducing the number of files on disk and at the same time packing in more keys at every level. I am considering creating separate files for the nodes and inserting the keys in the files as and when they are generated.
1) I wanted to ask if my approach and thought-process is correct and
2) Whether I am using the right data structure and if B-Trees are the right data structure what should the order be to make the application efficient? What flavour of B Trees would yield maximum efficiency. Sorry for the long post! Thanks in advance for your replies!
Yes, you're reasoning is correct, although there are probably smarter schemes than to store one node per file. In fact, a B(+)-Tree often outperforms a binary search tree in practice (especially for very large collections) for numerous reasons and that's why just about every major database system uses it as its main index structure. Some reasons why binary search trees don't perform too well are:
Relatively large tree height (1 billion elements ~ height of 30 (if perfectly balanced)).
Every comparison is completely unpredictable (50/50 choice), so the hardware can't pre-fetch memory and fill the cpu pipeline with instructions.
After the upper few levels, you jump far away and to unpredictable locations in memory, each possibly requiring accessing the hard drive.
A B(+)-Tree with a high order will always be relatively shallow (height of 3-5) which reduces number of disk accesses. For range queries, you can read consecutively from memory while in binary trees you jump around a lot. Searching in a node may take a bit longer, but practically speaking you are limited by memory accesses not CPU time anyway.
So, the question remains what order to use? Usually, the node size is chosen to be equal to the page size (4-64KB) as optimizing for disk accesses is paramount. The page size is the minimal consecutive chunk of memory your computer may load from disk to main memory. Depending on the size of your key, this will result in a different number of elements per node.
For some help for the implementation, just look at how B+-Trees are implemented in database systems.

Why skiplist memory locality is poor but balanced tree is good?

A guy once challenged antirez(author of Redis) why Redis use skip list for the implementation sorted sets in ycombinator:
I was looking at Redis yesterday and noticed this. Is there any
particular reason you chose skip list instead of btrees except for
simplicity? Skip lists consume more memory in pointers and are
generally slower than btrees because of poor memory locality so
traversing them means lots of cache misses. I also suggested a way to
improve throughput when you guarantee each command's durability (at
the end of the wiki page):
http://code.google.com/p/redis/wiki/AppendOnlyFileHowto Also, have you
thought about accommodating read-only traffic in an additional thread
as a way to utilize at least two cores efficiently while sharing the
same memory?
Then antirez answered:
There are a few reasons: 1) They are not very memory intensive. It's
up to you basically. Changing parameters about the probability of a
node to have a given number of levels will make then less memory
intensive than btrees. 2) A sorted set is often target of many ZRANGE
or ZREVRANGE operations, that is, traversing the skip list as a linked
list. With this operation the cache locality of skip lists is at least
as good as with other kind of balanced trees. 3) They are simpler to
implement, debug, and so forth. For instance thanks to the skip list
simplicity I received a patch (already in Redis master) with augmented
skip lists implementing ZRANK in O(log(N)). It required little changes
to the code. About the Append Only durability & speed, I don't think
it is a good idea to optimize Redis at cost of more code and more
complexity for a use case that IMHO should be rare for the Redis
target (fsync() at every command). Almost no one is using this feature
even with ACID SQL databases, as the performance hint is big anyway.
About threads: our experience shows that Redis is mostly I/O bound.
I'm using threads to serve things from Virtual Memory. The long term
solution to exploit all the cores, assuming your link is so fast that
you can saturate a single core, is running multiple instances of Redis
(no locks, almost fully scalable linearly with number of cores), and
using the "Redis Cluster" solution that I plan to develop in the
future.
I read that carefully, but I can't understand why skip list comes with poor memory locality? And why balanced tree will lead a good memory locality?
In my opinion, memory locality is about storing data in a continuous memory. I think it's true when read data in address x, CPU will load the data in address x+1 into cache(Based on some experiments by C, years ago). So traversal an array will result a high possibility cache hit and we can say array has good memory locality.
But when comes to skip list and balanced tree, both aren't arrays and don't store data continuously. So I think their memory locality are both poor. So could anyone explain a little for me?
Maybe the guy meant that there is only one key value at skip list node (in case of default implementation) and N keys at b-tree node with linear layout. So we can load a bunch of b-tree keys from node into the cache.
you've said:
both aren't arrays and don't store data continuously
but we do. We store data continiously at b-tree node.

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.
I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.
Map Reduce
Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}
And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.
Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.
Insert
Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.
My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.
Question
This is really my first shot at riak, so there is definately the chance that I screwed something up.
Are there any settings I could tweak?
Are the hardware settings too constrained?
Maybe the PHP client library I used for interacting with riak is the limiting factor here?
Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.
It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.
This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.
Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.
Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.
These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.
A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.
Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.
I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.
Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.
Secondly, MapReduce is designed for batch processing, not realtime queries.
Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.
Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Are random updates disk bound mostly in standard and append only databases?

If I have large dataset and do random updates then I think updates are mostly disk bounded (in case append only databases there is not about seeks but about bandwidth I think). When I update record slightly one data page must be updated, so if my disk can write 10MB/s of data and page size is 16KB then i can have max 640 random updates per second. In append only databases apout 320 per second bacause one update can take two pages - index and data. In other databases bacause of ranom seeks to update page in place can be even worse like 100 updates per second.
I assume that one page in cache has only one update before write (random updates). Going forward the same will by for random inserts around all data pages (for examle not time ordered UUID) or even worst.
I refer to the situation when dirty pages (after update) must be flushed to disk and synced (can't longer stay in cache). So updates per second count is in this situation disk bandwidth bounded? Are my calculations like 320 updates per second likely? Maybe I am missing something?
"It depends."
To be complete, there are other things to consider.
First, the only thing distinguishing a random update from an append is the head seek involved. A random update will have the head dancing all over the platter, whereas an append will ideally just track like record player. This also assumes that each disk write is the full write and completely independent of all other writes.
Of course, that's in a perfect world.
With most modern databases, each update will typically involve, at a minimum, 2 writes. One for the actual data, the other for the log.
In a typical scenario, if you update a row, the database will make the change in memory. If you commit that row, the database will acknowledge that by making a note in the log, while keeping the actual dirty page in memory. Later, when the database checkpoints it will right the dirty pages to the disk. But when it does this, it will sort the blocks and write them as sequentially as it can. Then it will write a checkpoint to the log.
During recovery when the DB crashed and could not checkpoint, the database reads the log up to the last checkpoint, "rolls it forward" and applies those changes to actual disk page, marks the final checkpoint, then makes the system available for service.
The log write is sequential, the data writes are mostly sequential.
Now, if the log is part of a normal file (typical today) then you write the log record, which appends to the disk file. The FILE SYSTEM will then (likely) append to ITS log that change you just made so that it can update it's local file system structures. Later, the file system will, also, commit its dirty pages and make it's meta data changes permanent.
So, you can see that even a simple append can invoke multiple writes to the disk.
Now consider an "append only" design like CouchDB. What Couch will do, is when you make a simple write, it does not have a log. The file is its own log. Couch DB files grow without end, and need compaction during maintenance. But when it does the write, it writes not just the data page, but any indexes affected. And when indexes are affected, then Couch will rewrite the entire BRANCH of the index change from root to leaf. So, a simple write in this case can be more expensive than you would first think.
Now, of course, you throw in all of the random reads to disrupt your random writes and it all get quite complicated quite quickly. What I've learned though is that while streaming bandwidth is an important aspect of IO operations, overall operations per second are even more important. You can have 2 disks with the same bandwidth, but the one with the slower platter and/or head speed will have fewer ops/sec, just from head travel time and platter seek time.
Ideally, your DB uses dedicated raw storage vs a file system for storage, but most do not do that today. The advantages of file systems based stores operationally typically outweigh the performance benefits.
If you're on a file system, then preallocated, sequential files are a benefit so that your "append only" isn't simply skipping around other files on the file system, thus becoming similar to random updates. Also, by using preallocated files, your updates are simply updating DB data structures during writes rather than DB AND file system data structures as the file expands.
Putting logs, indexes, and data on separate disks allow multiple drives to work simultaneously with less interference. Your log can truly be append only for example compared to fighting with the random data reads or index updates.
So, all of those things factor in to throughput on DBs.

Resources