What is Fractal Index (by Tokutek) exactly? - data-structures

There are 3 kinds of IO-optimized write-optimized data structures (not including LSM) mentioned in connection to Fractal Index (by Tokutek):
1) Buffered repository trees of any kind. Related publications with the same idea:
http://www.cc.gatech.edu/~bader/COURSES/GATECH/CSE-Algs-Fall2013/papers/Arg03.pdf
http://cs.slu.edu/~goldwasser/publications/SODA2000.pdf
2) COLA (cache-oblivious lookahead (forward pointers) array).
http://supertech.csail.mit.edu/papers/sbtree.pdf
3) Shuttle trees:
http://supertech.csail.mit.edu/papers/sbtree.pdf
What data structure is actually called a "Fractal Tree Index"?
How COLAs used exactly in real software? Is COLA used as a small buffers for buffered tree or it handle terabytes of data in real applications, similar to LSM? Why someone would prefer to use COLA instead of buffered tree? How it is different from LSM on terabytes?
Speaking of buffered tree by Lars Arge: as far as i understand, the "buffers" may be stored in external memory and the "buffers" may have size of entire RAM: the only requirement is to fit into memory for sorting before pushing one level down?
Why someone would prefer to use such a large external memory "buffers" instead of using smaller buffers of size B on every internal node?

Related

How to determine the optimal capacity for Quadtree subdivision?

I've created a flocking simulation using Boid's algorithm and have integrated a quadtree for optimization. Boids are inserted into the quadtree if the quadtree has not yet met its boid capacity. If the quadtree has met its capacity, it will subdivide into smaller quadtrees and the remaining boids will try to insert again on that one, recursively.
The performance seems to get better if I increase the capacity from its default 4 to one that is capable of holding more boids like 20, and I was just wondering if there is any sort of rule or methodology that goes into picking the optimal capacity formulaically.
You can view the site live here or the source code here if relevant.
I'd assume it very much depends on your implementation, hardware, and the data characteristics.
Implementation:
An extreme case would be using GPU processing to compare entries. If you support that, having very large nodes, potentially just a single node containing all entries, may be faster than any other solution.
Hardware:
Cache size and Bus speed will play a big role, also depending on how much memory every node and every entry consumes. Accessing a sub-node that is not cached is obviously expensive, so you may want to increase the size of nodes in order to reduce sub-node traversal.
-> Coming back to implementation, storing the whole quadtree on a contiguous segment of memory can be very beneficial.
Data characteristics:
Clustered data: Having strongly clustered data can have adverse effect on performance because it may cause the tree to become very deep. In this case, increasing node size may help.
Large amounts of data will mean that you may get over a threshold very everything fits into a cache. In this case, making nodes larger will save memory because you will have fewer nodes and everything may fit into the cache again.
In my experience I found that 10-50 entries per node gives the best performance across different datasets.
If you update your tree a lot, you may want to define a threshold to avoid 'flickering' and frequent merging/splitting of nodes. I.e. split nodes with >25 entries but merge them only when getting less than 15 entries.
If you are interested in a quadtree-like structure that avoids degenerated 'deep' quadtrees, have a look at my PH-Tree. It is structured like a quadtree but operates on bit-level, so maximum depth is strictly limited to 64 or 32, depending on how many bits your data has. In practice the depth will rarely exceed 10 levels or so, even for very dense data. Note: A plain PH-Tree is a key-value 'map' in the sense that every coordinate (=key) can only have one entry (=value). That means you need to store lists or sets of entries in case you expect more than one entry for any given coordinate.

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Balanced trees and space and time trade-offs

I was trying to solve problem 3-1 for large input sizes given in the following link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/assignments/MIT6_006F11_ps3_sol.pdf. The solution uses an AVL tree for range queries and that got me thinking.
I was wondering about scalability issues when the input size increases from a million to a billion and beyond. For instance consider a stream of integers (size: 4 bytes) and input of size 1 billion, the space required to store the integers in memory would be ~3GB!! The problem gets worse when you consider other data types such as floats and strings with the input size the order of magnitude under consideration.
Thus, I reached the conclusion that I would require the assistance of secondary storage to store all those numbers and pointers to child nodes of the AVL tree. I was considering storing the left and right child nodes as separate files but then I realized that that would be too many files and opening and closing the files would require expensive system calls and time consuming disk access and thus at this point I realized that AVL trees would not work.
I next thought about B-Trees and the advantage they provide as each node can have 'n' children, thereby reducing the number of files on disk and at the same time packing in more keys at every level. I am considering creating separate files for the nodes and inserting the keys in the files as and when they are generated.
1) I wanted to ask if my approach and thought-process is correct and
2) Whether I am using the right data structure and if B-Trees are the right data structure what should the order be to make the application efficient? What flavour of B Trees would yield maximum efficiency. Sorry for the long post! Thanks in advance for your replies!
Yes, you're reasoning is correct, although there are probably smarter schemes than to store one node per file. In fact, a B(+)-Tree often outperforms a binary search tree in practice (especially for very large collections) for numerous reasons and that's why just about every major database system uses it as its main index structure. Some reasons why binary search trees don't perform too well are:
Relatively large tree height (1 billion elements ~ height of 30 (if perfectly balanced)).
Every comparison is completely unpredictable (50/50 choice), so the hardware can't pre-fetch memory and fill the cpu pipeline with instructions.
After the upper few levels, you jump far away and to unpredictable locations in memory, each possibly requiring accessing the hard drive.
A B(+)-Tree with a high order will always be relatively shallow (height of 3-5) which reduces number of disk accesses. For range queries, you can read consecutively from memory while in binary trees you jump around a lot. Searching in a node may take a bit longer, but practically speaking you are limited by memory accesses not CPU time anyway.
So, the question remains what order to use? Usually, the node size is chosen to be equal to the page size (4-64KB) as optimizing for disk accesses is paramount. The page size is the minimal consecutive chunk of memory your computer may load from disk to main memory. Depending on the size of your key, this will result in a different number of elements per node.
For some help for the implementation, just look at how B+-Trees are implemented in database systems.

Why skiplist memory locality is poor but balanced tree is good?

A guy once challenged antirez(author of Redis) why Redis use skip list for the implementation sorted sets in ycombinator:
I was looking at Redis yesterday and noticed this. Is there any
particular reason you chose skip list instead of btrees except for
simplicity? Skip lists consume more memory in pointers and are
generally slower than btrees because of poor memory locality so
traversing them means lots of cache misses. I also suggested a way to
improve throughput when you guarantee each command's durability (at
the end of the wiki page):
http://code.google.com/p/redis/wiki/AppendOnlyFileHowto Also, have you
thought about accommodating read-only traffic in an additional thread
as a way to utilize at least two cores efficiently while sharing the
same memory?
Then antirez answered:
There are a few reasons: 1) They are not very memory intensive. It's
up to you basically. Changing parameters about the probability of a
node to have a given number of levels will make then less memory
intensive than btrees. 2) A sorted set is often target of many ZRANGE
or ZREVRANGE operations, that is, traversing the skip list as a linked
list. With this operation the cache locality of skip lists is at least
as good as with other kind of balanced trees. 3) They are simpler to
implement, debug, and so forth. For instance thanks to the skip list
simplicity I received a patch (already in Redis master) with augmented
skip lists implementing ZRANK in O(log(N)). It required little changes
to the code. About the Append Only durability & speed, I don't think
it is a good idea to optimize Redis at cost of more code and more
complexity for a use case that IMHO should be rare for the Redis
target (fsync() at every command). Almost no one is using this feature
even with ACID SQL databases, as the performance hint is big anyway.
About threads: our experience shows that Redis is mostly I/O bound.
I'm using threads to serve things from Virtual Memory. The long term
solution to exploit all the cores, assuming your link is so fast that
you can saturate a single core, is running multiple instances of Redis
(no locks, almost fully scalable linearly with number of cores), and
using the "Redis Cluster" solution that I plan to develop in the
future.
I read that carefully, but I can't understand why skip list comes with poor memory locality? And why balanced tree will lead a good memory locality?
In my opinion, memory locality is about storing data in a continuous memory. I think it's true when read data in address x, CPU will load the data in address x+1 into cache(Based on some experiments by C, years ago). So traversal an array will result a high possibility cache hit and we can say array has good memory locality.
But when comes to skip list and balanced tree, both aren't arrays and don't store data continuously. So I think their memory locality are both poor. So could anyone explain a little for me?
Maybe the guy meant that there is only one key value at skip list node (in case of default implementation) and N keys at b-tree node with linear layout. So we can load a bunch of b-tree keys from node into the cache.
you've said:
both aren't arrays and don't store data continuously
but we do. We store data continiously at b-tree node.

Why does LevelDB needs more than two levels?

I think only two levels(level-0 and level-1) is ok, why does LevelDB need level-2, level-3, and more?
I'll point you in the direction of some articles on LevelDB and it's underlying storage structure.
So in the documentation for LevelDB
it discusses merges among levels.
These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
LevelDB is similar in structure to Log Structured Merge Trees. The paper discusses the different levels if you're interested in the analysis of it. If you can get through the mathematics it seems to be your best bet to understanding the data structure.
A much easier to read analysis of levelDB talks about the datastore's relation to LSM Trees but in terms of your questions about the levels all it says is:
Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables.
Probably the LevelDB documentation provides the best answer: (maximizing the size of the writes and reads, since LevelDB is on-disk(slow seek) data storage).
Good Luck!
I think it is mostly to do with easy and quick merging of levels.
In Leveldb, level-(i+1) has approx. 10 times the data compared to level-i. This is more analogous to a multi-level cache structure where in if the database has 1000 records between keys x1 to x2, then 10 of the most frequently accessed ones in that range would be in level-1 and 100 in the same range would be in level-2 and rest in level-3 (this is not exact but just to give an intuitive idea of levels). In this set up, to merge a file in level-i we need to look at at most 10 files in level-(i+1) and it can all be brought into memory, a quick merge done and written back. These results in reading relatively small chunks of data for each compaction/merging operation.
On the other hand if you had just 2 levels, the key range in one level-0 file could potentially match 1000's of files in level-1 and all of them need to be opened up for merging which is going to be pretty slow. Note that an important assumption here is we have fixed size files (say 2MB). With variable length files in level-1, your idea could still work and I think a variant of that is used in systems like HBase and Cassandra.
Now if you are concern is about look up delay with many levels, again this is like a multi-level cache structure, most recently written data would be in higher levels to help with typical locality of reference.
Level 0 is data in memory other levels are disk data. The important part is that data in levels is sorted. If level1 consists of 3 2Mb files then in file1 it's the keys 0..50 (sorted) in file2 150..200 and in file3 300..400 (as an example). So when memory level is full we need to insert it's data to disk in the most efficient manner, which is sequential writing (using as few disk seeks as possible). Imagine in memory we have keys 60-120, cool, we just write them sequentially as file which becomes file2 in level1. Very efficient!
But now imagine that level1 is much larger then level0 (which is reasonable as level0 is memory). In this case there are many files in level1. And now our keys in memory (60-120) belong to many files as the key range in level1 is very fine grained. Now to merge level0 with level1 we need to read many files and make a lot of random seeks, make new files in memory and write them. So this is where many levels idea kicks in, we'll have many layers, each somewhat larger than the previous (x10), but not much larger so when we have to migrate data from i-1 to i-th layer we have a good chance of having to read least amount of files.
Now, since data might change there may be no need to propagate it to higher more expensive layers (it might be changed or deleted) and so we avoid expensive merges altogether. The data that does end up in the last level is statistically least likely to change so is the best fit for most-expensive-to-merge-with last layer.

Resources