I have a tree of depth 3 with a high branching factor. Let's say for example the first layer contains all the taxonomic genera, the second level all the species and the third level contains data regarding the species.
Or graphically:
genus1 genus2 ... genus70000
/ | \ / | \ / | \
sp1 sp2 sp3 sp4 sp5 sp6 sp330k sp330k+1 sp330k+2
| | | | | | | | |
data1 data2 data3 data4 data5 data6 data330k data330k+1 data330k+2
In reality there are about five species per genus on average and not 3 but it doesn't really matter. I want to store this data in such a manner to support the following operations in O(1) (assuming that the number of species in each genus is constant):
Get data related to species s
Get data for all species in genus g
Insert to species s whose genus is g with data d
My current implementation stores a hash map between the genera and a list of pairs of species belonging to each genus and the data associated with that species. In this scheme operations 2 and 3 run in O(1) but operation 1 must iterate over all the genera in order to find the one that contains species s.
I was wondering what would be a better data structure for this.
Edit
Solving this problem while doubling the memory required is easy. I could just store a separate hash map from the species to their data. It would be nice if I could do this without storing the tree twice.
P.S.
I am writing in Java7 if it makes any difference.
You can have an extra HashMap that maps from s to g.
You could try a multi-dimensional data structure with 3 dimensions for data, genus and sp.
For example: kd-Tree, R-tree, or PH-Tree.
Not sure how well these work, though.
PH-tree works best with larger datasets with 10^6 entries or more. But it is partially oblivious of the number of dimensions, because they are internally processed in 'transposed' 64bit strings.
Anyway, you would have to try it out.
Disclaimer: PH-Tree is my own data structure.
Related
I'm a bit confused. I cannot find any information about how to execute a range query against a sorted string table.
LevelDB and RocksDB support a range iterator which allows you to query between ranges, which is perfect for NoSQL. What I don't understand is how it is implemented to be efficient.
The tables are sorted in memory (and on disk) - what algorithm or data structure allows one to query a Sorted String Table efficiently in a range query? Do you just loop through the entries and rely on the cache lines being full of data?
Usually I would put a prefix tree in front, and this gives me indexing of keys. But I am guessing Sorted String Tables do something different and take advantage of sorting in some way.
Each layer of the LSM (except for the first one) is internally sorted by the key, so you can just keep an iterator into each layer and use the one pointing to the lexicographically smallest element. The files of a layer look something like this on disk:
Layer N
---------------------------------------
File1 | File2 | File3 | ... | FileN <- filename
n:File2 |n:File3|n:File4| ... | <- next file
a-af | af-b | b-f | ... | w-z <- key range
---------------------------------------
aaron | alex | brian | ... | walter <- value omitted for brevity, but these are key:value records
abe | amy | emily | ... | xena
... | ... | ... | ... | ...
aezz | azir | erza | ... | zoidberg
---------------------------------------
First Layer (either 0 or 1)
---------------------------------------
File1 | File2 | ... | FileK
alex | amy | ... | andy
ben | chad | ... | dick
... | ... | ... | ...
xena | yen | ... | zane
---------------------------------------
...
Assume that you are looking for everything in the range ag-d (exclusive). A "range scan" is just to find the first matching element and then iterate the files of the layer. So you find that File2 is the first to contain any matching elements, and scan up to the first element starting with 'ag'. You iterate over File2, then look at the next file for File2 (n:File3). You check the key-range it contains and find that it contains more elements from the range you are interested in, so you iterate it until you hit the first entry starting with 'd'. You do the same thing in every layer, except the first. The first layer has files which are not sorted among each other, but they are internally sorted, so you can just keep an iterator per file. You also keep one more for the current memtables (in-memory data, only persisted in a log).
This never becomes too expensive, because the first layer is typically compacted on a small constant threshold. As the files in every layer are sorted and the files are internally sorted by the key too, you can just advance the smallest iterator until all iterators are exhausted. Apart from the initial search, every step has to look at a fixed number of iterators (assuming a naive approach) and is thus O(1). Most LSMs employ a block cache, and thus the sequential reads typically hit the cache most of the time.
Last but not least, be aware that this is mostly a conceptual explanation, because most implementations have a few extra tricks up their sleeves that make these things more efficient. You have to know which data is contained in which file-ranges anyway when you do a major compaction, i.e., merge layer N in to layer N + 1. Even the file-level operation may look quite different: RocksDB, e.g., maintains a coarse index with the key offsets at the beginning of each file to avoid scanning over the often much larger key/value pair portion of the file.
I have 50M different texts as input from which the top (up to) 10 most relevant tags have been extracted.
There are ~100K distinct tags
I would like to develop an algorithm that, given a text id T1 as input (present in the original input data set), computes the most related text id T2 based on the fact that T2 is the text that have most tags in common with T1.
id | tags
-------------
1 | A,B,C,D
2 | B,D,E,F
3 | A,B,D,E
4 | B,C,E
In the example above, the most similar id to 1 is 3 as they have 3 tags in common
This seems to be the same algorithm that shows the most related questions in StackOverflow.
My first idea was to map both texts and tags to integers to build a big (50M * 100K) binary matrix that is very sparse.
This matrix fits in memory, but I do not know how to use it.
As this is for a web application, I would like to deliver the result in real time conditions (at most a few ms, with possible multi-threading).
My main languages are Scala and Java.
Thanks for your help
Background:
I have read that many DBMSs use write-ahead logging to preserve atomicity and durability of transactions by storing updates as a group of write operations. What I'm trying to accomplish is to create a dbms model with improved concurrency by allowing reads to proceed on 'old' data while writes are pending.
Question:
Is there a data structure that allows me to efficiently (ideally O(1) amortized, at most O(log(n)) look up array elements (or memory locations, if you like), which may or may not have been overwritten by write actions, in reference to some point in time? This would be for about 1TB of data total.
Here is some ascii art to make this a little clearer. The dashes are data, with version 0 being the oldest version. The arrows indicate write operations.
^ ___________________________________Snapshot 2
| V | | V
| -- --- | | -------- Version 2
| | | __________________Snapshot 1
| V | | V
T| -------- | | --------- Version 1
I| | | ___________Snapshot 0
M| V V V V
E|------------------------------------- Version 0
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~>
SPACE/ADDRESS
Attempts at solution:
Let N be the data size, M be the number of versions, and P be the average number of updates per version.
The naive algorithm (searching each update) is O(M*P).
Dividing the data into buckets, updating only entire buckets, and searching a bitmask of buckets would be O(N/B*M), where B is bucket size, which isn't much better.
A Bloom filter seems like a good candidate at first glance, except that it requires more data than a simple bitmask of each memory location (which would be bad anyway, since it requires M*N/8 bytes to store.)
A standard hash table also comes to mind, but what would the key be?
Actually, now that I've gone to the trouble of writing this all up, I've thought of a solution that uses a binary search tree. I'll submit it as an answer in a bit, but it's still O(M*log2(P)) in space and time which is not ideal. See below.
The following is the best solution I could come up with, though it is still suboptimal.
The idea is to place each region into a binary search tree, one tree per version, where each inner node contains a memory location, and each leaf node is either Hit or Miss (and possibly lookup information), depending on if updated data exists there. This is O(P*log(P)) to construct for each version, and O(M*log(P)) to look up in.
This is suboptimal for two reasons:
The tree is balanced, but Misses are much more likely than Hits in practice, so it would make sense to put Miss nodes higher in the tree, or arrange nodes by their size. Some kind of Huffman coding comes to mind, but Huffman's algorithm does not preserve the search tree invariants.
It requires M trees (hence O(M*log(P)) lookup). Maybe there is some way to combine the trees.
We have two offline systems that normally can not communicate with each other. Both systems maintain the same ordered list of items. Only rarely will they be able to communicate with each other to synchronize the list.
Items are marked with a modification timestamp to detect edits. Items are identified by UUIDs to avoid conflicts when inserting new items (as opposed to using auto-incrementing integers). When synchronizing new UUIDs are detected and copied to the other system. Likewise for deletions.
The above data structure is fine for an unordered list, but how can we handle ordering? If we added an integer "rank", that would need renumbering when inserting a new item (thus requiring synchronizing all successor items due to only 1 insertion). Alternatively, we could use fractional ranks (use the average of the ranks of the predecessor and successor item), but that doesn't seem like a robust solution as it will quickly run into accuracy problems when many new items are inserted.
We also considered implementing this as a doubly linked-list with each item holding the UUID of its predecessor and successor item. However, that would still require synchronizing 3 items when 1 new items was inserted (or synchronizing the 2 remaining items when 1 item was deleted).
Preferably, we would like to use a data structure or algorithm where only the newly inserted item needs to be synchronized. Does such a data structure exist?
Edit: we need to be able to handle moving an existing item to a different position too!
There is really no problem with the interpolated rank approach. Just define your own numbering system based on variable length bit vectors representing binary fractions between 0 and 1 with no trailing zeros. The binary point is to the left of the first digit.
The only inconvenience of this system is that the minimum possible key is 0 given by the empty bit vector. Therefore you use this only if you're positive the associated item will forever be the first list element. Normally, just give the first item the key 1. That's equivalent to 1/2, so random insertions in the range (0..1) will tend to minimize bit usage. To interpolate an item before and after,
01 < newly interpolated = 1/4
1
11 < newly interpolated = 3/4
To interpolate again:
001 < newly interpolated = 1/8
01
011 < newly interpolated = 3/8
1
101 < newly interpolated = 5/8
11
111 < newly interpolated = 7/8
Note that if you wish you can omit storing the final 1! All keys (except 0 which you won't normally use) end in 1, so storing it is supefluous.
Comparison of binary fractions is a lot like lexical comparison: 0<1 and the first bit difference in a left-to-right scan tells you which is less. If no differences occur, i.e. one vector is a strict prefix of the other, then the shorter one is smaller.
With these rules it's pretty simple to come up with an algorithm that accepts two bit vectors and computes a result that's roughly (or exactly in some cases) between them. Just add the bit strings, and shift right 1, dropping unnecessary trailing bits, i.e. take the average of the two to split the range between.
In the example above, if deletions had left us with:
01
111
and we need to interpolate these, add 01(0) and and 111 to obtain 1.001, then shift to get 1001. This works fine as an interpolant. But note the final 1 unnecessarily makes it longer than either of the operands. An easy optimization is to drop the final 1 bit along with trailing zeros to get simply 1. Sure enough, 1 is about half way between as we'd hope.
Of course if you do many inserts in the same location (think e.g. of successive inserts at the start of the list), the bit vectors will get long. This is exactly the same phenomenon as inserting at the same point in a binary tree. It grows long and stringy. To fix this, you must "rebalance" during a synchronization by renumbering with the shortest possible bit vectors, e.g. for 14 you'd use the sequence above.
Addition
Though I haven't tried it, the Postgres bit string type seems to suffice for the keys I've described. The thing I'd need to verify is that the collation order is correct.
Also, the same reasoning works just fine with base-k digits for any k>=2. The first item gets key k/2. There is also a simple optimization that prevents the very common cases of appending and prepending elements at the end and front respectively from causing keys of length O(n). It maintains O(log n) for those cases (though inserting at the same place internally can still produce O(p) keys after p insertions). I'll let you work that out. With k=256, you can use indefinite length byte strings. In SQL, I believe you'd want varbinary(max). SQL provides the correct lexicographic sort order. Implementation of the interpolation ops is easy if you have a BigInteger package similar to Java's. If you like human-readable data, you can convert the byte strings to e.g. hex strings (0-9a-f) and store those. Then normal UTF8 string sort order is correct.
You can add two fields to each item - 'creation timestamp' and 'inserted after' (containing the id of the item after which the new item was inserted). Once you synchronize a list, send all the new items. That information is enough for you to be able to construct the list on the other side.
With the list of newly added items received, do this (on the receiving end): sort by creation timestamp, then go one by one, and use the 'inserted after' field to add the new item in the appropriate place.
You may face trouble if an item A is added, then B is added after A, then A is removed. If this can happen, you will need to sync A as well (basically syncing the operations that took place on the list since the last sync, and not just the content of the current list). It's basically a form of log-shipping.
You could have a look at "lenses", which is bidirectional programming concept.
For instance, your problem seems to be solved my "matching lenses", described in this paper.
I think the datastructure that is appropriate here is order statistic tree. In order statistic tree you also need to maintain sizes of subtrees along with other data, the size field helps easy to find element by rank as you need it to be. All operations like rank,delete,change position,insert are O(logn).
I think you can try kind of transactional approach here. For example you do not delete items physically but mark them for deletion and commit changes only during synchronization. I'm not absolutely sure which data type you should choose, it depends on which operations you want to be more productive (insertions, deletions, search or iteration).
Let we have the following initial state on both systems:
|1| |2|
--- ---
|A| |A|
|B| |B|
|C| |C|
|D| |D|
After that the first system marks element A for deletion and the second system inserts element BC between B and C:
|1 | |2 |
------------ --------------
|A | |A |
|B[deleted]| |B |
|C | |BC[inserted]|
|D | |C |
|D |
Both systems continue processing taking into account local changes, System 1 ignores element B and System 2 treats element BC as normal element.
When synchronization occurs:
As I understand, each system receives the list snapshot from other system and both systems freeze processing until synchronization will be finished.
So each system iterates sequentially through received snapshot and local list and writes changes to local list (resolving possible conflicts according to modified timestamp) after that 'transaction is commited', all local changes are finally applied and information about them erases.
For example for system one:
|1 pre-sync| |2-SNAPSHOT | |1 result|
------------ -------------- ----------
|A | <the same> |A | |A |
|B[deleted]| <delete B> |B |
<insert BC> |BC[inserted]| |BC |
|C | <same> |C | |C |
|D | <same> |D | |D |
Systems wake up and continue processing.
Items are sorted by insertion order, moving can be implemented as simultaneous deletion and insertion. Also I think that it will be possible not to transfer the whole list shapshot but only list of items that were actually modified.
I think, broadly, Operational Transformation could be related to the problem you are describing here. For instance, consider the problem of Real-Time Collaborative text editing.
We essentially have a sorted list of items( words) which needs to be kept synchronized, and which could be added/modified/deleted at random within the list. The only major difference I see is in the periodicity of modifications to the list.( You say it does not happen often)
Operational Transformation does happen to be well studied field. I could find this blog article giving pointers and introduction. Plus, for all the problems Google Wave had, they actually made significant advancements to the domain of Operational Transform. Check this out. . There is quite a bit of literature available on this subject. Look at this stackoverflow thread, and about Differential Synchronisation
Another parallel that struck me was the data structure used in Text Editors - Ropes.
So if you have a log of operations,lets say, "Index 5 deleted", "Index 6 modified to ABC","Index 8 inserted",what you might now have to do is to transmit a log of the changes from System A to System B, and then reconstruct the operations sequentially on the other side.
The other "pragmatic Engineer" 's choice would be to simply reconstruct the entire list on System B when System A changes. Depending on actual frequency and size of changes, this might not be as bad as it sounds.
I have tentatively solved a similar problem by including a PrecedingItemID (which can be null if the item is the top/root of the ordered list) on each item, and then having a sort of local cache that keeps a list of all items in sorted order (this is purely for efficiency—so you don't have to recursively query for or build the list based on PrecedingItemIDs every time there is a re-ordering on the local client). Then when it comes time to sync I do the slightly more expensive operation of looking for cases where two items are requesting the same PrecedingItemID. In those cases, I simply order by creation time (or however you want to reconcile which one wins and comes first), put the second (or others) behind it, and move on ordering the list. I then store this new ordering in the local ordering cache and go on using that until the next sync (just making sure to keep the PrecedingItemID updated as you go).
I haven't unit tested this approach yet—so I'm not 100% sure I'm not missing some problematic conflict scenario—but it appears at least conceptually to handle my needs, which sound similar to those of the OP.
Here's the problem I'm trying to solve:
I need to be able to display a paged, sorted table of data that is stored across several database shards.
Paging and sorting are well known problems that most of us can solve in any number of ways when the data comes from a single source. But if you're splitting your data across shards or using a DHT or distributed document database or whatever flavor of NoSQL you prefer, things get more complicated.
Here's a simple picture of a really small data set:
Shard | Data
1 | A
1 | D
1 | G
2 | B
2 | E
2 | H
3 | C
3 | F
3 | I
Sorted into pages (Page Size = 3):
Page | Data
1 | A
1 | B
1 | C
2 | D
2 | E
2 | F
3 | G
3 | H
3 | I
And if we wanted to show the user page 2, we'd return:
D
E
F
If the size of the table in question is something like 10 million rows, or 100 million, you can't just pull down all the data onto a web/application server to sort it and return the correct page. And you obviously can't let each individual shard sort and page its own slice of the data because the shards don't know about each other.
To complicate matters, the data I need to present can't be too far out of date, so pre-calculating a set of useful sorts ahead of time and storing the results for later retrieval isn't practical.
There are several solutions, some of which may not be feasible for you, but maybe one of them will stick:
Do the sharding by input ranges for this value (e.g., shard 1 contains A-C, shard 2 D-F, etc.). Alternately, use another table with foreign keys to this table as an index, and shard the index table using this system. That way you can easily locate and fetch specified ranges. This solution is probably the best in terms of performance, if you can do it (it assumes that the number of shards is static and the shards are reliable).
Identify the page items by binary search. For example, say you want items 100 to 110. For each shard, count the number of values lexicographically below "M". If the sum of the numbers is above 100, reduce the pivot point, otherwise increase it (using binary search). After you identify the 100th item (the first item on your page), take top 9 (10 - 1) items larger than that item from every shard, fetch them, sort the entire list, take the top 9 from the list, prepend the first item and there's your page! This approach is more difficult to implement and will require O(log(n)) queries so it is slower than (1), but still may be reasonably fast if the load is not very heavy.
Store the page number with each value. This would give you blazingly fast reads, but horribly slow writes, so it only works in the scenario where there are very few writes (or only appends in terms of the ordered variable).