Merkle tree for finding data inconsistencies - optimizing number of queries - algorithm

I understand the idea behind using Merkle tree to identify inconsistencies in data, as suggested by articles like
Key Concepts: Using Merkle trees to detect inconsistencies in data
Merkle Tree | Brilliant Math & Science Wiki
Essentially, we use a recursive algorithm to traverse down from root we want to verify, and follow the nodes where stored hash values are different from server (with trusted hash values), all the way to the inconsistent leaf/datablock.
If there's only one such block (leaf) that's corrupted, this means we following a single path down to leaf, which is log(n) queries.
However, in the case of multiple inconsistent data blocks/leaves, we need up to O(n) queries. In the extreme case, all data blocks are corrupted, and our algorithm will need to send every single node to server (authenticator). In the real world this becomes costly due to the network.
So my question is, is there any known improvement to the basic traverse-from-root algorithm? A possible improvement I could think of is to query the level of nodes in the middle. For example, in the tree below, we send the server the two nodes in the second level ('64' and '192'), and for any node that returns inconsistency, we recursively go to the middle level of that sub-tree - something like a binary search based on height.
This increases our best case time from O(1) to O(sqrt(n)), and probably reduces our worst case time to some extent (I have not calculated how much).
I wonder if there's any better approach than this? I've tried to search for relevant articles on Google Scholar, but looks like most of the algorithm-focused papers are concerned with the merkle-tree traversal problem, which is different from the problem above.
Thanks in advance!

Related

Splay tree real life applications

Where would you use splay-tree in production. I mean a REAL LIFE example.
I was thinking about implementing autocomplete using tries and splay trees. For a large dataset it's not a good idea to traverse through trie from node x to the leaves to return results, so the idea was of having a splay tree inside a node in trie, so when user entered 'sta' it will go to s-t-a, 'a' - node and then return the top 5 elements in the splay tree (by BFS/level traversing, which doesn't necessarily mutates/modifies the tree)
Of course after the autocomplete variant was picked, we should traverse up the trie and update all splay trees inside those nodes.
Since splay trees are sensitive in concurrent environments I was questioning its' usage in production
Your ideas?
Splay trees are not a good match for data which rarely or never changes, particularly in a threaded environment. The extra mutations during read operations defeat memory caches and can create unnecessary lock contention. In any case, for read-only data structures, you can do a one-time computation of an optimal tree. Even if that computation is slow, it will have no impact on the long-term execution time.
I'm not entirely persuaded by the claim that large tries are slow, and certainly not in the case of autocompleters. On even not-so-modern hardware, the cost of a trie traversal is trivial compared to the time it takes for the user to type a character, or even the time it takes for the underlying keyboard driver and input processor to deliver the keypress to your application.
If you really need to optimise a trie, there is good reason to believe that a hybrid data structure with a trie at the root combined with a linear (or binary) search once the alternatives can fit in a cache line. This maximizes the benefit of the trie's large fan-out while avoiding the poor caching behaviour and excessive storage overhead at the end of the lines.
Splay trees are most useful (if they are useful at all) on data structures which are modified frequently. The ckassic example is a "rope" data structure (a tree of string segments), which is one way to attempt to optimise a text editor by avoiding large string copies. Compared with a deterministic tree-balancing algorithm such as RB-trees, the splay tree algorithm has the benefit of simplicity, as well as only touching nodes which form part of the tree traversal.
However, the ready availability of self-balancing tree libraries (part of the standard libraries of many modern programming languages) combined with often-disappointing empirical results make the splay algorithm a niche product at best, although it is certainly a fascinating idea.
I found a quite interesting usage of splay trees in Network load optimisations, it's called SplayNet. A Autonomous System (I think under Facebook) has implemented this maybe around 2015 and they have somehow managed with this to lower their internal communication load by around 40%(?).
So there is a good usage for Splaytrees!
Few weeks ago i was also reading about Splaytrees being usefully depending on the spread in the sequence of Search. If there is none you could also use f.a. binary trees or some static trees. But in the moment there is one, Splaytrees perform (if you use unlimited time) better.
In my thesis I use splay trees as pre processed data collection for the actual searching. So the splay tree only stores the results of the most common search requests. In the next step the search starts from the splay tree given node ... I think this is useful for big datasets, specially if it's stored on different computers/storages, so your program has a better guess where to start.
To say it the easy way - my splaytrees stores the FAQ of the given datastructure/dataset :)

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

How to calculate that a B+ tree is O(log(n)) for lookups

I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct? Then how are lookups made? If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
I read the wikipedia article on B+ trees and I understand the structure but not how an actual lookup is performed. Could you guide me perhaps with some link to reading material?
What are some other uses of B+ trees besides database indexing?
I'm studying B+trees for indexing and I try to understand more than just memorizing the structure. As far as I understand the inner nodes of a B+tree forms an index on the leaves and the leaves contains pointers to where the data is stored on disk. Correct?
No, the index is formed by the inner nodes (non-leaves). Depending on the implementation the leaves may contain either key/value pairs or key/pointer to value pairs. For example, a database index uses the latter, unless it is an IOT (Index Organized Table) in which case the values are inlined in the leaves. This depends mainly on whether the value is insanely large wrt the key.
Then how are lookups made?
In the general case where the root node is not a leaf (it does happen, at first), the root node contains a sorted array of N keys and N+1 pointers. You binary search for the two keys S0 and S1 such that S0 <= K < S1 (where K is what you are looking for) and this gives you the pointer to the next node.
You repeat the process until you (finally) hit a leaf node, which contains a sorted list of key-values pairs and make a last binary search pass on those.
If a B+tree is so much better than a binary tree, why don't we use B+trees instead of binary trees everywhere?
Binary trees are simpler to implement. One though cookie with B+Trees is to size the number of keys/pointers in inner nodes and the number of key/values pairs in leaves nodes. Another though cookie is to decide on the low and high watermark that leads to grouping two nodes or exploding one.
Binary trees also offer memory stability: an element inserted is not moved, at all, in memory. On the other hand, inserting an element in a B+Tree or removing one is likely to lead to elements shuffling
B+Trees are tailored for small keys/large values cases. They also require that keys can be duplicated (hopefully cheaply).
Could you guide me perhaps with some link to reading material?
I hope the rough algorithm I explained helped out, otherwise feel free to ask in the comments.
What are some other uses of B+ trees besides database indexing?
In the same vein: file-system indexing also benefits.
The idea is always the same: a B+Tree is really great with small keys/large values and caching. The idea is to have all the keys (inner nodes) in your fast memory (CPU Cache >> RAM >> Disk), and the B+Tree achieves that for large collections by pushing keys to the bottom. With all inner nodes in the fast memory, you only have one slow memory access at each search (to fetch the value).
B+ trees are better than binary tree all the dbms use them,
a lookup in B+Tree is LOGF N being F the base of LOG and the fan out. The lookup is performed exactly like in a binary tree but with a bigger fan out and lower height thats why it is way better.
B+Tree are usually known for having the data in the leaf(if they are unclustered probably not), this means you dont have to make another jump to the disk to get the data, you just take it from the leaf.
B+Tree is used almost everywhere, Operating Systems use them, datawarehouse (not so much here but still), lots of applications.
B+Tree are perfect for range queries, and are used whenever you have unique values, like a primary key, or any field with low cardinality.
If you can get this book http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638 its one of the best. Its basically the bible for any database guy.

When to choose RB tree, B-Tree or AVL tree?

As a programmer when should I consider using a RB tree, B- tree or an AVL tree?
What are the key points that needs to be considered before deciding on the choice?
Can someone please explain with a scenario for each tree structure why it is chosen over others with reference to the key points?
Take this with a pinch of salt:
B-tree when you're managing more than thousands of items and you're paging them from a disk or some slow storage medium.
RB tree when you're doing fairly frequent inserts, deletes and retrievals on the tree.
AVL tree when your inserts and deletes are infrequent relative to your retrievals.
I think B+ trees are a good general-purpose ordered container data structure, even in main memory. Even when virtual memory isn't an issue, cache-friendliness often is, and B+ trees are particularly good for sequential access - the same asymptotic performance as a linked list, but with cache-friendliness close to a simple array. All this and O(log n) search, insert and delete.
B+ trees do have problems, though - such as the items moving around within nodes when you do inserts/deletes, invalidating pointers to those items. I have a container library that does "cursor maintenance" - cursors attach themselves to the leaf node they currently reference in a linked list, so they can be fixed or invalidated automatically. Since there's rarely more than one or two cursors, it works well - but it's an extra bit of work all the same.
Another thing is that the B+ tree is essentially just that. I guess you can strip off or recreate the non-leaf nodes depending on whether you need them or not, but with binary tree nodes you get a lot more flexibility. A binary tree can be converted to a linked list and back without copying nodes - you just change the pointers then remember that you're treating it as a different data structure now. Among other things, this means you get fairly easy O(n) merging of trees - convert both trees to lists, merge them, then convert back to a tree.
Yet another thing is memory allocation and freeing. In a binary tree, this can be separated out from the algorithms - the user can create a node then call the insert algorithm, and deletes can extract nodes (detach them from the tree, but dont free the memory). In a B-tree or B+-tree, that obviously doesn't work - the data will live in a multi-item node. Writing insert methods that "plan" the operation without modifying nodes until they know how many new nodes are needed and that they can be allocated is a challenge.
Red black vs. AVL? I'm not sure it makes any big difference. My own library has a policy-based "tool" class to manipulate nodes, with methods for double-linked lists, simple binary trees, splay trees, red-black trees and treaps, including various conversions. Some of those methods were only implemented because I was bored at one time or another. I'm not sure I've even tested the treap methods. The reason I chose red-black trees rather than AVL is because I personally understand the algorithms better - which doesn't mean they're simpler, it's just a fluke of history that I'm more familiar with them.
One last thing - I only originally developed my B+ tree containers as an experiment. It's one of those experiments that never ended really, but it's not something I'd encourage others to repeat. If all you need is an ordered container, the best answer is to use the one that your existing library provides - e.g. std::map etc in C++. My library evolved over years, it took quite a while to get it stable, and I just relatively recently discovered it's technically non-portable (dependent on a bit of undefined behaviour WRT offsetof).
In memory B-Tree has the advantage when the number of items is more than 32000... Look at speedtest.pdf from stx-btree.
When choosing data structures you are trading off factors such as
speed of retrieval v speed of update
how well the structure copes with worst case operations, for example insertion of records that arrive in a sorted order
space wasted
I would start by reading the Wikipedia articles referenced by Robert Harvey.
Pragmatically, when working in languages such as Java the average programmer tends to use the collection classes provided. If in a performance tuning activity one discovers that the collection performance is problematic then one can seek alternative implementations. It's rarely the first thing a business-led development has to consider. It's extremely rare that one needs to implement such data structures by hand, there are usually libraries that can be used.

Self-sorted data structure with random access

I need to implement self-sorted data structure with random access. Any ideas?
A self sorted data structure can be binary search trees. If you want a self sorted data structure and a self balanced one. AVL tree is the way to go. Retrieval time will be O(lgn) for random access.
Maintaining a sorted list and accessing it arbitrarily requires at least O(lgN) / operation. So, look for AVL, red-black trees, treaps or any other similar data structure and enrich them to support random indexing. I suggest treaps since they are the easiest to understand/implement.
One way to enrich the treap tree is to keep in each node the count of nodes in the subtree rooted at that node. You'll have to update the count when you modify the tree (eg: insertion/deletion).
I'm not too much involved lately with data structures implementation. Probably this answer is not an answer at all... you should see "Introduction to algorithms" written by Thomas Cormen. That book has many "recipes" with explanations about the inner workings of many data structures.
On the other hand you have to take into account how much time do you want to spend writing an algorithm, the size of the input and the if there is an actual necessity of an special kind of datastructure.
I see one thing missing from the answers here, the Skiplist
https://en.wikipedia.org/wiki/Skip_list
You get order automatically, there is a probabilistic element to search and creation.
Fits the question no worse than binary trees.
Self sorting is a little bit to ambigious. First of all
What kind of data structure?
There are a lot of different data structures out there, such as:
Linked list
Double linked list
Binary tree
Hash set / map
Stack
Heap
And many more and each of them behave differently than others and have their benefits of course.
Now, not all of them could or should be self-sorting, such as the Stack, it would be weird if that one were self-sorting.
However, the Linked List and the Binary Tree could be self sorting, and for this you could sort it in different ways and on different times.
For Linked Lists
I would preffere Insertion sort for this, you can read various good articles about this on both wikis and other places. I like the pasted link though. Look at it and try to understand the concept.
If you want to sort after it is inserted, i.e. on random times, well then you can just implement a sorting algororithm different than insertion sort maybe, bubblesort or maybe quicksort, I would avoid bubblesort though, it's a lot slower! But easier to gasp the mind around.
Random Access
Random is always something thats being discusses around so have a read about how to perform good randomization and you will be on your way, if you have a linked list and have a "getAt"-method, you could just randomize an index between 0 and n and get the item at that index.

Resources