Lucene index modeling - Why are skiplists used instead of btree? - data-structures

I have recently started learning lucene and came to know about how lucene stores and queries indices. Lucene seems to be using skip list as an underlying data structure. However, I did not find any reason to use skip list over a binary tree.
The advantage with skip lists is that it provides good performance when being used concurrently. And lucene allows single writer thread per index and readers read from immutable segments, so skip list is not helping here either. Other than that binary tree (self balancing) trumps skip list - since it provides worst case complexity of O(logn) for reading and writing whereas skip list provides same time complexity in average case. Also, binary tree would serve range queries in better time compared to skip list. For serving a conjunction query as well, lucene uses skip lists of multiple postings list to find their intersection - for this case too binary tree would have been enough.
Is there any specific reason skip list is used in lucene for indexing purposes that I have missed?

Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). See this SO answer for How does lucene index documents?
In that answer, it also indicates that the primary benefit of using Skip-Lists it that it avoids ever having to rebalance a B-Tree. If you'd like to dig deeper that answer cite another one that provides a lot more detail: Skip List vs. Binary Search Tree Which intern references additional whitepapers.
Researching this a bit more, there is one other advantage to using Skip-Lists rather then a BTree. It's not just the rebalancing that is avoided, but also avoided is the locking of a portion of the tree while the rebalancing is taking place. This aspect is discussed further here. This latter advantage improves concurrency.

Related

Why Redis SortedSet uses Skip List instead of Balanced Tree?

The Redis document said as below :
ZSETs are ordered sets using two data structures to hold the same elements
in order to get O(log(N)) INSERT and REMOVE operations into a sorted
data structure.
The elements are added to a hash table mapping Redis objects to
scores. At the same time the elements are added to a skip list
mapping scores to Redis objects (so objects are sorted by scores in
this "view").
I can not understand very much. Could someone give me a detailed explanation?
Antirez said, see in https://news.ycombinator.com/item?id=1171423
There are a few reasons:
They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make then less memory intensive than btrees.
A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees.
They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code.
About the Append Only durability & speed, I don't think it is a good idea to optimize Redis at cost of more code and more complexity for a use case that IMHO should be rare for the Redis target (fsync() at every command). Almost no one is using this feature even with ACID SQL databases, as the performance hint is big anyway.
About threads: our experience shows that Redis is mostly I/O bound. I'm using threads to serve things from Virtual Memory. The long term solution to exploit all the cores, assuming your link is so fast that you can saturate a single core, is running multiple instances of Redis (no locks, almost fully scalable linearly with number of cores), and using the "Redis Cluster" solution that I plan to develop in the future.
First of all, I think I got the idea of what the Redis documents says. Redis ordered set maintain the order of elements by the the element's score specified by user. But when user using some Redis Zset APIs, it only gives element args. For example:
ZREM key member [member ...]
ZINCRBY key increment member
...
redis need to know what value is about this member (element), so it uses hash table maintaining a mapping, just like the documents said:
The elements are added to a hash table mapping Redis objects to
scores.
when it receives a member, it finds its value through the hash table, and then manipulate the operation on the skip list to maintain the order of set. redis uses two data structure to maintain a double mapping to satisfy the need of the different API.
I read the papers by William Pugh Skip Lists: A Probabilistic
Alternative to Balanced Trees, and found the skip list is very elegant and easier to implement than rotating.
Also, I think the general binary balanced tree is able to do this work at the same time cost. I case I've missed something, please point that out.

LMDB variant offering finger B-Tree?

A finger B-Tree is a B-Tree that tracks a user-specified associative "summarizing" operation on the leaves. When nodes are merged, the operation is used to combine summaries; when nodes are split the summary is recalculated using the node's grandchildren (but no deeper nodes).
By updating the summary data with each split/merge, a finger B-Tree is able to answer queries a the summary over any arbitrary range of keys in at most O(log n) page lookups (i.e. along the path from the root down to the floorkey of the range and the ceilkey of the range).
I don't think LMDB supports this out of the box, but I'd be happy to be wrong. Is anybody aware of an LMDB fork or variant which adds it? If not, is there another lightweight persistent (not necessarily transactional) on-disk BTree library that does?
RocksDB offers custom compaction filters and merge operators, which could be used to implement such summaries in a fairly efficient way, I think. Of course, it's architecture is very different from LMDB.

Is a linked list in a B-tree node superior to an array?

I want to implement a B-tree index for my database.
I have read many data structure and algorithm books to learn how to do it. All implementations use an array to save data and child indexes.
Now I want to know: is a linked list in B-tree node superior to an array?
There are some ideas I've thought about:
when splitting a node, the copy operation will be more quickly than with an array.
when inserting data, if the data is inserted into the middle or at the head of the array, the speed is lower than inserting to the linked list.
The linked list is not better, in fact a simple array is not better either (except its simplicity which is good argument for it and search speed if sorted).
You have to realize that the "array" implementation is more a "reference" implementation than a true full power implementation. For example, the implementation of the data/key pairs inside a B-Tree node in commercial implementations uses many strategies to solve two problems: storage efficiency and efficient search of keys in the node.
With regard with efficient search, an array of key/value with an internal balanced tree structure on the top of it can make insertion/deletion/search be done in O(log N), for large B tree nodes it makes sense.
With regard to memory efficiency, the nature of data in the key and value is very important. For example, lexicographical keys can be shorten by a common start (e.g. "good", "great" have "g" in common), the data might be compressed as well using any possible scheme relevant to the nature of the data. The compression of keys is more complex as you will want to keep this lexicographical property. Remember that the more data and keys you stuff in a node, the fastest are the disk accesses.
The time to split a node is only partially relevant, as it will be much less than the time to read or write a node on typical media by several order of magnitude. On SSD and extremely fast disks (by 10 to 20 years it is expected to have disks as fast as RAM), many researches are conducted to find a successor to B-Trees, stratified B-Trees are an example.
If the BTree is itself stored on the disk then a linked list will make it very complicated to maintain.
Keep the B-Tree structure compact. This will allow more nodes per page, locality of data and allowing caching of more nodes, and fewer disk reads/cache misses.
Use an array.
The perceived in-memory computational benefits are inconsequential.
So, in short, no, linked list is not superior.
B-tree is typically used in DBs where the data is stored on disks and you want to minimize the number of blocks you want to read. I do not think your proposal would be efficient in that case (although it might be beneficial if you can load all data into RAM).
If you want to perform those two operations effectively you should use a Skip List (http://en.wikipedia.org/wiki/Skip_list). Performance-wise it will be similar to what you have outlined.

Lucene's algorithm

I read the paper by Doug Cutting; "Space optimizations for total ranking".
Since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking).
Particularly, the total ranking algorithm described there involves traversing down the entire postings list for each query term, so in case of very common query terms like "yellow dog", either of the 2 terms may have a very very long postings list in case of web search. Are they all really traversed in the current Lucene/Solr? Or are there any heuristics to truncate the list employed?
In the case when only the top k results are returned, I can understand that distributing the postings list across multiple machines, and then combining the top-k from each would work, but if we are required to return "the 100th result page", i.e. results ranked from 990--1000th, then each partition would still have to find out the top 1000, so
partitioning would not help much.
Overall, is there any up-to-date detailed documentation on the internal algorithms used by Lucene?
I am not aware of such documentation, but since Lucene is open-source, I encourage you go reading the source code. In particular, the current trunk version includes flexible indexing, meaning that the storage and posting list traversal has been decoupled from the rest of the code, making it possible to write custom codecs.
You assumptions are correct regarding posting list traversal, by default (it depends on your Scorer implementation) Lucene traverses the entire posting list for every term present in the query and puts matching documents in a heap of size k to compute the top-k docs (see TopDocsCollector). So returning results from 990 to 1000 makes Lucene instantiate a heap of size 1000. And if you partition your index by document (another approach could be to split by term), every shard will need to send the top 1000 results to the server which is responsible for merging results (see Solr QueryComponent for example, which translates a query from N to P>N to several shard requests from 0 to P sreq.params.set(CommonParams.START, "0");). This is why Solr might be slower in distributed mode than in standalone mode in case of extreme paging.
I don't know how Google manages to score results efficiently, but Twitter published a paper on their retrieval engine Earlybird where they explain how they patched Lucene in order to do efficient reverse chronological order traversal of the posting lists, which allows them to return the most recent tweets matching a query without traversing the entire posting list for every term.
Update:
I found this presentation from Googler Jeff Dean, which explains how Google built its large scale information retrieval system. In particular, it talks about sharding strategies and posting list encoding.

Where is binary search used in practice?

Every programmer is taught that binary search is a good, fast way to search an ordered list of data. There are many toy textbook examples of using binary search, but what about in real programming: where is binary search actually used in real-life programs?
Binary search is used everywhere. Take any sorted collection from any language library (Java, .NET, C++ STL and so on) and they all will use (or have the option to use) binary search to find values. While true that you have to implement it rarely, you still have to understand the principles behind it to take advantage of it.
Binary search can be used to access ordered data quickly when memory space is tight. Suppose you want to store a set of 100.000 32-bit integers in a searchable, ordered data structure but you are not going to change the set often. You can trivially store the integers in a sorted array of 400.000 bytes, and you can use binary search to access it fast. But if you put them e.g. into a B-tree, RB-tree or whatever "more dynamic" data structure, you start to incur memory overhead. To illustrate, storing the integers in any kind of tree where you have left child and right child pointers would make you consume at least 1.200.000 bytes of memory (assuming 32-bit memory architecture). Sure, there are optimizations you can do, but that's how it works in general.
Because it is very slow to update an ordered array (doing insertions or deletions), binary search is not useful when the array changes often.
Here some practical examples where I have used binary search:
Implementing a "switch() ... case:" construct in a virtual machine where the case labels are individual integers. If you have 100 cases, you can find the correct entry in 6 to 7 steps using binary search, where as sequence of conditional branches takes on average 50 comparisons.
Doing fast substring lookup using suffix arrays, which contain all the suffixes of the set of searchable strings in lexiographic ordering (I wanted to conserve memory and keep the implementation simple)
Finding numerical solutions to an equation (when you are lazy and do not mind to implement Newton's method)
Every programmer needs to know how to use binary search when debugging.
When you have a program, and you know that a bug is visible at a particular point
during the execution of the program, you can use binary search to pin-point the
place where it actually happens. This can be much faster than single-stepping through
large parts of the code.
Binary search is a good and fast way!
Before the arrival of STL and .NET framework, etc, you rather often could bump into situations where you needed to roll your own customized collection classes. Whenever a sorted array would be a feasible place of storing the data, binary search would be the way of locating entries in that array.
I'm quite sure binary search is in widespread use today as well, although it is taken care of "under the hood" by the library for your convenience.
I've implemented binary searches in BTree implementations.
The BTree search algorithms were used for finding the next node block to read but, within the 4K block itself (which contained a number of keys based on the key size), binary search was used for find either the record number (for a leaf node) or the next block (for a non-leaf node).
Blindingly fast compared to sequential search since, like balanced binary trees, you remove half the remaining search space with every check.
I once implemented it (without even knowing that this was indeed binary search) for a GUI control showing two-dimensional data in a graph. Clicking with the mouse should set the data cursor to the point with the closest x value. When dealing with large numbers of points (several 1000, this was way back when x86 was only beginning to get over 100 MHz CPU frequency) this was not really usable interactively - I was doing a linear search from the start. After some thinking it occurred to me that I could approach this in a divide and conquer fashion. Took me some time to get it working under all edge cases.
It was only some time later that I learned that this is indeed a fundamental CS algorithm...
One example is the stl set. The underlying data structure is a balanced binary search tree which supports look-up, insertion, and deletion in O(log n) due to binary search.
Another example is an integer division algorithm that runs in log time.
We still use it heavily in our code to search thousands of ACLS many thousands of times a second. It's useful because the ACLs are static once they come in from file, and we can suffer the expense of growing the array as we add to it at bootup. Blazingly fast once its running too.
When you can search a 255 element array in at most 7 compare/jumps (511 in 8, 1023 in 9, etc) you can see that binary search is about as fast as you can get.
Well, binary search is now used in 99% of 3D games and applications. Space is divided into a tree structure and a binary search is used to retrieve which subdivisions to display according to a 3D position and camera.
One of its first greatest showcase was Doom. Binary trees and associated search enhanced the rendering.
Answering your question with hands-on example.
In R programming language there is a package data.table. It is known from C-implemented, short syntax, high performance extension for data transformation. It uses binary search. Even without binary search it scales better than competitors.
You can find benchmark vs python pandas and vs R dplyr in project wiki grouping 2E9 - random order data.
There is also nice benchmark vs databases vs bigdata benchm-databases.
In recent data.table version (1.9.6) binary search was extended and now can be used as index on any atomic column.
I just found a nice summary with which I totally agree - see.
Anyone doing R comparisons should use data.table instead of data.frame. More so for benchmarks. data.table is the best data structure/query language I have found in my career. It's leading the way in The R world, and in my way, in all the data-focused languages.
So yes, binary search is being used and world is much better place thanks to it.
Binary search can be used to debug with Git. It's called git bisect.
Amongst other places, I have an interpreter with a table of command names and a pointer to the function to interpret that command. There are about 60 commands. It would not be incredibly onerous to use a linear search - but I use a binary search.
Semiconductor test programs used for measuring digital timing or analog levels make extensive use of binary search. Automatic Test Equipment (ATE) from Advantest, Teradyne, Verigy and the like can be thought of as truth table blasters, applying input logic and verifying output states of a digital part.
Think of a simple gate, with the input logic changing at time = 0 of each cycle, and transitioning X ns after the input logic changes. If you strobe the output before T=X,the logic does not match expected value. Strobe later than time T=X, and the logic does match expected value. Binary search is used to find the threshold between the latest value that the logic does not match, and the earliest part where it does.(A Teradyne FLEX system resolves timing to 39pS resolution, other testers are comparable). That's a simple way to measure transition time. Same technique can be used to solve for setup time, hold time, operable power supply levels, power supply vs. delay,etc.
Any kind of microprocessor, memory, FPGA, logic, and many analog mixed signal circuits use binary search in test and characterization.
-- mike
I had a program that iterated through a collection to perform some calculations. I thought that this was inefficient so I sorted the collection and then used a single binary search to find an item of interest. I returned this item and its matching neighbours. I had in-effect filtered the collection.
Doing this was actually slower than iterating the entire collection and fishing out matching items.
I continued to add items to the collection knowing that the sorting and searching performance would eventually catch up with the iteration. It took a collection of about 600 objects until the speed was identical. 1000 objects had a clear performance benefit.
I would also consider the type of data you are working with, the duplicates and spread. This will have an effect on the sorting and searching.
My answer is to try both methods and time them.
It's the basis for hg bisect
Binary sort is useful in adjusting fonts to size of text with constant dimension of textbox
Finding roots of an equation is probably one of those very easy things you want to do with a very easy algorithm like binary search.
Delphi uses can enjoy binary search while searching string in sorted TStringList.
I believe that the .NET SortedDictionary uses a binary tree behind the scenes (much like the STL map)... so a binary search is used to access elements in the SortedDictionary
Python'slist.sort() method uses Timsort which (AFAIK) uses binary search to locate the positions of elements.
Binary search offers a feature that many readymade map/dictionary implementations don't: finding non-exact matches.
For example, I've used binary search to implement geotagging of photos based on GPS logs: put all GPS waypoints in an array sorted by timestamp, and use binary search to identify the waypoint that lies closest in time to each photo's timestamp.
If you have a set of elements to find in an array you can either search for each of them linearly or sort the array and then use binary search with the same comparison predicate. The latter is much faster.

Resources