Using B-Tree instead of Trie - algorithm

This is an interview question and not a homework.
"You have N documents, where N is very large. Each document has a set of words lets say w1,w2..wm where m might differ for each document. Now you are given a list to K words lets say q1,q2…qk.
Write an algorithm to print the list of document which have the K words in them."
Now, I could figure out solutions using Hashing and trie. But the guy who posted the question had also written that the interviewer wanted a solution using B-tree.
I am not really able to figure out how to use a B-Tree for this and how efficient would that be. Can somebody please help?

B-Tree is preferred over Trie if our dataset is stored on media with slow random access, for example on conventional hard drives. The interviewer's note that N is very large might imply that it's simply large enough to not fit in memory and should be placed on disk.
As noted in comments: when the data is really huge and it is stored on a disk, the efficiency of a data structure depends more on the number of disk block accesses, not the total amount of all operations. B-Tree contains many records in one node (which can be considered a "data block"), thus requires significantly fewer block accesses than Trie does.
That is exactly the same reason why most DBs store their indexes in a B-Tree. They need fast search through index located on conventional hard drive.
Actually, your problem can be solved by putting your (word-documentId) pairs in DB table and creating an index on word column or the entire pair.

You can try a ternary trie. It doesn't take so much space. You can also look for a Kart-trie. It uses a key and 2 leafs:http://code.dogmap.org/kart/.

Related

Disadvantages of tries

I've been studying tries and checking out their advantages and disadvantages. They're quite useful in many practical applications like dictionary, spell checkers etc due to their constant O(m) look-ups (where m is length of the string) and other advantages like providing ordered retrieval of strings, and getting common prefixes. So, the advantages are pretty clear to me, but the limitations are a bit confusing.
I'm following this link : https://en.wikipedia.org/wiki/Trie
Drawbacks listed here are:
Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.
Follow up question - Why is there a scenario involving secondary storage? Aren't tries also supposed to be stored in main memory. If they're stored in secondary storage, then there's no use of using trie anyways as disk access will always cause greater times.
Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.
Follow-up question : Is it due to the fact that tries would contain more references/pointers for connecting each character to next one, and that'd consume more bytes than if it was stored as a whole string? (I got this reason from one of the answers here). Can anyone elaborate this too?
I'd really appreciate some help here. Thanks.
First, "constant O(m) look-ups" is meaningless. Lookup time in a trie is O(m): it depends on the length of the string you're looking up.
A well constructed hash table (i.e. a good hash function and a reasonable load factor) has O(1) lookup time.
Assuming competent construction, looking up a string in a hash table will be much faster than looking it up in a trie.
Tries and hash tables are used for different things. If all you want is the ability to lookup a word, then a hash table will be faster. If you want to find common prefixes, ordered retrieval, or do similar things, then you want a trie.
A hash table can look up individual strings very quickly. It's like a thoroughbred racehorse. That's all it can do. A trie, on the other hand, is a workhorse that can do a lot of things. It'll never be as fast at lookups as a hash table, but it can do lots of things that the hash table can't do.
For example, finding all the words that start with "pre" will take O(n) time with a dictionary because you have to search all of the words. With a trie, it takes three probes to find the subtree that contains all of those words, and then all you have to do is traverse that subtree. Sure, the worst case is O(n), but that's only if all the words in your trie start with "pre".
Whereas it's true that going to disk will be slower than if the entire trie were in memory, it's wrong to say that a disk-based trie offers no advantage over alternatives. If the data won't fit in memory, then no matter what data structure you use, you'll need some external (i.e. non-memory) storage. The fact that your data access is slower when it's on the disk does not fundamentally change the advantages or disadvantages of trie vs. hash table. For example, a disk-based trie will still be faster than a disk-based hash table when it comes to finding all the words with a particular prefix.
A hash table's overhead is typically a constant multiple of the number of words it contains. That is, in addition to the memory required to store the strings, there is per-string overhead to store the mapping between hash code and string.
Memory for a trie is a little more involved. In the worst case, there is one node per character. All those little node allocations start adding up. Imagine a dictionary that contains 200,000 words, and the average word length is five characters. That's a million nodes of overhead.
Fortunately, there are ways to greatly compress a trie, without losing much, if any, performance. The resulting data structure is much smaller and more cache-friendly than a naively constructed trie.
It's been a while since this was asked, but I'd like to add, if anyone is wondering, that a good hashing function should take O(1) time for fixed memory values such as primitive types or fixed-length lists of primitive types. The same logical operations are often applied on all values to be hashed (logical shift left and right, bitwise operations, etc.). These operations take the same time regardless of what value they're used on. This makes hash tables far quicker, and relatively reliable, at storing values that use up a predictable amount of space. Hashing a string can also be done in O(1) time if you traverse the underlying character array and only pick out characters at intervals to ensure that you're always hashing the same amount of memory.
For example, for a string of length 10, you may hash the 10 characters in the underlying character array, whereas for a string of length 100, you hash based on every tenth character.
So, to answer your question, hashing is usually completed in constant time, whereas insertion or retrieval from a trie is O(n) time, where n is the length of the value to be inserted or retrieved. Even if there is little difference in practice, constant has the advantage of being predictable. All operations on a hash table will take the same time each time, give or take. But with a trie (representing a dictionary of Welsh place names), searching for Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch with one character at the end changed will take far more time than searching for "a". The system will eat through the whole string before realising that it is not in the dictionary. Google and other tech companies tend to prefer nice, predictable (but evenly distributed) hashing to avoid security concerns.

Is a linked list in a B-tree node superior to an array?

I want to implement a B-tree index for my database.
I have read many data structure and algorithm books to learn how to do it. All implementations use an array to save data and child indexes.
Now I want to know: is a linked list in B-tree node superior to an array?
There are some ideas I've thought about:
when splitting a node, the copy operation will be more quickly than with an array.
when inserting data, if the data is inserted into the middle or at the head of the array, the speed is lower than inserting to the linked list.
The linked list is not better, in fact a simple array is not better either (except its simplicity which is good argument for it and search speed if sorted).
You have to realize that the "array" implementation is more a "reference" implementation than a true full power implementation. For example, the implementation of the data/key pairs inside a B-Tree node in commercial implementations uses many strategies to solve two problems: storage efficiency and efficient search of keys in the node.
With regard with efficient search, an array of key/value with an internal balanced tree structure on the top of it can make insertion/deletion/search be done in O(log N), for large B tree nodes it makes sense.
With regard to memory efficiency, the nature of data in the key and value is very important. For example, lexicographical keys can be shorten by a common start (e.g. "good", "great" have "g" in common), the data might be compressed as well using any possible scheme relevant to the nature of the data. The compression of keys is more complex as you will want to keep this lexicographical property. Remember that the more data and keys you stuff in a node, the fastest are the disk accesses.
The time to split a node is only partially relevant, as it will be much less than the time to read or write a node on typical media by several order of magnitude. On SSD and extremely fast disks (by 10 to 20 years it is expected to have disks as fast as RAM), many researches are conducted to find a successor to B-Trees, stratified B-Trees are an example.
If the BTree is itself stored on the disk then a linked list will make it very complicated to maintain.
Keep the B-Tree structure compact. This will allow more nodes per page, locality of data and allowing caching of more nodes, and fewer disk reads/cache misses.
Use an array.
The perceived in-memory computational benefits are inconsequential.
So, in short, no, linked list is not superior.
B-tree is typically used in DBs where the data is stored on disks and you want to minimize the number of blocks you want to read. I do not think your proposal would be efficient in that case (although it might be beneficial if you can load all data into RAM).
If you want to perform those two operations effectively you should use a Skip List (http://en.wikipedia.org/wiki/Skip_list). Performance-wise it will be similar to what you have outlined.

Is hash the best for application requesting high lookup speed?

I keep in mind that hash would be first thing I should resort to if I want to write an application which requests high lookup speed, and any other data structure wouldn't guarantee that.
But I got confused when saw some many post saying different, such as suffix tree, trie, to name a few.
So I wonder is hash always the best thing for high speed lookup? What if I want both high lookup speed and less space cost?
Is there any material (books or papers) lecturing about the data structures or algorithms **on high speed lookup and space efficiency? Any of this kind is highly appreciated.
So I wonder is hash always the best thing for high speed lookup?
No. As stated in comments:
There is never such a thing Best data structure for [some generic issue]. Everything is case dependent. Tries and radix trees might be great for strings, since you need to read the string anyway. arrays allows simplicity and great cache efficiency - and are usually the best for small scale static information
I once answered a related question of cases where a tree might be better then a hash table: Hash Table v/s Trees
What if I want both high lookup speed and less space cost?
The two might be self-contradicting. Even for the simple example of a hash table of size X vs a hash table of size 2*X. The bigger hash table is less likely to encounter collisions, and thus is expected to be faster then the smaller one.
Is there any material (books or papers) lecturing about the data
structures or algorithms on high speed lookup and space efficiency?
Introduction to Algorithms provide a good walk through on the main data structure used. Any algorithm developed is trying to provide a good space and time efficiency, but like said, there is a trade off, and some algorithms might be better for specific cases then others.
Choosing the right algorithm/data structure/design for the specific problem is what engineering is about, isn't it?
I assume you are talking about strings here, and the answer is "no", hashes are not the fastest or most space efficient way to look up strings, tries are. Of course, writing a hashing algorithm is much, much easier than writing a trie.
One thing you won't find in wikipedia or books about tries is that if you naively implement them with one node per letter, you end up with large numbers of inefficient, one-child nodes. To make a trie that really burns up the CPU you have to implement nodes so that they can have a variable number of characters. This, of course, is even harder than writing a plain trie.
I have written trie implementations that handle over a billion entries and I can tell you that if done properly it is insanely fast, nothing else compares.
One other issue with tries is that you have to write a custom heap, because if you just use some kind of generic memory management it will be slow. So in addition to implementing the trie, you have to implement the heap that the trie runs on. Pretty freakin complicated, but if you do it, you get batshit crazy speed.
Only a good implementation of hash will give you good performance. And you cannot compare hash with Trie for all situations. Situations where Trie is applicable, is fast, but it can be costly in terms of memory, (again dependent on implementation).
But have you measured performance? Or it is unnecessary optimization you are looking for. Did the map fail you?
That might also depend on the actual number of elements.
In complexity theory a hash is not bad, but complexity theory is only good if the actual number of elements is bigger than some threshold.
I.e. if you have only 2 elements, there is a faster method than a hash ;-)
Hash tables are a good general purpose structure but they can fail spectacularly if the hash function doesn't suit the input data. Worst case lookup is O(n). They also waste some space as you mentioned. Other general-purpose structures like balanced binary search trees have worse average case but better worst case performance than a hash table. This is important for real-time applications. A trie is a more special-purpose structure tailored to string lookup.

find repeated word in infinite stream of words

You are given an infinite supply of words, which are coming one by one, and length of words, can be huge and is unknown how big it is. How will you find if the new word is repeated, what data structure will you use to store.This was the question asked to me in the interview .please help me to verify my answer.
Normally use a hash-table to keep track of the count of each word. Since you only have to answer whether the words are duplicated, you can reduce the word count to a bitmask, so that you only store a single bit for each hash index.
If the question is related to big data, like how to write a search engine for Google, your answer may need to relate to MapReduce or similar distributed techniques (which takes root somewhat in same hash table techniques as described above)
As with most sequential data, a trie would be a good choice here. Using a trie you can store new words very cost efficiently and still be sure to find new words. Tries can actually be seen as a form of multiple hashing of the words. If this still leads to problems, because the size of the words is to big, you can make it more efficient by producing a directed acyclic word graph (DAWG) from the words in order to reduce common suffixes as well as prefixes.
If all you need to do is efficiently detect if each word is one you've seen before, a Bloom filter is one nice option. It's kind of like a set and a hash table combined in one, and therefore can result in false positives -- for this reason they are sometimes adapted to use additional techniques to reduce that risk. The advantage of Bloom filters is that they are very space efficient (important if you really don't know how large the list will be). They are also fast. On the downside, you can't get the words out again, you can only tell whether you've seen them or not.
There's a nice description at: http://en.wikipedia.org/wiki/Bloom_filter.

Where is binary search used in practice?

Every programmer is taught that binary search is a good, fast way to search an ordered list of data. There are many toy textbook examples of using binary search, but what about in real programming: where is binary search actually used in real-life programs?
Binary search is used everywhere. Take any sorted collection from any language library (Java, .NET, C++ STL and so on) and they all will use (or have the option to use) binary search to find values. While true that you have to implement it rarely, you still have to understand the principles behind it to take advantage of it.
Binary search can be used to access ordered data quickly when memory space is tight. Suppose you want to store a set of 100.000 32-bit integers in a searchable, ordered data structure but you are not going to change the set often. You can trivially store the integers in a sorted array of 400.000 bytes, and you can use binary search to access it fast. But if you put them e.g. into a B-tree, RB-tree or whatever "more dynamic" data structure, you start to incur memory overhead. To illustrate, storing the integers in any kind of tree where you have left child and right child pointers would make you consume at least 1.200.000 bytes of memory (assuming 32-bit memory architecture). Sure, there are optimizations you can do, but that's how it works in general.
Because it is very slow to update an ordered array (doing insertions or deletions), binary search is not useful when the array changes often.
Here some practical examples where I have used binary search:
Implementing a "switch() ... case:" construct in a virtual machine where the case labels are individual integers. If you have 100 cases, you can find the correct entry in 6 to 7 steps using binary search, where as sequence of conditional branches takes on average 50 comparisons.
Doing fast substring lookup using suffix arrays, which contain all the suffixes of the set of searchable strings in lexiographic ordering (I wanted to conserve memory and keep the implementation simple)
Finding numerical solutions to an equation (when you are lazy and do not mind to implement Newton's method)
Every programmer needs to know how to use binary search when debugging.
When you have a program, and you know that a bug is visible at a particular point
during the execution of the program, you can use binary search to pin-point the
place where it actually happens. This can be much faster than single-stepping through
large parts of the code.
Binary search is a good and fast way!
Before the arrival of STL and .NET framework, etc, you rather often could bump into situations where you needed to roll your own customized collection classes. Whenever a sorted array would be a feasible place of storing the data, binary search would be the way of locating entries in that array.
I'm quite sure binary search is in widespread use today as well, although it is taken care of "under the hood" by the library for your convenience.
I've implemented binary searches in BTree implementations.
The BTree search algorithms were used for finding the next node block to read but, within the 4K block itself (which contained a number of keys based on the key size), binary search was used for find either the record number (for a leaf node) or the next block (for a non-leaf node).
Blindingly fast compared to sequential search since, like balanced binary trees, you remove half the remaining search space with every check.
I once implemented it (without even knowing that this was indeed binary search) for a GUI control showing two-dimensional data in a graph. Clicking with the mouse should set the data cursor to the point with the closest x value. When dealing with large numbers of points (several 1000, this was way back when x86 was only beginning to get over 100 MHz CPU frequency) this was not really usable interactively - I was doing a linear search from the start. After some thinking it occurred to me that I could approach this in a divide and conquer fashion. Took me some time to get it working under all edge cases.
It was only some time later that I learned that this is indeed a fundamental CS algorithm...
One example is the stl set. The underlying data structure is a balanced binary search tree which supports look-up, insertion, and deletion in O(log n) due to binary search.
Another example is an integer division algorithm that runs in log time.
We still use it heavily in our code to search thousands of ACLS many thousands of times a second. It's useful because the ACLs are static once they come in from file, and we can suffer the expense of growing the array as we add to it at bootup. Blazingly fast once its running too.
When you can search a 255 element array in at most 7 compare/jumps (511 in 8, 1023 in 9, etc) you can see that binary search is about as fast as you can get.
Well, binary search is now used in 99% of 3D games and applications. Space is divided into a tree structure and a binary search is used to retrieve which subdivisions to display according to a 3D position and camera.
One of its first greatest showcase was Doom. Binary trees and associated search enhanced the rendering.
Answering your question with hands-on example.
In R programming language there is a package data.table. It is known from C-implemented, short syntax, high performance extension for data transformation. It uses binary search. Even without binary search it scales better than competitors.
You can find benchmark vs python pandas and vs R dplyr in project wiki grouping 2E9 - random order data.
There is also nice benchmark vs databases vs bigdata benchm-databases.
In recent data.table version (1.9.6) binary search was extended and now can be used as index on any atomic column.
I just found a nice summary with which I totally agree - see.
Anyone doing R comparisons should use data.table instead of data.frame. More so for benchmarks. data.table is the best data structure/query language I have found in my career. It's leading the way in The R world, and in my way, in all the data-focused languages.
So yes, binary search is being used and world is much better place thanks to it.
Binary search can be used to debug with Git. It's called git bisect.
Amongst other places, I have an interpreter with a table of command names and a pointer to the function to interpret that command. There are about 60 commands. It would not be incredibly onerous to use a linear search - but I use a binary search.
Semiconductor test programs used for measuring digital timing or analog levels make extensive use of binary search. Automatic Test Equipment (ATE) from Advantest, Teradyne, Verigy and the like can be thought of as truth table blasters, applying input logic and verifying output states of a digital part.
Think of a simple gate, with the input logic changing at time = 0 of each cycle, and transitioning X ns after the input logic changes. If you strobe the output before T=X,the logic does not match expected value. Strobe later than time T=X, and the logic does match expected value. Binary search is used to find the threshold between the latest value that the logic does not match, and the earliest part where it does.(A Teradyne FLEX system resolves timing to 39pS resolution, other testers are comparable). That's a simple way to measure transition time. Same technique can be used to solve for setup time, hold time, operable power supply levels, power supply vs. delay,etc.
Any kind of microprocessor, memory, FPGA, logic, and many analog mixed signal circuits use binary search in test and characterization.
-- mike
I had a program that iterated through a collection to perform some calculations. I thought that this was inefficient so I sorted the collection and then used a single binary search to find an item of interest. I returned this item and its matching neighbours. I had in-effect filtered the collection.
Doing this was actually slower than iterating the entire collection and fishing out matching items.
I continued to add items to the collection knowing that the sorting and searching performance would eventually catch up with the iteration. It took a collection of about 600 objects until the speed was identical. 1000 objects had a clear performance benefit.
I would also consider the type of data you are working with, the duplicates and spread. This will have an effect on the sorting and searching.
My answer is to try both methods and time them.
It's the basis for hg bisect
Binary sort is useful in adjusting fonts to size of text with constant dimension of textbox
Finding roots of an equation is probably one of those very easy things you want to do with a very easy algorithm like binary search.
Delphi uses can enjoy binary search while searching string in sorted TStringList.
I believe that the .NET SortedDictionary uses a binary tree behind the scenes (much like the STL map)... so a binary search is used to access elements in the SortedDictionary
Python'slist.sort() method uses Timsort which (AFAIK) uses binary search to locate the positions of elements.
Binary search offers a feature that many readymade map/dictionary implementations don't: finding non-exact matches.
For example, I've used binary search to implement geotagging of photos based on GPS logs: put all GPS waypoints in an array sorted by timestamp, and use binary search to identify the waypoint that lies closest in time to each photo's timestamp.
If you have a set of elements to find in an array you can either search for each of them linearly or sort the array and then use binary search with the same comparison predicate. The latter is much faster.

Resources