Are Tries still a good idea on modern architectures? - data-structures

One of my favourite data structures in college was the Trie. It's a great data structure for holding a large set of strings if the prefixes are shared. Lookups are also nice, since they are done at O(|length|) of the string regardless of how many strings are in the set.
By comparison, a balanced tree would be O(log N) in the number of set items, plus whatever you pay for comparisons. A hash table would involve the hash calculation, comparison, etc.
It is therefore surprising to me that there is no Trie implementation in the standard library of most languages.
The only reason I could come up with was the possibility that memory access costs make it too expensive. Rather than investigating O(log N) locations if doing a tree lookup, you are not doing O(|length|) different locations, with all the consequences. If the strings are long, this could end up being way too much.
So I'm wondering: how much is what I just described a concern? What do you do when you need to store a large set or map of strings?

I hadn't thought of this as an area of concern before, but now that you mention it, there are times when a standard Trie implementation might be handy. On the other hand, as far as I know, Tries are used by Python and Perl and other string-savvy languages that I use now.
Last I checked, which was ages ago, the BSD kernel code used Tries (Patricia Tries) in the code to select the best interface for sending packets. Looks like Wikipedia has some info.

You could just build two sample apps and see which one performs better. Memory access is cheap assuming you don't page fault. Then it's very expensive. For client application development, its almost always better to process than to access memory for this very reason. Modern processors are ridiculously fast, but cache misses still hurt.

I did some performance testing in C# with a Trie and a Dictionary (a strongly typed hash table). I found that the Dictionary was 5-10 times faster than the Trie. Perhaps my implementation of the Trie could be optimized a bit, but hardly enough to be much faster than (or perhaps even as fast as) the Dictionary.
The ContainsKey method in the dictionary is close to an O(1) operation (depending on how good the hashing algorithm is), so it's not easy to make a collection that beats that as long as the hashing algorithm is reasonably fast.
With a custom IEqualityComparer you can use most anything as a key in a Dictionary, which makes it rather flexible. A Trie is a bit more limited in what you can use as key, so that limits the usefulness somewhat.

Tries are particularly useful for ip address and subnet lookup, and always will be on any architecture. You can find a trie implementation for addresses in The open-source IPAddress Java library. Disclaimer: I am the project manager of that library.
Tries organize by address prefixes, as in this example:
● 0.0.0.0/0 (10)
└─○ 0.0.0.0/1 (9)
├─○ 8.0.0.0/6 (8)
│ ├─● 8.9.8.0/24 (7)
│ │ └─○ 8.9.8.0/28 (6)
│ │ ├─● 8.9.8.0/29 (1)
│ │ └─● 8.9.8.8/29 (5)
│ │ ├─○ 8.9.8.8/30 (2)
│ │ │ ├─● 8.9.8.9 (1)
│ │ │ └─● 8.9.8.10 (1)
│ │ └─● 8.9.8.12/30 (2)
│ │ └─● 8.9.8.12/31 (1)
│ └─● 10.0.2.15 (1)
└─● 127.0.0.1 (1)
The structure allows for efficient lookup, routing and containment checks.

It's hard to make a standard Trie class broadly useful enough to justify putting one in a standard library.
Of course it would require anything you put in it to implement a common string-like interface.
And since the Trie doesn't actually store the strings, iterating through it is either slow, because it has to reconstitute all the elements, or awkward, because you don't actually get the strings.
And, you know, average_length and max_length are both at least O(log N). If you know the elements are strings, then you can search in a BST in O(length + log N), which is pretty much the same as O(length), and you retain the advantage of actually having the strings in the structure.
Really the only thing that Tries win for is memory-efficient storage of strings with long common prefixes. It happens, but it's not that common. I guess none of the languages have deemed it to be worth the trouble include one.

Related

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.
They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

How to deal with duplicates in red-black trees?

So I've been(so far unsuccessfully) trying to make my red-black tree implementation work consistently with duplicates, but it seems to always be missing that small something, so here I am.
I tried make the tree lean to one side, but It didn't seem to balance it properly(from the color perspective).I'd like to ask how should one go about adding duplicates to a red-black tree?(apart obviously making the node fat, holding or pointing to duplicate key values).
Not really looking for a code review, more interested in suggestions. So basically the methods(taken from Introduction to Algorithms, Third Edition) I use for insert and balancing are these(while rotations are pretty obvious):
If you look at the pseudo-code you wrote here, it is completely agnostic to the question of whether keys are duplicate or not. The code here only looks at the result of comparing keys, and doesn't care if they are identical or not. In fact, unique-key implementations need to go out of their way to make RB-Insert detect duplicate keys. The data structure doesn't care about this naturally, and the algorithms and proofs hold whether there are duplicate keys or not. If you implemented these functions correctly, it should work as is.
I also disagree with the comments advising you to hold what you call "fat nodes". Holding multiple keys is the common implementation of C++'s std::multimap, for example. Not that from a computational complexity point of view, say that you have altogether n keys, but each k are a multiple. Using the "efficient" fat node version, the complexity of the basic find operation will be Θ(log(n / k)) = Θ(log(n) - log(k)); using the multiple key version, the complexity will be Θ(log(n)). In real life cases, probably k << n, which means that the relative difference is negligible.

Why are Haskell Maps implemented as balanced binary trees instead of traditional hashtables?

From my limited knowledge of Haskell, it seems that Maps (from Data.Map) are supposed to be used much like a dictionary or hashtable in other languages, and yet are implemented as self-balancing binary search trees.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1) and requires that the elements be in Ord. Certainly there is a good reason, so what are the advantages of using a binary tree?
Also:
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Hash tables can't be implemented efficiently without mutable state, because they're based on array lookup. The key is hashed and the hash determines the index into an array of buckets. Without mutable state, inserting elements into the hashtable becomes O(n) because the entire array must be copied (alternative non-copying implementations, like DiffArray, introduce a significant performance penalty). Binary-tree implementations can share most of their structure so only a couple pointers need to be copied on inserts.
Haskell certainly can support traditional hash tables, provided that the updates are in a suitable monad. The hashtables package is probably the most widely used implementation.
One advantage of binary trees and other non-mutating structures is that they're persistent: it's possible to keep older copies of data around with no extra book-keeping. This might be useful in some sort of transaction algorithm for example. They're also automatically thread-safe (although updates won't be visible in other threads).
Traditional hashtables rely on memory mutation in their implementation. Mutable memory and referential transparency are at ends, so that relegates hashtable implementations to either the IO or ST monads. Trees can be implemented persistently and efficiently by leaving old leaves in memory and returning new root nodes which point to the updated trees. This lets us have pure Maps.
The quintessential reference is Chris Okasaki's Purely Functional Data Structures.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1)
Lookup is only one of the operations; insertion/modification may be more important in many cases; there are also memory considerations. The main reason the tree representation was chosen is probably that it is more suited for a pure functional language. As "Real World Haskell" puts it:
Maps give us the same capabilities as hash tables do in other languages. Internally, a map is implemented as a balanced binary tree. Compared to a hash table, this is a much more efficient representation in a language with immutable data. This is the most visible example of how deeply pure functional programming affects how we write code: we choose data structures and algorithms that we can express cleanly and that perform efficiently, but our choices for specific tasks are often different their counterparts in imperative languages.
This:
and requires that the elements be in Ord.
does not seem like a big disadvantage. After all, with a hash map you need keys to be Hashable, which seems to be more restrictive.
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Unfortunately, I cannot provide an extensive comparative analysis, but there is a hash map package, and you can check out its implementation details and performance figures in this blog post and decide for yourself.
My answer to what the advantage of using binary trees is, would be: range queries. They require, semantically, a total preorder, and profit from a balanced search tree organization algorithmically. For simple lookup, I'm afraid there may only be good Haskell-specific answers, but not good answers per se: Lookup (and indeed hashing) requires only a setoid (equality/equivalence on its key type), which supports efficient hashing on pointers (which, for good reasons, are not ordered in Haskell). Like various forms of tries (e.g. ternary tries for elementwise update, others for bulk updates) hashing into arrays (open or closed) is typically considerably more efficient than elementwise searching in binary trees, both space and timewise. Hashing and Tries can be defined generically, though that has to be done by hand -- GHC doesn't derive it (yet?). Data structures such as Data.Map tend to be fine for prototyping and for code outside of hotspots, but where they are hot they easily become a performance bottleneck. Luckily, Haskell programmers need not be concerned about performance, only their managers. (For some reason I presently can't find a way to access the key redeeming feature of search trees amongst the 80+ Data.Map functions: a range query interface. Am I looking the wrong place?)

A Haskell hash implementation that does not live in the IO monad

I am looking for a data structure that works a bit like Data.HashTable but that is not encumbered by the IO monad. At the moment, I am using [(key,val)]. I would like a structure that is O(log n) where n is the number of key value pairs.
The structure gets built infrequently compared to how often it must be read, and when it is built, I have all the key value pairs available at the same time. The keys are Strings if that makes a difference.
It would also be nice to know at what size it is worth moving away from [(key,val)].
You might consider:
Data.Map
or alternatively,
Data.HashMap
The former is the standard container for storing and looking up elements by keys in Haskell. The latter is a new library specifically optimized for hashing keys.
Johan Tibell's recent talk, Faster persistent data structures through hashing gives an overview, while Milan Straka's recent Haskell Symposium paper specifically outlines the Data.Map structure and the hashmap package.
If you have all the key-value pairs up front you might want to consider a perfect hash function.
Benchmarking will tell you when to switch from a simple list.

Where is binary search used in practice?

Every programmer is taught that binary search is a good, fast way to search an ordered list of data. There are many toy textbook examples of using binary search, but what about in real programming: where is binary search actually used in real-life programs?
Binary search is used everywhere. Take any sorted collection from any language library (Java, .NET, C++ STL and so on) and they all will use (or have the option to use) binary search to find values. While true that you have to implement it rarely, you still have to understand the principles behind it to take advantage of it.
Binary search can be used to access ordered data quickly when memory space is tight. Suppose you want to store a set of 100.000 32-bit integers in a searchable, ordered data structure but you are not going to change the set often. You can trivially store the integers in a sorted array of 400.000 bytes, and you can use binary search to access it fast. But if you put them e.g. into a B-tree, RB-tree or whatever "more dynamic" data structure, you start to incur memory overhead. To illustrate, storing the integers in any kind of tree where you have left child and right child pointers would make you consume at least 1.200.000 bytes of memory (assuming 32-bit memory architecture). Sure, there are optimizations you can do, but that's how it works in general.
Because it is very slow to update an ordered array (doing insertions or deletions), binary search is not useful when the array changes often.
Here some practical examples where I have used binary search:
Implementing a "switch() ... case:" construct in a virtual machine where the case labels are individual integers. If you have 100 cases, you can find the correct entry in 6 to 7 steps using binary search, where as sequence of conditional branches takes on average 50 comparisons.
Doing fast substring lookup using suffix arrays, which contain all the suffixes of the set of searchable strings in lexiographic ordering (I wanted to conserve memory and keep the implementation simple)
Finding numerical solutions to an equation (when you are lazy and do not mind to implement Newton's method)
Every programmer needs to know how to use binary search when debugging.
When you have a program, and you know that a bug is visible at a particular point
during the execution of the program, you can use binary search to pin-point the
place where it actually happens. This can be much faster than single-stepping through
large parts of the code.
Binary search is a good and fast way!
Before the arrival of STL and .NET framework, etc, you rather often could bump into situations where you needed to roll your own customized collection classes. Whenever a sorted array would be a feasible place of storing the data, binary search would be the way of locating entries in that array.
I'm quite sure binary search is in widespread use today as well, although it is taken care of "under the hood" by the library for your convenience.
I've implemented binary searches in BTree implementations.
The BTree search algorithms were used for finding the next node block to read but, within the 4K block itself (which contained a number of keys based on the key size), binary search was used for find either the record number (for a leaf node) or the next block (for a non-leaf node).
Blindingly fast compared to sequential search since, like balanced binary trees, you remove half the remaining search space with every check.
I once implemented it (without even knowing that this was indeed binary search) for a GUI control showing two-dimensional data in a graph. Clicking with the mouse should set the data cursor to the point with the closest x value. When dealing with large numbers of points (several 1000, this was way back when x86 was only beginning to get over 100 MHz CPU frequency) this was not really usable interactively - I was doing a linear search from the start. After some thinking it occurred to me that I could approach this in a divide and conquer fashion. Took me some time to get it working under all edge cases.
It was only some time later that I learned that this is indeed a fundamental CS algorithm...
One example is the stl set. The underlying data structure is a balanced binary search tree which supports look-up, insertion, and deletion in O(log n) due to binary search.
Another example is an integer division algorithm that runs in log time.
We still use it heavily in our code to search thousands of ACLS many thousands of times a second. It's useful because the ACLs are static once they come in from file, and we can suffer the expense of growing the array as we add to it at bootup. Blazingly fast once its running too.
When you can search a 255 element array in at most 7 compare/jumps (511 in 8, 1023 in 9, etc) you can see that binary search is about as fast as you can get.
Well, binary search is now used in 99% of 3D games and applications. Space is divided into a tree structure and a binary search is used to retrieve which subdivisions to display according to a 3D position and camera.
One of its first greatest showcase was Doom. Binary trees and associated search enhanced the rendering.
Answering your question with hands-on example.
In R programming language there is a package data.table. It is known from C-implemented, short syntax, high performance extension for data transformation. It uses binary search. Even without binary search it scales better than competitors.
You can find benchmark vs python pandas and vs R dplyr in project wiki grouping 2E9 - random order data.
There is also nice benchmark vs databases vs bigdata benchm-databases.
In recent data.table version (1.9.6) binary search was extended and now can be used as index on any atomic column.
I just found a nice summary with which I totally agree - see.
Anyone doing R comparisons should use data.table instead of data.frame. More so for benchmarks. data.table is the best data structure/query language I have found in my career. It's leading the way in The R world, and in my way, in all the data-focused languages.
So yes, binary search is being used and world is much better place thanks to it.
Binary search can be used to debug with Git. It's called git bisect.
Amongst other places, I have an interpreter with a table of command names and a pointer to the function to interpret that command. There are about 60 commands. It would not be incredibly onerous to use a linear search - but I use a binary search.
Semiconductor test programs used for measuring digital timing or analog levels make extensive use of binary search. Automatic Test Equipment (ATE) from Advantest, Teradyne, Verigy and the like can be thought of as truth table blasters, applying input logic and verifying output states of a digital part.
Think of a simple gate, with the input logic changing at time = 0 of each cycle, and transitioning X ns after the input logic changes. If you strobe the output before T=X,the logic does not match expected value. Strobe later than time T=X, and the logic does match expected value. Binary search is used to find the threshold between the latest value that the logic does not match, and the earliest part where it does.(A Teradyne FLEX system resolves timing to 39pS resolution, other testers are comparable). That's a simple way to measure transition time. Same technique can be used to solve for setup time, hold time, operable power supply levels, power supply vs. delay,etc.
Any kind of microprocessor, memory, FPGA, logic, and many analog mixed signal circuits use binary search in test and characterization.
-- mike
I had a program that iterated through a collection to perform some calculations. I thought that this was inefficient so I sorted the collection and then used a single binary search to find an item of interest. I returned this item and its matching neighbours. I had in-effect filtered the collection.
Doing this was actually slower than iterating the entire collection and fishing out matching items.
I continued to add items to the collection knowing that the sorting and searching performance would eventually catch up with the iteration. It took a collection of about 600 objects until the speed was identical. 1000 objects had a clear performance benefit.
I would also consider the type of data you are working with, the duplicates and spread. This will have an effect on the sorting and searching.
My answer is to try both methods and time them.
It's the basis for hg bisect
Binary sort is useful in adjusting fonts to size of text with constant dimension of textbox
Finding roots of an equation is probably one of those very easy things you want to do with a very easy algorithm like binary search.
Delphi uses can enjoy binary search while searching string in sorted TStringList.
I believe that the .NET SortedDictionary uses a binary tree behind the scenes (much like the STL map)... so a binary search is used to access elements in the SortedDictionary
Python'slist.sort() method uses Timsort which (AFAIK) uses binary search to locate the positions of elements.
Binary search offers a feature that many readymade map/dictionary implementations don't: finding non-exact matches.
For example, I've used binary search to implement geotagging of photos based on GPS logs: put all GPS waypoints in an array sorted by timestamp, and use binary search to identify the waypoint that lies closest in time to each photo's timestamp.
If you have a set of elements to find in an array you can either search for each of them linearly or sort the array and then use binary search with the same comparison predicate. The latter is much faster.

Resources