Is hash the best for application requesting high lookup speed? - algorithm

I keep in mind that hash would be first thing I should resort to if I want to write an application which requests high lookup speed, and any other data structure wouldn't guarantee that.
But I got confused when saw some many post saying different, such as suffix tree, trie, to name a few.
So I wonder is hash always the best thing for high speed lookup? What if I want both high lookup speed and less space cost?
Is there any material (books or papers) lecturing about the data structures or algorithms **on high speed lookup and space efficiency? Any of this kind is highly appreciated.

So I wonder is hash always the best thing for high speed lookup?
No. As stated in comments:
There is never such a thing Best data structure for [some generic issue]. Everything is case dependent. Tries and radix trees might be great for strings, since you need to read the string anyway. arrays allows simplicity and great cache efficiency - and are usually the best for small scale static information
I once answered a related question of cases where a tree might be better then a hash table: Hash Table v/s Trees
What if I want both high lookup speed and less space cost?
The two might be self-contradicting. Even for the simple example of a hash table of size X vs a hash table of size 2*X. The bigger hash table is less likely to encounter collisions, and thus is expected to be faster then the smaller one.
Is there any material (books or papers) lecturing about the data
structures or algorithms on high speed lookup and space efficiency?
Introduction to Algorithms provide a good walk through on the main data structure used. Any algorithm developed is trying to provide a good space and time efficiency, but like said, there is a trade off, and some algorithms might be better for specific cases then others.
Choosing the right algorithm/data structure/design for the specific problem is what engineering is about, isn't it?

I assume you are talking about strings here, and the answer is "no", hashes are not the fastest or most space efficient way to look up strings, tries are. Of course, writing a hashing algorithm is much, much easier than writing a trie.
One thing you won't find in wikipedia or books about tries is that if you naively implement them with one node per letter, you end up with large numbers of inefficient, one-child nodes. To make a trie that really burns up the CPU you have to implement nodes so that they can have a variable number of characters. This, of course, is even harder than writing a plain trie.
I have written trie implementations that handle over a billion entries and I can tell you that if done properly it is insanely fast, nothing else compares.
One other issue with tries is that you have to write a custom heap, because if you just use some kind of generic memory management it will be slow. So in addition to implementing the trie, you have to implement the heap that the trie runs on. Pretty freakin complicated, but if you do it, you get batshit crazy speed.

Only a good implementation of hash will give you good performance. And you cannot compare hash with Trie for all situations. Situations where Trie is applicable, is fast, but it can be costly in terms of memory, (again dependent on implementation).
But have you measured performance? Or it is unnecessary optimization you are looking for. Did the map fail you?

That might also depend on the actual number of elements.
In complexity theory a hash is not bad, but complexity theory is only good if the actual number of elements is bigger than some threshold.
I.e. if you have only 2 elements, there is a faster method than a hash ;-)

Hash tables are a good general purpose structure but they can fail spectacularly if the hash function doesn't suit the input data. Worst case lookup is O(n). They also waste some space as you mentioned. Other general-purpose structures like balanced binary search trees have worse average case but better worst case performance than a hash table. This is important for real-time applications. A trie is a more special-purpose structure tailored to string lookup.

Related

What's a performant data-structure to retrieve the most-commonly-retrieved value, from a small set of values, very fast?

I need to cache a map of "seen version IDs" -> "MD5 of that version-id."
For instance,
{
"version/20220531-0200-1822-g296fa0290a3": "933cbfc50909025f57d6434ec593461c",
"version/20211215-0200-1900-99046b102fdb": "2aa036d04e42086e9f7d7a7f0bdfe812"
}
This list should only ever contain a few entries; a very small data-structure — but retrieving the most-commonly-accessed entry should be absolutely blazing-fast.
Obviously the past of least resistance is using a standard-library Hashtbl or just an array and string-comparison; but I'm hoping I can do better than that — and learn a little about data-structures in the process.
Is there some sort of self-sorting-by-access-frequency data-structure that would be ideal for this?
Try using a heap keyed on the number of times an element has been looked up. It's designed for O(constant) lookup time, O(log n) deletion and insertion. It has great memory locality and is about the fastest you can get.
In general though, with small N the asymptotic runtime of a data structure becomes less relevant compared to various overheads of the specific implementations of the data structures or its use case.

Why are Haskell Maps implemented as balanced binary trees instead of traditional hashtables?

From my limited knowledge of Haskell, it seems that Maps (from Data.Map) are supposed to be used much like a dictionary or hashtable in other languages, and yet are implemented as self-balancing binary search trees.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1) and requires that the elements be in Ord. Certainly there is a good reason, so what are the advantages of using a binary tree?
Also:
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Hash tables can't be implemented efficiently without mutable state, because they're based on array lookup. The key is hashed and the hash determines the index into an array of buckets. Without mutable state, inserting elements into the hashtable becomes O(n) because the entire array must be copied (alternative non-copying implementations, like DiffArray, introduce a significant performance penalty). Binary-tree implementations can share most of their structure so only a couple pointers need to be copied on inserts.
Haskell certainly can support traditional hash tables, provided that the updates are in a suitable monad. The hashtables package is probably the most widely used implementation.
One advantage of binary trees and other non-mutating structures is that they're persistent: it's possible to keep older copies of data around with no extra book-keeping. This might be useful in some sort of transaction algorithm for example. They're also automatically thread-safe (although updates won't be visible in other threads).
Traditional hashtables rely on memory mutation in their implementation. Mutable memory and referential transparency are at ends, so that relegates hashtable implementations to either the IO or ST monads. Trees can be implemented persistently and efficiently by leaving old leaves in memory and returning new root nodes which point to the updated trees. This lets us have pure Maps.
The quintessential reference is Chris Okasaki's Purely Functional Data Structures.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1)
Lookup is only one of the operations; insertion/modification may be more important in many cases; there are also memory considerations. The main reason the tree representation was chosen is probably that it is more suited for a pure functional language. As "Real World Haskell" puts it:
Maps give us the same capabilities as hash tables do in other languages. Internally, a map is implemented as a balanced binary tree. Compared to a hash table, this is a much more efficient representation in a language with immutable data. This is the most visible example of how deeply pure functional programming affects how we write code: we choose data structures and algorithms that we can express cleanly and that perform efficiently, but our choices for specific tasks are often different their counterparts in imperative languages.
This:
and requires that the elements be in Ord.
does not seem like a big disadvantage. After all, with a hash map you need keys to be Hashable, which seems to be more restrictive.
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Unfortunately, I cannot provide an extensive comparative analysis, but there is a hash map package, and you can check out its implementation details and performance figures in this blog post and decide for yourself.
My answer to what the advantage of using binary trees is, would be: range queries. They require, semantically, a total preorder, and profit from a balanced search tree organization algorithmically. For simple lookup, I'm afraid there may only be good Haskell-specific answers, but not good answers per se: Lookup (and indeed hashing) requires only a setoid (equality/equivalence on its key type), which supports efficient hashing on pointers (which, for good reasons, are not ordered in Haskell). Like various forms of tries (e.g. ternary tries for elementwise update, others for bulk updates) hashing into arrays (open or closed) is typically considerably more efficient than elementwise searching in binary trees, both space and timewise. Hashing and Tries can be defined generically, though that has to be done by hand -- GHC doesn't derive it (yet?). Data structures such as Data.Map tend to be fine for prototyping and for code outside of hotspots, but where they are hot they easily become a performance bottleneck. Luckily, Haskell programmers need not be concerned about performance, only their managers. (For some reason I presently can't find a way to access the key redeeming feature of search trees amongst the 80+ Data.Map functions: a range query interface. Am I looking the wrong place?)

If consistent hash is efficient,why don't people use it everywhere?

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

Optimal physical orderings of nodes

First, read this:
TPT paper
I was wondering what other options might exist for arranging nodes to boost performance. Anything from post-parent order in a byte array, like TPT's, to something more like a k-order b-tree; I'm wondering what good options are known at the moment?
A bit more on the problem:
I have an extremely fast way of finding elements within a sparse set, given some concept of adjacency to a given pointer. I was wondering how I could best take advantage of this in storing a patricia trie.
You can make assumptions about whether the trie will be random-access, read only, write-seldom, or add-only. Please note them if you do, but I've actually used a TPT and the gains were pretty significant so I'm willing to consider certain constraints.
Update
I guess in some senses this was a little unclear. What I'm looking for here is ways of arranging things in memory that optimize one performance metric or another. The TPTs, through some tricks, use node order to optimize disk reads and space-per-node. I'm curious about:
Total deletion, where the structure is removed from memory entirely.
Inserts, particularly in densely populated structures.
Deletes, again, particularly in densely populated structures.
A DAWG or a minimal DFA (see this question or the paper "How to squeeze a lexicon") may be even better than a TPT because the totel size is smaller.

Is there any reason to implement my own sorting algorithm?

Sorting has been studied for decades, so surely the sorting algorithms provide by any programming platform (java, .NET, etc.) must be good by now, right? Is there any reason to override something like System.Collections.SortedList?
There are absolutely times where your intimate understanding of your data can result in much, much more efficient sorting algorithms than any general purpose algorithm available. I shared an example of such a situation in another post at SO, but I'll share it hear just to provide a case-in-point:
Back in the days of COBOL, FORTRAN, etc... a developer working for a phone company had to take a relatively large chunk of data that consisted of active phone numbers (I believe it was in the New York City area), and sort that list. The original implementation used a heap sort (these were 7 digit phone numbers, and a lot of disk swapping was taking place during the sort, so heap sort made sense).
Eventually, the developer stumbled on a different approach: By realizing that one, and only one of each phone number could exist in his data set, he realized that he didn't have to store the actual phone numbers themselves in memory. Instead, he treated the entire 7 digit phone number space as a very long bit array (at 8 phone numbers per byte, 10 million phone numbers requires just over a meg to capture the entire space). He then did a single pass through his source data, and set the bit for each phone number he found to 1. He then did a final pass through the bit array looking for high bits and output the sorted list of phone numbers.
This new algorithm was much, much faster (at least 1000x faster) than the heap sort algorithm, and consumed about the same amount of memory.
I would say that, in this case, it absolutely made sense for the developer to develop his own sorting algorithm.
If your application is all about sorting, and you really know your problem space, then it's quite possible for you to come up with an application specific algorithm that beats any general purpose algorithm.
However, if sorting is an ancillary part of your application, or you are just implementing a general purpose algorithm, chances are very, very good that some extremely smart university types have already provided an algorithm that is better than anything you will be able to come up with. Quick Sort is really hard to beat if you can hold things in memory, and heap sort is quite effective for massive data set ordering (although I personally prefer to use B+Tree type implementations for the heap b/c they are tuned to disk paging performance).
Generally no.
However, you know your data better than the people who wrote those sorting algorithms. Perhaps you could come up with an algorithm that is better than a generic algorithm for your specific set of data.
Implementing you own sorting algorithm is akin to optimization and as Sir Charles Antony Richard Hoare said, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil".
Certain libraries (such as Java's very own Collections.sort) implement a sort based on criteria that may or may not apply to you. For example, Collections.sort uses a merge sort for it's O(n log(n)) efficiency as well as the fact that it's an in-place sort. If two different elements have the same value, the first element in the original collection stays in front (good for multi-pass sorting to different criteria (first scan for date, then for name, the collection stays name (then date) sorted)) However, if you want slightly better constants or have a special data-set, it might make more sense to implement your own quick sort or radix sort specific exactly to what you want to do.
That said, all operations are fast on sufficiently small n
Short answer; no, except for academic interest.
You might want to multi-thread the sorting implementation.
You might need better performance characteristics than Quicksorts O(n log n), think bucketsort for example.
You might need a stable sort while the default algorithm uses quicksort. Especially for user interfaces you'll want to have the sorting order be consistent.
More efficient algorithms might be available for the data structures you're using.
You might need an iterative implementation of the default sorting algorithm because of stack overflows (eg. you're sorting large sets of data).
Ad infinitum.
A few months ago the Coding Horror blog reported on some platform with an atrociously bad sorting algorithm. If you have to use that platform then you sure do want to implement your own instead.
The problem of general purpose sorting has been researched to hell and back, so worrying about that outside of academic interest is pointless. However, most sorting isn't done on generalized input, and often you can use properties of the data to increase the speed of your sorting.
A common example is the counting sort. It is proven that for general purpose comparison sorting, O(n lg n) is the best that we can ever hope to do.
However, suppose that we know the range that the values to be sorted are in a fixed range, say [a,b]. If we create an array of size b - a + 1 (defaulting everything to zero), we can linearly scan the array, using this array to store the count of each element - resulting in a linear time sort (on the range of the data) - breaking the n lg n bound, but only because we are exploiting a special property of our data. For more detail, see here.
So yes, it is useful to write your own sorting algorithms. Pay attention to what you are sorting, and you will sometimes be able to come up with remarkable improvements.
If you have experience at implementing sorting algorithms and understand the way the data characteristics influence their performance, then you would already know the answer to your question. In other words, you would already know things like a QuickSort has pedestrian performance against an almost sorted list. :-) And that if you have your data in certain structures, some sorts of sorting are (almost) free. Etc.
Otherwise, no.

Resources