I've been studying tries and checking out their advantages and disadvantages. They're quite useful in many practical applications like dictionary, spell checkers etc due to their constant O(m) look-ups (where m is length of the string) and other advantages like providing ordered retrieval of strings, and getting common prefixes. So, the advantages are pretty clear to me, but the limitations are a bit confusing.
I'm following this link : https://en.wikipedia.org/wiki/Trie
Drawbacks listed here are:
Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.
Follow up question - Why is there a scenario involving secondary storage? Aren't tries also supposed to be stored in main memory. If they're stored in secondary storage, then there's no use of using trie anyways as disk access will always cause greater times.
Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.
Follow-up question : Is it due to the fact that tries would contain more references/pointers for connecting each character to next one, and that'd consume more bytes than if it was stored as a whole string? (I got this reason from one of the answers here). Can anyone elaborate this too?
I'd really appreciate some help here. Thanks.
First, "constant O(m) look-ups" is meaningless. Lookup time in a trie is O(m): it depends on the length of the string you're looking up.
A well constructed hash table (i.e. a good hash function and a reasonable load factor) has O(1) lookup time.
Assuming competent construction, looking up a string in a hash table will be much faster than looking it up in a trie.
Tries and hash tables are used for different things. If all you want is the ability to lookup a word, then a hash table will be faster. If you want to find common prefixes, ordered retrieval, or do similar things, then you want a trie.
A hash table can look up individual strings very quickly. It's like a thoroughbred racehorse. That's all it can do. A trie, on the other hand, is a workhorse that can do a lot of things. It'll never be as fast at lookups as a hash table, but it can do lots of things that the hash table can't do.
For example, finding all the words that start with "pre" will take O(n) time with a dictionary because you have to search all of the words. With a trie, it takes three probes to find the subtree that contains all of those words, and then all you have to do is traverse that subtree. Sure, the worst case is O(n), but that's only if all the words in your trie start with "pre".
Whereas it's true that going to disk will be slower than if the entire trie were in memory, it's wrong to say that a disk-based trie offers no advantage over alternatives. If the data won't fit in memory, then no matter what data structure you use, you'll need some external (i.e. non-memory) storage. The fact that your data access is slower when it's on the disk does not fundamentally change the advantages or disadvantages of trie vs. hash table. For example, a disk-based trie will still be faster than a disk-based hash table when it comes to finding all the words with a particular prefix.
A hash table's overhead is typically a constant multiple of the number of words it contains. That is, in addition to the memory required to store the strings, there is per-string overhead to store the mapping between hash code and string.
Memory for a trie is a little more involved. In the worst case, there is one node per character. All those little node allocations start adding up. Imagine a dictionary that contains 200,000 words, and the average word length is five characters. That's a million nodes of overhead.
Fortunately, there are ways to greatly compress a trie, without losing much, if any, performance. The resulting data structure is much smaller and more cache-friendly than a naively constructed trie.
It's been a while since this was asked, but I'd like to add, if anyone is wondering, that a good hashing function should take O(1) time for fixed memory values such as primitive types or fixed-length lists of primitive types. The same logical operations are often applied on all values to be hashed (logical shift left and right, bitwise operations, etc.). These operations take the same time regardless of what value they're used on. This makes hash tables far quicker, and relatively reliable, at storing values that use up a predictable amount of space. Hashing a string can also be done in O(1) time if you traverse the underlying character array and only pick out characters at intervals to ensure that you're always hashing the same amount of memory.
For example, for a string of length 10, you may hash the 10 characters in the underlying character array, whereas for a string of length 100, you hash based on every tenth character.
So, to answer your question, hashing is usually completed in constant time, whereas insertion or retrieval from a trie is O(n) time, where n is the length of the value to be inserted or retrieved. Even if there is little difference in practice, constant has the advantage of being predictable. All operations on a hash table will take the same time each time, give or take. But with a trie (representing a dictionary of Welsh place names), searching for Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch with one character at the end changed will take far more time than searching for "a". The system will eat through the whole string before realising that it is not in the dictionary. Google and other tech companies tend to prefer nice, predictable (but evenly distributed) hashing to avoid security concerns.
Related
This is an interview question and not a homework.
"You have N documents, where N is very large. Each document has a set of words lets say w1,w2..wm where m might differ for each document. Now you are given a list to K words lets say q1,q2…qk.
Write an algorithm to print the list of document which have the K words in them."
Now, I could figure out solutions using Hashing and trie. But the guy who posted the question had also written that the interviewer wanted a solution using B-tree.
I am not really able to figure out how to use a B-Tree for this and how efficient would that be. Can somebody please help?
B-Tree is preferred over Trie if our dataset is stored on media with slow random access, for example on conventional hard drives. The interviewer's note that N is very large might imply that it's simply large enough to not fit in memory and should be placed on disk.
As noted in comments: when the data is really huge and it is stored on a disk, the efficiency of a data structure depends more on the number of disk block accesses, not the total amount of all operations. B-Tree contains many records in one node (which can be considered a "data block"), thus requires significantly fewer block accesses than Trie does.
That is exactly the same reason why most DBs store their indexes in a B-Tree. They need fast search through index located on conventional hard drive.
Actually, your problem can be solved by putting your (word-documentId) pairs in DB table and creating an index on word column or the entire pair.
You can try a ternary trie. It doesn't take so much space. You can also look for a Kart-trie. It uses a key and 2 leafs:http://code.dogmap.org/kart/.
I'm receiving "order update" from stock exchange. Each order id is between 1 and 100 000 000, so I can use 100 million array to store 100 million orders and when update is received I can look-up order from array very fast just accessing it by index arrray[orderId]. I will spent several gigabytes of memory but this is OK.
Alternatively I can use hashmap, and because at any moment the number of "active" orders is limited (to, very roughly, 100 000), look-up will be pretty fast too, but probaly a little bit slower then array.
The question is - will hashmap be actually slower? Is it reasonably to create 100 millions array?
I need latency and nothing else, I completely don't care about memory, what should I choose?
Whenever considering performance issues, one experiment is worth a thousand expert opinions. Test it!
That said, I'll take a wild stab in the dark: it's likely that if you can convince your OS to keep your multi-gigabyte array resident in physical memory (this isn't necessarily easy - consider looking at the mlock and munlock syscalls), you'll have relatively better performance. Any such performance gain you notice (should one exist) will likely be by virtue of bypassing the cost of the hashing function, and avoiding the overheads associated with whichever collision-resolution and memory allocation strategies your hashmap implementation uses.
It's also worth cautioning that many hash table implementations have non-constant complexity for some operations (e.g., separate chaining could degrade to O(n) in the worst case). Given that you are attempting to optimize for latency, an array with very aggressive signaling to the OS memory manager (e.g., madvise and mlock) are likely to result in the closest to constant-latency lookups that you can get on a microprocessor easily.
While the only way to objectively answer this question is with performance tests, I will argue for using a Hashtable Map. (Caching and memory access can be so full of surprises; I do not have the expertise to speculate on which one will be faster, and when. Also consider that localized performance differences may be marginalized by other code.)
My first reason for "initially choosing" a hash is based off of the observation that there are 100M distinct keys but only 0.1M active records. This means that if using an array, index utilization will only be 0.1% - this is a very sparse array.
If the data is stored as values in the array then it needs to be relatively small or the array size will balloon. If the data is not stored in the array (e.g. array is of pointers) then the argument for locality of data in the array is partially mitigated. Either way, the simple array approach requires lots of unused space.
Since all the keys are already integers, the distribution (hash) function and can be efficiently implemented - there is no need to create a hash of a complex type/sequence so the "cost" of this function should approach zero.
So, my simple proposed hash:
Use linear probing backed by contiguous memory. It is simple, has good locality (especially during the probe), and avoids needing to do any form of dynamic allocation.
Pick a suitable initial bucket size; say, 2x (or 0.2M buckets, primed). Don't even give the hash a chance of resize. Note that this suggested bucket array size is only 0.2% the size of the simple array approach and could be reduced further as the size vs. collision rate can be tuned.
Create a good distribution function for the hash. It can also exploit knowledge of the ID range.
While I've presented specialized hashtable rules "optimized" for the given case, I would start with a normal Map implementation (be it a hashtable or tree) and test it .. if a standard implementation works suitably well, why not use it?
Now, test different candidates under expected and extreme loads - and pick the winner.
This seems to depend on the clustering of the IDs.
If the active IDs are clustered suitably already then, without hashing, the OS and/or L2 cache have a fair shot at holding on to the good data and keeping it low-latency.
If they're completely random then you're going to suffer just as soon as the number of active transactions exceeds the number of available cache lines or the size of those transactions exceeds the size of the cache (it's not clear which is likely to happen first in your case).
However, if the active IDs work out to have some unfortunate pattern which causes a high rate of contention (eg., it's a bit-pack of different attributes, and the frequently-varying attribute hits the hardware where it hurts), then you might benefit from using a 1:1 hash of the index to get back to the random case, even though that's usually considered a pretty bad case on its own.
As far as hashing for compaction goes; noting that some people are concerned about worst-case fallback behaviour for a hash collision, you might simply implement a cache of the full-sized table in contiguous memory, since that has a reasonably constrained worst case. Simply keep the busiest entry in the map, and fall back to the full table on collisions. Move the other entry into the map if it's more active (if you can find a suitable algorithm to decide this).
Even so, it's not clear that the necessary hash table size is sufficient to reduce the working set to being cacheable. How big are your orders?
The overhead of a hashmap vs. an array is almost none. I would bet on a hashmap of 100,000 records over an array of 100,000,000, without a doubt.
Remember also that, while you "don't care about memory", this also means you'd better have the memory to back it up - an array of 100,000,000 integers will take up 400mb, even if all of them are empty. You run the risk of your data being swapped out. If your data gets swapped out, you will get a performance hit of several orders of magnitude.
You should test and profile, as others have said. My random stab in the dark, though: A high-load-factor hash table will be the way to go here. One huge array is going to cost you a TLB miss and then a last-level cache miss per access. This is expensive. A hash table, given the working set size you mentioned, is probably only going to cost some arithmetic and an L1 miss.
Again, test both alternatives on representative examples. We're all just stabbing in the dark.
I'm not talking about distributed key/value systems, such as typically used with memcached, which use consistent hashing to make adding/removing nodes a relatively cheap procedure.
I'm talking about your standard in-memory hashtable like python's dict or perl's hash.
It would seem like the benefits of using consistent hashing would also apply to these standard data structures, by lowering the cost of resizing the hashtable. Real-time systems (and other latency-sensitive systems) would benefit from / require hashtables optimized for low-cost growth, even if overall throughput declines slightly.
Wikipedia alludes to "incremental resizing" but basically talks about a hot/cold replacement approach to resizing; there is a separate article about "extendible hashing" that uses a trie for bucket lookup to accomplish cheap rehashing.
Just curious if anyone's heard of in-core, single-node hashtables that use consistent hashing to lower growth cost. Or is this requirement better met using something other approach (ala the two wikipedia bits listed above)?
or ... is my whole question misguided? Do memory paging considerations make the complexity not worth it? That is, the extra indirection of consistent hashing lets you rehash only a fraction of the total keys, but perhaps that doesn't matter because you'll probably have to read from each existing page, so memory latency is your primary factor, and whether you rehash some or all of the keys doesn't matter compared to the cost of the memory access.... but on the other hand, with consistent hashing, all of your key remaps have the same destination page, so there's going to be less memory thrashing than if your keys remap to any of the existing pages.
EDIT: added "data-structures" tag, clarified final sentence to say "page" instead of "bucket".
I haven't heard of this in the wild, but it may be a good idea if you choose the right consistent hash implementation. Specifically, Jump Consistent Hashing by Google et al. First I'll go into why Jump, then I'll go into how it can be useful in a local data structure.
Jump Consistent Hashing
Jump Consistent Hashing (which I'll shorten to Jump) is great for this space for a few reasons. Jump assumes that nodes don't fail, which is great for local data structures because they, well, don't fail! This allows Jump to merely be a mapping to a range of numbers [0, numBuckets), requiring only 2-4 bytes of space.
Further the implementation is simple and fast. And it is even faster if we remove the reference implementation's floating point divides and replace them with half as many integer divides. (Which we can, by the way.)
All this can be used for a variation on...
ConcurrentHashMap
But first, Java's Concurrent Hash Map at a high-level.
Java's ConcurrentHashMap is parameterized by a number of buckets. This sharding factor is constant through the life of the map. Each of these buckets is itself a hash map with its own lock.
When inserting a key-value pair into the map, the key is hashed into one of the buckets. The lock for that key is taken, and the item is inserted into the bucket's hash map before releasing the lock. Whilst inserting into bucket x another thread can be inserting concurrently into bucket y, but it will wait for the lock if inserting into bucket x. Thus Java's ConcurrentHashMap has n-way concurrency, where n is the bucket parameter of the constructor.
Just like any hash map, a bucket in ConcurrentHashMap can fill up and need to grow. Just like the regular hash map it does this by doubling its size and rehashing everything in the bucket back into its bigger self. Except that 'its bigger self' is only the bucket's 'self'. If a bucket is a hot spot and gets more than its fair share of keys, the bucket will grow disproportionately compared to the other buckets. And each time a bucket grows it takes longer and longer to rehash into itself. This last point is not only a problem for hot spots, but when the hash table plain old gets more keys.
Imagine if we could grow the number of buckets as the number of keys grows. With this we could dampen the amount of growth each individual bucket grows.
Enter consistent hashing, which allows us to add more buckets!
ConcurrentHashMap take 2: Consistent Hashing Style
We can get ConcurrentHashMap to grow its number of buckets in a two easy steps.
First replace the function that maps to each bucket with the jump consistent hash function. So far everything should work the the same.
Second split off a new bucket when a bucket is filled; also grow the filled bucket. Actually, only split off a new bucket if the filled bucket becomes the largest biggest in terms of capacity. That can be calculated without iterating the buckets.
With consistent hashing the split will only direct keys into the new bucket and not backwards into any of the old buckets.
End notes
I'm sure there can be improvements on this scheme. To wit, splitting off a bucket requires a full table scan to move keys into the new bucket. This is surely no worse than a vanilla hash map, and likely better, but it is at a disadvantage to the ConcurrentHashMap implementation which likely doesn't have to do a full scan.
I keep in mind that hash would be first thing I should resort to if I want to write an application which requests high lookup speed, and any other data structure wouldn't guarantee that.
But I got confused when saw some many post saying different, such as suffix tree, trie, to name a few.
So I wonder is hash always the best thing for high speed lookup? What if I want both high lookup speed and less space cost?
Is there any material (books or papers) lecturing about the data structures or algorithms **on high speed lookup and space efficiency? Any of this kind is highly appreciated.
So I wonder is hash always the best thing for high speed lookup?
No. As stated in comments:
There is never such a thing Best data structure for [some generic issue]. Everything is case dependent. Tries and radix trees might be great for strings, since you need to read the string anyway. arrays allows simplicity and great cache efficiency - and are usually the best for small scale static information
I once answered a related question of cases where a tree might be better then a hash table: Hash Table v/s Trees
What if I want both high lookup speed and less space cost?
The two might be self-contradicting. Even for the simple example of a hash table of size X vs a hash table of size 2*X. The bigger hash table is less likely to encounter collisions, and thus is expected to be faster then the smaller one.
Is there any material (books or papers) lecturing about the data
structures or algorithms on high speed lookup and space efficiency?
Introduction to Algorithms provide a good walk through on the main data structure used. Any algorithm developed is trying to provide a good space and time efficiency, but like said, there is a trade off, and some algorithms might be better for specific cases then others.
Choosing the right algorithm/data structure/design for the specific problem is what engineering is about, isn't it?
I assume you are talking about strings here, and the answer is "no", hashes are not the fastest or most space efficient way to look up strings, tries are. Of course, writing a hashing algorithm is much, much easier than writing a trie.
One thing you won't find in wikipedia or books about tries is that if you naively implement them with one node per letter, you end up with large numbers of inefficient, one-child nodes. To make a trie that really burns up the CPU you have to implement nodes so that they can have a variable number of characters. This, of course, is even harder than writing a plain trie.
I have written trie implementations that handle over a billion entries and I can tell you that if done properly it is insanely fast, nothing else compares.
One other issue with tries is that you have to write a custom heap, because if you just use some kind of generic memory management it will be slow. So in addition to implementing the trie, you have to implement the heap that the trie runs on. Pretty freakin complicated, but if you do it, you get batshit crazy speed.
Only a good implementation of hash will give you good performance. And you cannot compare hash with Trie for all situations. Situations where Trie is applicable, is fast, but it can be costly in terms of memory, (again dependent on implementation).
But have you measured performance? Or it is unnecessary optimization you are looking for. Did the map fail you?
That might also depend on the actual number of elements.
In complexity theory a hash is not bad, but complexity theory is only good if the actual number of elements is bigger than some threshold.
I.e. if you have only 2 elements, there is a faster method than a hash ;-)
Hash tables are a good general purpose structure but they can fail spectacularly if the hash function doesn't suit the input data. Worst case lookup is O(n). They also waste some space as you mentioned. Other general-purpose structures like balanced binary search trees have worse average case but better worst case performance than a hash table. This is important for real-time applications. A trie is a more special-purpose structure tailored to string lookup.
I want a representation for strings with fast concatenation and editing operations. I have read the paper "Ropes: an Alternative to Strings", but have there been any significant improvements in this area since 1995?
EDIT: One possibility I've considered before is using a 2-3 finger tree with strings as leaves, but I have not done a detailed analysis of this; this gives amortized constant-time addition/deletion on ends and logarithmic (in the number of chunks of the smaller string) concatenation, as opposed to vice versa for ropes.
This is an old question! I wonder if anyone reads this. But still it's intrigueing.
In your comments, you say you look for:
Faster asymptotics, or constant
factors, or less memory use
Well, ropes have O(1) insertion, and O(n) iteration. You can't do better than that. Substrings and indexing is obviously going to be more costly. But most use cases for large documents don't require editing or random access. If you only concatenate at the end, a 1D vector/list of strings could improve the insertion time constant. I used to use this in JavaScript because it had such slow string concatentation.
It is said that memory representation is less efficient than using strings.
I doubt that: If you work in a language that has garbage collection, the rope allows you to use the same string fragment instance in multiple places. In a rope that represents a HTML document, there will be many DIV's, SPAN's and LINK elements. This might even happen automatically assuming these tags are compile time constants, and you add them to the rope directly. Even for such short phrases, the rope document will reduce in size significantly, to the same order of magnitude as the original string. Longer strings will produce a net gain.
If you also make the tree elemenst read only, you can create subropes (longer phrases expressed as ropes), that occur multiple times or are shared across rope based strings. The downside of this sharing is that such shard rope sections can't be changed: to edit them, or to balance the tree you need to copy the object graph. But that does not matter if you mostly concatenate and iterate. In a web server, you can keep a subrope that repesents the CSS stylesheet declaration that is shared across all HTML documents served by that server.