A Haskell hash implementation that does not live in the IO monad - data-structures

I am looking for a data structure that works a bit like Data.HashTable but that is not encumbered by the IO monad. At the moment, I am using [(key,val)]. I would like a structure that is O(log n) where n is the number of key value pairs.
The structure gets built infrequently compared to how often it must be read, and when it is built, I have all the key value pairs available at the same time. The keys are Strings if that makes a difference.
It would also be nice to know at what size it is worth moving away from [(key,val)].

You might consider:
Data.Map
or alternatively,
Data.HashMap
The former is the standard container for storing and looking up elements by keys in Haskell. The latter is a new library specifically optimized for hashing keys.
Johan Tibell's recent talk, Faster persistent data structures through hashing gives an overview, while Milan Straka's recent Haskell Symposium paper specifically outlines the Data.Map structure and the hashmap package.

If you have all the key-value pairs up front you might want to consider a perfect hash function.
Benchmarking will tell you when to switch from a simple list.

Related

Static functional data structure with O(1) amortised associative lookup

I'm looking for a static data structure with amortised constant time associative lookup. The only operations I want to perform are lookup and construction. Also being functional is a must. I've had a look at finger trees, but I can't seem to wrap my head round them. Are there any good docs on them or, better yet, a simpler static functional data structure?
I assume that by "functional" and "static" you mean an immutable structure, which can not be modified after its construction, by "lookup" you mean a dictionary-like, key-value lookup and by "construction" you mean the initial construction of the datastructure from a given set of elements.
In that case an immutable, hashtable based dictionary would work. The disadvantage with that is that insertions and removals are O(N), but you state that this is acceptable in your case.
Depending on what programming language you are using a datatype that could be used to implement this might or might not be available. In Erlang a tuple could be used. Haskell has an immutable array in Data.Array.IArray.
You'd have to look at this from an information-theoretical point of view: The more key/value pairs you store, the more keys you'd have to recognize on an associative lookup. Thus, no matter what you do, the more keys you have, the more complex your lookup will be.
Constant time lookup is only possible when your key directly gives you the address (or something equivalent) of the element you want to look up.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Why are Haskell Maps implemented as balanced binary trees instead of traditional hashtables?

From my limited knowledge of Haskell, it seems that Maps (from Data.Map) are supposed to be used much like a dictionary or hashtable in other languages, and yet are implemented as self-balancing binary search trees.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1) and requires that the elements be in Ord. Certainly there is a good reason, so what are the advantages of using a binary tree?
Also:
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Hash tables can't be implemented efficiently without mutable state, because they're based on array lookup. The key is hashed and the hash determines the index into an array of buckets. Without mutable state, inserting elements into the hashtable becomes O(n) because the entire array must be copied (alternative non-copying implementations, like DiffArray, introduce a significant performance penalty). Binary-tree implementations can share most of their structure so only a couple pointers need to be copied on inserts.
Haskell certainly can support traditional hash tables, provided that the updates are in a suitable monad. The hashtables package is probably the most widely used implementation.
One advantage of binary trees and other non-mutating structures is that they're persistent: it's possible to keep older copies of data around with no extra book-keeping. This might be useful in some sort of transaction algorithm for example. They're also automatically thread-safe (although updates won't be visible in other threads).
Traditional hashtables rely on memory mutation in their implementation. Mutable memory and referential transparency are at ends, so that relegates hashtable implementations to either the IO or ST monads. Trees can be implemented persistently and efficiently by leaving old leaves in memory and returning new root nodes which point to the updated trees. This lets us have pure Maps.
The quintessential reference is Chris Okasaki's Purely Functional Data Structures.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1)
Lookup is only one of the operations; insertion/modification may be more important in many cases; there are also memory considerations. The main reason the tree representation was chosen is probably that it is more suited for a pure functional language. As "Real World Haskell" puts it:
Maps give us the same capabilities as hash tables do in other languages. Internally, a map is implemented as a balanced binary tree. Compared to a hash table, this is a much more efficient representation in a language with immutable data. This is the most visible example of how deeply pure functional programming affects how we write code: we choose data structures and algorithms that we can express cleanly and that perform efficiently, but our choices for specific tasks are often different their counterparts in imperative languages.
This:
and requires that the elements be in Ord.
does not seem like a big disadvantage. After all, with a hash map you need keys to be Hashable, which seems to be more restrictive.
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Unfortunately, I cannot provide an extensive comparative analysis, but there is a hash map package, and you can check out its implementation details and performance figures in this blog post and decide for yourself.
My answer to what the advantage of using binary trees is, would be: range queries. They require, semantically, a total preorder, and profit from a balanced search tree organization algorithmically. For simple lookup, I'm afraid there may only be good Haskell-specific answers, but not good answers per se: Lookup (and indeed hashing) requires only a setoid (equality/equivalence on its key type), which supports efficient hashing on pointers (which, for good reasons, are not ordered in Haskell). Like various forms of tries (e.g. ternary tries for elementwise update, others for bulk updates) hashing into arrays (open or closed) is typically considerably more efficient than elementwise searching in binary trees, both space and timewise. Hashing and Tries can be defined generically, though that has to be done by hand -- GHC doesn't derive it (yet?). Data structures such as Data.Map tend to be fine for prototyping and for code outside of hotspots, but where they are hot they easily become a performance bottleneck. Luckily, Haskell programmers need not be concerned about performance, only their managers. (For some reason I presently can't find a way to access the key redeeming feature of search trees amongst the 80+ Data.Map functions: a range query interface. Am I looking the wrong place?)

An efficient Javascript set structure

After reading many similar questions:
JavaScript implementation of a set data structure
Mimicking sets in JavaScript?
Node JS, traditional data structures? (such as Set, etc), anything like Java.util for node?
Efficient Javascript Array Lookup
Best way to find if an item is in a JavaScript array?
How do I check if an array includes an object in JavaScript?
I still have a question: suppose I have a large array of strings (several thousands), and I have to make many lookups (i.e. check many times whether a given string is contained in this array). What is the most efficient way to do this in Node.js ?
A. Sort the array of strings, then use binary search? or:
B. Convert the strings to keys of an object, then use the "in" operator
?
I know that the complexity of A is O(log N), where N is the number of strings.
But I don't know the complexity of B.
IF a Javascript object is implemented as a hash table, then the complexity of B is, on average, O(1), which is better than A. However, I don't know if a Javascript object is really implemented as a hash table!
Update for 2016
Since you're asking about node.js and it is 2016, you can now use either the Set or Map object from ES6 as these are built into ES6. Both allows you to use any string as a key. The Set object is appropriate when you just want to see if the key exists as in:
if (mySet.has(someString)) {
//code here
}
And, Map is appropriate when you want to store a value for that key as in:
if (myMap.has(someString)) {
let val = myMap[someString];
// do something with val here
}
Both ES6 features are now built into node.js as of node V4 (the current version of node.js as of this edit is v6).
See this performance comparison to see how much faster the Set operations are than many other choices.
Older Answer
All important performance questions should be tested with actual performance tests in a tool like jsperf.com. In your case, a javascript object uses a hash-table like implementation because without something that performs pretty well, the whole implementation would be slow since so much of javascript uses object.
String keys on an object would be the first thing I'd test and would be my guess for the best performer. Since the internals of an object are implemented in native code, I'd expect this to be faster than your own hashtable or binary search implemented in javascript.
But, as I started my answer with, you should really test your specific circumstance with the number and length of strings you are most concerned about in a tool like jsperf.
For fixed large array of string I suggest to use some form of radix search
Also, take a look at different data structures and algorithms (AVL trees, queues/heaps etc) in this package
I'm pretty sure that using JS object as storage for strings will result in 'hash mode' for that object. Depending on implementation this could be O(log n) to O(1) time. Look at some jsperf benchmarks to compare property lookup vs binary search on sorted array.
In practice, especially if I'm not going to use the code in browser I would offload this functionality to something like redis or memcached.

Which data structure to add/look up/keep count of strings?

I'm trying to figure out what data structure to quickly support the following operations:
Add a string (if it's not there, add it, if it is there, increment a counter for the word)
Count a given string (look up by string and then read the counter)
I'm debating between a hash table or a trie. From my understanding a hash table is fast to look up and add as long as you avoid collisions. If I don't know my inputs ahead of time would a trie be a better way to go?
It really depends on the types of strings you're going to be using as "keys". If you're using highly variable strings, plus you do not have a good hash algorithm for your strings, then a trie can outperform a hash.
However, given a good hash, the lookup will be faster than in a trie. (Given a very bad hash, the opposite is true, though.) If you don't know your inputs, but do have a decent hashing algorithm, I personally prefer using a hash.
Also, most modern languages/frameworks have very good hashing algorithms, so chances are, you'll be able to implement a good lookup using a hash with very little work, that will perform quite well.
A trie won't buy you much; they're only interesting when prefixes are important. Hash tables are simpler, and usually part of your language's standard library, if not directly part of the language itself (Ruby, Python, etc). Here's a dead-simple way to do this in Ruby:
strings = %w(some words that may be repeated repeated)
counts = Hash.new(0)
strings.each { |s| counts[s] += 1 }
#counts => {"words"=>1, "be"=>1, "repeated"=>2, "may"=>1, "that"=>1, "some"=>1}
Addenda:
For C++, you can probably use Boost's hash implementation.
Either one is reasonably fast.
It isn't necessary to completely avoid collisions.
Looking at performance a little more closely, usually, hash tables are faster than trees, but I doubt if a real life program ever ran too slow simply because it used a tree instead of a HT, and some trees are faster than some hash tables.
What else can we say, well, hash tables are more common than trees.
One advantage of the complex trees is that they have predictable access times. With hash tables and simple binary trees, the performance you see depends on the data and with an HT performance depends strongly on the quality of the implementation and its configuration with respect to the data set size.

Resources