I'm looking for a static data structure with amortised constant time associative lookup. The only operations I want to perform are lookup and construction. Also being functional is a must. I've had a look at finger trees, but I can't seem to wrap my head round them. Are there any good docs on them or, better yet, a simpler static functional data structure?
I assume that by "functional" and "static" you mean an immutable structure, which can not be modified after its construction, by "lookup" you mean a dictionary-like, key-value lookup and by "construction" you mean the initial construction of the datastructure from a given set of elements.
In that case an immutable, hashtable based dictionary would work. The disadvantage with that is that insertions and removals are O(N), but you state that this is acceptable in your case.
Depending on what programming language you are using a datatype that could be used to implement this might or might not be available. In Erlang a tuple could be used. Haskell has an immutable array in Data.Array.IArray.
You'd have to look at this from an information-theoretical point of view: The more key/value pairs you store, the more keys you'd have to recognize on an associative lookup. Thus, no matter what you do, the more keys you have, the more complex your lookup will be.
Constant time lookup is only possible when your key directly gives you the address (or something equivalent) of the element you want to look up.
Related
So sometimes I have a certain ds with certain functionalities which have a get time complexity of O(N) like a queue, stack, heap, etc.. I use one of these ds in a program which just needs to check whether a certain element is in one of theses ds, but because they have a get complexity of O(N), it is the pitfall in my algorithm.
If memory isn't much of my worries, would it be poor design to have a hashmap which keeps track of the elements in the restricted data structure? Doing this would essentially remove the O(N) restriction and allow it to be O(1).
Having a supplemental hash table is warranted in many situations. However, maintaining a "parallel hash" could become a liability.
The situation that you describe, when you need to check membership quickly, is often modeled with a hash-based set (HashSet<T>, std::unordered_set<T>, and so on, depending on the language). The disadvantage of these structures is that the order of elements is not specified, and that they cannot have duplicates.
Depending on the library, you may have access to data structures that fix these shortcomings. For example, Java offers LinkedHashSet<T> which provides a predictable order of enumeration, and C++ provides std::unordered_multiset<T>, which allows duplicates.
I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.
I believe that in some languages other than Ruby, an Array lookup is O(1) because you know where the data starts, and you multiply the index by the size of the data the array is holding, and then access that memory location.
However, in Ruby, an Array can have objects from different classes, so how does it manage to do a lookup of O(1) complexity?
What #Neil Slater said, with a little more detail…
There are basically two plausible approaches to storing an array of heterogeneous objects of differing sizes:
Store the objects as a singly- or doubly-linked list, with the storage space for each individual object preceded by pointer(s) to the preceding and/or following objects. This structure has the advantage of making it very easy to insert new objects at arbitrary points without shifting around the rest of the array, but the huge downside is that looking up an object by its position is generally O(N), since you have to start from one end of the list and jump through it node-by-node until you arrive at the n-th one.
Store a table or array of constant-sized pointers to the individual objects. Since this lookup table contains constant-sized items in a contiguous ordered layout, looking up the addresses of individual objects O(1); the table is just a C-style array, in which lookup only takes 1-to-a-few machine instructions, even on RISC CPU architectures.
(The allocation strategies for storing the individual objects are also interesting and complex, but not immediately relevant to your question.)
Dynamic languages like Perl/Python/Ruby pretty much all opt for #2 for their general-purpose list/array types. In other words, they make lookup more efficient than inserting objects at random locations in the list, which is the better choice for many applications.
I'm not familiar with the implementation details for Ruby, but they are likely quite similar to those of Python's list type, whose performance and design is explained in wonderful detail at effbot.org.
Its implementation probably contains an array of memory addresses, pointing to the actual objects. Therefore it can still lookup without looping through the array.
From my limited knowledge of Haskell, it seems that Maps (from Data.Map) are supposed to be used much like a dictionary or hashtable in other languages, and yet are implemented as self-balancing binary search trees.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1) and requires that the elements be in Ord. Certainly there is a good reason, so what are the advantages of using a binary tree?
Also:
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Hash tables can't be implemented efficiently without mutable state, because they're based on array lookup. The key is hashed and the hash determines the index into an array of buckets. Without mutable state, inserting elements into the hashtable becomes O(n) because the entire array must be copied (alternative non-copying implementations, like DiffArray, introduce a significant performance penalty). Binary-tree implementations can share most of their structure so only a couple pointers need to be copied on inserts.
Haskell certainly can support traditional hash tables, provided that the updates are in a suitable monad. The hashtables package is probably the most widely used implementation.
One advantage of binary trees and other non-mutating structures is that they're persistent: it's possible to keep older copies of data around with no extra book-keeping. This might be useful in some sort of transaction algorithm for example. They're also automatically thread-safe (although updates won't be visible in other threads).
Traditional hashtables rely on memory mutation in their implementation. Mutable memory and referential transparency are at ends, so that relegates hashtable implementations to either the IO or ST monads. Trees can be implemented persistently and efficiently by leaving old leaves in memory and returning new root nodes which point to the updated trees. This lets us have pure Maps.
The quintessential reference is Chris Okasaki's Purely Functional Data Structures.
Why is this? Using a binary tree reduces lookup time to O(log(n)) as opposed to O(1)
Lookup is only one of the operations; insertion/modification may be more important in many cases; there are also memory considerations. The main reason the tree representation was chosen is probably that it is more suited for a pure functional language. As "Real World Haskell" puts it:
Maps give us the same capabilities as hash tables do in other languages. Internally, a map is implemented as a balanced binary tree. Compared to a hash table, this is a much more efficient representation in a language with immutable data. This is the most visible example of how deeply pure functional programming affects how we write code: we choose data structures and algorithms that we can express cleanly and that perform efficiently, but our choices for specific tasks are often different their counterparts in imperative languages.
This:
and requires that the elements be in Ord.
does not seem like a big disadvantage. After all, with a hash map you need keys to be Hashable, which seems to be more restrictive.
In what applications would a binary tree be much worse than a hashtable? What about the other way around? Are there many cases in which one would be vastly preferable to the other? Is there a traditional hashtable in Haskell?
Unfortunately, I cannot provide an extensive comparative analysis, but there is a hash map package, and you can check out its implementation details and performance figures in this blog post and decide for yourself.
My answer to what the advantage of using binary trees is, would be: range queries. They require, semantically, a total preorder, and profit from a balanced search tree organization algorithmically. For simple lookup, I'm afraid there may only be good Haskell-specific answers, but not good answers per se: Lookup (and indeed hashing) requires only a setoid (equality/equivalence on its key type), which supports efficient hashing on pointers (which, for good reasons, are not ordered in Haskell). Like various forms of tries (e.g. ternary tries for elementwise update, others for bulk updates) hashing into arrays (open or closed) is typically considerably more efficient than elementwise searching in binary trees, both space and timewise. Hashing and Tries can be defined generically, though that has to be done by hand -- GHC doesn't derive it (yet?). Data structures such as Data.Map tend to be fine for prototyping and for code outside of hotspots, but where they are hot they easily become a performance bottleneck. Luckily, Haskell programmers need not be concerned about performance, only their managers. (For some reason I presently can't find a way to access the key redeeming feature of search trees amongst the 80+ Data.Map functions: a range query interface. Am I looking the wrong place?)
After reading many similar questions:
JavaScript implementation of a set data structure
Mimicking sets in JavaScript?
Node JS, traditional data structures? (such as Set, etc), anything like Java.util for node?
Efficient Javascript Array Lookup
Best way to find if an item is in a JavaScript array?
How do I check if an array includes an object in JavaScript?
I still have a question: suppose I have a large array of strings (several thousands), and I have to make many lookups (i.e. check many times whether a given string is contained in this array). What is the most efficient way to do this in Node.js ?
A. Sort the array of strings, then use binary search? or:
B. Convert the strings to keys of an object, then use the "in" operator
?
I know that the complexity of A is O(log N), where N is the number of strings.
But I don't know the complexity of B.
IF a Javascript object is implemented as a hash table, then the complexity of B is, on average, O(1), which is better than A. However, I don't know if a Javascript object is really implemented as a hash table!
Update for 2016
Since you're asking about node.js and it is 2016, you can now use either the Set or Map object from ES6 as these are built into ES6. Both allows you to use any string as a key. The Set object is appropriate when you just want to see if the key exists as in:
if (mySet.has(someString)) {
//code here
}
And, Map is appropriate when you want to store a value for that key as in:
if (myMap.has(someString)) {
let val = myMap[someString];
// do something with val here
}
Both ES6 features are now built into node.js as of node V4 (the current version of node.js as of this edit is v6).
See this performance comparison to see how much faster the Set operations are than many other choices.
Older Answer
All important performance questions should be tested with actual performance tests in a tool like jsperf.com. In your case, a javascript object uses a hash-table like implementation because without something that performs pretty well, the whole implementation would be slow since so much of javascript uses object.
String keys on an object would be the first thing I'd test and would be my guess for the best performer. Since the internals of an object are implemented in native code, I'd expect this to be faster than your own hashtable or binary search implemented in javascript.
But, as I started my answer with, you should really test your specific circumstance with the number and length of strings you are most concerned about in a tool like jsperf.
For fixed large array of string I suggest to use some form of radix search
Also, take a look at different data structures and algorithms (AVL trees, queues/heaps etc) in this package
I'm pretty sure that using JS object as storage for strings will result in 'hash mode' for that object. Depending on implementation this could be O(log n) to O(1) time. Look at some jsperf benchmarks to compare property lookup vs binary search on sorted array.
In practice, especially if I'm not going to use the code in browser I would offload this functionality to something like redis or memcached.