An efficient Javascript set structure - performance

After reading many similar questions:
JavaScript implementation of a set data structure
Mimicking sets in JavaScript?
Node JS, traditional data structures? (such as Set, etc), anything like Java.util for node?
Efficient Javascript Array Lookup
Best way to find if an item is in a JavaScript array?
How do I check if an array includes an object in JavaScript?
I still have a question: suppose I have a large array of strings (several thousands), and I have to make many lookups (i.e. check many times whether a given string is contained in this array). What is the most efficient way to do this in Node.js ?
A. Sort the array of strings, then use binary search? or:
B. Convert the strings to keys of an object, then use the "in" operator
?
I know that the complexity of A is O(log N), where N is the number of strings.
But I don't know the complexity of B.
IF a Javascript object is implemented as a hash table, then the complexity of B is, on average, O(1), which is better than A. However, I don't know if a Javascript object is really implemented as a hash table!

Update for 2016
Since you're asking about node.js and it is 2016, you can now use either the Set or Map object from ES6 as these are built into ES6. Both allows you to use any string as a key. The Set object is appropriate when you just want to see if the key exists as in:
if (mySet.has(someString)) {
//code here
}
And, Map is appropriate when you want to store a value for that key as in:
if (myMap.has(someString)) {
let val = myMap[someString];
// do something with val here
}
Both ES6 features are now built into node.js as of node V4 (the current version of node.js as of this edit is v6).
See this performance comparison to see how much faster the Set operations are than many other choices.
Older Answer
All important performance questions should be tested with actual performance tests in a tool like jsperf.com. In your case, a javascript object uses a hash-table like implementation because without something that performs pretty well, the whole implementation would be slow since so much of javascript uses object.
String keys on an object would be the first thing I'd test and would be my guess for the best performer. Since the internals of an object are implemented in native code, I'd expect this to be faster than your own hashtable or binary search implemented in javascript.
But, as I started my answer with, you should really test your specific circumstance with the number and length of strings you are most concerned about in a tool like jsperf.

For fixed large array of string I suggest to use some form of radix search
Also, take a look at different data structures and algorithms (AVL trees, queues/heaps etc) in this package
I'm pretty sure that using JS object as storage for strings will result in 'hash mode' for that object. Depending on implementation this could be O(log n) to O(1) time. Look at some jsperf benchmarks to compare property lookup vs binary search on sorted array.
In practice, especially if I'm not going to use the code in browser I would offload this functionality to something like redis or memcached.

Related

What is the best data structure to lookup if something exists?

I want to keep a list of things I've already encountered in my recursive looping so I can avoid recalculating the same things over and over again. What is the most efficient data structure for this?
Looking at complexity analysis, Hashtables seem to be the most efficient for lookups. However, it feels inefficient to hold a data structure of key value pairs when all I'm interested in is looking up whether a specified key exists.
I thought that maybe a set would be a good idea, but its complexity doesn't seem to indicate so.
A set probably is the correct abstract data type for this, since really the only information it encodes is whether an item is inside or not.
Using a set doesn't specify a particular implementation though. In most contexts, you should be able to find a set type implemented using hashing, or some type of tree structure. Which one you choose would depend on if the data you're inserting can be efficiently hashed or ordered.
In some libraries you will find set types that are implemented using the library's map types, but where the value is ignored. For example in rust, the standard library's HashSet type is implemented using the HashMap type.

Static functional data structure with O(1) amortised associative lookup

I'm looking for a static data structure with amortised constant time associative lookup. The only operations I want to perform are lookup and construction. Also being functional is a must. I've had a look at finger trees, but I can't seem to wrap my head round them. Are there any good docs on them or, better yet, a simpler static functional data structure?
I assume that by "functional" and "static" you mean an immutable structure, which can not be modified after its construction, by "lookup" you mean a dictionary-like, key-value lookup and by "construction" you mean the initial construction of the datastructure from a given set of elements.
In that case an immutable, hashtable based dictionary would work. The disadvantage with that is that insertions and removals are O(N), but you state that this is acceptable in your case.
Depending on what programming language you are using a datatype that could be used to implement this might or might not be available. In Erlang a tuple could be used. Haskell has an immutable array in Data.Array.IArray.
You'd have to look at this from an information-theoretical point of view: The more key/value pairs you store, the more keys you'd have to recognize on an associative lookup. Thus, no matter what you do, the more keys you have, the more complex your lookup will be.
Constant time lookup is only possible when your key directly gives you the address (or something equivalent) of the element you want to look up.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

External store for complex collections that can be accessed by Key-Value

Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.
In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.
I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.
Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)

A Haskell hash implementation that does not live in the IO monad

I am looking for a data structure that works a bit like Data.HashTable but that is not encumbered by the IO monad. At the moment, I am using [(key,val)]. I would like a structure that is O(log n) where n is the number of key value pairs.
The structure gets built infrequently compared to how often it must be read, and when it is built, I have all the key value pairs available at the same time. The keys are Strings if that makes a difference.
It would also be nice to know at what size it is worth moving away from [(key,val)].
You might consider:
Data.Map
or alternatively,
Data.HashMap
The former is the standard container for storing and looking up elements by keys in Haskell. The latter is a new library specifically optimized for hashing keys.
Johan Tibell's recent talk, Faster persistent data structures through hashing gives an overview, while Milan Straka's recent Haskell Symposium paper specifically outlines the Data.Map structure and the hashmap package.
If you have all the key-value pairs up front you might want to consider a perfect hash function.
Benchmarking will tell you when to switch from a simple list.

Resources