Hashes: Tables, Lists and Maps, Oh My? - data-structures

I've been trying to find some concrete (laymen; non super-academic) definitions for the various types of hash data structures, specifically hash tables, hash lists and hash maps. Online searches provide many useful links to all of these, but never give clear definitions of when it is appropriate to use each over the others.
(1) From a practical standpoint, what's the difference between these 3?
(2) How do their operations' run times differ? Are there clear instances when one should be used or avoided over the other types of hashes?
(3) How do each of these relate back to the Map ADT? Are they all just different implementations of it, or different beasts altogether?
Thanks for any insight here!

There's an abstract data structure that contains mapping between keys and values. It has several different names, including Map, Dictionary, Table, Association Table, and more.
The most basic operations that should be supported by this data-structure are adding, removing and retrieving a value, given its associated key. There are variations and additions around this basic concept - for instance, some structures support iterating over all the key-value pairs, some structures support multiple values per key, etc. There's also a difference in time and space complexity between the various implementations.
Of the multiple implementations available for this data structure, some of the most popular ones utilize hash functions for fast access times. Those implementations are sometimes called by the name Hash Table or Hash Map, you can read more about them in Wikipedia. The performance also varies between hash table implementations, with some reaching amortized O(1) insertion and access complexity (for the price of a lot of space used).
A hash list, on the other hand, is a different thing, and is more about the usage of a data structure, than its actual structures. A hash list is usually just a regular list of hash values, nothing special about it. It's used when verifying the integrity of a large piece of data - in that case it allows various data chunks to be verified independently, allowing for fixing or retrieving of just the bad chunks. This is as opposed to using a single hash value to hash the entire piece of data, in which case a failure means all the data has to be fixed or retrieved again.

Related

Checking a given hash is in a very, very long list

I have a list of hashes. Long list. Very long list. I need to check does a given hash is in that list.
The easiest way is to store hashes in memory (in a map or a simple array) and check that. But it will require lots of RAM/SSD/HDD memory. More than a server(s) can handle.
I'm wondering is there a trick to do that in reasonable memory usage. Maybe there's an algorithm I'm not familiar with or a special collection?
Three thoughts-
Depending on the structure of these hashes, you may be able to borrow some ideas from the concept of a Rainbow Table to implicitly store some of them.
You could use a trie to compress storage for shared prefixes if you have enough hashes, however given their length and (likely) uniformity, you won't see terrific savings.
You could split the hash into multiple smaller hashes, and then use these to implement a Bloom Filter, however this a probabilistic test, so you'll still need them stored somewhere else (or able to be calculated / derived) if there's a perceived "hit", however this may enable you filter out enough "misses" that a less performant (speed-wise) data structure becomes feasible.

Why is lookup in an Array O(1)?

I believe that in some languages other than Ruby, an Array lookup is O(1) because you know where the data starts, and you multiply the index by the size of the data the array is holding, and then access that memory location.
However, in Ruby, an Array can have objects from different classes, so how does it manage to do a lookup of O(1) complexity?
What #Neil Slater said, with a little more detail…
There are basically two plausible approaches to storing an array of heterogeneous objects of differing sizes:
Store the objects as a singly- or doubly-linked list, with the storage space for each individual object preceded by pointer(s) to the preceding and/or following objects. This structure has the advantage of making it very easy to insert new objects at arbitrary points without shifting around the rest of the array, but the huge downside is that looking up an object by its position is generally O(N), since you have to start from one end of the list and jump through it node-by-node until you arrive at the n-th one.
Store a table or array of constant-sized pointers to the individual objects. Since this lookup table contains constant-sized items in a contiguous ordered layout, looking up the addresses of individual objects O(1); the table is just a C-style array, in which lookup only takes 1-to-a-few machine instructions, even on RISC CPU architectures.
(The allocation strategies for storing the individual objects are also interesting and complex, but not immediately relevant to your question.)
Dynamic languages like Perl/Python/Ruby pretty much all opt for #2 for their general-purpose list/array types. In other words, they make lookup more efficient than inserting objects at random locations in the list, which is the better choice for many applications.
I'm not familiar with the implementation details for Ruby, but they are likely quite similar to those of Python's list type, whose performance and design is explained in wonderful detail at effbot.org.
Its implementation probably contains an array of memory addresses, pointing to the actual objects. Therefore it can still lookup without looping through the array.

Why aren't cryptographic hash functions used in data structures?

While some algorithms like MD5 haven't quite stood the test of time with regards to the security industry, others like the SHA family of functions have (thus far). Yet despite the discovery, or theoretical existence of collisions within their domains, cryptographic hash functions still provide an incredibly well distributed range of fixed length output mappings for data of arbitrary length and type – why aren’t they used in data structures more often? Isn’t the goal of a hash table (provided a good function) to map every input to a unique key, such that chaining, nested tables and other collision handling techniques become entirely moot? It’s certainly convenient being able to feed almost anything to a function, and know the exact length of the key you will receive! Seems like an ideal use for retired security protocols to me.
Cryptographic hash function can and are used as the hash function in hash tables. Only not so often. The drawback for the cryptographic hashes is that they are very 'expensive' in terms of processing power needed compared to the more traditional hash functions used in hash tables.
Traditional hash functions have all the characteristics that you need for a hash table, but require way less CPU cycles to perform. This has changes a bit now most chipsets include hardware acceleration for these cryptographic hashes though.
And the 'index' generated with a cryptographic hash function is too large. SO you need to trim it down by either a reduction or masking. (You don't need 16 bytes of hash table indexes ;))
All in all, they are often not worth the hassle..

How to implement an efficient excel-like app

I need to implement an efficient excel-like app.
I'm looking for a data structure that will:
Store the data in an efficient manner (for example - I don't want
to pre-allocate memory for unused cells).
Allow efficient update when the user changes a formula in one of the cells
Any ideas?
Thanks,
Li
In this case, you're looking for an online dictionary structure. This is a category of structures which allow you to associate one morsel of data (in this case, the coordinates that represent the cell) with another (in this case, the cell contents or formula). The "online" adjective means dictionary entries can be added, removed, or changed in real time.
There's many such structures. To name some more common ones: hash tables, binary trees, skip lists, linked lists, and even lists in arrays.
Of course, some of these are more efficient than others (depending on implementation and the number of entries). Typically I use hash tables for this sort of problem.
However, if you need to do range querying "modify all of the cells in this range", you may be better off with a binary tree or a more complicated spatial structure -- but not likely given the simple requirements of the problem.

External store for complex collections that can be accessed by Key-Value

Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.
In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.
I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.
Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)

Resources