hashmap vs treemap in mapdb: how to compare them? - caching

Recently I started playing with MapDB, and learning about its interesting properties. As I understand now, it has three major data types: BTree, Hashmap and Hashset. Something which is a little obscure to me is that, when it is better to use Hahsmap (and Hashset), than using Btree? Any pros and cons in using each data structure compared to the other?

In 1.0 HashMap is better for larger keys, it also has entry expiration based on TTL or maximal size. TreeMap is sorted and has data pump.
I would recommend HashMap in general.

Related

Hashtable with both values as key

Is there a hashing based data structure where I can search an item in O(1) time on both key and value.
This can be achieved by adding duplicate entry in the list for each key value par by reversing key and value, but it will take double the space.
This kind of data structure might be useful in some scenarios: like I want to store opening and closing parenthesis in a map and while parsing the string, I can just check in the map if the key is present without worrying about whether it is opening-closing map or closing-opening map or without storing duplicate.
I hope I am clear enough!!
Data structure that fulfills your needs is called bidirectional map.
I suppose that you are looking for the existing implementation, not for the pointers how to implement it :) Since you didn't specify the programming language, this is the current situation for Java - there is no such data structure in Java API. However, there is Google Guava's bi-directional map interface with several implementations. From the docs:
A bimap (or "bidirectional map") is a map that preserves the
uniqueness of its values as well as that of its keys. This constraint
enables bimaps to support an "inverse view", which is another bimap
containing the same entries as this bimap but with reversed keys and
values.
Alternatively, there is BidiMap from Apache Collections.
For C++, have a look at Boost.Bimap.
For Python, have a look at bidict.
In C#, as well as in other languages, there does not exist an official implementation, but that's where Jon Skeet comes in.
You're searching for a bidirectional map. Here is an article describing the implementation in c++. Note though that a bidirectional map is basically two maps merged into a single object. There isn't any more efficient solution than this though, for a simple reason:
a map is basically an unconnected directed graph of (key,value)-pairs. Each pair is represented by an edge. If you want the map to be bidirectional you'll wind up with twice as many edges, thus doubling the amount of required memory.
C++ and Java STL don't provide any classes for this purpose though. In Java you can use Googles Guava library, in C++ the boost-library provides bi-directional maps.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Computing percentiles

I'm writing a program that's going to generate a bunch of data. I'd like to find various percentiles over that data.
The obvious way to do this is to store the data in some kind of sorted container. Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
The alternative is to use an unordered container and perform sorting at the end. I don't know if that's going to be any faster. Either way, we're still left with needing a container which offers fast random access. (An array, perhaps...)
Suggestions?
(Another alternative is to build a histogram, rather than keep the entire data set in memory. But since the objective is to compute percentiles extremely accurately, I'm reluctant to go down that route. I also don't know the range of my data until I generate it...)
Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
Yes, it's your good old Data.Map. See elemAt and other functions under the «Indexed» category.
Data.Set doesn't offer these, but you can emulate it with Data.Map YourType ().

External store for complex collections that can be accessed by Key-Value

Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.
In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.
I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.
Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)

A Haskell hash implementation that does not live in the IO monad

I am looking for a data structure that works a bit like Data.HashTable but that is not encumbered by the IO monad. At the moment, I am using [(key,val)]. I would like a structure that is O(log n) where n is the number of key value pairs.
The structure gets built infrequently compared to how often it must be read, and when it is built, I have all the key value pairs available at the same time. The keys are Strings if that makes a difference.
It would also be nice to know at what size it is worth moving away from [(key,val)].
You might consider:
Data.Map
or alternatively,
Data.HashMap
The former is the standard container for storing and looking up elements by keys in Haskell. The latter is a new library specifically optimized for hashing keys.
Johan Tibell's recent talk, Faster persistent data structures through hashing gives an overview, while Milan Straka's recent Haskell Symposium paper specifically outlines the Data.Map structure and the hashmap package.
If you have all the key-value pairs up front you might want to consider a perfect hash function.
Benchmarking will tell you when to switch from a simple list.

Resources