Using boost unordered map - performance

Guys, I am using dynamic programming approach to solve a problem. Here is a brief overview of the approach
Each value generated is identified using 25 unique keys.
I use the boost::hash_combine to generate the seed for the hash table using these 25 keys.
I store the values in a hash table declared as
boost::unordered_map<Key_Object, Data_Object, HashFunction> hashState;
I did a time profiling on my algorithm and found that nearly 95% of the run time is spent towards retrieving/inserting data into the hash table.
These were the details of my hash table
hashState.size() 1880
hashState.load_factor() 0.610588
hashState.bucket_count() 3079
hashState.max_size() 805306456
hashState.max_load_factor() 1
hashState.max_bucket_count() 805306457
I have the following two questions
Is there anything which I can do to improve the performance of the Hash Table's insert/retrieve operations?
C++ STL has hash_multimap which would also suit my requirement. How does boost libraries unordered_map compare with hash_multimap in terms of insert/retrieve performance.

If your hash function is not the culprit, the best you can do is probably using a different map implementation. Since your keys are quite large, using unordered_map from Boost.Intrusive library might be the best option. Alternatively, you could try closed hashing: Google SparseHash or MCT, though profiling is certainly needed because closed hashing is recommended when elements are small enough. (SparseHash is more established and well tested, but MCT doesn't need those set_empty()/set_deleted() methods).
EDIT:
I just noticed there is no intrusive map in the Boost library I mentioned, only set and multiset. Still, you can try the two closed hashing libraries.
EDIT 2:
STL hash_map is not standard, it is probably some extension and not portable across compilers.

Are you sure that the hash function you are using is not the bottleneck?
Which time percentage takes the hash function?
Can you do the same test and replace the insert/retrievals by a simple call to the hash.

Related

Millions of searches in unordered_map, runtime hogger

I have around 5000 strings of size (length in range from 50-80 mostly). Currently i create an unordered map and push these keys and during execution i access (using map'
s find function) them 10-100 million times. I did some profiling around this search, seems to be the runtime hogger.
I searched for other better and faster search options, but somehow did not find anything substantial.
Do anyone have idea about, how to make it faster, open for custom made container also. I did try std::map, but did not help. Do share link if anyone have.
Also one more point to add, i also modify values against some keys at runtime also, but not that many times. Mostly it's search.
Having considered a similar question to yours C++ ~ 1M look-ups in unordered_map with string key works much slower than .NET code, I would guess you have run into the issue caused by hash function used by std::unordered_map. For strings with length of 50-80 that could lead to a lot of collisions and this would significantly degrade look-up performance.
I would suggest you to use some custom hash function for the std::unordered_map. Or you could give A fast, memory efficient hash map for C++ a try.

How to achieve Time Travel with clojure

Is there a way to achieve time traveling in Clojure, for example if I have a vector (which is internally a tree implemented as a persistent data structure) is there a way to achieve time traveling and get preivous versions of that vector? Kind of what Datomic does at the database level, since Clojure and Datomic share many concepts including the facts being immutable implemented as persistent data strcutures, technically the older version of the vector is still there. So I was wondering if time traveling and getting previous versions is possible in plain Clojure similarly to what it is done in Datomic at the database level
Yes, but you need to keep a reference to it in order to access it, and in order to prevent it from being garbage collected. Clojurists often implement undo/redo in this way; all you need to do is maintain a list of historical states of your data, and then you can trivially step backward.
David Nolen has described this approach here, and you can find a more detailed example and explanation here.
Datomic is plain Clojure. You can use Datomic as a Clojure library either with an in-memory database (for version tracking) or with no database at all.

Is Universal family of hash functions only to prevent enemy attack?

If my intention is only to have a good hash function that spreads data evenly into all of the buckets, then I need not come up with a family of hash functions, I could just do with one good hash function, is that correct?
The purpose of having a family of hash functions is only to make it harder for the enemy to build a pathological data set as when we pick a hash function randomly, he/she has no information about which hash function is employed. Is my understanding right?
EDIT:
Since someone is trying to close as unclear; This question is to know the real purpose of employing a Universal family of hash functions.
I could just do with one good hash function, is that correct?
As you note later in your question, an "enemy" who knows which hash function you're using could prepare a pathological data set.
Further, hashing is just the first stage in storing data into your table's buckets - if you're implementing open addressing / closed hashing, you also need to select alternative buckets to probe after collisions: simple approaches like linear and quadratic probing generally provide adequate collision avoidance, and are likely mathematically simpler and therefore faster than rehashing, but they don't maintain a probability of the next probe finding an unused bucket at the load factor. Rehashing with another good hash function (including another from a family of such functions) does, so if that's important to you you may prefer to use a family of hash functions.
Note too that sometimes an in-memory hash table is used to say at which offsets/sectors on disk data is stored, so extra rehashing calculations with already-in-memory data may be far more appealing than a higher probability (with linear/quadratic probing) of waiting on disk I/O only to find another collision.

Efficient Data Structures in Maple

I'm working with a large amount of data in Maple and I need to know the most efficient way to store it. I started with lists, but I quickly learned how inefficient those are so I have since replaced them. Now I'm using a mixture of Arrays (for structures with a fixed length) and tables (for structures with variable length), but my code actually runs significantly slower than it did when I was only using lists.
So here are my questions:
What is the most efficient data structure to use in Maple for a static-length set of data? for a variable-length set?
Are there any "gotchas" I need to be aware of when using these structures as parameters in a recursive proc? If using Arrays or tables, does each one need to be copied for each iteration to avoid clobbering data?
I think I can wrap this one up now. I made a few performance improvements, mostly just small tweaks that only helped a bit, but I did manage a big improvement by removing as many instances of the copy command as I could (I used it on arrays and tables). It turns out this is what was causing my array/table implementation to be slower than my list-only implementation. But the code still didn't run as fast as I needed, so I re-wrote it in C#. That's probably not the best solution for "how to improve Maple efficiency", but it sure does run a lot faster now.

Computing percentiles

I'm writing a program that's going to generate a bunch of data. I'd like to find various percentiles over that data.
The obvious way to do this is to store the data in some kind of sorted container. Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
The alternative is to use an unordered container and perform sorting at the end. I don't know if that's going to be any faster. Either way, we're still left with needing a container which offers fast random access. (An array, perhaps...)
Suggestions?
(Another alternative is to build a histogram, rather than keep the entire data set in memory. But since the objective is to compute percentiles extremely accurately, I'm reluctant to go down that route. I also don't know the range of my data until I generate it...)
Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
Yes, it's your good old Data.Map. See elemAt and other functions under the «Indexed» category.
Data.Set doesn't offer these, but you can emulate it with Data.Map YourType ().

Resources