I am writing some color management code, and I am dealing with LUTs (look up tables).
I can read the color profile LUT and convert my values... but, how can I do the inverse operation? maybe, is there a good algorithm to generate the 'inverse' of a LUT?
If your LUT is a given, the simplest method is to find the closest entry to any given color value. You can accelerate this computation by a variety of methods; for example, you can build a k-d tree out of your LUT entries and use it to eliminate most of the comparisons an exhaustive check would require.
However, this will tend to result in a "posterized" image, since smooth areas in your image will shift abruptly from one entry to the next. You can avoid this by taking your pixels in (quasi-)random order, picking the best fit from your LUT, and pushing the difference between the pixel value and the chosen entry back onto the nearby pixels which haven't already been chosen.
There are a variety of ways to do this last, but they all result in a dithering effect that generally makes better use (for imaging purposes) of the available LUT entries than the simple, per-pixel operation can.
Yes, you can usually invert a lookup table efficiently (linear time), assuming that the function is a bijection. If your lookup table maps two different keys to the same value, then there is no direct way to invert the table because you would end up needing to have a value that maps to two different keys. If you're okay with this that's fine, though it may call into question why you're trying to build the reverse map.
If you know that every value is unique, you can build an inverse lookup table as follows. First, create a data structure to hold the mapping from values to keys - perhaps a hash table, or a balanced binary tree, or a raw array if the values are small integers. Next, iterate over each key/value pair from the lookup table, then insert the mapping value → key into the new lookup table. This can be done in linear time plus the time required to insert the values into the new container.
Related
I am looking for suggestions in improving the query time access for unordered maps. My code essentially just consists of 2 steps. In the first step, I populate the unordered map. After the first step, no more entries are ever added to the map. In the second step, the unordered map is only queried. Since the map is essentially unchanging, is there something that can be done to speed up the query time?
For instance, does stl provide any function that can adjust the internal allocations in the map to improve query time access? In other words, it is possible that more than one key was mapped to the same bucket in the unordered map. If more memory was allocated to the map, then chances of such a collision occurring can reduce. In that sense, I am curious as to whether there is anything that can be done knowing the fact that the unordered map will remain unchanged.
If measurements show this is important for you, then I'd suggest taking measurements for other hash table implementations outside the Standard Library, e.g. google's. Using closed hashing aka open addressing may well work better for you, especially if your hash table entries are small enough to store directly in the hash table buckets.
More generally, Marshall suggests finding a good hash function. Be careful though - sometimes a generally "bad" hash function performs better than a "good" one, if it works in nicely with some of the properties of your keys. For example, if you tend to have incrementing number, perhaps with a few gaps, then an identity (aka trivial) hash function that just returns the key can select hash buckets with far less collisions than a crytographic hash that pseudo-randomly (but repeatably) scatters keys with as little as a single bit of difference in uncorrelated buckets. Identity hashing can also help if you're looking up several nearby key values, as their buckets are probably nearby too and you'll get better cache utilisation. But, you've told us nothing about your keys, values, number of entries etc. - so I'll leave the rest with you.
You have two knobs that you can twist: The the hash function and number of buckets in the map. One is fixed at compile-time (the hash function), and the other you can modify (somewhat) at run-time.
A good hash function will give you very few collisions (non-equal values that have the same hash value). If you have many collisions, then there's not really much you can do to improve your lookup times. Worst case (all inputs hash to the same value) gives you O(N) lookup times. So that's where you want to focus your effort.
Once you have a good hash function, then you can play games with the number of buckets (via rehash) which can reduce collisions further.
Hashmaps usually implemented using internal array (table) of buckets. On accessing hashmap by key, we get key's hashcode using key-type specific(logic type specific) hash function. Then we need to map hashcode to actual internal buckets table index.
key -> (hash function) -> hashcode -> (???) -> index in internal table
Sometimes internal table could shrink and expand, depending on hashmap filling ratio. Then probably hashcode->index conversion method could be changed a bit.
For example our hash function returns 32 bit unsigned integer value and
moment A: internal table has capacity 10000
moment B: internal table has capacity 100000
What algorithms or approach usually used to perform hashcode->internal table index conversion? How is table resizing isue solved for them?
Usually, a simple modulo will do the job.
To take a quick example from Wikipedia, it's simple as that :
hash = hashfunc(key)
index = hash % array_size
As you said, the resizing happen dependending on the hashmap filling ratio. The array is reallocated (see realloc()), then the indices are recalculated given the new array size, and the values copied to their new index.
I wrote about this here and here.
When you increase the size of your vector of indeces you can be sure that the algorithm that worked well on the shorter vector will work less well on the longer. It is possible to test beforehand and have new algorithms to put in place when you make the vector longer. Or, as the the number of occupied indeces in the current vector increases, have a background, lower-priority thread that tests different algorithms on the data.
As the example in one of my answers shows, a "new algorithm" need be nothing more than a different pair of matched prime numbers.
Relevant link: http://en.wikipedia.org/wiki/Hopscotch_hashing
Hopscotch hash tables seem great, but I haven't found an answer to this question in the literature: what happens if my neighborhood size is N and (due to malfeasance or extremely bad luck) I insert N+1 elements which all hash to the same exact value?
In the original article it is written that table needs to be resized:
Finally, notice that if more than a constant number of items are hashed by h into
a given bucket, the table needs to be resized. Luckily, as we show, for a universal
hash function h, the probability of this type of resize happening given H = 32 is
1/32!.
There are two cases where we need resize hopscotch hash
you have H collisions for the given bucket
the load factor is really too big to find the free bucket. In practice, you should setup a uplimit for search free bucket.
Given the universal hash function, you only have 1/32! chance to get into case #1, in other word, if you continuously insert 2^35 elements, then you have one chance to resize due to collisions.
The case #2 is more popular reason for resize in practice, you could refer to some quadratic implementations for how they decide to resize[C# hashmap and Google sparse hashmap], there is no real implementation for linear probe due to its cluster drawback, i.e. can't guarantee constant lookup.
I want to make an infinite tiled map, from (-max_int,-max_int) until (max_int,max_int), so I'm gonna make a basic structure: chunk, each chunk contain char tiles[w][h] and also it int x, y coordinates, so for example h=w=10 so tile(15,5) is in chunk(1,0) on (5,5) coordinate, and tile(-25,-17) is in chunk(-3,-2)on(5,3) and so on. Now there can be any amount of chunks and I need to store them and easy access them in O(logn) or better ( O(1) if possible.. but it's not.. ). It should be easy to: add, ??remove??(not must) and find. So what data structure should I use?
Read into KD-tree or Quad-tree (the 2d variant of Octree). Both of these might be a big help here.
So all your space is splited into chunks (rectangular clusters). Generally problem is storing data in sparse (since clusters already implemented) matrix. Why not to use two-level dictionary-like containers?.. I.e. rb-tree by row index where value is rb-tree by column index. Or if you are lucky you can use hashes to get your O(1). In both cases if you can't find row you allocate it in container and create new container as value but initially with only single chunk. Of course allocating new chunk on existing row will be a bit faster than on new one and I guess that's the only issue with this approach.
Can somebody explain the main differences between (advantages / disadvantages) the two implementations?
For a library, what implementation is recommended?
Wikipedia's article on hash tables gives a distinctly better explanation and overview of different hash table schemes that people have used than I'm able to off the top of my head. In fact you're probably better off reading that article than asking the question here. :)
That said...
A chained hash table indexes into an array of pointers to the heads of linked lists. Each linked list cell has the key for which it was allocated and the value which was inserted for that key. When you want to look up a particular element from its key, the key's hash is used to work out which linked list to follow, and then that particular list is traversed to find the element that you're after. If more than one key in the hash table has the same hash, then you'll have linked lists with more than one element.
The downside of chained hashing is having to follow pointers in order to search linked lists. The upside is that chained hash tables only get linearly slower as the load factor (the ratio of elements in the hash table to the length of the bucket array) increases, even if it rises above 1.
An open-addressing hash table indexes into an array of pointers to pairs of (key, value). You use the key's hash value to work out which slot in the array to look at first. If more than one key in the hash table has the same hash, then you use some scheme to decide on another slot to look in instead. For example, linear probing is where you look at the next slot after the one chosen, and then the next slot after that, and so on until you either find a slot that matches the key you're looking for, or you hit an empty slot (in which case the key must not be there).
Open-addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes. It gets very, very slow if the load factor approaches 1, because you end up usually having to search through many of the slots in the bucket array before you find either the key that you were looking for or an empty slot. Also, you can never have more elements in the hash table than there are entries in the bucket array.
To deal with the fact that all hash tables at least get slower (and in some cases actually break completely) when their load factor approaches 1, practical hash table implementations make the bucket array larger (by allocating a new bucket array, and copying elements from the old one into the new one, then freeing the old one) when the load factor gets above a certain value (typically about 0.7).
There are lots of variations on all of the above. Again, please see the wikipedia article, it really is quite good.
For a library that is meant to be used by other people, I would strongly recommend experimenting. Since they're generally quite performance-crucial, you're usually best off using somebody else's implementation of a hash table which has already been carefully tuned. There are lots of open-source BSD, LGPL and GPL licensed hash table implementations.
If you're working with GTK, for example, then you'll find that there's a good hash table in GLib.
My understanding (in simple terms) is that both the methods has pros and cons, though most of the libraries use Chaining strategy.
Chaining Method:
Here the hash tables array maps to a linked list of items. This is efficient if the number of collision is fairly small. The worst case scenario is O(n) where n is the number of elements in the table.
Open Addressing with Linear Probe:
Here when the collision occurs, move on to the next index until we find an open spot. So, if the number of collision is low, this is very fast and space efficient. The limitation here is the total number of entries in the table is limited by the size of the array. This is not the case with chaining.
There is another approach which is Chaining with binary search trees. In this approach, when the collision occurs, they are stored in binary search tree instead of linked list. Hence, the worst case scenario here would be O(log n). In practice, this approach is best suited when there is a extremely nonuniform distribution.
Since excellent explanation is given, I'd just add visualizations taken from CLRS for further illustration:
Open Addressing:
Chaining:
Open addressing vs. separate chaining
Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself...
doing that is called "open addressing"
it is also called "closed hashing"
Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys...
this is called "separate chaining"
it is also called "open hashing"
Collision resolution becomes easy with separate chaining: just insert a key in its linked list if it is not already there
(It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see)
Let’s look at analyzing time costs of these strategies
Source: http://cseweb.ucsd.edu/~kube/cls/100/Lectures/lec16/lec16-25.html
If the number of items that will be inserted in a hash table isn’t known when the table is created, chained hash table is preferable to open addressing.
Increasing the load factor(number of items/table size) causes major performance penalties in open addressed hash tables, but performance degrades only linearly in chained hash tables.
If you are dealing with low memory and want to reduce memory usage, go for open addressing. If you are not worried about memory and want speed, go for chained hash tables.
When in doubt, use chained hash tables. Adding more data than you anticipated won’t cause performance to slow to a crawl.