Efficient mapping from 2^24 values to a 2^7 index - algorithm

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?

Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.

As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.

Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.

How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.

Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.

Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Related

Search data from a data set without reading each element

I have just started learning algorithms and data structures and I came by an interesting problem.
I need some help in solving the problem.
There is a data set given to me. Within the data set are characters and a number associated with each of them. I have to evaluate the sum of the largest numbers associated with each of the present characters. The list is not sorted by characters however groups of each character are repeated with no further instance of that character in the data set.
Moreover, the largest number associated with each character in the data set always appears at the largest position of reference of that character in the data set. We know the length of the entire data set and we can get retrieve the data by specifying the line number associated with that data set.
For Eg.
C-7
C-9
C-12
D-1
D-8
A-3
M-67
M-78
M-90
M-91
M-92
K-4
K-7
K-10
L-13
length=15
get(3)= D-1(stores in class with character D and value 1)
The answer for the above should be 13+10+92+3+8+12 as they are the highest numbers associated with L,K,M,A,D,C respectively.
The simplest solution is, of course, to go through all of the elements but what is the most efficient algorithm(reading the data set lesser than the length of the data set)?
You'll have to go through them each one by one, since you can't be certain what the key is.
Just for sake of easy manipulation, I would loop over the dataset and check if the key at index i is equal to the index at i+1, if it's not, that means you have a local max.
Then, store that value into a hash or dictionary if there's not already an existing key:value pair for that key, if there is, do a check to see if the existing value is less than the current value, and overwrite it if true.
While you could use statistics to optimistically skip some entries - say you read A 1, you skip 5 entries you read A 10 - good. You skip 5 more, B 3, so you need to go back and also read what is inbetween.
But in reality it won't work. Not on text.
Because IO happens in blocks. Data is stored in chunks of usually around 8k. So that is the minimum read size (even if your programming language may provide you with other sized reads, they will eventually be translated to reading blocks and buffering them).
How do you find the next line? Well you read until you find a \n...
So you don't save anything on this kind of data. It would be different if you had much larger records (several KB, like files) and an index. But building that index will require reading all at least once.
So as presented, the fastest approach would likely be to linearly scan the entire data once.

what algorithm can save one bit of storage space for each arbitrary 32bit number in a LUT

a lookup table has a total of 4G entries, each entry of it is a 32bit arbitrary number but they never repeats.
is there any algorithm is able to utilize the index of each entry and its (index) value(32bit number)to make a fixed position bit of the value is always zero(so I can utilize the bit as a flag to log something). And I can retrieve the 32bit number by doing a reverse calculation.
Or step back and say, whether or not I can make a fixed position bit of every two continuous entries always zero?
my question is that is there any universal codes can make each arbitrary 32bit numeric save 1 bit. so I can utilize this bit as a lock flag. alternatively, is there a way can leverage the index and its value of a lookup table entry by some calculation to save 1 bit storage of the value.
It is not at all clear what you are asking. However I can perhaps find one thing in there that can be addressed, if I am reading it correctly, which is that you have a permutation of all of the integers in 0..232-1. Such a permutation can be represented in fewer bits than direct representation, which takes 32*232 bits. With a perfect representation of the permutations, each would be ceiling(log2(232!)) bits, since there are 232! possible permutations. That length turns out to be about 95.5% of the bits in the direct representation. So each permutation could be represented in about 30.6*232 bits, effectively taking off more than one bit per word.

What data structure should I use where the key falls within a range?

TL;DR: What data structure should I use for looking up key-value pairs where the key needs to fall within a range?
I'm looking for something like a Dictionary but with a twist.
I have a HexEditor with lines, say 8 bytes per line (this can and does differ though).
Any byte within the memblock displayed by the hexeditor can have a comment.
One or zero Comments are associated with one byte-address.
Obviously a range of bytes can have multiple comments and if so all comments will be displayed on a line.
I thought about storing the comments in a Dictionary<Int, String> however that will not work, because I need to lookup if the comment falls within a range and a Dict only matches on exact matches.
The range can change dynamically so I can't link to that either.
It is possible to change the number of bytes per line on the fly and I don't want to have to reconstitute the data store/recalculate all my hashes, so using a dictionary with start-end values as the key is out.
I don't want to do a query to the Dict for every byte in a line.
I suspect the answer is "binary tree" but I'm hoping for something a bit more O(1)ish.
Beware of O(1) when there is a high constant time involved, like is the case for hashed dictionaries, as the cost of hashing is never negligible.
Binary search (as in a binary tree or for an ordered list) is only O(log n), and log is a function that grows very slowly.
When looking up an Integer key, odds are you will be able to perform a score of comparisons in the same time it takes to compute a single hash, and a score of comparisons is enough to perform a binary search among a million elements.

Hashing - What Does It Do?

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

How to test a hash function?

Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.
EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.
You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.
However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.
For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.
For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.
First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?
For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.
As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.
Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.
If your using a chaining hash table, what you really care about is the number of collisions. This would be trivial to implement as a simple counter on your hash table. Every time an item is inserted and the table has to chain, increment a chain counter. A better hashing algorithm will result in a lower number of collisions. A good general purpose table hashing function to check out is: djb2
Based on your clarification:
I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap.
it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances.
Are you explicitly sizing your maps, or using the defaults? A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size.
Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem.
Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue.

Resources