Hashing - What Does It Do?

Hashing - What Does It Do? - algorithm

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.

Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science

Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.

Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.

You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.

a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function

You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.

I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

Related

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?

So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").

There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.

If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

How does a hash (in a language like Ruby) work "under the hood"?

I've read here and there about hash maps/tables, and can kind of understand the concept that a hash table is essentially a finite-sized array. The function could use the modulus operator to determine which index in the array corresponds to a particular key. If collisions occur, then a linked-list can be implemented to store all the collided values. This is my very-novice understanding, and I hope someone can expound on it/correct it in the context of a Ruby hash. In Ruby, all you really have to do is
hash = {}
hash[key] = value
and this creates a key with the corresponding value. Say that you're just storing a bunch of symbols as keys and numbers as values:
hash[:a] = 1
hash[:b] = 2
...
What exactly is happening under the hood in terms of storing the values in arrays and linked-lists? What would be an example of a collision?

The Ruby Language Specification does not prescribe any particular implementation strategy for the Hash class. Every implementation is allowed to implement it however they want, provided they honor the contract.
For example, here is Rubinius's implementation, which, being written in Ruby, is pretty easy to follow: kernel/common/hash.rb This is a fairly traditional hashtable. (One other cool thing to note about this implementation is that it actually happens to be as fast as YARV's, which proves that Ruby code can be as efficient as hand-optimized C.)
Rubinius also alternatively implements the Hash class with a Hash Array Mapped Trie: kernel/common/hash_hamt.rb [Note: this implementation uses three VM primitives written in C++.]
You can switch between those two implementations using a configuration option. So, not only is the Hash implementation different between different Ruby implementations, it might even be different between two runs of the exact same program on the exact same version of the exact same Ruby implementation!
In IronRuby, Ruby's Hash class simply delegates to a .NET System.Collections.Generic.Dictionary<object, object>: Ruby/Builtins/Hash.cs
In previous versions, it didn't even delegate, it was just simply a subclass: Ruby/Builtins/Hash.cs

If you are hardcore about this you could look at the implementation directly. This is what the hash ends up using:
https://github.com/ruby/ruby/blob/c8b3f1b470e343e7408ab5883f046b1056d94ccc/st.c
The hash itself is here:
https://github.com/ruby/ruby/blob/trunk/hash.c
Most of the times, the article diego provided in comments will be more than enough

In ruby 2.4, Hash table was moved to open addressing model, so I will describe only how Hash-tables structure works, but not how it is implemented in 2.4 and above.
Let's imagine that we store all entries in an array. When we want to find something, we have to go through all the elements to match one. This can take a long time if we have a lot of elements and using a hash table lets us go directly to the cell with the required value by computing the hash function for that key.
The hash table stores all values in the store (bins) groups, in a data structure similar to an array.
How does hash table work
When we add a new key-value pair, we need to calculate to which "storage" this pair will be inserted and we do this using the .hash method (hash function). The resulting value from the hash function is a pseudo-random number as it always produces the same number for the same value.
Roughly speaking, hash returns the equivalent of the link to the memory location where
the current object is stored. However, for strings, the calculation is relative to the value.
Having received a pseudo-random number, we have to calculate the number of the "storage" where the key-value pair will be stored.
'a'.hash % 16 =>9
a - key
16 - amount of storage
9 - the storage number
So, in Ruby the insertion works in the following way:
How insertion works
It takes the hash of the key using the internal hash function.
:c.hash #=> 2782
After getting the hash value, with the help of modulo operation (2782 % 16) we will get the storage number where to keep our key-value pair :d.hash % 16
Add key-value to a linked list of the proper bin
The search works as follows:
The search works quite the same way:
Determine "hash" function;
Find "storage";
Then iterate through the list and retrieve a hash element.
In ruby, the average number of elements per bin is 5. With the increase in the number of records, the density of elements will grow in each repository (in fact, that size of hash-table is only 16 storages).
If the density of the elements is large, for example 10_000 elements in one "storage", we will have to go through all the elements of this linked-list to find the corresponding record. And we'll go back to O(n) time, which is pretty bad.
To avoid this, table rehash is applied. This means that hash-table size will be increased (up to the next number of - 16, 32, 64, 128, ...) and for all current elements the position in the "storages" will be recalculated.
"Rehash" occurs when the number of all elements is greater than the maximum density multiplied by the current table size.
81 > 5 * 16 - rehash will be called when we add 81 elements to the table.
num_entries > ST_DEFAULT_MAX_DENSITY * table->num_bins
When the number of entries reaches the maximum possible value for current hash-table, the number of "storages" in that hash-table increases (it takes next size-number from 16, 32, 64, 128), and it re-calculates and corrects positions for all entries in that hash.
Check this article for a more in-depth explanation: Do You Know How Hash Table Works? (Ruby Examples)

list-of-list vs. hash-of-hashes

Setup:
I need to store feature vectors associated with string-string pairs. The string-string pairs encode an input-output relationship. There will be a relatively small number of inputs X (e.g. 5), and for each input x, there will be a relatively small number outputs Y|x (e.g. 10).
The question is, what data structure is fastest?
Additional relevant information:
The outputs are generally different for each input, and it cannot be assumed that each X has the same number of outputs.
Lookup will be done "many" times (perhaps 1000).
Inputs will be sampled equally frequently, but for each input, usually one or 2 outputs will be accessed frequently, and the remainder will be accessed infrequently or not at all.
At present, I am considering three possibilities:
list-of-lists: access outer list with index (representing input X[i]), access inner list with index (representing output Y[i][j]).
hash-of-hashes: same as above.
flat hash: key = (input,output).

If you have strings, it's unclear how you would look up the index to use a list of lists efficiently without utilizing hashing anyway. If you can pass around something that keeps the reference to the index (e.g. if the set of outputs is fixed, and you can define an enumeration of them), instead of the string a list of lists would be faster (assuming you mean list in the 'not necessarily linked list' sense, with O(1) element access). Otherwise you may as well just hash directly and save yourself the effort.
If not, that leaves hash of hashes v. flat hash. What's your access pattern like? Are you always going to ask for X,Y, or would you ever need to access all outputs for X? Hash(X+Y) is likely roughly equivalent to hash(X) + hash(Y) (both are going to generally walk over all the letters to generate the hash. So individual hashes is more flexible, at a slight (almost certainly negligible) overhead. From 3, it sounds like you might need the hash of hashes, anyhow.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?

Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.

As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.

Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.

How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.

Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.

Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Is there a method to generate a single key that remembers all the string that we have come across

I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?

You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5

There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.

If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.

Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.

What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio