Learn Ruby The Hard Way Ex39 - Understanding Buckets - ruby

Can somebody please explain the concept of buckets simply to me. I understand a Dict is an array of arrays, I cannot for the life of me make sense of this first block of code though and can't find anything online that explains num_buckets. If you could explain it line by line that would be great.
module Dict
def Dict.new(num_buckets=256)
# Initializes a Dict with the given number of buckets.
aDict = []
(0...num_buckets).each do |i|
aDict.push([])
end
return aDict
end

The code is meant to implement a data structure called Hash table. It is the data structure of Ruby's built-in Hash class.
Hash tables use the hashing of keys as indexes. Because there are limited number of possible indexes, collision (i.e, different keys have the same hashing) happens. Separate chaining is one common method for collision resolution. Keys are inserted into buckets. num_buckets here is the number of buckets. Different keys with the same hashing are in the same bucket.
An image illuatrating separate chaining from Wikipedia:

Related

hashing vs hash function, don't know the difference

For example, "Consistent hashing" and "Perfect hash function", in wikipedia, I click "hashing" and the link direct to "hash function", so it seems that they have the same meaning, but why does another exist? And is there any difference when using "hashing" or "hash function"? And is it ok to call "consistent hashing" as "consistent hash function"? Thanks!
A hash function takes some input data (typically a bunch of binary bytes, but could be anything - whatever you make it to) and calculates a hash value, which is typically an integer number (but, again, can be anything). The process of doing this is called hashing.
The hash value is always the same size, no matter what the input looks like. Well, I suppose you cold make a hash function that has a variable-size output, but I haven't seen one in the wild yet. It wouldn't be very practical. Thus, by its very nature, hashing is usually a one-way calculation. You can't normally get the original data back from the hash value, because there are many more possible input data combinations than there are possible hash values.
The main advantages are:
The hash value is always the same size
The same input will always generate the same output.
If it's a good hash function, different inputs will usually generate different outputs, but it's still possible that two different inputs generate the same output (this is called a hash collision).
If you have a cryptographical hash function you also get one more advantage:
From having only the hash value, it's impossible (unfeasible) to come up with input data that would hash to this value. Never mind that it's not the original input data, any kind of input data that would hash to the given output value is impossible to find in a useful timeframe.
The results of a hash function can be used in various ways. As mentioned in other answers, hash tables are one common use-case. Verifying data integrity is another case - for example, you download a file, then hash it, then check the hash value against the value that was specified in the webpage where you downloaded the file from. If they don't match, the file was not downloaded correctly. If you combine hash values with public-key cryptography you can get digital signatures. And I'm sure there are other uses to which the principle can be put.
you can write a hash function and what it does is to hash keys to bins.
In other words the hash function is doing the hashing.
I hope that clarifies it.
HashTable is a data Structure in which a given value is mapped with a particular key for faster access of elements. - Process of populating this data structure is known as hashing.
To do hashing , you need a function which will provide logic for mapping values to keys. This function is hash function
I hope this clarifies your doubt.

How does a hash (in a language like Ruby) work "under the hood"?

I've read here and there about hash maps/tables, and can kind of understand the concept that a hash table is essentially a finite-sized array. The function could use the modulus operator to determine which index in the array corresponds to a particular key. If collisions occur, then a linked-list can be implemented to store all the collided values. This is my very-novice understanding, and I hope someone can expound on it/correct it in the context of a Ruby hash. In Ruby, all you really have to do is
hash = {}
hash[key] = value
and this creates a key with the corresponding value. Say that you're just storing a bunch of symbols as keys and numbers as values:
hash[:a] = 1
hash[:b] = 2
...
What exactly is happening under the hood in terms of storing the values in arrays and linked-lists? What would be an example of a collision?
The Ruby Language Specification does not prescribe any particular implementation strategy for the Hash class. Every implementation is allowed to implement it however they want, provided they honor the contract.
For example, here is Rubinius's implementation, which, being written in Ruby, is pretty easy to follow: kernel/common/hash.rb This is a fairly traditional hashtable. (One other cool thing to note about this implementation is that it actually happens to be as fast as YARV's, which proves that Ruby code can be as efficient as hand-optimized C.)
Rubinius also alternatively implements the Hash class with a Hash Array Mapped Trie: kernel/common/hash_hamt.rb [Note: this implementation uses three VM primitives written in C++.]
You can switch between those two implementations using a configuration option. So, not only is the Hash implementation different between different Ruby implementations, it might even be different between two runs of the exact same program on the exact same version of the exact same Ruby implementation!
In IronRuby, Ruby's Hash class simply delegates to a .NET System.Collections.Generic.Dictionary<object, object>: Ruby/Builtins/Hash.cs
In previous versions, it didn't even delegate, it was just simply a subclass: Ruby/Builtins/Hash.cs
If you are hardcore about this you could look at the implementation directly. This is what the hash ends up using:
https://github.com/ruby/ruby/blob/c8b3f1b470e343e7408ab5883f046b1056d94ccc/st.c
The hash itself is here:
https://github.com/ruby/ruby/blob/trunk/hash.c
Most of the times, the article diego provided in comments will be more than enough
In ruby 2.4, Hash table was moved to open addressing model, so I will describe only how Hash-tables structure works, but not how it is implemented in 2.4 and above.
Let's imagine that we store all entries in an array. When we want to find something, we have to go through all the elements to match one. This can take a long time if we have a lot of elements and using a hash table lets us go directly to the cell with the required value by computing the hash function for that key.
The hash table stores all values in the store (bins) groups, in a data structure similar to an array.
How does hash table work
When we add a new key-value pair, we need to calculate to which "storage" this pair will be inserted and we do this using the .hash method (hash function). The resulting value from the hash function is a pseudo-random number as it always produces the same number for the same value.
Roughly speaking, hash returns the equivalent of the link to the memory location where
the current object is stored. However, for strings, the calculation is relative to the value.
Having received a pseudo-random number, we have to calculate the number of the "storage" where the key-value pair will be stored.
'a'.hash % 16 =>9
a - key
16 - amount of storage
9 - the storage number
So, in Ruby the insertion works in the following way:
How insertion works
It takes the hash of the key using the internal hash function.
:c.hash #=> 2782
After getting the hash value, with the help of modulo operation (2782 % 16) we will get the storage number where to keep our key-value pair :d.hash % 16
Add key-value to a linked list of the proper bin
The search works as follows:
The search works quite the same way:
Determine "hash" function;
Find "storage";
Then iterate through the list and retrieve a hash element.
In ruby, the average number of elements per bin is 5. With the increase in the number of records, the density of elements will grow in each repository (in fact, that size of hash-table is only 16 storages).
If the density of the elements is large, for example 10_000 elements in one "storage", we will have to go through all the elements of this linked-list to find the corresponding record. And we'll go back to O(n) time, which is pretty bad.
To avoid this, table rehash is applied. This means that hash-table size will be increased (up to the next number of - 16, 32, 64, 128, ...) and for all current elements the position in the "storages" will be recalculated.
"Rehash" occurs when the number of all elements is greater than the maximum density multiplied by the current table size.
81 > 5 * 16 - rehash will be called when we add 81 elements to the table.
num_entries > ST_DEFAULT_MAX_DENSITY * table->num_bins
When the number of entries reaches the maximum possible value for current hash-table, the number of "storages" in that hash-table increases (it takes next size-number from 16, 32, 64, 128), and it re-calculates and corrects positions for all entries in that hash.
Check this article for a more in-depth explanation: Do You Know How Hash Table Works? (Ruby Examples)

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

How to implement a dynamic-size hash table?

I know the basic principle of the hash table data structure. If I have a hash table of size N, I have to distribute my data into these N buckets as evenly as possible.
But in reality, most languages have their built-in hash table types. When I use them, I don't need to know the size of hash table beforehand. I just put anything I want into it. For example, in Ruby:
h = {}
10000000.times{ |i| h[i]=rand(10000) }
How can it do this?
See the Dynamic resizing section of the Hash table article on Wikipedia.
The usual approach is to use the same logic as a dynamic array: have some number of buckets and when there is too much items in the hash table, create a new hash table with a larger size and move all the items to the new hash table.
Also, depending on the type of hash table, this resizing might not be necessary for correctness (i.e. it would still work even without resizing), but it is certainly necessary for performance.

Is there a method to generate a single key that remembers all the string that we have come across

I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

Resources