Suppose I have an Array: ['a', 'b', 'c']. I want to record whether I have seen a particular array before.
I can put the array in a Set, but that is wasteful if I don't need to store the contents of the array, only that I have seen it before.
In Python, I could hash a tuple (i.e. hash(('a', 'b', 'c'))) and store the result in a set to achieve this. What is the way to do this in Ruby?
Ruby has #hash on most objects, including Array, but these values are not unique and will eventually collide.
For any serious use I'd strongly suggest using something like SHA2-256 or stronger as these are cryptographic hashes designed to minimize collisions.
For example:
require 'digest/sha2'
array = %w[ a b c ]
array.hash
# => 3218529217224510043
Digest::SHA2.hexdigest(array.inspect)
# => "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad"
Where that value is going to be relatively unique. SHA2-256 collisions are really infrequent due to the sheer size of that hash, 256 bits vs. the 64 bit #hash value. That's not 4x stronger, it's 6.2 octodecillion times stronger. That number may as well be a "zillion" given how it has 57 zeroes in it.
Related
I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.
I am new to Bloom Filter. I understand how to implement a Bloom Filter with bit array, which we hash value x with k hash functions and set each bit array index to 1.
But I am wondering how we are going to implement a Bloom Filter with a char array? Especially if the input is a string. One way I can think of is adding the ASCII value of each char of string and hash that value then set index of char array to some value (I am also not sure what value to set in char array if I use this method because it can't be just 0 or 1 since we are not using bit array), but the probability of false positive is going to be very high. May someone give me some ideas to get the start? (I do not need actual code, but I really appreciate if you can give me some insight on what hash function to use and how to map them into char array)
You can use some hashing algorithm which will convert that to an integer hash and then consider each bit of it as part of the bit array or char array.
hash(S)=sum(S[i]*(p^i))_i=0 to n-1.
You can use this hash 2 times to reduce the chance of false positives. That will give you a reasonable behavior.
Also choice of p must be limited to prime and it should be greater than the number of characters in the alphabet set.
This will give you a better result than simple ascii value addition.
Also a strange thing is the hash functions used should independent and uniformly distributed.
Also being fast is another criteria that's why standard cryptographic hashes are not good choice. (like sha1)
One standard hashing method that I heard is murmurhash which you can try to use and compare with the result you expect.
To be clear on how you will go about implementing it:-
You can consider multiple hash functions like murmur, fnv1a or
even the simple one I presented and then you get 3 values from each
hash. Put them in appropriate positions. And then that will work as
your bloom filter.
Here as you are implementing different hash functions the probability of false positive will depend on multiple hash functions resulting in a better result.
For example:
You want to hash stackoverflow. Now you use 3 hash functions which give you numbers 11, 45 and 17. You would keep an map where you will put this value.
{
11: 1,
45: 1,
17: 1
}
Again you hash this way and get the value 11, 15 and 97.
Then you will change it to
{
11: 1,
15: 1,
17: 1,
45: 1,
97: 1
}
Note: I have mentioned map here...it can be something like a bit array also where you set the bits. For example..in case of
stackoverflow 11,17,and 45 th bits will be set to 1.
Note this map will help you answer the query whether an element is there or not.
Now in case of query , you will do the same, get the hash values and will check if these values exist. If yes there is a high chance it is there(not exactly as it may be a false positive) , if not then it is not for sure.
Suppose now you will check if string "abcd" is there. You apply the 3 hash functions used earlier. Results are 11,99,55. You will check if all 3 of them exists. You can see 55 is not there. So string "abcd" is not there.
I know i will look like a total noob, but there's something I can't wrap my head around. Let me emphasize that i DID google this thing, but i didn't find what I was looking for.
I'm going through the learnrubythehardway course, and for ex39 this is one of the functions we have defined:
def Dict.hash_key(aDict, key)
return key.hash % aDict.length
end
The author gives this explanation:
hash_key
This deceptively simple function is the core of how a hash works. What it does is uses the built-in Ruby hash function to convert a
string to a number. Ruby uses this function for its own hash data
structure, and I'm just reusing it. You should fire up a Ruby console
to see how it works. Once I have a number for the key, I then use the
% (modulus) operator and the aDict.length to get a bucket where this
key can go. As you should know, the % (modulus) operator will divide
any number and give me the remainder. I can also use this as a way of
limiting giant numbers to a fixed smaller set of other numbers. If you
don't get this then use Ruby to explore it
I like this course, but the above paragraph was no help.
Ok, you call the function passing it two arguments (aDict is an array) and it returns something.
(My questions are not totally independent of one another.)
What and how does it do that? (ok, it returns a bucket index, but how do we "get there"?)
What does the key.hash do/what is it?
How does using the % help me get what I need? (What is the use of "modding" the key.hash by the aDict.length?)
"Use Ruby to explore it." - ok, but my question No.2. kinda already suggests that I wouldn't know how to go about doing that.
Thanks in advance.
key.hash is calling Object#hash, which is not to be confused with Hash.
Object#hash converts a string into a number consistently (the same string will always result in the same number, in the same running instance of Ruby).
pry(main)> "abc".hash
=> -1672853150
So now we have a number, but it's way too large for the number of buckets in our Dict structure, which defaults to 256 buckets. So we modulus it to get a number within our bucket range.
pry(main)> "abc".hash % 256
=> 98
This essentially allows us to translate Dict["abc"] into aDict[98].
RE: This example in particular
I'm going to change the order of things in a way that I hope makes more sense:
#2. You can think of a hash as a sort of 'fingerprint' of something. The .hash method will create a (generally) unique output for any given input.
#3. In this case, we know that the hash is a number, so we take the modulo of the generated number by the backing array's length in order to find a (hopefully empty) index that is within our storage's bounds.
#1. That's how. A hashing algorithm will return the same output for any given input. The modulo takes this output and turns it into something we can actually use in an array to find something reliably.
#4. Call hash on something. Call it on a string and then modulo it by the length of an array. Try again on another string. Do that again, and use your result to assign something to that array. Do it again to see that the hash and modulo thing will find that value again.
Further Notes:
By itself, the modulo function is not a good way to pick unique indexes for keys. This example is the first step, but especially in a small array, there is still a relatively large chance for the hashes of different keys to modulo into the same number. That's called a collision, and handling those seems to be outside the scope of this question.
Are these methods of creating an empty Ruby Hash different? If so how?
myHash = Hash.new
myHash = {}
I'd just like a solid understanding of memory management in Ruby.
There are many ways you can create a Hash object in Ruby, though the end result is the same sort of object:
hash = { }
hash = Hash.new
hash = Hash[]
hash = some_object.to_h
hash = YAML.load("--- {}\n\n")
As far as memory considerations go, an empty Hash is significantly smaller than one with even a singular value in it. Arrays tend to be smaller than Hashes at small sizes, but will be more efficient at larger scales.
In practice, though, the important thing to remember in Ruby is that every time you create an object it costs you something, even if it's only an infinitesimal amount. These little hits add up if you're needlessly creating billions of objects.
Generally you should avoid creating structures that will not be used, and instead create them on demand if that wouldn't complicate things needlessly. For example, a typical pattern is:
def cache
#cache ||= { }
end
Until this method is called, the cache Hash is never defined. The memory savings in this instance is nearly insignificant, but if that was loading a large configuration file or importing several hundred MB of data from a database you can imagine the savings would be significant in those instances where that data is not exercised.
The two methods are exactly equivalent.
As is mentioned above, the two are operationally equivalent. If you're referring to the standard MRI / YARV; perhaps this thread would help: http://www.ruby-forum.com/topic/215163#new.
With the Hash.new syntax you can specify what to do when some key is absent in the hash (the default behaviour). With the the {} syntax it takes another step.
my_hash = Hash[]
is another way of creating an array; the [] methods takes an even number of arguments.
my_hash = Hash[:a, 1, :b, 2]
This has nothing to do with memory management.
In Ruby, can I do something C-like, like this (with my made-up operator '&'):
a = [1,2,3,4] and b = &a[2], b => [3,4], and if I set b[0] = 99, a => [1,2,-9,4]?
If the elements of an array are integers, does Ruby necessary store them consecutively in a
contiguous part of memory? I'm guessing "no", that only addresses are stored, integers being
objects, like everything else in Ruby.
If the answer to #2 is "yes" (which I doubt), is there a way to efficiently shift blocks of
memory, as one can do in C, for example.
There is no such functionality built into Ruby (Ruby arrays are not built of cons cells, and taking the address is much lower level than Ruby operates), though honestly it would not be hard to write something like that.
To answer the second question: It wouldn't necessarily be a contiguous array of integers. MRI treats integers as immediate values (with the least significant bit as a flag indicating whether a word represents an integer or an object address), so it would probably store it that way. Other implementations do it their own way.