If I run the following one-line ruby script multiple times, it produces a different output value each time.
puts "This is a string".hash
What's going on here? How should I change it to get a consistent, reproducible value from .hash for any given input string?
Edit: The "possible duplicate" suggests other hashing methods. I'm trying to reproduce the behavior of another script I have no control over that uses .hash and gets consistent results. Changing hashing methods is not an option.
Edit #2: As noted in another comment below, the other script whose behavior I want to reproduce is inside an .exe wrapper. It dates from 2006, which means the Ruby version must be 1.8.5 or earlier. Did the #hash method work differently in earlier versions of Ruby, and if so, has anyone produced a script that replicates the behavior of those earlier versions? (It can be by a different name.)
What's going on here?
#hash should be different for different objects and the same for equal objects during the lifetime of the program. There is absolutely no guarantee whatsoever about what the value is across different invocations of the program.
The documentation is very explicit here (bold emphasis mine):
The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method.
[Note: for some reason, the documentation for current versions of Ruby isn't rendered correctly on ruby-doc.org. It is identical in the current master branch, though.]
How should I change it to get a consistent, reproducible value from .hash for any given input string?
Not use it.
I think it might be helpful to understand what #hash is for. It is used to bucket a Ruby object into a specific bucket of a Hash data structure - or, alternatively, to include it into a Set - but this is an implementation detail because Ruby Sets are implemented "on top" of a Hash. It is not used to digest a value. Once you know that, it becomes apparent that #hash should not satisfy the following constraints:
Minimize collisions - it is fine to have collisions sometimes since a bucket in a Hash can regress into a search if there are multiple items
Stable across lifetimes of the virtual machine - not required, because hashes are "reconstructed" anew every time, even when you do marshaling
It should satisfy the following constraints
Stable within the same lifetime of a VM - otherwise the item might have to be "migrated" to a different bucket in a Hash, which is impossible to achieve. This is why strings get frozen when used as Hash keys
Fast to compute
Fit into the arbitrary "key size" used by the Ruby Hash buckets (in MRI it is the size of st_index_t I believe)
The second requirement can be satisfied in multiple ways. For example, it can be satisfied by using a faster hashing function. But it can also be satisfied by doing a lookup of "arbitrary" computed hash values for, say, Strings and if this specific String is a duplicate of another - by reusing that value. Another approach - which is also sometimes applied - is to derive the hash value from the Ruby object ID - which per definition changes across the runs of the virtual machine.
So indeed what Jörg said - for your purpose the hash() function is not a good fit, because it is made for a different use case. There is a whole number of alternatives though - the usual SHA's, murmur hash, xxhash and so on - which might satisfy your requirements and are guaranteed to be content-derived.
Related
For example, "Consistent hashing" and "Perfect hash function", in wikipedia, I click "hashing" and the link direct to "hash function", so it seems that they have the same meaning, but why does another exist? And is there any difference when using "hashing" or "hash function"? And is it ok to call "consistent hashing" as "consistent hash function"? Thanks!
A hash function takes some input data (typically a bunch of binary bytes, but could be anything - whatever you make it to) and calculates a hash value, which is typically an integer number (but, again, can be anything). The process of doing this is called hashing.
The hash value is always the same size, no matter what the input looks like. Well, I suppose you cold make a hash function that has a variable-size output, but I haven't seen one in the wild yet. It wouldn't be very practical. Thus, by its very nature, hashing is usually a one-way calculation. You can't normally get the original data back from the hash value, because there are many more possible input data combinations than there are possible hash values.
The main advantages are:
The hash value is always the same size
The same input will always generate the same output.
If it's a good hash function, different inputs will usually generate different outputs, but it's still possible that two different inputs generate the same output (this is called a hash collision).
If you have a cryptographical hash function you also get one more advantage:
From having only the hash value, it's impossible (unfeasible) to come up with input data that would hash to this value. Never mind that it's not the original input data, any kind of input data that would hash to the given output value is impossible to find in a useful timeframe.
The results of a hash function can be used in various ways. As mentioned in other answers, hash tables are one common use-case. Verifying data integrity is another case - for example, you download a file, then hash it, then check the hash value against the value that was specified in the webpage where you downloaded the file from. If they don't match, the file was not downloaded correctly. If you combine hash values with public-key cryptography you can get digital signatures. And I'm sure there are other uses to which the principle can be put.
you can write a hash function and what it does is to hash keys to bins.
In other words the hash function is doing the hashing.
I hope that clarifies it.
HashTable is a data Structure in which a given value is mapped with a particular key for faster access of elements. - Process of populating this data structure is known as hashing.
To do hashing , you need a function which will provide logic for mapping values to keys. This function is hash function
I hope this clarifies your doubt.
I'm trying to create a game, which I want to always run the same given the same seed. That means that random events - be them what they may - will always be the same for two players using the same seed.
However, given the user's ability to save and load the game, Ruby's Random would reset every time the save loaded, making the whole principle void if two players save and load at different points.
The only solution I have imagined for this is, whenever a save file is loaded, to generate the same number of points as before, and thus getting Ruby's Random to the same state as it was before load. However, to do that I'd need to extend it so a counter is updated every time a random number is generated.
Does anyone know how to do that or has a better way to restore the state of Ruby's Random?
PS: I cannot use an instance of Random (Random.new) and Marshall it. I have to use Ruby's default.
Sounds like Marshal.dump/Marshal.load may be exactly what you want. The Random class documentation explicitly states "Random objects can be marshaled, allowing sequences to be saved and resumed."
You may still have problems with synchronization across games, since different user-based decisions can take you through different logic paths and thus use the sequence of random numbers in entirely different ways.
I'd suggest maybe saving the 'current' data to a file when the user decides to save (or when the program closes) depending on what you prefer.
This can be done using the File class in ruby.
This would mean you'd need to keep track of turns and pass that along with the save data. Or you could loop through the data in the file and find out how many turns have occurred that way I suppose.
So you'd have something like:
def loadGame(loadFile)
loadFile.open
data = loadFile.read
# What you do below here depends on how you decide to store the data in saveGame.
end
def saveGame(saveFile)
saveFile.open
saveFile.puts data
end
Havent really tried the above code so it could be bad syntax or such. It's mainly just the concept I'm trying to get across.
Hopefully that helps?
There are many generators that compute each random number in the sequence from the previous value alone, so if you used one of those you need only save the last random number as part of the state of the game. An example is a basic linear congruential generator, which has the form:
z(n+1) = (az(n) + b) mod c
where a, b and c are typically large (known) constants, and z(0) is the seed.
An arguably better one is the so-called "mulitply-with-carry" method.
I am reading about cuckoo hashing from Pagh and Rodle and I can't understand the meaning of this paragraph:
It may happen that this process loops, as shown in Fig. 1(b).
Therefore the number of iterations is bounded by a value “MaxLoop” to
be specified in Section 2.3. If this number of iterations is reached,
we rehash the keys in the tables using new hash functions, and try
once again to accommodate the nestless key. There is no need to
allocate new tables for the rehashing: We may simply run through the
tables to delete and perform the usual insertion procedure on all keys
found not to be at their intended position in the table.
What does it mean by using new hash functions?
In the insert algorithm the table is resized. Are we supposed to have a "pool" of hash functions to use somehow? How do we create this pool?
Yes, they're expecting new hash functions, just like they say. Fortunately, they don't require a pile of new algorithms, just slightly different hashing behavior on your current data set.
Take a look at section 2.1 of the paper, and then Appendix A. It discusses the construction of random universal hash functions.
A simple, hopefully illustrative example, is to suppose you've got some normal hash function you like, that operates on blocks of bytes. We'll call it H(x). You want to use it to produce a family of new, slightly different hash functions H_n(x). Well, if H(x) is good, and your requirements are weak, you can just define H_n(x) = H(concat(n,x)). You don't have nice strong guarantees about the behaviors of H_n(x), but you'd expect most of them to be different.
My question may sound a bit strange:
I know that hash functions used for encryption must have the characteristic to greatly change the outputs even for small changes of one bit due to some sort of avalanche effect.
Does an unefficent hash exist with the characteristic that for similar strings it produces similar outputs?
If the answer is yes, can you tell me if there is a way to manipulate this aspect (the definition of "similar") with underlying, pre-known relations between the characters of a text?
These kind of hashing functions are called locality sensitive.
Yes, it exists. As you expect the definition of similar must be given. But it depends on your application - I can just give an example.
Suppose your strings are domains and you want to hash all subdomains into one bucket. Then you can reverse strings like:
finance.yahoo.com => com.yahoo.finance
sport.yahoo.com => com.yahoo.sport
user.mail.yahoo.com => com.yahoo.mail.user
and hash only the first two parts: com.yahoo, dropping the rest. Your hash function can look like this (in python):
def hash(url):
return any_other_hash_function(".".join(url.split(".")[::-1][:2]))
Your question is not strange at all. You can find similar approach in Google's map-reduce or BigTable (and in many many other systems) for keeping similar things altogether, so that one can speed up computations.
The example I gave is for strings, but you can use similar approach for other objects. Here the idea is just to separate items into groups and hash the group id (high-level domain name).
Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.
EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.
You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.
However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.
For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.
For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.
First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?
For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.
As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.
Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.
If your using a chaining hash table, what you really care about is the number of collisions. This would be trivial to implement as a simple counter on your hash table. Every time an item is inserted and the table has to chain, increment a chain counter. A better hashing algorithm will result in a lower number of collisions. A good general purpose table hashing function to check out is: djb2
Based on your clarification:
I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap.
it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances.
Are you explicitly sizing your maps, or using the defaults? A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size.
Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem.
Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue.