My question may sound a bit strange:
I know that hash functions used for encryption must have the characteristic to greatly change the outputs even for small changes of one bit due to some sort of avalanche effect.
Does an unefficent hash exist with the characteristic that for similar strings it produces similar outputs?
If the answer is yes, can you tell me if there is a way to manipulate this aspect (the definition of "similar") with underlying, pre-known relations between the characters of a text?
These kind of hashing functions are called locality sensitive.
Yes, it exists. As you expect the definition of similar must be given. But it depends on your application - I can just give an example.
Suppose your strings are domains and you want to hash all subdomains into one bucket. Then you can reverse strings like:
finance.yahoo.com => com.yahoo.finance
sport.yahoo.com => com.yahoo.sport
user.mail.yahoo.com => com.yahoo.mail.user
and hash only the first two parts: com.yahoo, dropping the rest. Your hash function can look like this (in python):
def hash(url):
return any_other_hash_function(".".join(url.split(".")[::-1][:2]))
Your question is not strange at all. You can find similar approach in Google's map-reduce or BigTable (and in many many other systems) for keeping similar things altogether, so that one can speed up computations.
The example I gave is for strings, but you can use similar approach for other objects. Here the idea is just to separate items into groups and hash the group id (high-level domain name).
Related
A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.
If I run the following one-line ruby script multiple times, it produces a different output value each time.
puts "This is a string".hash
What's going on here? How should I change it to get a consistent, reproducible value from .hash for any given input string?
Edit: The "possible duplicate" suggests other hashing methods. I'm trying to reproduce the behavior of another script I have no control over that uses .hash and gets consistent results. Changing hashing methods is not an option.
Edit #2: As noted in another comment below, the other script whose behavior I want to reproduce is inside an .exe wrapper. It dates from 2006, which means the Ruby version must be 1.8.5 or earlier. Did the #hash method work differently in earlier versions of Ruby, and if so, has anyone produced a script that replicates the behavior of those earlier versions? (It can be by a different name.)
What's going on here?
#hash should be different for different objects and the same for equal objects during the lifetime of the program. There is absolutely no guarantee whatsoever about what the value is across different invocations of the program.
The documentation is very explicit here (bold emphasis mine):
The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method.
[Note: for some reason, the documentation for current versions of Ruby isn't rendered correctly on ruby-doc.org. It is identical in the current master branch, though.]
How should I change it to get a consistent, reproducible value from .hash for any given input string?
Not use it.
I think it might be helpful to understand what #hash is for. It is used to bucket a Ruby object into a specific bucket of a Hash data structure - or, alternatively, to include it into a Set - but this is an implementation detail because Ruby Sets are implemented "on top" of a Hash. It is not used to digest a value. Once you know that, it becomes apparent that #hash should not satisfy the following constraints:
Minimize collisions - it is fine to have collisions sometimes since a bucket in a Hash can regress into a search if there are multiple items
Stable across lifetimes of the virtual machine - not required, because hashes are "reconstructed" anew every time, even when you do marshaling
It should satisfy the following constraints
Stable within the same lifetime of a VM - otherwise the item might have to be "migrated" to a different bucket in a Hash, which is impossible to achieve. This is why strings get frozen when used as Hash keys
Fast to compute
Fit into the arbitrary "key size" used by the Ruby Hash buckets (in MRI it is the size of st_index_t I believe)
The second requirement can be satisfied in multiple ways. For example, it can be satisfied by using a faster hashing function. But it can also be satisfied by doing a lookup of "arbitrary" computed hash values for, say, Strings and if this specific String is a duplicate of another - by reusing that value. Another approach - which is also sometimes applied - is to derive the hash value from the Ruby object ID - which per definition changes across the runs of the virtual machine.
So indeed what Jörg said - for your purpose the hash() function is not a good fit, because it is made for a different use case. There is a whole number of alternatives though - the usual SHA's, murmur hash, xxhash and so on - which might satisfy your requirements and are guaranteed to be content-derived.
Say I have two set of items, which are similar except for their logical purpose in the program. Is it better programming practice to assign two hashes to them, or should I use only one hash for the purpose?
If you store them in the same hash table, you run the (perhaps small or non-existent) risk of overwriting one with another. Say for example you are storing first names and last names (both strings). There could conceivably be one person with first name "Jones" and another with last name "Jones".
If the above is not possible there's no technical reason why you could not use a single hash table. Items that hash to the same value will be stored in the same bucket along with other items with different hash values that map to the same bucket, but as long as you check for actual equality after hash collision, you're okay.
That being said, I would still prefer to separate logical items into their own hash tables without a very strong reason to combine them.
The code dealing with them will probably be easier to write and maintain.
It will be easier to debug issues.
Smaller hash tables will likely have fewer items per bucket and improve performance slightly.
If the set of items are same, the hashes should be same as well.
It is like saying you can use a wrench to tighten a bolt or break open a window, hence it should behave like 2 different objects, which isn't true, because it is your way of use that is differentiating, not the object itself.
I looking for a algorithm, function or technique that can take a string and convert it to a number. I would like the algorithm or function to have the following properties:
Identical string yields the same calculated value
Similar strings would yield similar values (similar can be defined as similar in meaning or similar in composition)
Capable of handling strings of variable length
I read an article several years ago that gives me hope that this can be achieved. Unfortunately, I have been unable to recall the source of the article.
Similar in composition is pretty easy, I'll let somebody else tackle that.
Similar in meaning is a lot harder, but fun :), I remember reading an article about how a neural network was trained to construct a 2D "semantic meaning graph" of a whole bunch of english words, where the distance between two words represented how "similar" they are in meaning, just by training it on wikipedia articles.
You could do the same thing, but make it one-dimensional, that will give you a single continuous number, where similar words will be close to each other.
Non-serious answer: Map everything to 0
Property 1: check. Property 2: check. Property 3: check.
But I figure you want dissimilar strings to get different values, too. The question then is, what is similar and what is not.
Essentially, you are looking for a hash function.
There are a lot of hash functions designed with different objectives. Crypographic hashes for examples are pretty expensive to compute, because you want to make it really hard to go backwards or even predict how a change to the input affects the output. So they try really hard to violate your condition 2. There are also simpler hash functions that mostly try to spread the data. They mostly try to ensure that close input values are not close to each other afterwards (but it is okay if it is predictable).
You may want to read up on Wikipedia:
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_substrings
(Yes, it has a section on "Finding similar substrings" via Hashing)
Wikipedia also has a list of hash functions:
https://en.wikipedia.org/wiki/List_of_hash_functions
There is a couple of related stuff for you. For example minhash could be used. Here is a minhash-inspired approach for you: Define a few random lists of all letters in your alphabet. Say I have the letters "abcde" only for this example. I'll only use two lists for this example. Then my lists are:
p1 = "abcde"
p2 = "edcba"
Let f1(str) be the index in p1 of the first letter in my test word, f2(str) the first letter in p2. So the word "bababa" would map to 0,3. The word "ababab" also. The word "dada" would make to 0,1, while "ce" maps to 2,0. Note that this map is invariant to word permutations (because it treats them as sets) and for long texts it will converge to "0,0". Yet with some fine tuning it can give you a pretty fast chance of finding candidates for closer inspection.
Fuzzy hashing (context triggered piecewise hashing) may be what you are looking for.
Implemenation: ssdeep
Explanation of the algorithm: Identifying almost identical files using context triggered piecewise hashing
I think you're probably after a hash function, as numerous posters have said. However, similar in meaning is also possible, after a fashion: use something like Latent Dirichlet Allocation or Latent Semantic Analysis to map your word into multidimensional space, relative to a model trained on a large collection of text (these pre-trained models can be downloaded if you don't have access to a representative sample of the kind of text you're interested in). If you need a scalar value rather than multi-dimensional vector (it's hard to tell, you don't say what you want it for) you could try a number of things like the probability of the most probable topic, the mean across the dimensions, the index of the most probable topic, etc. etc.
num = 0
for (byte in getBytes(str))
num += UnsignedIntValue(byte)
This would meet all 3 properties(for #2, this works on the strings binary composition).
Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.
EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.
You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.
However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.
For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.
For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.
First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?
For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.
As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.
Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.
If your using a chaining hash table, what you really care about is the number of collisions. This would be trivial to implement as a simple counter on your hash table. Every time an item is inserted and the table has to chain, increment a chain counter. A better hashing algorithm will result in a lower number of collisions. A good general purpose table hashing function to check out is: djb2
Based on your clarification:
I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap.
it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances.
Are you explicitly sizing your maps, or using the defaults? A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size.
Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem.
Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue.