How wide should be random numbers so it is virtually impossible that you repeat two of them? - algorithm

Certain system is supposed to spawn objects with unique IDs. That system will run in different computers without connection between them; yet no ID collision can happen. The only way to implement this is generating random numbers. How wide should be the those so you can consider it is virtually impossible for a collision to ever happen?

This is basically a generalization of the birthday problem.
This probability table can help you to figure out how many bits you are going to need in order to achieve the probability you desire - based on p - desired probability, and #elements that are going to be "hashed" (generated).

In your question you mentioned:
The only way to implement this is generating random numbers
No, this is NOT the only way to do this. In fact this is one of the ways NOT to do it.
There is already a well known and widely used method for doing something like this that you yourself are using right now: adding a prefix (or postfix, doesn't matter). The prefix is called many things by many systems: Ethernet and WiFi call it vendor id. In TCP/IP it's called a subnet (technically it's called a "network").
The idea is simple. Say for example you want to use a 32 bit number for your global id. Reserve something like 8 bits to identify which system it's on and the rest can simply be sequential numbers within each system.
Stealing syntax from IPv4 for a moment. Say system 1 has an id of 1. And system 2 has an id of 2. Therefore ids form system 1 will be in the range between 1.0.0.0 - 1.255.255.255 and ids from system 2 will be between 2.0.0.0 - 2.255.255.255.
That's just an example. There's nothing that forces you to waste so many bits for system id. In fact, IPv4 is itself no longer organized by byte boundaries. You can instead use 4 bits as system id and 28 bits for individual ids. You can also use 64 bits if you need more ids or go the IPv6 route and use 128 bits (in which case you can definitely afford to waste a byte or two for system id).
Because each system cannot generate an id that's generated by another system no collision will ever occur before the ids overflow.
If you need the ids to look "random" use a hashing algorithm. Good hashing algorithms such as SHA1 and CRC are guaranteed to never collide if your data is of a fixed size below the size of the hash. For example, SHA1 is 160 bits so if your id generation system is less than 160 bits internally then the SHA1 hash of ids will never collide. The caveat being that you must use all 160 bits. Turncating the SHA1 will cause collisions. For 32 bit ids CRC32 is a perfect fit while there's also CRC64 if you want to generate 64 bit ids.

Guids use 2^128 and the likelyhood of collision is negligible

Related

How can I generate a unique identifier that is apparently not progressive [duplicate]

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.

How does StackOverflow/StackExchange generate unique integer IDs?

I'm looking for a way to generate unique integer IDs for my customers, very similar to how StackOverflow/StackExchange generates one for each question. The challenge is making this number unique in a distributed system since multiple databases are used, which means that the auto increment feature cannot work.
I have to assume that a huge site like StackOverflow/StackExchange is distributed, so I would very much like to know how it's able to generate a unique integer without any collisions.
From what I've seen, the two kinds of implementations around are Flickr's ticket server approach, but that creates a single point of failure, or Twitter's Snowflake, but that generates a 64 bit number, whereas StackOverflow's seems to be much smaller.
Your operating system most likely has some way to do this. For example, on iOS or MacOS X you just call a method in NSUUID; others will have something similar. This gives you 122 random bits. If you insist on integers, I'd split this into two 64-bit integers. Uniqueness is practically guaranteed by having 122 random bits.

How can I generate a set of (short) unique identifiers which are easy to check but hard to spoof?

For example, such as the license keys much software uses. I had thought of cryptographically signing a sequence, so I could have maybe 4 bytes for the ID and say 8 bytes for the signature, but I can't find a suitable algorithm.
What I need is something that an attacker can't readily generate, but which is stored in less than approx 20 ASCII bytes. I also need to be confident of uniqueness. This doesn't need to be completely secure, only secure against a casual attack.
Note: I'm doing this in java on appengine.
Just generate a GUID for each ID and keep track of the ones you've generated in a database. The universe of GUIDs is so large that each will be unique. It's not cryptographic so there's a possibility that anyone who has a large enough population of your generated ones could produce a match, but I think the odds are still miniscule.
A GUID is 128 bits, which can be encoded in 23 bytes using Base64.
Sounds like HMAC. You will probably need to ensure uniqueness manually though.
Calculate the values including the id into a string and use the byte based HDMAC with a secret key and max length. Just make sure that you have a unique part in the values to encrypt. This could be server time or some other ID. The length will need to be tested that it remains within your 20 character requirement.
Encryption is reversible, so the output is guaranteed unique for unique inputs. Just encrypt 0, 1, 2, 3, 4, 5 etc. using the same key every time. For 128 bit output use AES and 128 bit numbers in ECB mode. Other modes will need identical IV/Nonces as well. For 64 bit numbers use DES. For other size numbers either use Hasty Pudding cypher or roll your own simple Feistel cypher for the size you want.
ECB is not the most secure mode, but I do not get the impression that you are looking for very high levels of security here.

Guid vs random string

If I randomly generate a string of 32 characters-long can I use this string as a GUID for all intents and purposes?
Will the "GUID" I generate have more or less likelihood of collision than a "real" GUID?
Any more specific info on GUIDs and how they compare to random strings is appreciated.
GUID-generation algorithms take into account the date and time as well as generating random numbers to create the final 128 bit value.
If you simply generate random strings w/o any other algorithmics thrown in then yes, you will run a much greater risk of collision. (Computers cannot create truly random numbers so other data has to folded into the GUID gen algorithms to lower risk of collision. GUID v1 for example used a computer's MAC address though that approach has been deprecated since it identifies the generating computer.)
You could create your own GUID value but why reinvent something that already works well?
Also, see Eric Lippert's answer as to why using a GUID is superior to using your own, home-brewed random ID generator.
A GUID is not a 32-character long string. So no, you cannot use it in place of a GUID.
Depending on the encoding, a char can be either one or two bytes, so 32 chars can be 32 bytes or 64 bytes. A GUID is 16 bytes. If you have an equivalent amount of randomness in your generator, your string will produce less chance of collision. Saying that, the chance of collision in 16 bytes is pretty unlikely as it is.
The clinch is that you have to have at least as good a generator as the Guid generator to make it worthwhile. When you do that, patent it.
Depends on the GUID you're comparing it against: nowadays most GUIDs are "Version 4", which is really just a big random number with some wasted bits. So as long as your random number generator is as good as the one used to generate the GUID, your solution is more unique.
If it's a Version 1 GUID, then it's probably more unique than a random number (assuming it's being used as expected: the system clock isn't being reset very often, the system has a network card, and the MAC address hasn't been tampered with) but most people don't use version 1 anymore because it leaks your MAC address.
It depends on algorithm which you will use. If you have good generator the result would be the same.
The likelihood depends on how good both the generators are (yours vs. GUID one).
I would suggest to use actual guid's. The chances that your random string generator would be unique is far less than that of a guid.
Social MSDN gives little info, but doesn't answer your question whether a collision is more likely or not.
Guid Structure tells a GUID is not a string but "A GUID is a 128-bit integer (16 bytes) that can be used across all computers and networks wherever a unique identifier is required. Such an identifier has a very low probability of being duplicated."

Algorithm for assigning a unique series of bits for each user?

The problem seems simple at first: just assign an id and represent that in binary.
The issue arises because the user is capable of changing as many 0 bits to a 1 bit. To clarify, the hash could go from 0011 to 0111 or 1111 but never 1010. Each bit has an equal chance of being changed and is independent of other changes.
What would you have to store in order to go from hash -> user assuming a low percentage of bit tampering by the user? I also assume failure in some cases so the correct solution should have an acceptable error rate.
I would an estimate the maximum number of bits tampered with would be about 30% of the total set.
I guess the acceptable error rate would depend on the number of hashes needed and the number of bits being set per hash.
I'm worried with enough manipulation the id can not be reconstructed from the hash. The question I am asking I guess is what safe guards or unique positioning systems can I use to ensure this happens.
Your question isn't entirely clear to me.
Are you saying that you want to validate a user based on a hash of the user ID, but are concerned that the user might change some of the bits in the hash?
If that is the question, then as long as you are using a proven hash algorithm (such as MD5), there is very low risk of a user manipulating the bits of their hash to get another user's ID.
If that's not what you are after, could you clarify your question?
EDIT
After reading your clarification, it looks like you might be after Forward Error Correction, a family of algorithms that allow you to reconstruct altered data.
Essentially with FEC, you encode each bit as a series of 3 bits and apply the "majority wins" principal when decoding again. When encoding you represent "1" as "111" and "0" as "000". When decoding, if most of the encoded 3 bits are zero, you decode that to mean zero. If most of the encoded 3 bits are 1, you decode that to mean 1.
Assign each user an ID with the same number of bits set.
This way you can detect immediately if any tampering has occurred. If you additionally make the Hamming distance between any two IDs at least 2n, then you'll be able to reconstruct the original ID in cases where less than n bits have been set.
So you're trying to assign a "unique id" that will still remain a unique id even if it's changed to something else?
If the only "tampering" is changing 0's to 1's (but not vice-versa) (which seems fairly contrived), then you could get an effective 'ID' by assigning each user a particular bit position, set that bit to zero in that user's id, and to one in every other user's id.
Thus any fiddling by the user will result in corrupting their own id, but not allow impersonation of anyone else.
The distance between two IDs, ( the number of bits you have to change to get from one word to the other ) is called the Hamming distance. Error correcting codes can correct up to half this distance and still give you the original word. If you assume that 30% of the bits can be tampered with, this means that the distance between 2 words should be 60% of the bits. This leaves 40% of that space to be used for IDs. As long as you randomly generate up to 40% of the IDs you could for a given number of bits ( also include the error correcting part), you should be able to recover the original ID.

Resources