Collision probability with random number plus timestamp as a username - probability

Say I need to generate a random username for an user. So I first set it as a random number generated using a function. In the next case I set the username to random number + timestamp. Which case has more probability of a collision ? Is it even decidable ?
Edit
Please note that I am not generating usernames to be used by people. Its just a placeholder.
And most importantly I'm not appending the timestamp. I'm rather performing an addition.
To further elaborate, take the random number to be 105 and the timestamp to be 106. So the username will be 211. Later the timestamp will be a bigger number but there is still a probability that during the timestamp 201 the generated random number might be 10, making it 211 again.

Obviously, the more characters you have, the less likely you are to have a collision. Chance of collision will be 1/(#OfcharsInUsername ^ #ofpossiblecharacters supported). For example, a 5 character password using only the lowercase alphabet will have a 1/(5^26) chance of a collision. Since you are appending a timestamp to the username it will be hard to have a collision occur. I see 3 potential issues however. The first is that you need to make sure you use a time fed random function, or you will have the same username multiple times. The second is that if you use a time fed random number generator than if you generate two usernames at the same time they will be exactly the same (even if you append the time to the end). The last one is that if you append the timestamp to the end of the username its going to be quite annoying for people entering the username, as timestamps have a bunch of colons and other garbage associated with them. I would prompt you to look for a different solution, producing a mix of letters and numbers. A different solution might be to grab a set amount of random numbers and use their ascii values to concatenate a username. Going from 21 to 126 would give you all of the lower and upper case letters, along with the numbers and some symbols. In addition this would make it very difficult to have collision because it wouldn't execute perfectly in time with another password. Likelyhood of collision would be 1/(n^105) where n is the number of characters.

Related

How to generate a 4-digit validation code according to latitude & longitude information?

My application needs this feature:
User A can upload his location information and get ADD CODE which is
generated on server.
User B can input ADD CODE and also has to upload his location
information. Only when userA and userB is close enough and ADD CODE
is matched can they finally be friends.
Before calculating distance and comparing ADD CODE, I will check
whether their city number(unique for each city) is same. In other
words, I have to make sure that in each city, ADD CODE won't
conflict with another at the same time(or few minutes).
Of course, a 4-digit number won't satisfy all possibilities, but is there a way to generate this 4-digit number to satisfy this feature as much as possible?
Assuming this sequence happens within a limited period of time (and not that an Add code is valid for all-time):
Your 4-digit number doesn't need to be globally unique, it only needs to be unique within this window of time. So, with that observation, maintain a table of Add codes, when they were issued and for what location. Generate them randomly ensuring they aren't already in the table. Periodically remove any Add codes that have expired.
Provided you never have more than 10,000 users simultaneously trying to connect with each other this will work.
If you need more than that consider allowing duplicates in the table but using the lat/long to ensure that the same Add code is never allocated to any point within 2x the max distance allowed for pairing.
Is there a way to generate this 4-digit number to satisfy this feature as much as possible?
Yes. There are probably millions of possible ways to generate a 4-digit number (where almost all of them are awful and don't satisfy most of the requirements); and if you sort them in order of how much they satisfy, then one of them must satisfy the feature as much as possible.
The real question is, how awful and unsatisfactory would "as much as possible" be?
If you assume 4 decimal digits; then you're limited to 10000 locations or 10000 unique users. That's unlikely to be enough for anything good.
If you assume 4 hexadecimal digits; then you're limited to 65536 locations or 65536 unique users. That's better but still not enough.
So.. what if you used "base 1234567"? In this case a 4-digit number has 2323050529221952581345121 permutations. The surface of the Earth is about 510.1 million square kilometers, so this would be enough to encode a location very precisely (probably within a few meters?).

How can I generate an order number with similar results as Amazon when they do it?

Note: I have already read through older questions like What is the best format for a customer number, order number? , however my question is a little more specific.
Generating pseudo-random numbers encounter the "birthday problem" before long. For example, if I am using a 27-bit field for my order number, after 15000 entries, the chances of collision increase to 50%.
I am wondering whether large ecommerce businesses like Amazon generates its order number in any other way - for example :
pre-generate the entire set and pick from them randomly (a few hundred GB of database)
Use lexicographical "next_permutation" starting from a particular seed number
MD5 or SHA-1 hash of the date, user-id, etc parameters, truncated to 14 digits
etc
All I want is a non-repeating integer (doesnt need to be very random except to obfuscate total number of orders) of a certain width. Any ideas on how this can be achieved ?
Suggest starting with the date in reverse format then starting at 1, followed by a check (or random) digit. If you are likely to never exceed 100 orders per day you need add two digits plus a check/random digit.
The year need include only the final two digits, possibly only the final digit, depending on how long you keep records of orders: 7 years or so is usually enough, meaning the records from 2009 (beginning with 9) could be deleted during 2018 in preparation to use the order numbers again in 2019. You could use mmdd for the next 4 digits, or simply number the days through the year and use just 3 digits - it depends how human-friendly you want the number to be. It's also possible just to omit the day of the month and restart the sequential numbers at the start of each month, rather than every day.
Today is 2 Nov 2017, let's suppose this is order no 16 today, your order no would be 71102168 (where the 8 is a check digit or random digit). If you're likely to have up to, but not exceeding a thousand, you'll need an extra digit, thus: 711020168. To avoid limiting yourself the number of digits, you might prefer to use a hyphen: 71102-168 … you could include another hyphen before the check/random digit if you wish: 71102-16-8.
If you have several areas dealing with orders, you may wish to include a depot number, perhaps at the beginning or after the date, allowing you to use the sequence numbers at each depot - eg depot 5 might be: 5-71102-168, 71102-5-168 or 711025168. Again, if you don't use hyphens, you'll need to assess whether you need up to ten, a hundred or a thousand (etc) possible depot numbers. I hope this helps!
This problem has been solved, why
not use the UUID. See RFC 4122. These are close enough to globally unique you can easily combine many systems and never ever have a duplicate just because the number space is so massive.

How much can you truncate a SHA1 hash and be reasonably sure of having an unique ID?

I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?
Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1))), where n is the document count and b is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)
For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).
Be careful of truncation as there is no reduction in proof that the smaller hash is secure. See Kelsey's http://csrc.nist.gov/groups/ST/hash/documents/Kelsey_Truncation.pdf. Kelsey gives to heuristic arguments stating the same ("Related Hash Outputs" and "Near Collisions"). Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials.
In the end, you probably want to feed your data into an HMAC with the truncated size (the size is digested by the HMAC, too) and then use the truncated HMAC.
There really isn't a value for this; part of what makes SHA a good general-purpose hashing algorithm is that similar data does not necessarily produce similar hashed values. Your best bet (without knowing anything else about your system) would just be to search the list of documents whose hashes start with the value supplied by the user, then either present them with a list of documents to select from or go directly to the document if there's only one.
It's a generalization of the birthday problem. In you case n is number of documents, and instead of constant 365 you'd have number of possibilities the cutoff gives you (so for k bits it's 2k).
Of course exact calculation is out of the question, but you might use approximation.
Well, here's a possibly too simplistic of an answer..
If with full sha1 you get about 1 in 2^160 chance of collision, then by truncating one character you increase the chances of collision by 16 (all possible values of the truncated character)... which is 2^4.. So, if you truncate x characters you get 1 in 2^(160 - 4*x) chances of collision.. right?

A good algorithm for generating an order number

As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit

Algorithm for assigning a unique series of bits for each user?

The problem seems simple at first: just assign an id and represent that in binary.
The issue arises because the user is capable of changing as many 0 bits to a 1 bit. To clarify, the hash could go from 0011 to 0111 or 1111 but never 1010. Each bit has an equal chance of being changed and is independent of other changes.
What would you have to store in order to go from hash -> user assuming a low percentage of bit tampering by the user? I also assume failure in some cases so the correct solution should have an acceptable error rate.
I would an estimate the maximum number of bits tampered with would be about 30% of the total set.
I guess the acceptable error rate would depend on the number of hashes needed and the number of bits being set per hash.
I'm worried with enough manipulation the id can not be reconstructed from the hash. The question I am asking I guess is what safe guards or unique positioning systems can I use to ensure this happens.
Your question isn't entirely clear to me.
Are you saying that you want to validate a user based on a hash of the user ID, but are concerned that the user might change some of the bits in the hash?
If that is the question, then as long as you are using a proven hash algorithm (such as MD5), there is very low risk of a user manipulating the bits of their hash to get another user's ID.
If that's not what you are after, could you clarify your question?
EDIT
After reading your clarification, it looks like you might be after Forward Error Correction, a family of algorithms that allow you to reconstruct altered data.
Essentially with FEC, you encode each bit as a series of 3 bits and apply the "majority wins" principal when decoding again. When encoding you represent "1" as "111" and "0" as "000". When decoding, if most of the encoded 3 bits are zero, you decode that to mean zero. If most of the encoded 3 bits are 1, you decode that to mean 1.
Assign each user an ID with the same number of bits set.
This way you can detect immediately if any tampering has occurred. If you additionally make the Hamming distance between any two IDs at least 2n, then you'll be able to reconstruct the original ID in cases where less than n bits have been set.
So you're trying to assign a "unique id" that will still remain a unique id even if it's changed to something else?
If the only "tampering" is changing 0's to 1's (but not vice-versa) (which seems fairly contrived), then you could get an effective 'ID' by assigning each user a particular bit position, set that bit to zero in that user's id, and to one in every other user's id.
Thus any fiddling by the user will result in corrupting their own id, but not allow impersonation of anyone else.
The distance between two IDs, ( the number of bits you have to change to get from one word to the other ) is called the Hamming distance. Error correcting codes can correct up to half this distance and still give you the original word. If you assume that 30% of the bits can be tampered with, this means that the distance between 2 words should be 60% of the bits. This leaves 40% of that space to be used for IDs. As long as you randomly generate up to 40% of the IDs you could for a given number of bits ( also include the error correcting part), you should be able to recover the original ID.

Resources