How can I create a unique 7-digit code for an entity? - random

When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?

Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}

You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}

Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.

i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.

All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.

I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.

You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.

I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.

With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.

Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.

Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.

Related

How to select nth random integer from a range of integers without repetition or storage? [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 1 year ago.
Let's say my system needs to provide a unique integer id regularly, between 1 and 10^20, from a function like --
function getNextRandomUniqueId(index:BigInt, min:BigInt, max:BigInt, seed:BigInt): BigInt { ? }
id = getNextRandomUniqueId(index=42, min=1, max=10^20, seed=0)
These ids need to be provided in random order as the index increases, not sequentially. Once an id has been provided, it cannot be provided again, as long as the index increases. My system cannot store a random list of all the numbers to be issued, or all the numbers issued, there's too many. I also don't want to rely on something like a random UUID, which is exceedingly unlikely to have a collision, but not guaranteed to.
How can this be done? To have a deterministic mathematical way to iterate randomly through a set of sequential integers without repetition and without storage?
EDIT: Fixed 1^20 to 10^20
This can be done, assuming you are allowed to store an encryption key and counter. Encryption is a one-to-one mapping so by encrypting all the numbers in a given range you will get back all those same numbers in a randomized order. Different keys will give a different order. Encrypt the numbers 0, 1, 2, 3, ... in order, using the key and keeping track of how far you have got.
Depending on the range of numbers, you may need to use some form of Format Preserving encryption to keep the outputs within the required range.
You cannot guarantee that your same id is not in another seed sequence.
Most languages use the time to generate the sequence when you are not providing a seed yourself. You have set your seed to zero so each time you restart your program, you will get your same ids. This is most likely not your intent :-)
But even when you would do this, the chance that you hit the same id is there.
1 in the 100,000,000,000,000,000,000.
The reason you can get the same id is because it is RANDOM
I would go with a GUID.
1 in the 340.280.000.000.000.000.000.000.000.000.000.000.000

Random number generation guaranteeing uniqueness over time

I created a counter that goes up from 0 to 9999 until it resets again. I use the output of this counter as a value to make unique entries. However, the application needs to find its last created number each time the application is restarted. Therfore I am looking for a method which avoids any sort of object storage and relies solely on random number generation.
Something like:
int randomTimeBasedGenerator() {
Random r = new Random(System.currentTimeMillis())
int num = r.nextInt() % 9999
return num
}
But what guarantee do I have that this method generates unique numbers? And, if not, how long would it remain unique? Are there any study papers I can look into for this sort of scenario?
Random number generation would be an elegant solution for my situation, if I can at least guarantee it won't repeat within a couple of weeks or months. But random number generation would be useless in my case if no such guarantee exists.
You have no guarantee that the return value of a random number generator remains unique. Random number generators generate unique sequences of numbers, not unique numbers. Random numbers will always repeat themselves, sooner or later.
As suggested by #Thilo, UUIDs are unique numbers. But an even better approach in your case might be to set up a lightweight database (sqlite will do) and add a record to a table with incremental id's. It is not possible to keep track of a process without storing values somewhere.

Hashing - What Does It Do?

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?
Sorry for the vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.
Hashing is a fast heuristic for finding an object's equivalence class.
In smaller words:
Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time
An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class
If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.
Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science
Hashing is summarizing.
The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.
When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.
So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.
Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.
Firstly we should say about the problem to be solved with Hashing algorithm.
Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?
When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.
So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.
Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.
Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.
Some example for better understanding:
Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?
Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).
So, f("aaa") = 3, f("qwerty") = 6, and so on...
So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...
Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:
H[3] = "aaa", H[4] = "eccc",...
And finally, how to find given String in this array?
Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.
But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.
You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.
a hash function applied to some data generates some new data.
it is always the same for the same data.
thats about it.
another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash.
for me this is an own category called cryptographic or one way hashing.
there are a lot of demands on certain kinds of hash f unctions
for example that the hash is always the same length.
or that hashes are distributet randomly for any given sequence of input data.
the only important point is that its deterministic (always the same hash for the same data).
so you can use it for eample verify data integrity, validate passwords, etc.
read all about it here
http://en.wikipedia.org/wiki/Hash_function
You should read the wikipedia article first. Then come with questions on the topics you don't understand.
To put it short, quoting the article, to hash means:
to chop and mix
That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).
Lets take x % 9 as an example hashing function.
345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5
You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.
On the other hand if we would take x%10. We would get
345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5
As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.
Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.
I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.
Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:
if (inputstring == thefirststring) return 0;
if (inputstring == thesecondstring) return 1;
if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring
in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.
What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:
Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.
When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.

A good algorithm for generating an order number

As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit

Best way to generate order numbers for an online store?

Every order in my online store has a user-facing order number. I'm wondering the best way to generate them. Criteria include:
Short
Easy to say over the phone (e.g., "m" and "n" are ambiguous)
Unique
Checksum (overkill? Useful?)
Edit: Doesn't reveal how many total orders there have been (a customer might find it unnerving to make your 3rd order)
Right now I'm using the following method (no checksum):
def generate_number
possible_values = 'abfhijlqrstuxy'.upcase.split('') | '123456789'.split('')
record = true
while record
random = Array.new(5){possible_values[rand(possible_values.size)]}.join
record = Order.find(:first, :conditions => ["number = ?", random])
end
self.number = random
end
As a customer I would be happy with:
year-month-day/short_uid
for example:
2009-07-27/KT1E
It gives room for about 33^4 ~ 1mln orders a day.
Here is an implementation for a system I proposed in an earlier question:
MAGIC = [];
29.downto(0) {|i| MAGIC << 839712541[i]}
def convert(num)
order = 0
0.upto(MAGIC.length - 1) {|i| order = order << 1 | (num[i] ^ MAGIC[i]) }
order
end
It's just a cheap hash function, but it makes it difficult for an average user to determine how many orders have been processed, or a number for another order. It won't run out of space until you've done 230 orders, which you won't hit any time soon.
Here are the results of convert(x) for 1 through 10:
1: 302841629
2: 571277085
3: 34406173
4: 973930269
5: 437059357
6: 705494813
7: 168623901
8: 906821405
9: 369950493
10: 638385949
At my old place it was the following:
The customer ID (which started at 1001), the sequence of the order they made then the unique ID from the Orders table. That gave us a nice long number of at least 6 digits and it was unique because of the two primary keys.
I suppose if you put dashes or spaces in you could even get us a little insight into the customer's purchasing habits. It isn't mind boggling secure and I guess a order ID would be guessable but I am not sure if there is security risk in that or not.
Ok, how about this one?
Sequentially, starting at some number (2468) and add some other number to it, say the day of the month that the order was placed.
The number always increases (until you exceed the capacity of the integer type, but by then you probably don't care, as you will be incredibly successful and will be sipping margaritas in some far-off island paradise). It's simple enough to implement, and it mixes things up enough to throw off any guessing as to how many orders you have.
I'd rather submit the number 347 and get great customer service at a smaller personable website than: G-84e38wRD-45OM at the mega-site and be ignored for a week.
You would not want this as a system id or part of a link, but as a user-friendly number it works.
Douglas Crockford's Base32 Encoding works superbly well for this.
http://www.crockford.com/wrmg/base32.html
Store the ID itself in your database as an auto-incrementing integer, starting at something suitably large like 100000, and simply expose the encoded value to the customer/interface.
5 characters will see you through your first ~32 million orders, whilst performing very well and satisfying most of these requirements. It doesn't allow for the exclusion of similar sounding characters though.
Rather than generating and storing a number, you might try creating an encrypted version that would not reveal the number of orders in the system. Here's an article on exactly that.
Something like this:
Get sequential order number. Or, maybe, an UNIX timestamp plus two or three random digits (when two orders are placed at the same moment) is fine too.
Bitwise-XOR it with some semi-secret value to make number appear "pseudo-random". This is primitive and won't stop those who really want to investigate how many orders you have, but for true "randomness" you need to keep a (large) permutation table. Or you'll need to have large random numbers, so you won't be hit by the birthday paradox.
Add checkdigit using Verhoeff algorithm (I'm not sure it will have such a good properties for base33, but it shouldn't be bad).
Convert the number to - for example - base 33 ("0-9A-Z", except for "O", "Q" and "L" which can be mistaken with "0" and "1") or something like that. Ease of pronouncation means excluding more letters.
Group the result in some visually readable pattern, like XXX-XXX-XX, so users won't have to track the position with their fingers or mouse pointers.
Sequentially, starting at 1? What's wrong with that?
(Note: This answer was given before the OP edited the question.)
Just one Rube Goldberg-style idea:
You could generate a table with a random set of numbers that is tied to a random period of time:
Time Period Interleaver
next 2 weeks: 442
following 8 days: 142
following 3 weeks: 580
and so on... this gives you an unlimited number of Interleavers, and doesn't let anyone know your rate of orders because your time periods could be on the order of days and your interleaver is doing a lot of low-tech "mashing" for you.
You can generate this table once, and simply ensure that all Interleavers are unique.
You can ensure you don't run out of Interleavers by simply adding more characters into the set, or start by defining longer Interleavers.
So you generate an order ID by getting a sequential number, and using today's Interleaver value, interleave its digits (hence, the name) in between each sequential number's digits. Guaranteed unique - guaranteed confusing.
Example:
Today I have a sequential number 1, so I will generate the order ID: 4412
The next order will be 4422
The next order will be 4432
The 10th order will be 41402
In two weeks my interleaver will change to 142,
The 200th order will be 210402
The 201th order will be 210412
Eight days later, my interleaver changes to 580:
The 292th order will be 259820
This will be completely confusing but completely deterministic. You can just remove every other digit starting at the 1's place. (except when your order id is only one digit longer than your interleaver)
I didn't say this was the best way - just a Friday idea.
You could do it like a postal code:
2b2 b2b
That way it has some kind of checksum (not really, but at least you know it's wrong if there are 2 consecutive numbers or letters). It's easy to read out over the phone, and it doesn't give an indication of how many orders are in the system.
http://blog.logeek.fr/2009/7/2/creating-small-unique-tokens-in-ruby
>> rand(36**8).to_s(36)
=> "uur0cj2h"
How about getting the current time in miliseconds and using that as your order ID?

Resources