I'm working on an application where I need to generate unique, non-sequential IDs. One of the constraints I have is that they must consist of 3 digits followed by 2 letters (only about 600k IDs). Given my relatively small pool of IDs I was considering simply generating all possible IDs, shuffling them and putting them into a database. Since, internally, I'll have a simple, sequential, ID to use, it'll be easy to pluck them out one at a time & be sure I don't have any repeats.
This doesn't feel like a very satisfying solution. Does anyone out there have a more interesting method of generating unique IDs from a limited pool than this 'lottery' method?
This can be done a lot of different ways, depending on what you are trying to optimize (speed, memory usage, etc.).
ID pattern = ddd c1c[0]
Option 1 (essentially like hashing, similar to Zak's):
1 Generate a random number between 0 and number of possibilities (676k).
2- Convert number to combination
ddd = random / (26^2)
c[0] = random % (26)
c[1] = (random / 26) % 26
3- Query DB for existence of ID and increment until a free one is found.
Option 2 (Linear feedback shift register, see wikipedia):
1- Seed with a random number in range (0,676k). (See below why you can't seed with '0')
2- Generate subsequent random numbers by applying the following to the current ID number
num = (num >> 1) ^ (-(num & 1u) & 0x90000u);
3- Skip IDs larger than range (ie 0xA50A0+)
4- Convert number into ID format (as above)
*You will need to save the last number generated that was used for an ID, but you won't need to query the DB to see if it is used. This solution will enumerate all possible IDs except [000 AA] due to the way the LFSR works.
[edit] Since your range is actually larger than you need, you can get back [000 AA] by subtracting 1 before you convert to the ID and have your valid range be (0,0xA50A0]
Use a finite group. Basically, take a 32 or 64-bit integer, and find a large number that is coprime to the maximum value for your integer; call this number M. Then, for all integers n, n * M will result in a unique number that has lots of digits.
This has the advantage that you don't need to pre-fill the database, or run a separate select query -- you can do this all from within one insert statement, by having your n just be an auto-increment, and have a separate ID column that defaults to the n * M.
You could generate a random ID conforming to that standard, do a DB select to see if it exists already, then insert it into a DB to note it has been "used". For the first 25% of the life of that scheme (or about 150k entries), it should be relatively fast to generate new random ID's. After that though, it will take longer and longer, and you might as well pre-fill the table to look for free IDs.
Depending on what you define as sequential, you could just pick a certain starting point on the letters, such as 'aa', and just loop through the three digits, so it would be:
001aa
002aa
003aa
Once you get to zz then increment the number part.
You could use modular arithmetic to generate ids. Pick a number that is coprime with 676,000 and for a seed. id is the standard incrementing id of the table. Then the following pseudocode is what you need:
uidNo = (id * seed) % 676000
digits = uidNo / 676
char1 = uidNo % 26
char2 = (uidNo / 26) % 26
uidCode = str(digits) + chr(char1+65) + chr(char2+65)
If a user has more than one consecutively issued id, they could guess the algorithm and the seed and generate all the ids in order. This may mean the algorithm is not secure enough for your use case.
Related
I created a counter that goes up from 0 to 9999 until it resets again. I use the output of this counter as a value to make unique entries. However, the application needs to find its last created number each time the application is restarted. Therfore I am looking for a method which avoids any sort of object storage and relies solely on random number generation.
Something like:
int randomTimeBasedGenerator() {
Random r = new Random(System.currentTimeMillis())
int num = r.nextInt() % 9999
return num
}
But what guarantee do I have that this method generates unique numbers? And, if not, how long would it remain unique? Are there any study papers I can look into for this sort of scenario?
Random number generation would be an elegant solution for my situation, if I can at least guarantee it won't repeat within a couple of weeks or months. But random number generation would be useless in my case if no such guarantee exists.
You have no guarantee that the return value of a random number generator remains unique. Random number generators generate unique sequences of numbers, not unique numbers. Random numbers will always repeat themselves, sooner or later.
As suggested by #Thilo, UUIDs are unique numbers. But an even better approach in your case might be to set up a lightweight database (sqlite will do) and add a record to a table with incremental id's. It is not possible to keep track of a process without storing values somewhere.
I have 4 integers with which I want to convert to a seed in order to generate a random number. I understand this is arbitrary for the most part, I do however want to make sure what I am currently doing is not overkill (or doesn't generate enough spread in seed values).
I have roughly 1000 objects which I want to have random properties based on some of their variables.
Two variables are constant and are of the 0 - 1000 range and are random for each object, duplicates can occur but this is not likely at all (constant1 and constant2). The other two variables change with deltas of 1 over long time periods through the running of the program, start at 0, can be anywhere within the signed int32 range but will tend to be between -100 and 100 (variable1 and variable2).
How do you suitably generate a seed from these 4 values?
You should probably initialize Random generator only once, when class instance is initialized, so you should use only 2 of the properties (the other 2 are set to 0 by default, aren't they?) to get a seed.
Because of 1. and assuming that constant1 and constant2 are random by default within 0-1000, you can use constant1 * 1000 + constant2 to get random number between 0 and 1000000. I'm not sure about the randomness distribution, but it should be enough to get a seed.
Update
If you really need to get the seed depend on other two variables, you can follow the pattern and do it as follows:
var seed = ((variable1 * 200 + variable) * 1000 + constant1) * 1000 + constant2;
but because it exceeds Int32 range you have to do that in unsafe context to prevent OverflowException being thrown.
And the last thing: I'm not 100% sure it will give you normalized distribution of generated values.
If I start using a HiLo generator to assign ID's for a table, and then decide to increase or decrease the capacity (i.e. the maximum 'lo' value), will this cause collisions with the already-assigned ID's?
I'm just wondering if I need to put a big red flag around the number saying 'Don't ever change this!'
Note - not NHibernate specific, I'm just curious about the HiLo algorithm in general.
HiLo algorithms in general basically map two integers to one integer ID. It guarantees that the pair of numbers will be unique per database. Typically, the next step is to guarantee that a unique pair of numbers maps to a unique integer ID.
A nice explanation of how HiLo conceptually works is given in this previous SO answer
Changing the max_lo will preserve the property that your pair of numbers will be unique. However, will it make sure that the mapped ID is unique and collision-free?
Let's look at Hibernate's implementation of HiLo. The algorithm they appear to use (as from what I've gathered) is: (and I might be off on a technicality)
h = high sequence (starting at 0)
l_size = size of low block
l = low sequence (starting at 1)
ID = h*l_size + l
So, if your low block is, say, 100, your reserved ID blocks would go 1-100, 101-200, 201-300, 301-400...
Your High sequence is now 3. Now what would happen if you all of a sudden changed your l_size to 10? Your next block, your High is incremented, and you'd get 4*10+1 = 41
Oops. This new value definitely falls within the "reserved block" of 1-100. Someone with a high sequence of 0 would think, "Well, I have the range 1-100 reserved just for me, so I'll just put down one at 41, because I know it's safe."
There is definitely a very, very high chance of collision when lowering your l_max.
What about the opposite case, raising it?
Back to our example, let's raise our l_size to 500, turning the next key into 4*500+1 = 2001, reserving the range 2001-2501.
It looks like collision will be avoided, in this particular implementation of HiLo, when raising your l_max.
Of course, you should do some own tests on your own to make sure that this is the actual implementation, or close to it. One way would be to set l_max to 100 and find the first few keys, then set it to 500 and find the next. If there is a huge jump like mentioned here, you might be safe.
However, I am not by any means suggesting that it is best practice to raise your l_max on an existing database.
Use your own discretion; the HiLo algorithm isn't exactly one made with varying l_max in mind, and your results may in the end be unpredictable depending on your exact implementation. Maybe someone who has had experience with raising their l_max and finding troubles can prove this count correct.
So in conclusion, even though, in theory, Hibernate's HiLo implementation will most likely avoid collisions when l_max is raised, it probably still isn't good practice. You should code as if l_max were not going to change over time.
But if you're feeling lucky...
See the Linear Chunk table allocator -- this is logically a more simple & correct approach to the same problem.
What's the Hi/Lo algorithm?
By allocating ranges from the number space & representing the NEXT directly, rather than complicating the logic with high words or multiplied numbers, you can directly see what keys are going to be generated.
Essentially, "Linear Chunk allocator" uses addition rather than multiplication. If the NEXT is 1000 & we've configured range-size of 20, NEXT will advance to 1020 and we'll hold keys 1000-1019 for allocation.
Range-sized can be tuned or reconfigured at any time, without loss of integrity. There is a direct relationship between the NEXT field of the allocator, the generated keys & MAX(ID) existing in the table.
(By comparison, "Hi-Lo" uses multiplication. If the next is 50 & the multiplier is 20, then you're allocating keys around 1000-1019. There are no direct correlation between NEXT, generated keys & MAX(ID) in the table, it is difficult to adjust NEXT safely and the multiplier can't be changed without disturbing current allocation point.)
With "Linear Chunk", you can configure how large each range/ chunk is -- size of 1 is equivalent to traditional table-based "single allocator" & hits the database to generate each key, size of 10 is 10x faster as it allocates a range of 10 at once, size of 50 or 100 is faster still..
A size of 65536 generates ugly-looking keys, wastes vast numbers of keys on server restart, and is equivalent to Scott Ambler's original HI-LO algorithm.
In short, Hi-Lo is an erroneously complex & flawed approach to what should have been conceptually trivially simple -- allocating ranges along a number line.
I tried to unearth behviour of HiLo algorith through a simple helloWrold-ish hibernate application.
I tried a hibernate example with
<generator class="hilo">
<param name="table">HILO_TABLE</param>
<param name="column">TEST_HILO</param>
<param name="max_lo">40</param>
</generator>
Table named "HILO_TABLE" created with single column "TEST_HILO"
Initially I set value of TEST_HILO column to to 8.
update HILO_TABLE set TEST_HILO=8;
I observed that pattern to create ID is
hivalue * lowvalue + hivalue
hivalue is column value in DB (i.e. select TEST_HILO from HILO_TABLE )
lowvalue is from config xml (40 )
so in this case IDs started from 8*40 + 8 = 328
In my hibernate example i added 200 rows in one session. so rows were created with IDs 328 to 527
And in DB hivalue was incremented till 13.
The increment logic seems to be :-
new hivalue in DB = inital value in DB + (rows_inserted/lowvalue + 1 )
= 8 + 200/40 = 8 + 5 =13
Now if I run same hibernate program to insert rows, the IDs should start from
13*40 + 13 = 533
When ran the program it was confirmed.
Just by experience I'd say: yes, decreasing will cause collisions. When you have a lower max low, you get lower numbers, independent of the high value in the database (which is handled the same way, eg. increment with each session factory instance in case of NH).
There is a chance that increasing will not cause collisions. But you either need to try or ask someone who knows better then I do to be sure.
Old question, I know, but worth answering with a 'yes, you can'
You can increase or decrease your nex_hi at any point as long as you recompute your hibernate_unique_key table based on the current Id numbers of your tables.
In our case, we have a Id per entity hibernate_unique_key table with two columns:
next_hi
EntityName.
The next_hi for any given Id is calculated as
SELECT MAX(Id) FROM TableName/(#max_lo + 1) + 1
The script below runs through every table with an Id column and updates our nex_hi values
DECLARE #scripts TABLE(Script VARCHAR(MAX))
DECLARE #max_lo VARCHAR(MAX) = '100';
INSERT INTO #scripts
SELECT '
INSERT INTO hibernate_unique_key (next_hi, EntityName)
SELECT
(SELECT ISNULL(Max(Id), 0) FROM ' + name + ')/(' + #max_lo + ' + 1) + 1, ''' + name + '''
'
FROM sys.tables WHERE type_desc = 'USER_TABLE'
AND COL_LENGTH(name, 'Id') IS NOT NULL
AND NOT EXISTS (select next_hi from hibernate_unique_key k where name = k.EntityName)
DECLARE curs CURSOR FOR SELECT * FROM #scripts
DECLARE #script VARCHAR(MAX)
OPEN curs
FETCH NEXT FROM curs INTO #script
WHILE ##FETCH_STATUS = 0
BEGIN
--PRINT #script
EXEC(#script)
FETCH NEXT FROM curs INTO #script
END
CLOSE curs
DEALLOCATE curs
When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.
Every order in my online store has a user-facing order number. I'm wondering the best way to generate them. Criteria include:
Short
Easy to say over the phone (e.g., "m" and "n" are ambiguous)
Unique
Checksum (overkill? Useful?)
Edit: Doesn't reveal how many total orders there have been (a customer might find it unnerving to make your 3rd order)
Right now I'm using the following method (no checksum):
def generate_number
possible_values = 'abfhijlqrstuxy'.upcase.split('') | '123456789'.split('')
record = true
while record
random = Array.new(5){possible_values[rand(possible_values.size)]}.join
record = Order.find(:first, :conditions => ["number = ?", random])
end
self.number = random
end
As a customer I would be happy with:
year-month-day/short_uid
for example:
2009-07-27/KT1E
It gives room for about 33^4 ~ 1mln orders a day.
Here is an implementation for a system I proposed in an earlier question:
MAGIC = [];
29.downto(0) {|i| MAGIC << 839712541[i]}
def convert(num)
order = 0
0.upto(MAGIC.length - 1) {|i| order = order << 1 | (num[i] ^ MAGIC[i]) }
order
end
It's just a cheap hash function, but it makes it difficult for an average user to determine how many orders have been processed, or a number for another order. It won't run out of space until you've done 230 orders, which you won't hit any time soon.
Here are the results of convert(x) for 1 through 10:
1: 302841629
2: 571277085
3: 34406173
4: 973930269
5: 437059357
6: 705494813
7: 168623901
8: 906821405
9: 369950493
10: 638385949
At my old place it was the following:
The customer ID (which started at 1001), the sequence of the order they made then the unique ID from the Orders table. That gave us a nice long number of at least 6 digits and it was unique because of the two primary keys.
I suppose if you put dashes or spaces in you could even get us a little insight into the customer's purchasing habits. It isn't mind boggling secure and I guess a order ID would be guessable but I am not sure if there is security risk in that or not.
Ok, how about this one?
Sequentially, starting at some number (2468) and add some other number to it, say the day of the month that the order was placed.
The number always increases (until you exceed the capacity of the integer type, but by then you probably don't care, as you will be incredibly successful and will be sipping margaritas in some far-off island paradise). It's simple enough to implement, and it mixes things up enough to throw off any guessing as to how many orders you have.
I'd rather submit the number 347 and get great customer service at a smaller personable website than: G-84e38wRD-45OM at the mega-site and be ignored for a week.
You would not want this as a system id or part of a link, but as a user-friendly number it works.
Douglas Crockford's Base32 Encoding works superbly well for this.
http://www.crockford.com/wrmg/base32.html
Store the ID itself in your database as an auto-incrementing integer, starting at something suitably large like 100000, and simply expose the encoded value to the customer/interface.
5 characters will see you through your first ~32 million orders, whilst performing very well and satisfying most of these requirements. It doesn't allow for the exclusion of similar sounding characters though.
Rather than generating and storing a number, you might try creating an encrypted version that would not reveal the number of orders in the system. Here's an article on exactly that.
Something like this:
Get sequential order number. Or, maybe, an UNIX timestamp plus two or three random digits (when two orders are placed at the same moment) is fine too.
Bitwise-XOR it with some semi-secret value to make number appear "pseudo-random". This is primitive and won't stop those who really want to investigate how many orders you have, but for true "randomness" you need to keep a (large) permutation table. Or you'll need to have large random numbers, so you won't be hit by the birthday paradox.
Add checkdigit using Verhoeff algorithm (I'm not sure it will have such a good properties for base33, but it shouldn't be bad).
Convert the number to - for example - base 33 ("0-9A-Z", except for "O", "Q" and "L" which can be mistaken with "0" and "1") or something like that. Ease of pronouncation means excluding more letters.
Group the result in some visually readable pattern, like XXX-XXX-XX, so users won't have to track the position with their fingers or mouse pointers.
Sequentially, starting at 1? What's wrong with that?
(Note: This answer was given before the OP edited the question.)
Just one Rube Goldberg-style idea:
You could generate a table with a random set of numbers that is tied to a random period of time:
Time Period Interleaver
next 2 weeks: 442
following 8 days: 142
following 3 weeks: 580
and so on... this gives you an unlimited number of Interleavers, and doesn't let anyone know your rate of orders because your time periods could be on the order of days and your interleaver is doing a lot of low-tech "mashing" for you.
You can generate this table once, and simply ensure that all Interleavers are unique.
You can ensure you don't run out of Interleavers by simply adding more characters into the set, or start by defining longer Interleavers.
So you generate an order ID by getting a sequential number, and using today's Interleaver value, interleave its digits (hence, the name) in between each sequential number's digits. Guaranteed unique - guaranteed confusing.
Example:
Today I have a sequential number 1, so I will generate the order ID: 4412
The next order will be 4422
The next order will be 4432
The 10th order will be 41402
In two weeks my interleaver will change to 142,
The 200th order will be 210402
The 201th order will be 210412
Eight days later, my interleaver changes to 580:
The 292th order will be 259820
This will be completely confusing but completely deterministic. You can just remove every other digit starting at the 1's place. (except when your order id is only one digit longer than your interleaver)
I didn't say this was the best way - just a Friday idea.
You could do it like a postal code:
2b2 b2b
That way it has some kind of checksum (not really, but at least you know it's wrong if there are 2 consecutive numbers or letters). It's easy to read out over the phone, and it doesn't give an indication of how many orders are in the system.
http://blog.logeek.fr/2009/7/2/creating-small-unique-tokens-in-ruby
>> rand(36**8).to_s(36)
=> "uur0cj2h"
How about getting the current time in miliseconds and using that as your order ID?