I have to write oracle procedure registering 16 digit security numbers in registered_security_numbers.
I know that first 6 digits of security number are either 1234 11 or 1234 12 , rest 10 digits are generated randomly.
I have 2 possible solutions :
Write second procedure,which generates all possible security numbers and inserts them in possible_security_numbers table, setting property free=1 .
Then when i get a request to register a new security number, I query possible_security_numbers table for a random security number,which is free and insert it in registered_security_numbers.
Every time I get a request to register security number, I generate random number from 1234 1100 0000 0000 - 1234 1299 9999 9999 range until I get security number, which does not exist in registered_security_numbers table and insert it in registered_security_numbers.
(1) approach i don't like because possible_security_numbers table will contain several billion entries and I am not sure how good it is or how fast select/update can be run.
(2) approach I don't like because if I have many records in registered_security_numbers table, generating random number from a range might be repeated many times.
I'd like to know if anyone has other solution or can comment on my solutions, which seem bad to me …
How many numbers are you actually going to generate?
Imagine that, at most, you're going to generate 1 million (10^6) numbers. If so, the odds that you're going to need to generate a second random number is roughly 5 in 10^-5 (0.00005 or 0.005%). If that's the case, it makes little sense to worry about the expense of occasionally generating a second number or the near impossibility of generating a third. The second approach will be much more efficient.
On the other hand, imagine that you intend to generate 1 billion numbers over time. If that's the case, then by the end, the odds that you're going to need to generate a second number is 5% and you'll need to generate 3 or 4 numbers reasonably often. Here, the trade-offs are much harder to figure out. Depending on the business, the performance impact of catching the unique constraint violation exception and generating multiple numbers on some calls may cause a service to violate the SLA often enough to matter while enumerating the valid numbers may be more efficient on average.
On the third hand, imagine that you intend to generate all 20 billion numbers over time. If that's the case, by the end, you'd expect to have to generate 10 billion random numbers before you found the one remaining valid number. If that's the case, the clear advantage will be with the first option of enumerating all possible numbers and tracking which ones have been used.
Related
Note: I have already read through older questions like What is the best format for a customer number, order number? , however my question is a little more specific.
Generating pseudo-random numbers encounter the "birthday problem" before long. For example, if I am using a 27-bit field for my order number, after 15000 entries, the chances of collision increase to 50%.
I am wondering whether large ecommerce businesses like Amazon generates its order number in any other way - for example :
pre-generate the entire set and pick from them randomly (a few hundred GB of database)
Use lexicographical "next_permutation" starting from a particular seed number
MD5 or SHA-1 hash of the date, user-id, etc parameters, truncated to 14 digits
etc
All I want is a non-repeating integer (doesnt need to be very random except to obfuscate total number of orders) of a certain width. Any ideas on how this can be achieved ?
Suggest starting with the date in reverse format then starting at 1, followed by a check (or random) digit. If you are likely to never exceed 100 orders per day you need add two digits plus a check/random digit.
The year need include only the final two digits, possibly only the final digit, depending on how long you keep records of orders: 7 years or so is usually enough, meaning the records from 2009 (beginning with 9) could be deleted during 2018 in preparation to use the order numbers again in 2019. You could use mmdd for the next 4 digits, or simply number the days through the year and use just 3 digits - it depends how human-friendly you want the number to be. It's also possible just to omit the day of the month and restart the sequential numbers at the start of each month, rather than every day.
Today is 2 Nov 2017, let's suppose this is order no 16 today, your order no would be 71102168 (where the 8 is a check digit or random digit). If you're likely to have up to, but not exceeding a thousand, you'll need an extra digit, thus: 711020168. To avoid limiting yourself the number of digits, you might prefer to use a hyphen: 71102-168 … you could include another hyphen before the check/random digit if you wish: 71102-16-8.
If you have several areas dealing with orders, you may wish to include a depot number, perhaps at the beginning or after the date, allowing you to use the sequence numbers at each depot - eg depot 5 might be: 5-71102-168, 71102-5-168 or 711025168. Again, if you don't use hyphens, you'll need to assess whether you need up to ten, a hundred or a thousand (etc) possible depot numbers. I hope this helps!
This problem has been solved, why
not use the UUID. See RFC 4122. These are close enough to globally unique you can easily combine many systems and never ever have a duplicate just because the number space is so massive.
I am generating about 100 million random numbers to pick from 300 things. I need to set it up so that I have 10 million independent instances (different seed) that picks 10 times each. The goal is for the aggregate results to have very low discrepancy, as in, each item gets picked about the same number of times.
The problem is with a regular prng, some numbers get chosen more than others. (tried lcg and mersenne twister) The difference between the most picked and least picked can be several thousand, to ten thousands) With linear congruity generators and mersenne twister, I also tried picking 100 million times with 1 instance and that also didn't yield uniform results. I'm guessing this is because the period is very long, and perhaps 100 million isn't big enough. Theoretically, if I pick enough numbers, the results should reach uniformity. (should settle at the expected value)
I switched to Sobol, a quasirandom generator and got much better results with the 100 million from 1 instance test. (difference between most picked and least picked is about 5) But splitting them up to 10 million instances at 10 times each, the uniformity was lost and I got similar results as with the prng. Sobol seem very sensitive to sequence - skipping ahead randomly diminishes uniformity.
Is there a class of random generators that can maintain quasirandom-like low discrepancy even when combining 10 million independent instances? Or is that theoretically impossible? One solution I can think of now is to use 1 Sobol generator that is shared across 10 million instances, so effectively it is the same as the 100 million from 1 instance test.
Both the shuffling and proper use of Sobol should give you uniformity as desired. Shuffling needs to be done at the aggregate level (start with a global 100M sample having the desired aggregate frequencies, then shuffle it to introduce randomness, and finally split into the 10 values instances; shuffling within each instance wouldnt help globally, as you noted).
But that's an additional level of uniformity, you might not really need that: randomness might be enough.
First of all I would check the check itself, because it sounds strange that with enough samples you're really getting significant deviations (check "chi square test" to qualify such significance, or equivalently how many are "enough" samples). So for a first safety check: if you're picking independent values, then simplify differently to 10M instances picking 10 out 2 categories: do you get approximately a binomial distribution? For exclusive picking it's a different distribution (hypergeometric iirc, but need to check). Then generalize to more categories (multinomial distribution) and only later it's safe to proceed with your problem.
When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.
As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit
Every order in my online store has a user-facing order number. I'm wondering the best way to generate them. Criteria include:
Short
Easy to say over the phone (e.g., "m" and "n" are ambiguous)
Unique
Checksum (overkill? Useful?)
Edit: Doesn't reveal how many total orders there have been (a customer might find it unnerving to make your 3rd order)
Right now I'm using the following method (no checksum):
def generate_number
possible_values = 'abfhijlqrstuxy'.upcase.split('') | '123456789'.split('')
record = true
while record
random = Array.new(5){possible_values[rand(possible_values.size)]}.join
record = Order.find(:first, :conditions => ["number = ?", random])
end
self.number = random
end
As a customer I would be happy with:
year-month-day/short_uid
for example:
2009-07-27/KT1E
It gives room for about 33^4 ~ 1mln orders a day.
Here is an implementation for a system I proposed in an earlier question:
MAGIC = [];
29.downto(0) {|i| MAGIC << 839712541[i]}
def convert(num)
order = 0
0.upto(MAGIC.length - 1) {|i| order = order << 1 | (num[i] ^ MAGIC[i]) }
order
end
It's just a cheap hash function, but it makes it difficult for an average user to determine how many orders have been processed, or a number for another order. It won't run out of space until you've done 230 orders, which you won't hit any time soon.
Here are the results of convert(x) for 1 through 10:
1: 302841629
2: 571277085
3: 34406173
4: 973930269
5: 437059357
6: 705494813
7: 168623901
8: 906821405
9: 369950493
10: 638385949
At my old place it was the following:
The customer ID (which started at 1001), the sequence of the order they made then the unique ID from the Orders table. That gave us a nice long number of at least 6 digits and it was unique because of the two primary keys.
I suppose if you put dashes or spaces in you could even get us a little insight into the customer's purchasing habits. It isn't mind boggling secure and I guess a order ID would be guessable but I am not sure if there is security risk in that or not.
Ok, how about this one?
Sequentially, starting at some number (2468) and add some other number to it, say the day of the month that the order was placed.
The number always increases (until you exceed the capacity of the integer type, but by then you probably don't care, as you will be incredibly successful and will be sipping margaritas in some far-off island paradise). It's simple enough to implement, and it mixes things up enough to throw off any guessing as to how many orders you have.
I'd rather submit the number 347 and get great customer service at a smaller personable website than: G-84e38wRD-45OM at the mega-site and be ignored for a week.
You would not want this as a system id or part of a link, but as a user-friendly number it works.
Douglas Crockford's Base32 Encoding works superbly well for this.
http://www.crockford.com/wrmg/base32.html
Store the ID itself in your database as an auto-incrementing integer, starting at something suitably large like 100000, and simply expose the encoded value to the customer/interface.
5 characters will see you through your first ~32 million orders, whilst performing very well and satisfying most of these requirements. It doesn't allow for the exclusion of similar sounding characters though.
Rather than generating and storing a number, you might try creating an encrypted version that would not reveal the number of orders in the system. Here's an article on exactly that.
Something like this:
Get sequential order number. Or, maybe, an UNIX timestamp plus two or three random digits (when two orders are placed at the same moment) is fine too.
Bitwise-XOR it with some semi-secret value to make number appear "pseudo-random". This is primitive and won't stop those who really want to investigate how many orders you have, but for true "randomness" you need to keep a (large) permutation table. Or you'll need to have large random numbers, so you won't be hit by the birthday paradox.
Add checkdigit using Verhoeff algorithm (I'm not sure it will have such a good properties for base33, but it shouldn't be bad).
Convert the number to - for example - base 33 ("0-9A-Z", except for "O", "Q" and "L" which can be mistaken with "0" and "1") or something like that. Ease of pronouncation means excluding more letters.
Group the result in some visually readable pattern, like XXX-XXX-XX, so users won't have to track the position with their fingers or mouse pointers.
Sequentially, starting at 1? What's wrong with that?
(Note: This answer was given before the OP edited the question.)
Just one Rube Goldberg-style idea:
You could generate a table with a random set of numbers that is tied to a random period of time:
Time Period Interleaver
next 2 weeks: 442
following 8 days: 142
following 3 weeks: 580
and so on... this gives you an unlimited number of Interleavers, and doesn't let anyone know your rate of orders because your time periods could be on the order of days and your interleaver is doing a lot of low-tech "mashing" for you.
You can generate this table once, and simply ensure that all Interleavers are unique.
You can ensure you don't run out of Interleavers by simply adding more characters into the set, or start by defining longer Interleavers.
So you generate an order ID by getting a sequential number, and using today's Interleaver value, interleave its digits (hence, the name) in between each sequential number's digits. Guaranteed unique - guaranteed confusing.
Example:
Today I have a sequential number 1, so I will generate the order ID: 4412
The next order will be 4422
The next order will be 4432
The 10th order will be 41402
In two weeks my interleaver will change to 142,
The 200th order will be 210402
The 201th order will be 210412
Eight days later, my interleaver changes to 580:
The 292th order will be 259820
This will be completely confusing but completely deterministic. You can just remove every other digit starting at the 1's place. (except when your order id is only one digit longer than your interleaver)
I didn't say this was the best way - just a Friday idea.
You could do it like a postal code:
2b2 b2b
That way it has some kind of checksum (not really, but at least you know it's wrong if there are 2 consecutive numbers or letters). It's easy to read out over the phone, and it doesn't give an indication of how many orders are in the system.
http://blog.logeek.fr/2009/7/2/creating-small-unique-tokens-in-ruby
>> rand(36**8).to_s(36)
=> "uur0cj2h"
How about getting the current time in miliseconds and using that as your order ID?