How to select nth random integer from a range of integers without repetition or storage? [duplicate] - algorithm

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 1 year ago.
Let's say my system needs to provide a unique integer id regularly, between 1 and 10^20, from a function like --
function getNextRandomUniqueId(index:BigInt, min:BigInt, max:BigInt, seed:BigInt): BigInt { ? }
id = getNextRandomUniqueId(index=42, min=1, max=10^20, seed=0)
These ids need to be provided in random order as the index increases, not sequentially. Once an id has been provided, it cannot be provided again, as long as the index increases. My system cannot store a random list of all the numbers to be issued, or all the numbers issued, there's too many. I also don't want to rely on something like a random UUID, which is exceedingly unlikely to have a collision, but not guaranteed to.
How can this be done? To have a deterministic mathematical way to iterate randomly through a set of sequential integers without repetition and without storage?
EDIT: Fixed 1^20 to 10^20

This can be done, assuming you are allowed to store an encryption key and counter. Encryption is a one-to-one mapping so by encrypting all the numbers in a given range you will get back all those same numbers in a randomized order. Different keys will give a different order. Encrypt the numbers 0, 1, 2, 3, ... in order, using the key and keeping track of how far you have got.
Depending on the range of numbers, you may need to use some form of Format Preserving encryption to keep the outputs within the required range.

You cannot guarantee that your same id is not in another seed sequence.
Most languages use the time to generate the sequence when you are not providing a seed yourself. You have set your seed to zero so each time you restart your program, you will get your same ids. This is most likely not your intent :-)
But even when you would do this, the chance that you hit the same id is there.
1 in the 100,000,000,000,000,000,000.
The reason you can get the same id is because it is RANDOM
I would go with a GUID.
1 in the 340.280.000.000.000.000.000.000.000.000.000.000.000

Related

Generate random looking numbers deterministically from a random-access lookup key in O(1) space and time

I want to output random looking numbers based on an input. If the same input is put in, the same output is given.
I don't want to pregenerate and store a bunch of random data, and I don't want it to take an O(n) amount of time to recover the nth index.
It does not need to be secure, cryptographically or otherwise, just enough to look random.
If you want a deterministic random-access function from an (index,length) pair to a random looking string of bytes you could use SHA3-N(index)[:length] where N is the first convenient number greater than length.
This would not behave identically to an actual array as reading indexes 1 (with length 10) and 5 (with length 10) would not have any overlap (which you'd expect from an array).
This is going to be slow and very inconvenient for N>512, so if you need longer strings you'll want to do multiple rounds. Something like SHA3-512(SHA3-512(index)[0:256])++SHA3-512(SHA3-512(index)[256:512]) to get something 1024bytes long.
Armed with the multiple rounds part you could use any hash function (e.g. SHA256, MD5) which might be more convenient.
I should note that this is definitely not secure and the output could easily be predicted by an adversary.
Typically, a random number generator will generate the same sequence of pseudo-random numbers given the same seed. For example, such python code might be like so:
random.seed(1)
for i in range(1, 10):
print(random.randint(1,100)
Will print the same list no matter how many times you invoke that code. Similarly, so will this:
random.seed(42)
for i in range(1, 10):
print(random.randint(1,100)
If somehow you then describe the sections of your array as a seed (you could use a hash function to do this indeed) you can seed the generator with that value and reliably allow dynamic sizing of the list requested.

Random number generation guaranteeing uniqueness over time

I created a counter that goes up from 0 to 9999 until it resets again. I use the output of this counter as a value to make unique entries. However, the application needs to find its last created number each time the application is restarted. Therfore I am looking for a method which avoids any sort of object storage and relies solely on random number generation.
Something like:
int randomTimeBasedGenerator() {
Random r = new Random(System.currentTimeMillis())
int num = r.nextInt() % 9999
return num
}
But what guarantee do I have that this method generates unique numbers? And, if not, how long would it remain unique? Are there any study papers I can look into for this sort of scenario?
Random number generation would be an elegant solution for my situation, if I can at least guarantee it won't repeat within a couple of weeks or months. But random number generation would be useless in my case if no such guarantee exists.
You have no guarantee that the return value of a random number generator remains unique. Random number generators generate unique sequences of numbers, not unique numbers. Random numbers will always repeat themselves, sooner or later.
As suggested by #Thilo, UUIDs are unique numbers. But an even better approach in your case might be to set up a lightweight database (sqlite will do) and add a record to a table with incremental id's. It is not possible to keep track of a process without storing values somewhere.

Iterative Hash Algorithm for Fast File Check

I want to create a representation of the state of all files in a folder (ignoring order), so that I can send this state to another computer to check if we are in sync. This "state representation" is 3 numbers concatenated by . which are:
sum . product . number of items
The "sum" is the numerical addition all of the file's md5 numerical representations.
The product is the multiplication of all of the file's md5 numerical representations.
The number of items is just the number of files.
The main reason for doing this is that this allows me to create unique states iteratively/quickly when I add or delete a file (a modification being a combination of delete then add). Also, one should end up with the same "state" even if the same set of operations are performed in any random order.
Adding A File
Generate the file's md5
Calculate the md5's numerical value (x).
Add x to the sum
Multiply the product by x
Increment the number of items.
Removing A File
Generate the file's md5
Calculate the md5's numerical value (x).
Subtract x from the sum
Divide the product by x
Decrement the number of items.
Problems
Since the numerical representations of hashes can be quite large, I may have to use a library to generate results using strings rather than integers which may be quite slow.
With the limited testing I have done, I have not been able to create "collisions" where a collision is where two different sets of file hashes could produce the same state (remember that we are ignoring the order of the file hashes).
Question
I'm sure that I can't be the first person to want to achieve such a thing. Is there an algorithm or iterative hash function that aims to do the same thing already, preferably in PHP, Java, or Python? Is there a term for this type of thing, all I could think of was "iterative hash"? Is there a flaw with this algorithm that I haven't spotted, such as with "collisions" making generated state representations non-unique?
How many states can your file system take ? infinity for all practical purposes.
How long is your hash length ? short enough to be efficient, finite in any case.
Will I get collisions ? Yes.
So, your hash approach seems fine, particularly if it spreads correctly points that are close, i.e. the state of the fs differing by content of just one file hashes to very different values.
However, you should depend on your hash to produce collisions in the long run, it's a mathematical certainty that probability goes to one that someday you get a collision, given that collision chance is not 0.
So to be really safe, you probably need a full MD5 exchange, if speed and fast updates are the goal your scheme sounds good, but I would back it up with more infrequent exchanges of longer keys, just to be on the safe side if sync is mission critical.

How can I create a unique 7-digit code for an entity?

When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.

A good algorithm for generating an order number

As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit

Resources