Tinyurl-style unique code: potential algorithm to prevent collisions - algorithm

I have a system that requires a unique 6-digit code to represent an object, and I'm trying to think of a good algorithm for generating them. Here are the pre-reqs:
I'm using a base-20 system (no caps, numbers, vowels, or l to prevent confusion and naughty words)
The base-20 allows 64 million combinations
I'll be inserting potentially 5-10 thousand entries at once, so in theory I'd use bulk inserts, which means using a unique key probably won't be efficient or pretty (especially if there starts being lots of collisions)
It's not out of the question to fill up 10% of the combinations so there's a high potential for lots of collisions
I want to make sure the codes are non-consecutive
I had an idea that sounded like it would work, but I'm not good enough at math to figure out how to implement it: if I start at 0 and increment by N, then convert to base-20, it seems like there should be some value for N that lets me count each value from 0-63,999,999 before repeating any.
For example, going from 0 through 9 using N=3 (so 10 mod 3): 0, 3, 6, 9, 2, 5, 8, 1, 4, 7.
Is there some magic math method for figuring out values of N for some larger number that is able to count through the whole range without repeating? Ideally, the number I choose would sort of jump around the set such that it wasn't obvious that there was a pattern, but I'm not sure how possible that is.
Alternatively, a hashing algorithm that guaranteed uniqueness for values 0-64 million would work, but I'm way too dumb to know if that's possible.

All you need is a number that shares no factors with your key space. Easiest value is to use a prime number. You can google for large primes, or use http://primes.utm.edu/lists/small/10000.txt

Any prime number which is not a factor of the length of the sequence should be able to span the sequence without repeating. For 64000000, that means you shouldn't use 2 or 5. Of course, if you don't want them to be generated consecutively, generating them 2 or 5 apart is probably also not very good. I personally like the number 73973!

There is another method to get a similar result (jumping over the entire set of the values without repeating, nonconsequtively), without using the primes - by using maximum length sequences, which you can generate using specially constructed shift registers.

My math is a bit rusty, but I think you just need to ensure that the GCF of N and 64 million is 1. I'd go with a prime number (that doesn't divide evenly into 64 million) just in case though.

#Nick Lewis:
Well, only if the prime number doesn't divide 64 million. So, for the questioner's purposes, numbers like 2 or 5 would probably not be advisable.

Don't reinvent the wheel:
http://en.wikipedia.org/wiki/Universally_Unique_Identifier

Related

Shuffle sequential numbers without a buffer

I am looking for a shuffle algorithm to shuffle a set of sequential numbers without buffering. Another way to state this is that I’m looking for a random sequence of unique numbers that have a given period.
Your typical Fisher–Yates shuffle needs to have each element all of the elements it is going to shuffle, so that isn’t going to work.
A Linear-Feedback Shift Register (LFSR) does what I want, but only works for periods that are powers-of-two less two. Here is an example of using a 4-bit LFSR to shuffle the numbers 1-14:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
8
12
14
7
4
10
5
11
6
3
2
1
9
13
The first two is the input, and the second row the output. What’s nice is that the state is very small—just the current index. You can start of any index and get a difference set of numbers (starting at 1 yields: 8, 12, 14; starting at 9: 6, 3, 2), although the sequence is always the same (5 is always followed by 11). If I want a different sequence, I can pick a different generator polynomial.
The limitations to the LFSR are that the periods are always power-of-two less two (the min and max are always the same, thus unshuffled) and there not enough enough generator polynomials to allow every possible random sequence.
A block cipher algorithm would work. Every key produces a uniquely shuffled set of numbers. However all block ciphers (that I know about) have power-of-two block sizes, and usually a fixed or limited number of block sizes. A block cipher with a arbitrary non-binary block size would be perfect if such a thing exists.
There are a couple of projects I have that could benefit from such an algorithm. One is for small embedded micros that need to produce a shuffled sequence of numbers with a period larger than the memory they have available (think Arduino Uno needing to shuffle 1 to 100,000).
Does such an algorithm exist? If not, what things might I search for to help me develop such an algorithm? Or is this simply not possible?
Edit 2022-01-30
I have received a lot of good feedback and I need to better explain what I am searching for.
In addition to the Arduino example, where memory is an issue, there is also the shuffle of a large number of records (billions to trillions). The desire is to have a shuffle applied to these records without needing a buffer to hold the shuffle order array, or the time needed to build that array.
I do not need an algorithm that could produce every possible permutation, but a large number of permutations. Something like a typical block cipher in counter mode where each key produces a unique sequence of values.
A Linear Congruential Generator using coefficients to produce the desired sequence period will only produce a single sequence. This is the same problem for a Linear Feedback Shift Register.
Format-Preserving Encryption (FPE), such as AES FFX, shows promise and is where I am currently focusing my attention. Additional feedback welcome.
It is certainly not possible to produce an algorithm which could potentially generate every possible sequence of length N with less than N (log2N - 1.45) bits of state, because there are N! possible sequence and each state can generate exactly one sequence. If your hypothetical Arduino application could produce every possible sequence of 100,000 numbers, it would require at least 1,516,705 bits of state, a bit more than 185Kib, which is probably more memory than you want to devote to the problem [Note 1].
That's also a lot more memory than you would need for the shuffle buffer; that's because the PRNG driving the shuffle algorithm also doesn't have enough state to come close to being able to generate every possible sequence. It can't generate more different sequences than the number of different possible states that it has.
So you have to make some compromise :-)
One simple algorithm is to start with some parametrisable generator which can produce non-repeating sequences for a large variety of block sizes. Then you just choose a block size which is as least as large as your target range but not "too much larger"; say, less than twice as large. Then you just select a subrange of the block size and start generating numbers. If the generated number is inside the subrange, you return its offset; if not, you throw it away and generate another number. If the generator's range is less than twice the desired range, then you will throw away less than half of the generated values and producing the next element in the sequence will be amortised O(1). In theory, it might take a long time to generate an individual value, but that's not very likely, and if you use a not-very-good PRNG like a linear congruential generator, you can make it very unlikely indeed by restricting the possible generator parameters.
For LCGs you have a couple of possibilities. You could use a power-of-two modulus, with an odd offset and a multiplier which is 5 mod 8 (and not too far from the square root of the block size), or you could use a prime modulus with almost arbitrary offset and multiplier. Using a prime modulus is computationally more expensive but the deficiencies of LCG are less apparent. Since you don't need to handle arbitrary primes, you can preselect a geometrically-spaced sample and compute the efficient division-by-multiplication algorithm for each one.
Since you're free to use any subrange of the generator's range, you have an additional potential parameter: the offset of the start of the subrange. (Or even offsets, since the subrange doesn't need to be contiguous.) You can also increase the apparent randomness by doing any bijective transformation (XOR/rotates are good, if you're using a power-of-two block size.)
Depending on your application, there are known algorithms to produce block ciphers for subword bit lengths [Note 2], which gives you another possible way to increase randomness and/or add some more bits to the generator state.
Notes
The approximation for the minimum number of states comes directly from Stirling's approximation for N!, but I computed the number of bits by using the commonly available lgamma function.
With about 30 seconds of googling, I found this paper on researchgate.net; I'm far from knowledgable enough in crypto to offer an opinion, but it looks credible; also, there are references to other algorithms in its footnotes.

Return the count of all prime numbers in range [a,b] such that all the digits are from set {1,5,9} . 1<=a<=b<=10⁹

Return the count of all prime numbers in range [a,b] such that all the digits are from set {1,5,9} . 1<=a<=b<=10⁹.
My approach -
I was trying to generate all the numbers which are from set {1,5,9}. which comes out to be 3^9(19683) and after that I am checking for is it prime or not.
Can I do this in a better time complexity?
Never generate a large set and after check all elements of the set, ruling out most. That requires a lot of memory to store things you'll be discarding. Instead, find a single number with "valid" digits, check for primeness, and only then store it in a set. Accessing large arrays of memory is very time-intense on modern computers compared to doing math.
"I produced all the numbers": I hope you're doing this smartly! You never have to check a number with a last digit being 5 for primeness (there's only a single prime that ends in 5; that's 5 itself!), for example. Also, you hopefully don't just build all combinations of digits "manually". Say, you find a number 19551, then 19559 is also a candidate, you never have to manually "combine" digits to try out the last digit.
Of course, your prime-checking algorithm needs to be matching your kind of problem: You can remove the initial check for divisibility by 2 (you never produce even numbers), for example. You never need to check for divisibility by 5, because you never use 5 or 0 as last digit. Depending on your prime checking algorithm, you also would want to save the factor that "killed" the xxxx1 – that's one factor you don't have to check xxxx9 against. Do your 3-factor-checking based on the count of 1,5 and 9 in your number; you can directly infer cross-sum and hence 3-divisibility from that.

Algorithm to calculate all possible subset

This should be a quite simple problem, but I don't have proper algorithmic training and find myself stuck trying to solve this.
I need to calculate the possible combinations to reach a number by adding a limited set of smaller numbers together.
Imagine that we are playing with LEGO and I have a brick that is 12 units long and I need to list the possible substitutions I can make with shorter bricks. For this example we may say that the available bricks are 2, 4, 6 and 12 units long.
What might be a good approach to building an algorithm that can calculate the substitions? There are no bounds on how many bricks I can use at a time, so it could be 6x2 as well as 1x12, the important thing is I need to list all of the options.
So the inputs are the target length (in this case 12) and available bricks (an array of numbers (arbitrary length), in this case [2, 4, 6, 12]).
My approach was to start with the low number and add it up until I reach the target, then take the next lowest and so on. But that way I miss out on the combinations of multiple numbers and when I try to factor that in, it gets really messy.
I suggest a recursive approach: given a function f(target,permissibles) to list all representations of target as a combination of permissibles, you can do this:
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles)
if you do not want to differentiate between 12 = 4+4+2+2 and 12=2+4+2+4, you need to sort permissibles in the descending order and do
def f(target,permissibles):
for x in permissibles:
collect f(target - x, permissibles.remove(larger than x))

non-repeating random numbers

I need to generate around 9-100 million non-repeating random numbers, ranging from zero to the amount of numbers generated, and I need them to be generated very quickly. Several answers to similar questions proposed simply shuffling an array in order to get the random numbers, and others proposed using a bloom filter. The question is, which one is more efficient, and in case of it being the bloom filter, how do I use it?
You don't want random numbers at all. You want exactly the numbers 0 to N-1, in random order.
Simply filling the array and shuffling should be very quick. A proper Fisher-Yates shuffle is O(n), so an array of 100 million should take well under a second in C or even Java, slightly slower in a higher-level language like Python.
You only have to generate N-1 random numbers to do the shuffle (maybe up to 1.3N if you use rejection sampling to get perfect uniformity), so the speed will depend largely on how fast your RNG is.
You'll never need to look up whether a number has already be generated; that will deadly be slow no matter which algorithm you use, especially toward the end of the run.
If you need slightly fewer than N total numbers, fill the array from 0 to N-1, then just abort the shuffle early and take the partial result. Only if the amount of numbers you need is very small compared to their range should you consider the generate-and-check-for-dups approach. In that case Bob Floyd's algorithm might be good.
As an alternative you could use an appropriately sized block cypher. Use the block cypher to encrypt the numbers 0, 1, 2, ... and you will get a series of non-repeating random numbers out. Exactly what series will depend on the key you use. They are guaranteed not to repeat, because a block cypher is a reversible permutation.
For 64 bit numbers use DES, for 32 bit use Hasty Pudding (which allows a large range of block sizes) or write your own simple Feistel cypher. Assuming that security is not a big issue for this, then writing your own is possible.
For sure its better create an algorithm to shuffle the numbers, if you use a seed, as for example, the server microtime or timestamp, you can have one different random string for each milisecond .
Start creating an array using range function, set number of numbers as you like .
Than, you need to use a seed to make the pseudo-randomness better .
So, instead of rand, you gotta use SHUFFLE,
so you set array on range as 1 to 90, set the seed, than use shuffle to shuffle the array.. than you got all numbers in a random order (corresponding to the seed) .
You gotta change the seed to have another result .
The order of the numbers is the result .
as .. ball 1 : 42 ... ball 2: 10.... ball 3: 50.... ball 1 is 0 in the array. ;)
You can also use slice function and create a for / each loop, incrementing the slice factor, so you loop
slice array 0,1 the the result .. ball 1...
slice array 0.2 ball 2...
slice array 0.3
Thats the logic, i hope you understand, if so .. it ill help you a lot .

Subset calculation of list of integers

I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.

Resources