Obfuscating an ID - algorithm

I'm looking for a way to encrypt/obfuscate an integer ID into another integer. More precisely, I need a function int F(int x), so that
x<->F(x) is one-to-one correspondence (if x != y, F(x) != F(y))
given F(x), it's easy to find out x - so F is not a hash function
given x and F(x) it's hard/impossible to find out F(y), something like x ^ 0x1234 won't work
For clarity, I'm not looking for a strong encryption solution, it's only obfuscation. Imagine a web application with urls like example.com/profile/1, example.com/profile/2 etc. The profiles themselves are not secret, but I'd like to prevent casual voyeurs to view/fetch all profiles one after another, so I'd rather hide them behind something like example.com/profile/23423, example.com/profile/80980234 etc. Although database-stored tokens can do the job quite easily, I'm curious if there's some simple math available for this.
One important requirement I wasn't clear about is that results should look "random", that is, given a sequence x,x+1,...,x+n , F(x),F(x+1)...F(x+n) shouldn't form a progression of any kind.

Obfuscate it with some combination of 2 or 3 simple methods:
XOR
shuffle individual bits
convert to modular representation (D.Knuth, Vol. 2, Chapter 4.3.2)
choose 32 (or 64) overlapping subsets of bits and XOR bits in each subset (parity bits of subsets)
represent it in variable-length numberic system and shuffle digits
choose a pair of odd integers x and y that are multiplicative inverses of each other (modulo 232), then multiply by x to obfuscate and multiply by y to restore, all multiplications are modulo 232 (source: "A practical use of multiplicative inverses" by Eric Lippert)
Variable-length numberic system method does not obey your "progression" requirement on its own. It always produces short arithmetic progressions. But when combined with some other method, it gives good results.
The same is true for the modular representation method.
Here is C++ code example for 3 of these methods. Shuffle bits example may use some different masks and distances to be more unpredictable. Other 2 examples are good for small numbers (just to give the idea). They should be extended to obfuscate all integer values properly.
// *** Numberic system base: (4, 3, 5) -> (5, 3, 4)
// In real life all the bases multiplied should be near 2^32
unsigned y = x/15 + ((x/5)%3)*4 + (x%5)*12; // obfuscate
unsigned z = y/12 + ((y/4)%3)*5 + (y%4)*15; // restore
// *** Shuffle bits (method used here is described in D.Knuth's vol.4a chapter 7.1.3)
const unsigned mask1 = 0x00550055; const unsigned d1 = 7;
const unsigned mask2 = 0x0000cccc; const unsigned d2 = 14;
// Obfuscate
unsigned t = (x ^ (x >> d1)) & mask1;
unsigned u = x ^ t ^ (t << d1);
t = (u ^ (u >> d2)) & mask2;
y = u ^ t ^ (t << d2);
// Restore
t = (y ^ (y >> d2)) & mask2;
u = y ^ t ^ (t << d2);
t = (u ^ (u >> d1)) & mask1;
z = u ^ t ^ (t << d1);
// *** Subset parity
t = (x ^ (x >> 1)) & 0x44444444;
u = (x ^ (x << 2)) & 0xcccccccc;
y = ((x & 0x88888888) >> 3) | (t >> 1) | u; // obfuscate
t = ((y & 0x11111111) << 3) | (((y & 0x11111111) << 2) ^ ((y & 0x22222222) << 1));
z = t | ((t >> 2) ^ ((y >> 2) & 0x33333333)); // restore

You want the transformation to be reversible, and not obvious. That sounds like an encryption that takes a number in a given range and produces a different number in the same range. If your range is 64 bit numbers, then use DES. If your range is 128 bit numbers then use AES. If you want a different range, then your best bet is probably Hasty Pudding cipher, which is designed to cope with different block sizes and with number ranges that do not fit neatly into a block, such as 100,000 to 999,999.

Obfuscation is not really sufficient in terms of security.
However, if you are trying to thwart the casual onlooker, I'd recommend a combination of two methods:
A private key that you combine with the id by xor'ing them together
Rotating the bits by a certain amount both before and after the key
has been applied
Here is an example (using pseudo code):
def F(x)
x = x XOR 31415927 # XOR x with a secret key
x = rotl(x, 5) # rotate the bits left 5 times
x = x XOR 31415927 # XOR x with a secret key again
x = rotr(x, 5) # rotate the bits right 5 times
x = x XOR 31415927 # XOR x with a secret key again
return x # return the value
end
I haven't tested it, but I think this is reversible, should be fast, and not too easy to tease out the method.

I found this particular piece of Python/PHP code very useful:
https://github.com/marekweb/opaque-id

I wrote some JS code using some of the ideas in this thread:
const BITS = 32n;
const MAX = 4294967295n;
const COPRIME = 65521n;
const INVERSE = 2166657316n;
const ROT = 6n;
const XOR1 = 10296065n;
const XOR2 = 2426476569n;
function rotRight(n, bits, size) {
const mask = (1n << bits) - 1n;
// console.log('mask',mask.toString(2).padStart(Number(size),'0'));
const left = n & mask;
const right = n >> bits;
return (left << (size - bits)) | right;
}
const pipe = fns => fns.reduce((f, g) => (...args) => g(f(...args)));
function build(...fns) {
const enc = fns.map(f => Array.isArray(f) ? f[0] : f);
const dec = fns.map(f => Array.isArray(f) ? f[1] : f).reverse();
return [
pipe(enc),
pipe(dec),
]
}
[exports.encode, exports.decode] = build(
[BigInt, Number],
[i => (i * COPRIME) % MAX, i => (i * INVERSE) % MAX],
x => x ^ XOR1,
[x => rotRight(x, ROT, BITS), x => rotRight(x, BITS-ROT, BITS)],
x => x ^ XOR2,
);
It produces some nice results like:
1 1352888202n 1 'mdh37u'
2 480471946n 2 '7y26iy'
3 3634587530n 3 '1o3xtoq'
4 2225300362n 4 '10svwqy'
5 1084456843n 5 'hxno97'
6 212040587n 6 '3i8rkb'
7 3366156171n 7 '1jo4eq3'
8 3030610827n 8 '1e4cia3'
9 1889750920n 9 'v93x54'
10 1017334664n 10 'gtp0g8'
11 4171450248n 11 '1wzknm0'
12 2762163080n 12 '19oiqo8'
13 1621319561n 13 'qtai6h'
14 748903305n 14 'cdvlhl'
15 3903018889n 15 '1sjr8nd'
16 3567473545n 16 '1mzzc7d'
17 2426613641n 17 '144qr2h'
18 1554197390n 18 'ppbudq'
19 413345678n 19 '6u3fke'
20 3299025806n 20 '1ik5klq'
21 2158182286n 21 'zoxc3y'
22 1285766031n 22 'l9iff3'
23 144914319n 23 '2ea0lr'
24 4104336271n 24 '1vvm64v'
25 2963476367n 25 '1d0dkzz'
26 2091060108n 26 'ykyob0'
27 950208396n 27 'fpq9ho'
28 3835888524n 28 '1rfsej0'
29 2695045004n 29 '18kk618'
30 1822628749n 30 'u559cd'
31 681777037n 31 'b9wuj1'
32 346231693n 32 '5q4y31'
Testing with:
const {encode,decode} = require('./obfuscate')
for(let i = 1; i <= 1000; ++i) {
const j = encode(i);
const k = decode(j);
console.log(i, j, k, j.toString(36));
}
XOR1 and XOR2 are just random numbers between 0 and MAX. MAX is 2**32-1; you should set this to whatever you think your highest ID will be.
COPRIME is a number that's coprime w/ MAX. I think prime numbers themselves are coprime with every other number (except multiples of themselves).
INVERSE is the tricky one to figure out. These blog posts don't give a straight answer, but WolframAlpha can figure it out for you. Basically, just solve the equation (COPRIME * x) % MAX = 1 for x.
The build function is something I created to make it easier to create these encode/decode pipelines. You can feed it as many operations as you want as [encode, decode] pairs. These functions have to be equal and opposite. The XOR functions are their own compliments so you don't need a pair there.
Here's another fun involution:
function mixHalves(n) {
const mask = 2n**12n-1n;
const right = n & mask;
const left = n >> 12n;
const mix = left ^ right;
return (mix << 12n) | right;
}
(assumes 24-bit integers -- just change the numbers for any other size)

Do anything with the bits of the ID that won't destroy them. For example:
rotate the value
use lookup to replace certain parts of the value
xor with some value
swap bits
swap bytes
mirror the whole value
mirror a part of the value
... use your imagination
For decryption, do all that in reverse order.
Create a program that will 'encrypt' some interesting values for you and put them in a table you can examine. Have same program TEST your encryption/decryption routine WITH all set of values that you want to have in your system.
Add stuff to the above list into the routines until your numbers will look properly mangled to you.
For anything else, get a copy of The Book.

I wrote an article on secure permutations with block ciphers, which ought to fulfil your requirements as stated.
I'd suggest, though, that if you want hard to guess identifiers, you should just use them in the first place: generate UUIDs, and use those as the primary key for your records in the first place - there's no need to be able to convert to and from a 'real' ID.

Not sure how "hard" you need it to be, how fast, or how little memory to use. If you have no memory constraints you could make a list of all integers, shuffle them and use that list as a mapping. However, even for a 4 byte integer you would need a lot of memory.
However, this could be made smaller so instead of mapping all integers you would map only 2 (or worst case 1) byte and apply this to each group in the integer. So, using 2 bytes a integer would be (group1)(group2) you would map each group through the random map. But that means that if you only change group2 then the mapping for group1 would stay the same. This could "fixed" by mapping different bits to each group.
So, *(group2) could be (bit 14,12,10,8,6,4,2,0) so, adding 1 would change both group1 and group2.
Still, this is only security by obscurity, anyone that can feed numbers into your function (even if you keep the function secret) could fairly easily figure it out.

Generate a private symmetric key for use in your application, and encrypt your integer with it. This will satisfy all three requirements, including the hardest #3: one would need to guess your key in order to break your scheme.

What you're describing here seems to be the opposite of a one-way function: it's easy to invert but super difficult to apply. One option would be to use a standard, off-the-shelf public-key encryption algorithm where you fix a (secret, randomly-chosen) public key that you keep a secret and a private key that you share with the world. That way, your function F(x) would be the encryption of x using the public key. You could then easily decrypt F(x) back to x by using the private decryption key. Notice that the roles of the public and private key are reversed here - you give out the private key to everyone so that they can decrypt the function, but keep the public key secret on your server. That way:
The function is a bijection, so it's invertible.
Given F(x), x is efficiently computable.
Given x and F(x), it is extremely difficult to compute F(y) from y, since without the public key (assuming you use a cryptographically strong encryption scheme) there is no feasible way to encrypt the data, even if the private decryption key is known.
This has many advantages. First, you can rest assured that the crypto system is safe, since if you use a well-established algorithm like RSA then you don't need to worry about accidental insecurity. Second, there are already libraries out there to do this, so you don't need to code much up and can be immune to side-channel attacks. Finally, you can make it possible for anyone to go and invert F(x) without anyone actually being able to compute F(x).
One detail- you should definitely not just be using the standard int type here. Even with 64-bit integers, there are so few combinations possible that an attacker could just brute-force try inverting everything until they find the encryption F(y) for some y even if they don't have the key. I would suggest using something like a 512-bit value, since even a science fiction attack would not be able to brute-force this.
Hope this helps!

If xor is acceptable for everything but inferring F(y) given x and F(x) then I think you can do that with a salt. First choose a secret one-way function. For example S(s) = MD5(secret ^ s). Then F(x) = (s, S(s) ^ x) where s is chosen randomly. I wrote that as a tuple but you can combine the two parts into an integer, e.g. F(x) = 10000 * s + S(s) ^ x. The decryption extracts the salt s again and uses F'(F(x)) = S(extract s) ^ (extract S(s)^x). Given x and F(x) you can see s (though it is slightly obfuscated) and you can infer S(s) but for some other user y with a different random salt t the user knowing F(x) can't find S(t).

Related

Generate random numbers without repetition (or vanishing probability of repetition) without storing full list of past generated numbers?

I need to generate random numbers in a very large range, 128 bits integers, and I will generate a many many of them. I'll generate so many of them, that I cannot fit into memory a list of the numbers generated.
I also have the requirement that the generated numbers do not repeat, or at least that the probability of repetition is vanishingly small.
Is there an algorithm that does this?
Build a 128 bit linear congruential generator or linear feedback shift register generator. With properly chosen coefficients either of those will achieve full cycle, meaning no repeats until you've exhausted all outcomes.
Any full-period PRNG with a 128-bit state will do what you need in principle. Unfortunately many of these generators tend to produce only 32 or 64 bits per iteration while the rest of the state goes through a predictable permutation (LFSRs being the worst case, producing only 1 bit per iteration). Each 128-bit state is unique, but many of its bits would show a trivial relation to the previous state.
This can be overcome with tempering -- taking your questionable-quality PRNG state with a known-good period, and permuting it through a 1:1 transform to hide the not-so-random factors.
For example, borrowing from the example xorshift+ shown on Wikipedia:
static uint64_t s[2] = { 1, 0 };
void random128(uint64_t result[]) {
uint64_t x = s[0];
uint64_t y = s[1];
x ^= x << 23;
x ^= y ^ (x >> 17) ^ (y >> 26);
s[0] = y;
s[1] = x;
At this point we know that s[0] is just the old value of s[1], which would be a terrible PRNG if all 128 bits were exposed (normally only s[1] is exposed). To overcome this we permute the result to disguise that relationship (following the same principle as a feistel network to ensure that the transform is 1:1).
y += x * 1630144151483159999;
x ^= y >> 3;
result[0] = x;
result[1] = y;
}
This seems to be sufficient to pass diehard. So long as the original generator has full(ish) period, the whole generator should be full period too.
The logical conclusion to tempering a low-quality generator is to use AES-128 in counter mode. Simply run a counter from 0 to 2**128-1 (an extremely low-quality generator), and encrypt each value using AES-128 and a consistent key (an ideal temper) for your final output.
If you do this, don't get distracted by full cryptographic RNG requirements. Those involve re-seeding and consequently can produce the same number more than once (which is more random, but it's what you want to avoid).

Truncate integer using bit twiddling

Is there a way to "truncate" an integer using bit twiddling, as if it floor-divided and then multiplied back, as in:
z = floor(x / y) * y
I know it is possible to do so if y is of power of two, for example:
z = floor(x / 4) * 4 == x & ~3
But what trick does one use when y is some general positive integer?
For each individual y, there is a sequence of operations (addition, subtraction, and binary shift) which divides x by y faster than the (x86) division instruction.
Finding that sequence however is not straightforward, and must be done in advance (feasible when you divide by the same y a lot).
A simple example: to divide an arbitrary uint32 x by 3, we can instead calculate x * M in uint64 type and shift it to the right by 33 bits, where M is a magic constant equal to 233 / 3 rounded up.
The following code (C) tries 20 random uint32 values with the above algorithm and checks that the result is equal to just dividing by 3:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main ()
{
int step;
unsigned x, y1, y2;
unsigned const M = (1ULL << 33) / 3 + 1;
srand (time (NULL));
for (step = 0; step < 20; step++)
{
x = (rand () << 30) | (rand () << 15) | rand ();
y1 = x / 3;
y2 = (x * 1ULL * M) >> 33;
printf ("%10u %10u %10u %s\n", x, y1, y2, y1 == y2 ? "true" : "false");
}
return 0;
}
For further information, see Hacker's Delight book in general, and the freely available addition - chapter 10 here: hackersdelight.org/divcMore.pdf.
The reason this works for powers of 2 is the way binary representations works. Dividing by 2 (or powers of 2) is identical to bit shifting. Shifting right and then back left the same amount is identical to floor-division as you put it.
Consider an arbitrary binary number: 110101010111. If you'd bit shift it 3 times to the right (division by 8), and then back again it would turn to 110101010000 which is identical to ANDing it with 111111111000. Now lets consider division by 3 of the (decimal) number 16: start with 10000. Division (not shifting!) by 3 would be 5 (101) and multiply by 3 again is 15 (1111). No bit shifting can do that.
The obvious thing to do is to convert to whatever base you are trying to work with, and then basically make the last digit 0. (Or if you are working with a kth power, then make the last k digits 0). However you asked about bit (base-2) operations. It turns out that for any desired base B (at least, that is odd), you can come up with a number in binary so that the first M digits in base B are anything you want, for any M. Thus, how could you possibly have a general method for what you want (with an odd base), that just works on bits (binary)? At the very least it would probably be a lot more complicated than simply converting your number to your desired base and setting however many last digits to 0 and then converting back to natural base-2 integer representation.

Finding if a random number has occured before or not

Let me be clear at start that this is a contrived example and not a real world problem.
If I have a problem of creating a random number between 0 to 10. I do this 11 times making sure that a previously occurred number is not drawn again, if I get a repeated number,
I create another random number again to make sure it has not be seen earlier. So essentially I get a a sequence of unique numbers from 0 - 10 in a random order
e.g. 3 1 2 0 5 9 4 8 10 6 7 and so on
Now to come up with logic to make sure that the random numbers are unique and not one which we have drawn before, we could use many approaches
Use C++ std::bitset and set the bit corresponding to the index equal to value of each random no. and check it next time when a new random number is drawn.
Or
Use a std::map<int,int> to count the number of times or even simple C array with some sentinel values stored in that array to indicate if that number has occurred or not.
If I have to avoid these methods above and use some mathematical/logical/bitwise operation to find whether a random number has been draw before or not, is there a way?
You don't want to do it the way you suggest. Consider what happens when you have already selected 10 of the 11 items; your random number generator will cycle until it finds the missing number, which might be never, depending on your random number generator.
A better solution is to create a list of numbers 0 to 10 in order, then shuffle the list into a random order. The normal algorithm for doing this is due to Knuth, Fisher and Yates: starting at the first element, swap each element with an element at a location greater than the current element in the array.
function shuffle(a, n)
for i from n-1 to 1 step -1
j = randint(i)
swap(a[i], a[j])
We assume an array with indices 0 to n-1, and a randint function that sets j to the range 0 <= j <= i.
Use an array and add all possible values to it. Then pick one out of the array and remove it. Next time, pick again until the array is empty.
Yes, there is a mathematical way to do it, but it is a bit expansive.
have an array: primes[] where primes[i] = the i'th prime number. So its beginning will be [2,3,5,7,11,...].
Also store a number mult Now, once you draw a number (let it be i) you check if mult % primes[i] == 0, if it is - the number was drawn before, if it wasn't - then the number was not. chose it and do mult = mult * primes[i].
However, it is expansive because it might require a lot of space for large ranges (the possible values of mult increases exponentially
(This is a nice mathematical approach, because we actually look at a set of primes p_i, the array of primes is only the implementation to the abstract set of primes).
A bit manipulation alternative for small values is using an int or long as a bitset.
With this approach, to check a candidate i is not in the set you only need to check:
if (pow(2,i) & set == 0) // not in the set
else //already in the set
To enter an element i to the set:
set = set | pow(2,i)
A better approach will be to populate a list with all the numbers, shuffle it with fisher-yates shuffle, and iterate it for generating new random numbers.
If I have to avoid these methods above and use some
mathematical/logical/bitwise operation to find whether a random number
has been draw before or not, is there a way?
Subject to your contrived constraints yes, you can imitate a small bitset using bitwise operations:
You can choose different integer types on the right according to what size you need.
bitset code bitwise code
std::bitset<32> x; unsigned long x = 0;
if (x[i]) { ... } if (x & (1UL << i)) { ... }
// assuming v is 0 or 1
x[i] = v; x = (x & ~(1UL << i)) | ((unsigned long)v << i);
x[i] = true; x |= (1UL << i);
x[i] = false; x &= ~(1UL << i);
For a larger set (beyond the size in bits of unsigned long long), you will need an array of your chosen integer type. Divide the index by the width of each value to know what index to look up in the array, and use the modulus for the bit shifts. This is basically what bitset does.
I'm assuming that the various answers that tell you how best to shuffle 10 numbers are missing the point entirely: that your contrived constraints are there because you do not in fact want or need to know how best to shuffle 10 numbers :-)
Keep a variable too map the drawn numbers. The i'th bit of that variable will be 1 if the number was drawn before:
int mapNumbers = 0;
int generateRand() {
if (mapNumbers & ((1 << 11) - 1) == ((1 << 11) - 1)) return; // return if all numbers have been generated
int x;
do {
x = newVal();
} while (!x & mapNumbers);
mapNumbers |= (1 << x);
return x;
}

How to compute the integer absolute value

How to compute the integer absolute value without using if condition.
I guess we need to use some bitwise operation.
Can anybody help?
Same as existing answers, but with more explanations:
Let's assume a twos-complement number (as it's the usual case and you don't say otherwise) and let's assume 32-bit:
First, we perform an arithmetic right-shift by 31 bits. This shifts in all 1s for a negative number or all 0s for a positive one (but note that the actual >>-operator's behaviour in C or C++ is implementation defined for negative numbers, but will usually also perform an arithmetic shift, but let's just assume pseudocode or actual hardware instructions, since it sounds like homework anyway):
mask = x >> 31;
So what we get is 111...111 (-1) for negative numbers and 000...000 (0) for positives
Now we XOR this with x, getting the behaviour of a NOT for mask=111...111 (negative) and a no-op for mask=000...000 (positive):
x = x XOR mask;
And finally subtract our mask, which means +1 for negatives and +0/no-op for positives:
x = x - mask;
So for positives we perform an XOR with 0 and a subtraction of 0 and thus get the same number. And for negatives, we got (NOT x) + 1, which is exactly -x when using twos-complement representation.
Set the mask as right shift of integer by 31 (assuming integers are stored as two's-complement 32-bit values and that the right-shift operator does sign extension).
mask = n>>31
XOR the mask with number
mask ^ n
Subtract mask from result of step 2 and return the result.
(mask^n) - mask
Assume int is of 32-bit.
int my_abs(int x)
{
int y = (x >> 31);
return (x ^ y) - y;
}
One can also perform the above operation as:
return n*(((n>0)<<1)-1);
where n is the number whose absolute need to be calculated.
In C, you can use unions to perform bit manipulations on doubles. The following will work in C and can be used for both integers, floats, and doubles.
/**
* Calculates the absolute value of a double.
* #param x An 8-byte floating-point double
* #return A positive double
* #note Uses bit manipulation and does not care about NaNs
*/
double abs(double x)
{
union{
uint64_t bits;
double dub;
} b;
b.dub = x;
//Sets the sign bit to 0
b.bits &= 0x7FFFFFFFFFFFFFFF;
return b.dub;
}
Note that this assumes that doubles are 8 bytes.
I wrote my own, before discovering this question.
My answer is probably slower, but still valid:
int abs_of_x = ((x*(x >> 31)) | ((~x + 1) * ((~x + 1) >> 31)));
If you are not allowed to use the minus sign you could do something like this:
int absVal(int x) {
return ((x >> 31) + x) ^ (x >> 31);
}
For assembly the most efficient would be to initialize a value to 0, substract the integer, and then take the max:
pxor mm1, mm1 ; set mm1 to all zeros
psubw mm1, mm0 ; make each mm1 word contain the negative of each mm0 word
pmaxswmm1, mm0 ; mm1 will contain only the positive (larger) values - the absolute value
In C#, you can implement abs() without using any local variables:
public static long abs(long d) => (d + (d >>= 63)) ^ d;
public static int abs(int d) => (d + (d >>= 31)) ^ d;
Note: regarding 0x80000000 (int.MinValue) and 0x8000000000000000 (long.MinValue):
As with all of the other bitwise/non-branching methods shown on this page, this gives the single non-mathematical result abs(int.MinValue) == int.MinValue (likewise for long.MinValue). These represent the only cases where result value is negative, that is, where the MSB of the two's-complement result is 1 -- and are also the only cases where the input value is returned unchanged. I don't believe this important point was mentioned elsewhere on this page.
The code shown above depends on the value of d used on the right side of the xor being the value of d updated during the computation of left side. To C# programmers this will seem obvious. They are used to seeing code like this because .NET formally incorporates a strong memory model which strictly guarantees the correct fetching sequence here. The reason I mention this is because in C or C++ one may need to be more cautious. The memory models of the latter are considerably more permissive, which may allow certain compiler optimizations to issue out-of-order fetches. Obviously, in such a regime, fetch-order sensitivity would represent a correctness hazard.
If you don't want to rely on implementation of sign extension while right bit shifting, you can modify the way you calculate the mask:
mask = ~((n >> 31) & 1) + 1
then proceed as was already demonstrated in the previous answers:
(n ^ mask) - mask
What is the programming language you're using? In C# you can use the Math.Abs method:
int value1 = -1000;
int value2 = 20;
int abs1 = Math.Abs(value1);
int abs2 = Math.Abs(value2);

Expressing an integer as a series of multipliers

Scroll down to see latest edit, I left all this text here just so that I don't invalidate the replies this question has received so far!
I have the following brain teaser I'd like to get a solution for, I have tried to solve this but since I'm not mathematically that much above average (that is, I think I'm very close to average) I can't seem wrap my head around this.
The problem: Given number x should be split to a serie of multipliers, where each multiplier <= y, y being a constant like 10 or 16 or whatever. In the serie (technically an array of integers) the last number should be added instead of multiplied to be able to convert the multipliers back to original number.
As an example, lets assume x=29 and y=10. In this case the expected array would be {10,2,9} meaning 10*2+9. However if y=5, it'd be {5,5,4} meaning 5*5+4 or if y=3, it'd be {3,3,3,2} which would then be 3*3*3+2.
I tried to solve this by doing something like this:
while x >= y, store y to multipliers, then x = x - y
when x < y, store x to multipliers
Obviously this didn't work, I also tried to store the "leftover" part separately and add that after everything else but that didn't work either. I believe my main problem is that I try to think this in a way too complex manner while the solution is blatantly obvious and simple.
To reiterate, these are the limits this algorithm should have:
has to work with 64bit longs
has to return an array of 32bit integers (...well, shorts are OK too)
while support for signed numbers (both + and -) would be nice, if it helps the task only unsigned numbers is a must
And while I'm doing this using Java, I'd rather take any possible code examples as pseudocode, I specifically do NOT want readily made answers, I just need a nudge (well, more of a strong kick) so that I can solve this at least partly myself. Thanks in advance.
Edit: Further clarification
To avoid some confusion, I think I should reword this a bit:
Every integer in the result array should be less or equal to y, including the last number.
Yes, the last number is just a magic number.
No, this is isn't modulus since then the second number would be larger than y in most cases.
Yes, there is multiple answers to most of the numbers available, however I'm looking for the one with least amount of math ops. As far as my logic goes, that means finding the maximum amount of as big multipliers as possible, for example x=1 000 000,y=100 is 100*100*100 even though 10*10*10*10*10*10 is equally correct answer math-wise.
I need to go through the given answers so far with some thought but if you have anything to add, please do! I do appreciate the interest you've already shown on this, thank you all for that.
Edit 2: More explanations + bounty
Okay, seems like what I was aiming for in here just can't be done the way I thought it could be. I was too ambiguous with my goal and after giving it a bit of a thought I decided to just tell you in its entirety what I'd want to do and see what you can come up with.
My goal originally was to come up with a specific method to pack 1..n large integers (aka longs) together so that their String representation is notably shorter than writing the actual number. Think multiples of ten, 10^6 and 1 000 000 are the same, however the representation's length in characters isn't.
For this I wanted to somehow combine the numbers since it is expected that the numbers are somewhat close to each other. I firsth thought that representing 100, 121, 282 as 100+21+161 could be the way to go but the saving in string length is neglible at best and really doesn't work that well if the numbers aren't very close to each other. Basically I wanted more than ~10%.
So I came up with the idea that what if I'd group the numbers by common property such as a multiplier and divide the rest of the number to individual components which I can then represent as a string. This is where this problem steps in, I thought that for example 1 000 000 and 100 000 can be expressed as 10^(5|6) but due to the context of my aimed usage this was a bit too flaky:
The context is Web. RESTful URL:s to be specific. That's why I mentioned of thinking of using 64 characters (web-safe alphanumberic non-reserved characters and then some) since then I could create seemingly random URLs which could be unpacked to a list of integers expressing a set of id numbers. At this point I thought of creating a base 64-like number system for expressing base 10/2 numbers but since I'm not a math genius I have no idea beyond this point how to do it.
The bounty
Now that I have written the whole story (sorry that it's a long one), I'm opening a bounty to this question. Everything regarding requirements for the preferred algorithm specified earlier is still valid. I also want to say that I'm already grateful for all the answers I've received so far, I enjoy being proven wrong if it's done in such a manner as you people have done.
The conclusion
Well, bounty is now given. I spread a few comments to responses mostly for future reference and myself, you can also check out my SO Uservoice suggestion about spreading bounty which is related to this question if you think we should be able to spread it among multiple answers.
Thank you all for taking time and answering!
Update
I couldn't resist trying to come up with my own solution for the first question even though it doesn't do compression. Here is a Python solution using a third party factorization algorithm called pyecm.
This solution is probably several magnitudes more efficient than Yevgeny's one. Computations take seconds instead of hours or maybe even weeks/years for reasonable values of y. For x = 2^32-1 and y = 256, it took 1.68 seconds on my core duo 1.2 ghz.
>>> import time
>>> def test():
... before = time.time()
... print factor(2**32-1, 256)
... print time.time()-before
...
>>> test()
[254, 232, 215, 113, 3, 15]
1.68499994278
>>> 254*232*215*113*3+15
4294967295L
And here is the code:
def factor(x, y):
# y should be smaller than x. If x=y then {y, 1, 0} is the best solution
assert(x > y)
best_output = []
# try all possible remainders from 0 to y
for remainder in xrange(y+1):
output = []
composite = x - remainder
factors = getFactors(composite)
# check if any factor is larger than y
bad_remainder = False
for n in factors.iterkeys():
if n > y:
bad_remainder = True
break
if bad_remainder: continue
# make the best factors
while True:
results = largestFactors(factors, y)
if results == None: break
output += [results[0]]
factors = results[1]
# store the best output
output = output + [remainder]
if len(best_output) == 0 or len(output) < len(best_output):
best_output = output
return best_output
# Heuristic
# The bigger the number the better. 8 is more compact than 2,2,2 etc...
# Find the most factors you can have below or equal to y
# output the number and unused factors that can be reinserted in this function
def largestFactors(factors, y):
assert(y > 1)
# iterate from y to 2 and see if the factors are present.
for i in xrange(y, 1, -1):
try_another_number = False
factors_below_y = getFactors(i)
for number, copies in factors_below_y.iteritems():
if number in factors:
if factors[number] < copies:
try_another_number = True
continue # not enough factors
else:
try_another_number = True
continue # a factor is not present
# Do we want to try another number, or was a solution found?
if try_another_number == True:
continue
else:
output = 1
for number, copies in factors_below_y.items():
remaining = factors[number] - copies
if remaining > 0:
factors[number] = remaining
else:
del factors[number]
output *= number ** copies
return (output, factors)
return None # failed
# Find prime factors. You can use any formula you want for this.
# I am using elliptic curve factorization from http://sourceforge.net/projects/pyecm
import pyecm, collections, copy
getFactors_cache = {}
def getFactors(n):
assert(n != 0)
# attempt to retrieve from cache. Returns a copy
try:
return copy.copy(getFactors_cache[n])
except KeyError:
pass
output = collections.defaultdict(int)
for factor in pyecm.factors(n, False, True, 10, 1):
output[factor] += 1
# cache result
getFactors_cache[n] = output
return copy.copy(output)
Answer to first question
You say you want compression of numbers, but from your examples, those sequences are longer than the undecomposed numbers. It is not possible to compress these numbers without more details to the system you left out (probability of sequences/is there a programmable client?). Could you elaborate more?
Here is a mathematical explanation as to why current answers to the first part of your problem will never solve your second problem. It has nothing to do with the knapsack problem.
This is Shannon's entropy algorithm. It tells you the theoretical minimum amount of bits you need to represent a sequence {X0, X1, X2, ..., Xn-1, Xn} where p(Xi) is the probability of seeing token Xi.
Let's say that X0 to Xn is the span of 0 to 4294967295 (the range of an integer). From what you have described, each number is as likely as another to appear. Therefore the probability of each element is 1/4294967296.
When we plug it into Shannon's algorithm, it will tell us what the minimum number of bits are required to represent the stream.
import math
def entropy():
num = 2**32
probability = 1./num
return -(num) * probability * math.log(probability, 2)
# the (num) * probability cancels out
The entropy unsurprisingly is 32. We require 32 bits to represent an integer where each number is equally likely. The only way to reduce this number, is to increase the probability of some numbers, and decrease the probability of others. You should explain the stream in more detail.
Answer to second question
The right way to do this is to use base64, when communicating with HTTP. Apparently Java does not have this in the standard library, but I found a link to a free implementation:
http://iharder.sourceforge.net/current/java/base64/
Here is the "pseudo-code" which works perfectly in Python and should not be difficult to convert to Java (my Java is rusty):
def longTo64(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
output = ""
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % 64] + output
num /= 64
return output
If you have control over your web server and web client, and can parse the entire HTTP requests without problem, you can upgrade to base85. According to wikipedia, url encoding allows for up to 85 characters. Otherwise, you may need to remove a few characters from the mapping.
Here is another code example in Python
def longTo85(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = ""
base = len(mapping)
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % base] + output
num /= base
return output
And here is the inverse operation:
def stringToLong(string):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = 0
base = len(mapping)
place = 0
# check each digit from the lowest place
for digit in reversed(string):
# find the number the mapping of symbol to number, then multiply by base^place
output += mapping.find(digit) * (base ** place)
place += 1
return output
Here is a graph of Shannon's algorithm in different bases.
As you can see, the higher the radix, the less symbols are needed to represent a number. At base64, ~11 symbols are required to represent a long. At base85, it becomes ~10 symbols.
Edit after final explanation:
I would think base64 is the best solution, since there are standard functions that deal with it, and variants of this idea don't give much improvement. This was answered with much more detail by others here.
Regarding the original question, although the code works, it is not guaranteed to run in any reasonable time, as was answered as well as commented on this question by LFSR Consulting.
Original Answer:
You mean something like this?
Edit - corrected after a comment.
shortest_output = {}
foreach (int R = 0; R <= X; R++) {
// iteration over possible remainders
// check if the rest of X can be decomposed into multipliers
newX = X - R;
output = {};
while (newX > Y) {
int i;
for (i = Y; i > 1; i--) {
if ( newX % i == 0) { // found a divider
output.append(i);
newX = newX /i;
break;
}
}
if (i == 1) { // no dividers <= Y
break;
}
}
if (newX != 1) {
// couldn't find dividers with no remainder
output.clear();
}
else {
output.append(R);
if (output.length() < shortest_output.length()) {
shortest_output = output;
}
}
}
It sounds as though you want to compress random data -- this is impossible for information theoretic reasons. (See http://www.faqs.org/faqs/compression-faq/part1/preamble.html question 9.) Use Base64 on the concatenated binary representations of your numbers and be done with it.
The problem you're attempting to solve (you're dealing with a subset of the problem, given you're restriction of y) is called Integer Factorization and it cannot be done efficiently given any known algorithm:
In number theory, integer factorization is the breaking down of a composite number into smaller non-trivial divisors, which when multiplied together equal the original integer.
This problem is what makes a number of cryptographic functions possible (namely RSA which uses 128 bit keys - long is half of that.) The wiki page contains some good resources that should move you in the right direction with your problem.
So, your brain teaser is indeed a brain teaser... and if you solve it efficiently we can elevate your math skills to above average!
Updated after the full story
Base64 is most likely your best option. If you want a custom solution you can try implementing a Base 65+ system. Just remember that just because 10000 can be written as "10^4" doesn't mean that everything can be written as 10^n where n is an integer. Different base systems are the simplest way to write numbers and the higher the base the less digits the number requires. Plus most framework libraries contain algorithms for Base64 encoding. (What language you are using?).
One way to further pack the urls is the one you mentioned but in Base64.
int[] IDs;
IDs.sort() // So IDs[i] is always smaller or equal to IDs[i-1].
string url = Base64Encode(IDs[0]);
for (int i = 1; i < IDs.length; i++) {
url += "," + Base64Encode(IDs[i-1] - IDs[i]);
}
Note that you require some separator as the initial ID can be arbitrarily large and the difference between two IDs CAN be more than 63 in which case one Base64 digit is not enough.
Updated
Just restating that the problem is unsolvable. For Y = 64 you can't write 87681 in multipliers + remainder where each of these is below 64. In other words, you cannot write any of the numbers 87617..87681 with multipliers that are below 64. Each of these numbers has an elementary term over 64. 87616 can be written in elementary terms below 64 but then you'd need those + 65 and so the remainder will be over 64.
So if this was just a brainteaser, it's unsolvable. Was there some practical purpose for this which could be achieved in some way other than using multiplication and a remainder?
And yes, this really should be a comment but I lost my ability to comment at some point. :p
I believe the solution which comes closest is Yevgeny's. It is also easy to extend Yevgeny's solution to remove the limit for the remainder in which case it would be able to find solution where multipliers are smaller than Y and remainder as small as possible, even if greater than Y.
Old answer:
If you limit that every number in the array must be below the y then there is no solution for this. Given large enough x and small enough y, you'll end up in an impossible situation. As an example with y of 2, x of 12 you'll get 2 * 2 * 2 + 4 as 2 * 2 * 2 * 2 would be 16. Even if you allow negative numbers with abs(n) below y that wouldn't work as you'd need 2 * 2 * 2 * 2 - 4 in the above example.
And I think the problem is NP-Complete even if you limit the problem to inputs which are known to have an answer where the last term is less than y. It sounds quite much like the [Knapsack problem][1]. Of course I could be wrong there.
Edit:
Without more accurate problem description it is hard to solve the problem, but one variant could work in the following way:
set current = x
Break current to its terms
If one of the terms is greater than y the current number cannot be described in terms greater than y. Reduce one from current and repeat from 2.
Current number can be expressed in terms less than y.
Calculate remainder
Combine as many of the terms as possible.
(Yevgeny Doctor has more conscise (and working) implementation of this so to prevent confusion I've skipped the implementation.)
OP Wrote:
My goal originally was to come up with
a specific method to pack 1..n large
integers (aka longs) together so that
their String representation is notably
shorter than writing the actual
number. Think multiples of ten, 10^6
and 1 000 000 are the same, however
the representation's length in
characters isn't.
I have been down that path before, and as fun as it was to learn all the math, to save you time I will just point you to: http://en.wikipedia.org/wiki/Kolmogorov_complexity
In a nutshell some strings can be easily compressed by changing your notation:
10^9 (4 characters) = 1000000000 (10 characters)
Others cannot:
7829203478 = some random number...
This is a great great simplification of the article I linked to above, so I recommend that you read it instead of taking my explanation at face value.
Edit:
If you are trying to make RESTful urls for some set of unique data, why wouldn't you use a hash, such as MD5? Then include the hash as part of the URL, then look up the data based on the hash. Or am I missing something obvious?
The original method you chose (a * b + c * d + e) would be very difficult to find optimal solutions for simply due to the large search space of possibilities. You could factorize the number but it's that "+ e" that complicates things since you need to factorize not just that number but quite a few immediately below it.
Two methods for compression spring immediately to mind, both of which give you a much-better-than-10% saving on space from the numeric representation.
A 64-bit number ranges from (unsigned):
0 to
18,446,744,073,709,551,616
or (signed):
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
In both cases, you need to reduce the 20-characters taken (without commas) to something a little smaller.
The first is to simply BCD-ify the number the base64 encode it (actually a slightly modified base64 since "/" would not be kosher in a URL - you should use one of the acceptable characters such as "_").
Converting it to BCD will store two digits (or a sign and a digit) into one byte, giving you an immediate 50% reduction in space (10 bytes). Encoding it base 64 (which turns every 3 bytes into 4 base64 characters) will turn the first 9 bytes into 12 characters and that tenth byte into 2 characters, for a total of 14 characters - that's a 30% saving.
The only better method is to just base64 encode the binary representation. This is better because BCD has a small amount of wastage (each digit only needs about 3.32 bits to store [log210], but BCD uses 4).
Working on the binary representation, we only need to base64 encode the 64-bit number (8 bytes). That needs 8 characters for the first 6 bytes and 3 characters for the final 2 bytes. That's 11 characters of base64 for a saving of 45%.
If you wanted maximum compression, there are 73 characters available for URL encoding:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789$-_.+!*'(),
so technically you could probably encode base-73 which, from rough calculations, would still take up 11 characters, but with more complex code which isn't worth it in my opinion.
Of course, that's the maximum compression due to the maximum values. At the other end of the scale (1-digit) this encoding actually results in more data (expansion rather than compression). You can see the improvements only start for numbers over 999, where 4 digits can be turned into 3 base64 characters:
Range (bytes) Chars Base64 chars Compression ratio
------------- ----- ------------ -----------------
< 10 (1) 1 2 -100%
< 100 (1) 2 2 0%
< 1000 (2) 3 3 0%
< 10^4 (2) 4 3 25%
< 10^5 (3) 5 4 20%
< 10^6 (3) 6 4 33%
< 10^7 (3) 7 4 42%
< 10^8 (4) 8 6 25%
< 10^9 (4) 9 6 33%
< 10^10 (5) 10 7 30%
< 10^11 (5) 11 7 36%
< 10^12 (5) 12 7 41%
< 10^13 (6) 13 8 38%
< 10^14 (6) 14 8 42%
< 10^15 (7) 15 10 33%
< 10^16 (7) 16 10 37%
< 10^17 (8) 17 11 35%
< 10^18 (8) 18 11 38%
< 10^19 (8) 19 11 42%
< 2^64 (8) 20 11 45%
Update: I didn't get everything, thus I rewrote the whole thing in a more Java-Style fashion. I didn't think of the prime number case that is bigger than the divisor. This is fixed now. I leave the original code in order to get the idea.
Update 2: I now handle the case of the big prime number in another fashion . This way a result is obtained either way.
public final class PrimeNumberException extends Exception {
private final long primeNumber;
public PrimeNumberException(long x) {
primeNumber = x;
}
public long getPrimeNumber() {
return primeNumber;
}
}
public static Long[] decompose(long x, long y) {
try {
final ArrayList<Long> operands = new ArrayList<Long>(1000);
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
} catch (PrimeNumberException e) {
// return new Long[0];
// or do whatever you like, for example
operands.add(e.getPrimeNumber());
} finally {
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
}
// The recursive method
private static void recDivide(long x, long y, ArrayList<Long> operands)
throws PrimeNumberException {
while ((x > y) && (y != 1)) {
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
}
// go in recursion
x = rest;
} else {
// if the value x isn't divisible by y decrement y so you'll find a
// divisor eventually
if (--y == 1) {
throw new PrimeNumberException(x);
}
}
}
}
Original: Here some recursive code I came up with. I would have preferred to code it in some functional language but it was required in Java. I didn't bother converting the numbers to integer but that shouldn't be that hard (yes, I'm lazy ;)
public static Long[] decompose(long x, long y) {
final ArrayList<Long> operands = new ArrayList<Long>();
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
// The recursive method
private static void recDivide(long newX, long y, ArrayList<Long> operands) {
long x = newX;
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
} else {
// the rest can still be divided, go one level deeper in recursion
recDivide(rest, y, operands);
}
} else {
// if the value x isn't divisible by y decrement y so you'll find a divisor
// eventually
recDivide(x, y-1, operands);
}
}
Are you married to using Java? Python has an entire package dedicated just for this exact purpose. It'll even sanitize the encoding for you to be URL-safe.
Native Python solution
The standard module I'm recommending is base64, which converts arbitrary stings of chars into sanitized base64 format. You can use it in conjunction with the pickle module, which handles conversion from lists of longs (actually arbitrary size) to a compressed string representation.
The following code should work on any vanilla installation of Python:
import base64
import pickle
# get some long list of numbers
a = (854183415,1270335149,228790978,1610119503,1785730631,2084495271,
1180819741,1200564070,1594464081,1312769708,491733762,243961400,
655643948,1950847733,492757139,1373886707,336679529,591953597,
2007045617,1653638786)
# this gets you the url-safe string
str64 = base64.urlsafe_b64encode(pickle.dumps(a,-1))
print str64
>>> gAIoSvfN6TJKrca3S0rCEqMNSk95-F9KRxZwakqn3z58Sh3hYUZKZiePR0pRlwlfSqxGP05KAkNPHUo4jooOSixVFCdK9ZJHdEqT4F4dSvPY41FKaVIRFEq9fkgjSvEVoXdKgoaQYnRxAC4=
# this unwinds it
a64 = pickle.loads(base64.urlsafe_b64decode(str64))
print a64
>>> (854183415, 1270335149, 228790978, 1610119503, 1785730631, 2084495271, 1180819741, 1200564070, 1594464081, 1312769708, 491733762, 243961400, 655643948, 1950847733, 492757139, 1373886707, 336679529, 591953597, 2007045617, 1653638786)
Hope that helps. Using Python is probably the closest you'll get from a 1-line solution.
Wrt the original algorithm request: Is there a limit on the size of the last number (beyond that it must be stored in a 32b int)?
(The original request is all I'm able to tackle lol.)
The one that produces the shortest list is:
bool negative=(n<1)?true:false;
int j=n%y;
if(n==0 || n==1)
{
list.append(n);
return;
}
while((long64)(n-j*y)>MAX_INT && y>1) //R has to be stored in int32
{
y--;
j=n%y;
}
if(y<=1)
fail //Number has no suitable candidate factors. This shouldn't happen
int i=0;
for(;i<j;i++)
{
list.append(y);
}
list.append(n-y*j);
if(negative)
list[0]*=-1;
return;
A little simplistic compared to most answers given so far but it achieves the desired functionality of the original post... It's a little dirty but hopefully useful :)
Isn't this modulus?
Let / be integer division (whole numbers) and % be modulo.
int result[3];
result[0] = y;
result[1] = x / y;
result[2] = x % y;
Just set x:=x/n where n is the largest number that is less both than x and y. When you end up with x<=y, this is your last number in the sequence.
Like in my comment above, I'm not sure I understand exactly the question. But assuming integers (n and a given y), this should work for the cases you stated:
multipliers[0] = n / y;
multipliers[1] = y;
addedNumber = n % y;

Resources