How many digits will be after converting from one numeral system to another - algorithm

The main question: How many digits?
Let me explain. I have a number in binary system: 11000000 and in decimal is 192.
After converting to decimal, how many digits it will have (in dicimal)? In my example, it's 3 digits. But, it isn't a problem. I've searched over internet and found one algorithm for integral part and one for fractional part. I'm not quite understand them, but (I think) they works.
When converting from binary to octal, it's more easy: each 3 bits give you 1 digit in octal. Same for hex: each 4 bits = 1 hex digit.
But, I'm very curious, what to do, if I have a number in P numeral system and want to convert it to the Q numeral system? I know how to do it (I think, I know :)), but, 1st of all, I want to know how many digits in Q system it will take (u no, I must preallocate space).

Writing n in base b takes ceiling(log base b (n)) digits.
The ratio you noticed (octal/binary) is log base 8 (n) / log base 2 (n) = 3.
(From memory, will it stick?)

There was an error in my previous answer: look at the comment by Ben Schwehn.
Sorry for the confusion, I found and explain the error I made in my previous answer below.
Please use the answer provided by Paul Tomblin. (rewritten to use P, Q and n)
Y = ln(P^n) / ln(Q)
Y = n * ln(P) / ln(Q)
So Y (rounded up) is the number of characters you need in system Q to express the highest number you can encode in n characters in system P.
I have no answer (that wouldn't convert the number already and take up that many space in a temporary variable) to get the bare minimum for a given number 1000(bin) = 8(dec) while you would reserve 2 decimal positions using this formula.
If a temporary memory usage isn't a problem, you might cheat and use (Python):
len(str(int(otherBaseStr,P)))
This will give you the number of decimals needed to convert a number in base P, cast as a string (otherBaseStr), into decimals.
Old WRONG answer:
If you have a number in P numeral system of length n
Then you can calculate the highest number that is possible in n characters:
P^(n-1)
To express this highest number in number system Q you need to use logarithms (because they are the inverse to exponentiation):
log((P^(n-1))/log(Q)
(n-1)*log(P) / log(Q)
For example
11000000 in binary is 8 characters.
To get it in Decimal you would need:
(8-1)*log(2) / log(10) = 2.1 digits (round up to 3)
Reason it was wrong:
The highest number that is possible in n characters is
(P^n) - 1
not
P^(n-1)

If you have a number that's X digits long in base B, then the maximum value that can be represented is B^X - 1. So if you want to know how many digits it might take in base C, then you have to find the number Y that C^Y - 1 is at least as big as B^X - 1. The way to do that is to take the logarithm in base C of B^X-1. And since the logarithm (log) of a number in base C is the same as the natural log (ln) of that number divided by the natural log of C, that becomes:
Y = ln((B^X)-1) / ln(C) + 1
and since ln(B^X) is X * ln(B), and that's probably faster to calculate than ln(B^X-1) and close enough to the right answer, rewrite that as
Y = X * ln(B) / ln(C) + 1
Covert that to your favourite language. Because we dropped the "-1", we might end up with one digit more than you need in some cases. But even better, you can pre-calculate ln(B)/ln(C) and just multiply it by new "X"s and the length of the number you are trying to convert changes.

Calculating the number of digit can be done using the formulas given by the other answers, however, it might actually be faster to allocate a buffer of maximum size first and then return the relevant part of that buffer instead of calculating a logarithm.
Note that the worst case for the buffer size happens when you convert to binary, which gives you a buffer size of 32 characters for 32-bit integers.
Converting a number to an arbitrary base could be done using the C# function below (The code would look very similar in other languages like C or Java):
public static string IntToString(int value, char[] baseChars)
{
// 32 is the worst cast buffer size for base 2 and int.MaxValue
int i = 32;
char[] buffer = new char[i];
int targetBase= baseChars.Length;
do
{
buffer[--i] = baseChars[value % targetBase];
value = value / targetBase;
}
while (value > 0);
char[] result = new char[32 - i];
Array.Copy(buffer, i, result, 0, 32 - i);
return new string(result);
}

The keyword here is "logarithm", here are some suggestive links:
http://www.adug.org.au/MathsCorner/MathsCornerLogs2.htm
http://staff.spd.dcu.ie/johnbcos/download/Fermat%20material/Fermat_Record_Number/HOW_MANY.html

look at the logarithms base P and base Q. Round down to nearest integer.
The logarithm base P can be computed using your favorite base (10 or e): log_P(x) = log_10(x)/log_10(P)

You need to compute the length of the fractional part separately.
For binary to decimal, there are as many decimal digits as there are bits. For example, binary 0.11001101001001 is decimal 0.80133056640625, both 14 digits after the radix point.
For decimal to binary, there are two cases. If the decimal fraction is dyadic, then there are as many bits as decimal digits (same as for binary to decimal above). If the fraction is not dyadic, then the number of bits is infinite.
(You can use my decimal/binary converter to experiment with this.)

Related

Splitting a floating point number as sums of floating point of fixed precision

Suppose i have an algorithm by which i can compute an infinitely precise floating point number (depending from a parameter N) lets say in pseudocode:
arbitrary_precision_float f = computeValue(n); //it could be a function which compute a specific value, like PI for instance.
I guess i can implement computeValue(int) with the library mpf of the gnump library for example...
Anyway how can i split such number in sums of floating point number where each number has L Mantissa digits?
//example
f = x1 + x2 + ... + xn;
/*
for i = 1:n
xi = 2^ei * Mi
Mi has exactly p digits.
*/
I don't know if i'm clear but i'm looking for something "simple".
You can use a very simple algorithm. Assume without loss of generality that the exponent of your original number is zero; if it's not, then you just add that exponent to all the exponents of the answer.
Split your number f into groups of L digits and treat each group as a separate xi. Any such group can be represented in the form you need: the mantissa will be exactly that group, and the exponent will be negated start position of the group in the original number (that is, i*L, where i is the group number).
If any of the resulting xis starts from zero, you just shift its mantissa correcting the exponent correspondingly.
For example, for L=4
f = 10010011100
1001
0011
100
-> x1=1.001 *2^0
x2=0.011 *2^{-4} = 1.1*2^{-6}
x3=1.00 *2^{-8}
Another question arises if you want to minimize the amount of numbers you get. In the example above, two numbers are sufficient: 1.001*2^0+1.11*2^{-6}. This is a separate question, and in fact is a simple problem for dynamic programming.

Generate a unique number out of the combination of 'n' different numbers?

To clarify, as input I have 'n' (n1, n2, n3,...) numbers (integers) such as each number is unique within this set.
I would like to generate a number out of this set (lets call the generated number big 'N') that is also unique, and that allows me to verify that a number 'n1' belongs to the set 'n' just by using 'N'.
is that possible?
Edit:
Thanks for the answers guys, I am looking into them atm. For those requesting an example, here is a simple one:
imagine i have those paths (bi-directional graph) with a random unique value (let's call it identifier):
P1 (N1): A----1----B----2----C----3----D
P2 (N2): A----4----E----5----D
So I want to get the full path (unique path, not all paths) from A knowing N1 and this path as a result should be P1.
Mind you that 1,2,...are just unique numbers in this graph, not weights or distances, I just use them for my heuristic.
If you are dealing with small numbers, no problem. You are doing the same thing with digits every time you compose a number: a digit is a number from 0 to 9 and a full number is a combination of them that:
is itself a number
is unique for given digits
allows you to easily verify if a digit is inside
The gotcha is that the numbers must have an upper limit, like 10 is for digits. Let's say 1000 here for simplicity, the similar composed number could be:
n1*1000^k + n2*1000^(k-1) + n3*1000^(k-2) ... + nk*1000^(0)
So if you have numbers 33, 44 and 27 you will get:
33*1000000 + 44*1000 + 27, and that is number N: 33044027
Of course you can do the same with bigger limits, and binary like 256,1024 or 65535, but it grows big fast.
A better idea, if possible is to convert it into a string (a string is still a number!) with some separator (a number in base 11, that is 10 normal digits + 1 separator digit). This is more flexible as there are no upper limits. Imagine to use digits 0-9 + a separator digit 'a'. You can obtain number 33a44a27 in base 11. By translating this to base 10 or base 16 you can get an ordinary computer number (65451833 if I got it right). Then converting 65451833 to undecimal (base11) 33a44a27, and splitting by digit 'a' you can get the original numbers back to test.
EDIT: A VARIABLE BASE NUMBER?
Of course this would work better digitally in base 17 (16 digits+separator). But I suspect there are more optimal ways, for example if the numbers are unique in the path, the more numbers you add, the less are remaining, the shorter the base could shrink. Can you imagine a number in which the first digit is in base 20, the second in base 19, the third in base 18, and so on? Can this be done? Meh?
In this variating base world (in a 10 nodes graph), path n0-n1-n2-n3-n4-n5-n6-n7-n8-n9 would be
n0*10^0 + (n1*9^1)+(offset:1) + n2*8^2+(offset:18) + n3*7^3+(offset:170)+...
offset1: 10-9=1
offset2: 9*9^1-1*8^2+1=81-64+1=18
offset3: 8*8^2-1*7^3+1=343-512+1=170
If I got it right, in this fiddle: http://jsfiddle.net/Hx5Aq/ the biggest number path would be: 102411
var path="9-8-7-6-5-4-3-2-1-0"; // biggest number
o2=(Math.pow(10,1)-Math.pow(9,1)+1); // offsets so digits do not overlap
o3=(Math.pow(9,2)-Math.pow(8,2)+1);
o4=(Math.pow(8,3)-Math.pow(7,3)+1);
o5=(Math.pow(7,4)-Math.pow(6,4)+1);
o6=(Math.pow(6,5)-Math.pow(5,5)+1);
o7=(Math.pow(5,6)-Math.pow(4,6)+1);
o8=(Math.pow(4,7)-Math.pow(3,7)+1);
o9=(Math.pow(3,8)-Math.pow(2,8)+1);
o10=(Math.pow(2,9)-Math.pow(1,9)+1);
o11=(Math.pow(1,10)-Math.pow(0,10)+1);
var n=path.split("-");
var res;
res=
n[9]*Math.pow(10,0) +
n[8]*Math.pow(9,1) + o2 +
n[7]*Math.pow(8,2) + o3 +
n[6]*Math.pow(7,3) + o4 +
n[5]*Math.pow(6,4) + o5 +
n[4]*Math.pow(5,5) + o6 +
n[3]*Math.pow(4,6) + o7 +
n[2]*Math.pow(3,7) + o8 +
n[1]*Math.pow(2,8) + o9 +
n[0]*Math.pow(1,9) + o10;
alert(res);
So N<=102411 would represent any path of ten nodes? Just a trial. You have to find a way of naming them, for instance if they are 1,2,3,4,5,6... and you use 5 you will have to compact the remaining 1,2,3,4,6->5,7->6... => 1,2,3,4,5,6... (that is revertable and unique if you start from the first)
Theoretically, yes it is.
By defining p_i as the i'th prime number, you can generate N=p_(n1)*p_(n2)*..... Now, all you have to do is to check if N%p_(n) == 0 or not.
However, note that N will grow to huge numbers very fast, so I am not sure this is a very practical solution.
One very practical probabilistic solution is using bloom filters. Note that bloom filters is a set of bits, that can be translated easily to any number N.
Bloom filters have no false negatives (if you said a number is not in the set, it really isn't), but do suffer from false positives with an expected given probability (that is dependent on the size of the sets, number of functions used and number of bits used).
As a side note, to get a result that is 100% accurate, you are going to need at the very least 2^k bits (where k is the range of the elements) to represent the number N by looking at this number as a bitset, where each bit indicates existence or non-existence of a number in the set. You can show that there is no 100% accurate solution that uses less bits (peigeon hole principle). Note that for integers for example with 32 bits, it means you are going to need N with 2^32 bits, which is unpractical.

Encode number to a result

In my app I need to run a 5 digits number through an algorithm and return a number between the given interval, ie:
The function encode, gets 3 parameters, 5 digits initial number, interval lower limit and interval superior limit, for example:
int res=encode(12879,10,100) returns 83.
The function starts from 12879 and does something with the numbers and returns a number between 10 and 100. This mustn't be random, every time I pass the number 12879 to the encode function must always return the same number.
Any ideas?
Thanks,
Direz
One possible approach:
compute the range of your interval R = (100 - 10) + 1
compute a hash modulo R of the input H = hash(12879) % R
add the lower bound to the modular hash V = 10 + H
Here the thing though - you haven't defined any constraints or requirements on the "algorithm" that produces the result. If all you want is to map a value into a given range (without any knowledge of the distribution of the input, or how input values may cluster, etc), you could just as easily just take the range modulo of the input without hashing (as Foo Bah demonstrates).
If there are certain constraints, requirements, or distributions of the input or output of your encode method, then the approach may need to be quite different. However, you are the only one who knows what additional requirements you have.
You can do something simple like
encode(x,y,z) --> y + (x mod (z-y))
You don't have an upper limit for this function?
Assume it is 99999 because it is 5 digits. For your case, the simplest way is:
int encode (double N,double H,double L)
{
return (int)(((H - L) / (99999 - 10000)) * (N - 10000) + 10);
}

Number base conversion as a stream operation

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)
I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.
I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!
Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

Expressing an integer as a series of multipliers

Scroll down to see latest edit, I left all this text here just so that I don't invalidate the replies this question has received so far!
I have the following brain teaser I'd like to get a solution for, I have tried to solve this but since I'm not mathematically that much above average (that is, I think I'm very close to average) I can't seem wrap my head around this.
The problem: Given number x should be split to a serie of multipliers, where each multiplier <= y, y being a constant like 10 or 16 or whatever. In the serie (technically an array of integers) the last number should be added instead of multiplied to be able to convert the multipliers back to original number.
As an example, lets assume x=29 and y=10. In this case the expected array would be {10,2,9} meaning 10*2+9. However if y=5, it'd be {5,5,4} meaning 5*5+4 or if y=3, it'd be {3,3,3,2} which would then be 3*3*3+2.
I tried to solve this by doing something like this:
while x >= y, store y to multipliers, then x = x - y
when x < y, store x to multipliers
Obviously this didn't work, I also tried to store the "leftover" part separately and add that after everything else but that didn't work either. I believe my main problem is that I try to think this in a way too complex manner while the solution is blatantly obvious and simple.
To reiterate, these are the limits this algorithm should have:
has to work with 64bit longs
has to return an array of 32bit integers (...well, shorts are OK too)
while support for signed numbers (both + and -) would be nice, if it helps the task only unsigned numbers is a must
And while I'm doing this using Java, I'd rather take any possible code examples as pseudocode, I specifically do NOT want readily made answers, I just need a nudge (well, more of a strong kick) so that I can solve this at least partly myself. Thanks in advance.
Edit: Further clarification
To avoid some confusion, I think I should reword this a bit:
Every integer in the result array should be less or equal to y, including the last number.
Yes, the last number is just a magic number.
No, this is isn't modulus since then the second number would be larger than y in most cases.
Yes, there is multiple answers to most of the numbers available, however I'm looking for the one with least amount of math ops. As far as my logic goes, that means finding the maximum amount of as big multipliers as possible, for example x=1 000 000,y=100 is 100*100*100 even though 10*10*10*10*10*10 is equally correct answer math-wise.
I need to go through the given answers so far with some thought but if you have anything to add, please do! I do appreciate the interest you've already shown on this, thank you all for that.
Edit 2: More explanations + bounty
Okay, seems like what I was aiming for in here just can't be done the way I thought it could be. I was too ambiguous with my goal and after giving it a bit of a thought I decided to just tell you in its entirety what I'd want to do and see what you can come up with.
My goal originally was to come up with a specific method to pack 1..n large integers (aka longs) together so that their String representation is notably shorter than writing the actual number. Think multiples of ten, 10^6 and 1 000 000 are the same, however the representation's length in characters isn't.
For this I wanted to somehow combine the numbers since it is expected that the numbers are somewhat close to each other. I firsth thought that representing 100, 121, 282 as 100+21+161 could be the way to go but the saving in string length is neglible at best and really doesn't work that well if the numbers aren't very close to each other. Basically I wanted more than ~10%.
So I came up with the idea that what if I'd group the numbers by common property such as a multiplier and divide the rest of the number to individual components which I can then represent as a string. This is where this problem steps in, I thought that for example 1 000 000 and 100 000 can be expressed as 10^(5|6) but due to the context of my aimed usage this was a bit too flaky:
The context is Web. RESTful URL:s to be specific. That's why I mentioned of thinking of using 64 characters (web-safe alphanumberic non-reserved characters and then some) since then I could create seemingly random URLs which could be unpacked to a list of integers expressing a set of id numbers. At this point I thought of creating a base 64-like number system for expressing base 10/2 numbers but since I'm not a math genius I have no idea beyond this point how to do it.
The bounty
Now that I have written the whole story (sorry that it's a long one), I'm opening a bounty to this question. Everything regarding requirements for the preferred algorithm specified earlier is still valid. I also want to say that I'm already grateful for all the answers I've received so far, I enjoy being proven wrong if it's done in such a manner as you people have done.
The conclusion
Well, bounty is now given. I spread a few comments to responses mostly for future reference and myself, you can also check out my SO Uservoice suggestion about spreading bounty which is related to this question if you think we should be able to spread it among multiple answers.
Thank you all for taking time and answering!
Update
I couldn't resist trying to come up with my own solution for the first question even though it doesn't do compression. Here is a Python solution using a third party factorization algorithm called pyecm.
This solution is probably several magnitudes more efficient than Yevgeny's one. Computations take seconds instead of hours or maybe even weeks/years for reasonable values of y. For x = 2^32-1 and y = 256, it took 1.68 seconds on my core duo 1.2 ghz.
>>> import time
>>> def test():
... before = time.time()
... print factor(2**32-1, 256)
... print time.time()-before
...
>>> test()
[254, 232, 215, 113, 3, 15]
1.68499994278
>>> 254*232*215*113*3+15
4294967295L
And here is the code:
def factor(x, y):
# y should be smaller than x. If x=y then {y, 1, 0} is the best solution
assert(x > y)
best_output = []
# try all possible remainders from 0 to y
for remainder in xrange(y+1):
output = []
composite = x - remainder
factors = getFactors(composite)
# check if any factor is larger than y
bad_remainder = False
for n in factors.iterkeys():
if n > y:
bad_remainder = True
break
if bad_remainder: continue
# make the best factors
while True:
results = largestFactors(factors, y)
if results == None: break
output += [results[0]]
factors = results[1]
# store the best output
output = output + [remainder]
if len(best_output) == 0 or len(output) < len(best_output):
best_output = output
return best_output
# Heuristic
# The bigger the number the better. 8 is more compact than 2,2,2 etc...
# Find the most factors you can have below or equal to y
# output the number and unused factors that can be reinserted in this function
def largestFactors(factors, y):
assert(y > 1)
# iterate from y to 2 and see if the factors are present.
for i in xrange(y, 1, -1):
try_another_number = False
factors_below_y = getFactors(i)
for number, copies in factors_below_y.iteritems():
if number in factors:
if factors[number] < copies:
try_another_number = True
continue # not enough factors
else:
try_another_number = True
continue # a factor is not present
# Do we want to try another number, or was a solution found?
if try_another_number == True:
continue
else:
output = 1
for number, copies in factors_below_y.items():
remaining = factors[number] - copies
if remaining > 0:
factors[number] = remaining
else:
del factors[number]
output *= number ** copies
return (output, factors)
return None # failed
# Find prime factors. You can use any formula you want for this.
# I am using elliptic curve factorization from http://sourceforge.net/projects/pyecm
import pyecm, collections, copy
getFactors_cache = {}
def getFactors(n):
assert(n != 0)
# attempt to retrieve from cache. Returns a copy
try:
return copy.copy(getFactors_cache[n])
except KeyError:
pass
output = collections.defaultdict(int)
for factor in pyecm.factors(n, False, True, 10, 1):
output[factor] += 1
# cache result
getFactors_cache[n] = output
return copy.copy(output)
Answer to first question
You say you want compression of numbers, but from your examples, those sequences are longer than the undecomposed numbers. It is not possible to compress these numbers without more details to the system you left out (probability of sequences/is there a programmable client?). Could you elaborate more?
Here is a mathematical explanation as to why current answers to the first part of your problem will never solve your second problem. It has nothing to do with the knapsack problem.
This is Shannon's entropy algorithm. It tells you the theoretical minimum amount of bits you need to represent a sequence {X0, X1, X2, ..., Xn-1, Xn} where p(Xi) is the probability of seeing token Xi.
Let's say that X0 to Xn is the span of 0 to 4294967295 (the range of an integer). From what you have described, each number is as likely as another to appear. Therefore the probability of each element is 1/4294967296.
When we plug it into Shannon's algorithm, it will tell us what the minimum number of bits are required to represent the stream.
import math
def entropy():
num = 2**32
probability = 1./num
return -(num) * probability * math.log(probability, 2)
# the (num) * probability cancels out
The entropy unsurprisingly is 32. We require 32 bits to represent an integer where each number is equally likely. The only way to reduce this number, is to increase the probability of some numbers, and decrease the probability of others. You should explain the stream in more detail.
Answer to second question
The right way to do this is to use base64, when communicating with HTTP. Apparently Java does not have this in the standard library, but I found a link to a free implementation:
http://iharder.sourceforge.net/current/java/base64/
Here is the "pseudo-code" which works perfectly in Python and should not be difficult to convert to Java (my Java is rusty):
def longTo64(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
output = ""
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % 64] + output
num /= 64
return output
If you have control over your web server and web client, and can parse the entire HTTP requests without problem, you can upgrade to base85. According to wikipedia, url encoding allows for up to 85 characters. Otherwise, you may need to remove a few characters from the mapping.
Here is another code example in Python
def longTo85(num):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = ""
base = len(mapping)
# special case for 0
if num == 0:
return mapping[0]
while num != 0:
output = mapping[num % base] + output
num /= base
return output
And here is the inverse operation:
def stringToLong(string):
mapping = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*'();:#&=+$,/?%#[]"
output = 0
base = len(mapping)
place = 0
# check each digit from the lowest place
for digit in reversed(string):
# find the number the mapping of symbol to number, then multiply by base^place
output += mapping.find(digit) * (base ** place)
place += 1
return output
Here is a graph of Shannon's algorithm in different bases.
As you can see, the higher the radix, the less symbols are needed to represent a number. At base64, ~11 symbols are required to represent a long. At base85, it becomes ~10 symbols.
Edit after final explanation:
I would think base64 is the best solution, since there are standard functions that deal with it, and variants of this idea don't give much improvement. This was answered with much more detail by others here.
Regarding the original question, although the code works, it is not guaranteed to run in any reasonable time, as was answered as well as commented on this question by LFSR Consulting.
Original Answer:
You mean something like this?
Edit - corrected after a comment.
shortest_output = {}
foreach (int R = 0; R <= X; R++) {
// iteration over possible remainders
// check if the rest of X can be decomposed into multipliers
newX = X - R;
output = {};
while (newX > Y) {
int i;
for (i = Y; i > 1; i--) {
if ( newX % i == 0) { // found a divider
output.append(i);
newX = newX /i;
break;
}
}
if (i == 1) { // no dividers <= Y
break;
}
}
if (newX != 1) {
// couldn't find dividers with no remainder
output.clear();
}
else {
output.append(R);
if (output.length() < shortest_output.length()) {
shortest_output = output;
}
}
}
It sounds as though you want to compress random data -- this is impossible for information theoretic reasons. (See http://www.faqs.org/faqs/compression-faq/part1/preamble.html question 9.) Use Base64 on the concatenated binary representations of your numbers and be done with it.
The problem you're attempting to solve (you're dealing with a subset of the problem, given you're restriction of y) is called Integer Factorization and it cannot be done efficiently given any known algorithm:
In number theory, integer factorization is the breaking down of a composite number into smaller non-trivial divisors, which when multiplied together equal the original integer.
This problem is what makes a number of cryptographic functions possible (namely RSA which uses 128 bit keys - long is half of that.) The wiki page contains some good resources that should move you in the right direction with your problem.
So, your brain teaser is indeed a brain teaser... and if you solve it efficiently we can elevate your math skills to above average!
Updated after the full story
Base64 is most likely your best option. If you want a custom solution you can try implementing a Base 65+ system. Just remember that just because 10000 can be written as "10^4" doesn't mean that everything can be written as 10^n where n is an integer. Different base systems are the simplest way to write numbers and the higher the base the less digits the number requires. Plus most framework libraries contain algorithms for Base64 encoding. (What language you are using?).
One way to further pack the urls is the one you mentioned but in Base64.
int[] IDs;
IDs.sort() // So IDs[i] is always smaller or equal to IDs[i-1].
string url = Base64Encode(IDs[0]);
for (int i = 1; i < IDs.length; i++) {
url += "," + Base64Encode(IDs[i-1] - IDs[i]);
}
Note that you require some separator as the initial ID can be arbitrarily large and the difference between two IDs CAN be more than 63 in which case one Base64 digit is not enough.
Updated
Just restating that the problem is unsolvable. For Y = 64 you can't write 87681 in multipliers + remainder where each of these is below 64. In other words, you cannot write any of the numbers 87617..87681 with multipliers that are below 64. Each of these numbers has an elementary term over 64. 87616 can be written in elementary terms below 64 but then you'd need those + 65 and so the remainder will be over 64.
So if this was just a brainteaser, it's unsolvable. Was there some practical purpose for this which could be achieved in some way other than using multiplication and a remainder?
And yes, this really should be a comment but I lost my ability to comment at some point. :p
I believe the solution which comes closest is Yevgeny's. It is also easy to extend Yevgeny's solution to remove the limit for the remainder in which case it would be able to find solution where multipliers are smaller than Y and remainder as small as possible, even if greater than Y.
Old answer:
If you limit that every number in the array must be below the y then there is no solution for this. Given large enough x and small enough y, you'll end up in an impossible situation. As an example with y of 2, x of 12 you'll get 2 * 2 * 2 + 4 as 2 * 2 * 2 * 2 would be 16. Even if you allow negative numbers with abs(n) below y that wouldn't work as you'd need 2 * 2 * 2 * 2 - 4 in the above example.
And I think the problem is NP-Complete even if you limit the problem to inputs which are known to have an answer where the last term is less than y. It sounds quite much like the [Knapsack problem][1]. Of course I could be wrong there.
Edit:
Without more accurate problem description it is hard to solve the problem, but one variant could work in the following way:
set current = x
Break current to its terms
If one of the terms is greater than y the current number cannot be described in terms greater than y. Reduce one from current and repeat from 2.
Current number can be expressed in terms less than y.
Calculate remainder
Combine as many of the terms as possible.
(Yevgeny Doctor has more conscise (and working) implementation of this so to prevent confusion I've skipped the implementation.)
OP Wrote:
My goal originally was to come up with
a specific method to pack 1..n large
integers (aka longs) together so that
their String representation is notably
shorter than writing the actual
number. Think multiples of ten, 10^6
and 1 000 000 are the same, however
the representation's length in
characters isn't.
I have been down that path before, and as fun as it was to learn all the math, to save you time I will just point you to: http://en.wikipedia.org/wiki/Kolmogorov_complexity
In a nutshell some strings can be easily compressed by changing your notation:
10^9 (4 characters) = 1000000000 (10 characters)
Others cannot:
7829203478 = some random number...
This is a great great simplification of the article I linked to above, so I recommend that you read it instead of taking my explanation at face value.
Edit:
If you are trying to make RESTful urls for some set of unique data, why wouldn't you use a hash, such as MD5? Then include the hash as part of the URL, then look up the data based on the hash. Or am I missing something obvious?
The original method you chose (a * b + c * d + e) would be very difficult to find optimal solutions for simply due to the large search space of possibilities. You could factorize the number but it's that "+ e" that complicates things since you need to factorize not just that number but quite a few immediately below it.
Two methods for compression spring immediately to mind, both of which give you a much-better-than-10% saving on space from the numeric representation.
A 64-bit number ranges from (unsigned):
0 to
18,446,744,073,709,551,616
or (signed):
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
In both cases, you need to reduce the 20-characters taken (without commas) to something a little smaller.
The first is to simply BCD-ify the number the base64 encode it (actually a slightly modified base64 since "/" would not be kosher in a URL - you should use one of the acceptable characters such as "_").
Converting it to BCD will store two digits (or a sign and a digit) into one byte, giving you an immediate 50% reduction in space (10 bytes). Encoding it base 64 (which turns every 3 bytes into 4 base64 characters) will turn the first 9 bytes into 12 characters and that tenth byte into 2 characters, for a total of 14 characters - that's a 30% saving.
The only better method is to just base64 encode the binary representation. This is better because BCD has a small amount of wastage (each digit only needs about 3.32 bits to store [log210], but BCD uses 4).
Working on the binary representation, we only need to base64 encode the 64-bit number (8 bytes). That needs 8 characters for the first 6 bytes and 3 characters for the final 2 bytes. That's 11 characters of base64 for a saving of 45%.
If you wanted maximum compression, there are 73 characters available for URL encoding:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789$-_.+!*'(),
so technically you could probably encode base-73 which, from rough calculations, would still take up 11 characters, but with more complex code which isn't worth it in my opinion.
Of course, that's the maximum compression due to the maximum values. At the other end of the scale (1-digit) this encoding actually results in more data (expansion rather than compression). You can see the improvements only start for numbers over 999, where 4 digits can be turned into 3 base64 characters:
Range (bytes) Chars Base64 chars Compression ratio
------------- ----- ------------ -----------------
< 10 (1) 1 2 -100%
< 100 (1) 2 2 0%
< 1000 (2) 3 3 0%
< 10^4 (2) 4 3 25%
< 10^5 (3) 5 4 20%
< 10^6 (3) 6 4 33%
< 10^7 (3) 7 4 42%
< 10^8 (4) 8 6 25%
< 10^9 (4) 9 6 33%
< 10^10 (5) 10 7 30%
< 10^11 (5) 11 7 36%
< 10^12 (5) 12 7 41%
< 10^13 (6) 13 8 38%
< 10^14 (6) 14 8 42%
< 10^15 (7) 15 10 33%
< 10^16 (7) 16 10 37%
< 10^17 (8) 17 11 35%
< 10^18 (8) 18 11 38%
< 10^19 (8) 19 11 42%
< 2^64 (8) 20 11 45%
Update: I didn't get everything, thus I rewrote the whole thing in a more Java-Style fashion. I didn't think of the prime number case that is bigger than the divisor. This is fixed now. I leave the original code in order to get the idea.
Update 2: I now handle the case of the big prime number in another fashion . This way a result is obtained either way.
public final class PrimeNumberException extends Exception {
private final long primeNumber;
public PrimeNumberException(long x) {
primeNumber = x;
}
public long getPrimeNumber() {
return primeNumber;
}
}
public static Long[] decompose(long x, long y) {
try {
final ArrayList<Long> operands = new ArrayList<Long>(1000);
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
} catch (PrimeNumberException e) {
// return new Long[0];
// or do whatever you like, for example
operands.add(e.getPrimeNumber());
} finally {
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
}
// The recursive method
private static void recDivide(long x, long y, ArrayList<Long> operands)
throws PrimeNumberException {
while ((x > y) && (y != 1)) {
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
}
// go in recursion
x = rest;
} else {
// if the value x isn't divisible by y decrement y so you'll find a
// divisor eventually
if (--y == 1) {
throw new PrimeNumberException(x);
}
}
}
}
Original: Here some recursive code I came up with. I would have preferred to code it in some functional language but it was required in Java. I didn't bother converting the numbers to integer but that shouldn't be that hard (yes, I'm lazy ;)
public static Long[] decompose(long x, long y) {
final ArrayList<Long> operands = new ArrayList<Long>();
final long rest = x % y;
// Extract the rest so the reminder is divisible by y
final long newX = x - rest;
// Go into recursion, actually it's a tail recursion
recDivide(newX, y, operands);
// Add the reminder to the array
operands.add(rest);
return operands.toArray(new Long[operands.size()]);
}
// The recursive method
private static void recDivide(long newX, long y, ArrayList<Long> operands) {
long x = newX;
if (x % y == 0) {
final long rest = x / y;
// Since y is a divisor add it to the list of operands
operands.add(y);
if (rest <= y) {
// the rest is smaller than y, we're finished
operands.add(rest);
} else {
// the rest can still be divided, go one level deeper in recursion
recDivide(rest, y, operands);
}
} else {
// if the value x isn't divisible by y decrement y so you'll find a divisor
// eventually
recDivide(x, y-1, operands);
}
}
Are you married to using Java? Python has an entire package dedicated just for this exact purpose. It'll even sanitize the encoding for you to be URL-safe.
Native Python solution
The standard module I'm recommending is base64, which converts arbitrary stings of chars into sanitized base64 format. You can use it in conjunction with the pickle module, which handles conversion from lists of longs (actually arbitrary size) to a compressed string representation.
The following code should work on any vanilla installation of Python:
import base64
import pickle
# get some long list of numbers
a = (854183415,1270335149,228790978,1610119503,1785730631,2084495271,
1180819741,1200564070,1594464081,1312769708,491733762,243961400,
655643948,1950847733,492757139,1373886707,336679529,591953597,
2007045617,1653638786)
# this gets you the url-safe string
str64 = base64.urlsafe_b64encode(pickle.dumps(a,-1))
print str64
>>> gAIoSvfN6TJKrca3S0rCEqMNSk95-F9KRxZwakqn3z58Sh3hYUZKZiePR0pRlwlfSqxGP05KAkNPHUo4jooOSixVFCdK9ZJHdEqT4F4dSvPY41FKaVIRFEq9fkgjSvEVoXdKgoaQYnRxAC4=
# this unwinds it
a64 = pickle.loads(base64.urlsafe_b64decode(str64))
print a64
>>> (854183415, 1270335149, 228790978, 1610119503, 1785730631, 2084495271, 1180819741, 1200564070, 1594464081, 1312769708, 491733762, 243961400, 655643948, 1950847733, 492757139, 1373886707, 336679529, 591953597, 2007045617, 1653638786)
Hope that helps. Using Python is probably the closest you'll get from a 1-line solution.
Wrt the original algorithm request: Is there a limit on the size of the last number (beyond that it must be stored in a 32b int)?
(The original request is all I'm able to tackle lol.)
The one that produces the shortest list is:
bool negative=(n<1)?true:false;
int j=n%y;
if(n==0 || n==1)
{
list.append(n);
return;
}
while((long64)(n-j*y)>MAX_INT && y>1) //R has to be stored in int32
{
y--;
j=n%y;
}
if(y<=1)
fail //Number has no suitable candidate factors. This shouldn't happen
int i=0;
for(;i<j;i++)
{
list.append(y);
}
list.append(n-y*j);
if(negative)
list[0]*=-1;
return;
A little simplistic compared to most answers given so far but it achieves the desired functionality of the original post... It's a little dirty but hopefully useful :)
Isn't this modulus?
Let / be integer division (whole numbers) and % be modulo.
int result[3];
result[0] = y;
result[1] = x / y;
result[2] = x % y;
Just set x:=x/n where n is the largest number that is less both than x and y. When you end up with x<=y, this is your last number in the sequence.
Like in my comment above, I'm not sure I understand exactly the question. But assuming integers (n and a given y), this should work for the cases you stated:
multipliers[0] = n / y;
multipliers[1] = y;
addedNumber = n % y;

Resources