Number base conversion as a stream operation - algorithm

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)

I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.

I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!

Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

Related

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).
Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y &wedge; h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.
It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Splitting a floating point number as sums of floating point of fixed precision

Suppose i have an algorithm by which i can compute an infinitely precise floating point number (depending from a parameter N) lets say in pseudocode:
arbitrary_precision_float f = computeValue(n); //it could be a function which compute a specific value, like PI for instance.
I guess i can implement computeValue(int) with the library mpf of the gnump library for example...
Anyway how can i split such number in sums of floating point number where each number has L Mantissa digits?
//example
f = x1 + x2 + ... + xn;
/*
for i = 1:n
xi = 2^ei * Mi
Mi has exactly p digits.
*/
I don't know if i'm clear but i'm looking for something "simple".
You can use a very simple algorithm. Assume without loss of generality that the exponent of your original number is zero; if it's not, then you just add that exponent to all the exponents of the answer.
Split your number f into groups of L digits and treat each group as a separate xi. Any such group can be represented in the form you need: the mantissa will be exactly that group, and the exponent will be negated start position of the group in the original number (that is, i*L, where i is the group number).
If any of the resulting xis starts from zero, you just shift its mantissa correcting the exponent correspondingly.
For example, for L=4
f = 10010011100
1001
0011
100
-> x1=1.001 *2^0
x2=0.011 *2^{-4} = 1.1*2^{-6}
x3=1.00 *2^{-8}
Another question arises if you want to minimize the amount of numbers you get. In the example above, two numbers are sufficient: 1.001*2^0+1.11*2^{-6}. This is a separate question, and in fact is a simple problem for dynamic programming.

Generate a unique number out of the combination of 'n' different numbers?

To clarify, as input I have 'n' (n1, n2, n3,...) numbers (integers) such as each number is unique within this set.
I would like to generate a number out of this set (lets call the generated number big 'N') that is also unique, and that allows me to verify that a number 'n1' belongs to the set 'n' just by using 'N'.
is that possible?
Edit:
Thanks for the answers guys, I am looking into them atm. For those requesting an example, here is a simple one:
imagine i have those paths (bi-directional graph) with a random unique value (let's call it identifier):
P1 (N1): A----1----B----2----C----3----D
P2 (N2): A----4----E----5----D
So I want to get the full path (unique path, not all paths) from A knowing N1 and this path as a result should be P1.
Mind you that 1,2,...are just unique numbers in this graph, not weights or distances, I just use them for my heuristic.
If you are dealing with small numbers, no problem. You are doing the same thing with digits every time you compose a number: a digit is a number from 0 to 9 and a full number is a combination of them that:
is itself a number
is unique for given digits
allows you to easily verify if a digit is inside
The gotcha is that the numbers must have an upper limit, like 10 is for digits. Let's say 1000 here for simplicity, the similar composed number could be:
n1*1000^k + n2*1000^(k-1) + n3*1000^(k-2) ... + nk*1000^(0)
So if you have numbers 33, 44 and 27 you will get:
33*1000000 + 44*1000 + 27, and that is number N: 33044027
Of course you can do the same with bigger limits, and binary like 256,1024 or 65535, but it grows big fast.
A better idea, if possible is to convert it into a string (a string is still a number!) with some separator (a number in base 11, that is 10 normal digits + 1 separator digit). This is more flexible as there are no upper limits. Imagine to use digits 0-9 + a separator digit 'a'. You can obtain number 33a44a27 in base 11. By translating this to base 10 or base 16 you can get an ordinary computer number (65451833 if I got it right). Then converting 65451833 to undecimal (base11) 33a44a27, and splitting by digit 'a' you can get the original numbers back to test.
EDIT: A VARIABLE BASE NUMBER?
Of course this would work better digitally in base 17 (16 digits+separator). But I suspect there are more optimal ways, for example if the numbers are unique in the path, the more numbers you add, the less are remaining, the shorter the base could shrink. Can you imagine a number in which the first digit is in base 20, the second in base 19, the third in base 18, and so on? Can this be done? Meh?
In this variating base world (in a 10 nodes graph), path n0-n1-n2-n3-n4-n5-n6-n7-n8-n9 would be
n0*10^0 + (n1*9^1)+(offset:1) + n2*8^2+(offset:18) + n3*7^3+(offset:170)+...
offset1: 10-9=1
offset2: 9*9^1-1*8^2+1=81-64+1=18
offset3: 8*8^2-1*7^3+1=343-512+1=170
If I got it right, in this fiddle: http://jsfiddle.net/Hx5Aq/ the biggest number path would be: 102411
var path="9-8-7-6-5-4-3-2-1-0"; // biggest number
o2=(Math.pow(10,1)-Math.pow(9,1)+1); // offsets so digits do not overlap
o3=(Math.pow(9,2)-Math.pow(8,2)+1);
o4=(Math.pow(8,3)-Math.pow(7,3)+1);
o5=(Math.pow(7,4)-Math.pow(6,4)+1);
o6=(Math.pow(6,5)-Math.pow(5,5)+1);
o7=(Math.pow(5,6)-Math.pow(4,6)+1);
o8=(Math.pow(4,7)-Math.pow(3,7)+1);
o9=(Math.pow(3,8)-Math.pow(2,8)+1);
o10=(Math.pow(2,9)-Math.pow(1,9)+1);
o11=(Math.pow(1,10)-Math.pow(0,10)+1);
var n=path.split("-");
var res;
res=
n[9]*Math.pow(10,0) +
n[8]*Math.pow(9,1) + o2 +
n[7]*Math.pow(8,2) + o3 +
n[6]*Math.pow(7,3) + o4 +
n[5]*Math.pow(6,4) + o5 +
n[4]*Math.pow(5,5) + o6 +
n[3]*Math.pow(4,6) + o7 +
n[2]*Math.pow(3,7) + o8 +
n[1]*Math.pow(2,8) + o9 +
n[0]*Math.pow(1,9) + o10;
alert(res);
So N<=102411 would represent any path of ten nodes? Just a trial. You have to find a way of naming them, for instance if they are 1,2,3,4,5,6... and you use 5 you will have to compact the remaining 1,2,3,4,6->5,7->6... => 1,2,3,4,5,6... (that is revertable and unique if you start from the first)
Theoretically, yes it is.
By defining p_i as the i'th prime number, you can generate N=p_(n1)*p_(n2)*..... Now, all you have to do is to check if N%p_(n) == 0 or not.
However, note that N will grow to huge numbers very fast, so I am not sure this is a very practical solution.
One very practical probabilistic solution is using bloom filters. Note that bloom filters is a set of bits, that can be translated easily to any number N.
Bloom filters have no false negatives (if you said a number is not in the set, it really isn't), but do suffer from false positives with an expected given probability (that is dependent on the size of the sets, number of functions used and number of bits used).
As a side note, to get a result that is 100% accurate, you are going to need at the very least 2^k bits (where k is the range of the elements) to represent the number N by looking at this number as a bitset, where each bit indicates existence or non-existence of a number in the set. You can show that there is no 100% accurate solution that uses less bits (peigeon hole principle). Note that for integers for example with 32 bits, it means you are going to need N with 2^32 bits, which is unpractical.

Algorithm - Partition two numbers about a power-of-two

Given two floating point numbers, p and q where 0 < p < q I am interested in writing a function partition(p,q) that finds the 'simplest' number r that is between p and q. For example:
partition(3.0, 4.1) = 4.0 (2^2)
partition(4.2, 7.0) = 6.0 (2^2 + 2^1)
partition(2.0, 4.0) = 3.0 (2^1 + 2^0)
partition(0.3, 0.6) = 0.5 (2^-1)
partition(1.0, 10.0) = 8.0 (2^3)
In the last instance I am interested in the largest number (so 8 as opposed to 4 or 2).
Let us assume assume that p and q are both normalized and positive, and p < q.
If p and q have differing exponents, it appears that the number you are looking for is the number obtained by zeroing the mantissa of q after the leading (and often implicit) 1. The corner cases are left as an exercise, especially the case where q's mantissa is already made of zeroes after the leading, possibly implicit, 1.
If p and q have the same exponent, then we have to look at their mantissas. These mantissas have some bits in common (starting from the most significant end). Let us call c1 c2 .. ck pk+1 ... pn the bits of p's mantissa, c1 c2 .. ck qk+1 ... qnthe bits of q's mantissa, where c1 .. ck are common bits and pk+1, qk+1 differ. Then pk+1 is zero and qk+1 is one (because of the hypotheses). The number with the same exponent and mantissa c1 .. ck 1 0 .. 0 is in the interval p .. q and is the number you are looking for (again, corner cases left as an exercise).
Write the numbers in binary (terminating if possible, so 1 is written as 1.0000..., not 0.1111...),
Scan from left to right, "keeping" all digits at which the two numbers are equal
At the first digit where the two numbers differ, p must be 0 and q must be 1 since p < q:
If q has any more 1 digits after this point, then put a 1 at this point and you're done.
If q has no more 1 digits after this point, then doing that would result in r == q, which is forbidden, so instead append a 0 digit. Follow that by a 1 digit unless doing so would result in r == p, in which case append another 0 and then a 1.
Basically, we truncate q down to the first place at which p and q differ, then jigger it a bit if necessary to avoid r == p or r == q. The result is certainly less than q and greater than p. It is "simplest" (has the least possible number of 1 digits) since any number between p and q must share their common initial sequence. We have added only one 1 digit to that sequence, which is necessary since the initial sequence alone is <= p, so no value in range (p,q) has fewer 1 digits. We've chosen the "largest" solution because we always place our extra 1 at the first (biggest) possible place.
It sounds like you just want to convert the binary representation of the largest integer strictly less than your largest argument to the corresponding sum of powers of two.

How many digits will be after converting from one numeral system to another

The main question: How many digits?
Let me explain. I have a number in binary system: 11000000 and in decimal is 192.
After converting to decimal, how many digits it will have (in dicimal)? In my example, it's 3 digits. But, it isn't a problem. I've searched over internet and found one algorithm for integral part and one for fractional part. I'm not quite understand them, but (I think) they works.
When converting from binary to octal, it's more easy: each 3 bits give you 1 digit in octal. Same for hex: each 4 bits = 1 hex digit.
But, I'm very curious, what to do, if I have a number in P numeral system and want to convert it to the Q numeral system? I know how to do it (I think, I know :)), but, 1st of all, I want to know how many digits in Q system it will take (u no, I must preallocate space).
Writing n in base b takes ceiling(log base b (n)) digits.
The ratio you noticed (octal/binary) is log base 8 (n) / log base 2 (n) = 3.
(From memory, will it stick?)
There was an error in my previous answer: look at the comment by Ben Schwehn.
Sorry for the confusion, I found and explain the error I made in my previous answer below.
Please use the answer provided by Paul Tomblin. (rewritten to use P, Q and n)
Y = ln(P^n) / ln(Q)
Y = n * ln(P) / ln(Q)
So Y (rounded up) is the number of characters you need in system Q to express the highest number you can encode in n characters in system P.
I have no answer (that wouldn't convert the number already and take up that many space in a temporary variable) to get the bare minimum for a given number 1000(bin) = 8(dec) while you would reserve 2 decimal positions using this formula.
If a temporary memory usage isn't a problem, you might cheat and use (Python):
len(str(int(otherBaseStr,P)))
This will give you the number of decimals needed to convert a number in base P, cast as a string (otherBaseStr), into decimals.
Old WRONG answer:
If you have a number in P numeral system of length n
Then you can calculate the highest number that is possible in n characters:
P^(n-1)
To express this highest number in number system Q you need to use logarithms (because they are the inverse to exponentiation):
log((P^(n-1))/log(Q)
(n-1)*log(P) / log(Q)
For example
11000000 in binary is 8 characters.
To get it in Decimal you would need:
(8-1)*log(2) / log(10) = 2.1 digits (round up to 3)
Reason it was wrong:
The highest number that is possible in n characters is
(P^n) - 1
not
P^(n-1)
If you have a number that's X digits long in base B, then the maximum value that can be represented is B^X - 1. So if you want to know how many digits it might take in base C, then you have to find the number Y that C^Y - 1 is at least as big as B^X - 1. The way to do that is to take the logarithm in base C of B^X-1. And since the logarithm (log) of a number in base C is the same as the natural log (ln) of that number divided by the natural log of C, that becomes:
Y = ln((B^X)-1) / ln(C) + 1
and since ln(B^X) is X * ln(B), and that's probably faster to calculate than ln(B^X-1) and close enough to the right answer, rewrite that as
Y = X * ln(B) / ln(C) + 1
Covert that to your favourite language. Because we dropped the "-1", we might end up with one digit more than you need in some cases. But even better, you can pre-calculate ln(B)/ln(C) and just multiply it by new "X"s and the length of the number you are trying to convert changes.
Calculating the number of digit can be done using the formulas given by the other answers, however, it might actually be faster to allocate a buffer of maximum size first and then return the relevant part of that buffer instead of calculating a logarithm.
Note that the worst case for the buffer size happens when you convert to binary, which gives you a buffer size of 32 characters for 32-bit integers.
Converting a number to an arbitrary base could be done using the C# function below (The code would look very similar in other languages like C or Java):
public static string IntToString(int value, char[] baseChars)
{
// 32 is the worst cast buffer size for base 2 and int.MaxValue
int i = 32;
char[] buffer = new char[i];
int targetBase= baseChars.Length;
do
{
buffer[--i] = baseChars[value % targetBase];
value = value / targetBase;
}
while (value > 0);
char[] result = new char[32 - i];
Array.Copy(buffer, i, result, 0, 32 - i);
return new string(result);
}
The keyword here is "logarithm", here are some suggestive links:
http://www.adug.org.au/MathsCorner/MathsCornerLogs2.htm
http://staff.spd.dcu.ie/johnbcos/download/Fermat%20material/Fermat_Record_Number/HOW_MANY.html
look at the logarithms base P and base Q. Round down to nearest integer.
The logarithm base P can be computed using your favorite base (10 or e): log_P(x) = log_10(x)/log_10(P)
You need to compute the length of the fractional part separately.
For binary to decimal, there are as many decimal digits as there are bits. For example, binary 0.11001101001001 is decimal 0.80133056640625, both 14 digits after the radix point.
For decimal to binary, there are two cases. If the decimal fraction is dyadic, then there are as many bits as decimal digits (same as for binary to decimal above). If the fraction is not dyadic, then the number of bits is infinite.
(You can use my decimal/binary converter to experiment with this.)

Resources