Finding nth largest number among three numbers - algorithm

Given numbers k, a, b, c. How to find the kth largest number among a, b, and c without using if or any array or loop. min or max functions are provided.

You need something like that C++ fragment. In other languages it might look different.
auto 1st = max(a,max(b,c));
auto 3rd = min(a,min(b,c));
auto 2nd = a+b+c-1st-3rd;
return (2-k)*(3-k)*1st/2 + (k-1)*(3-k)*2nd + (k-1)*(k-2)*3rd/2;
It is assumed that k equals one of the numbers 1, 2, 3.
Incorporating 2 comments results in the more beautiful:
#define maybe(x) x*(x!=n1&&x!=n3)
auto n1 = max(a,max(b,c));
auto n3 = min(a,min(b,c));
auto n2 = maybe(a)+maybe(b)+maybe(c);
return n1*(k==1) + n2*(k==2) + n3*(k==3);
Also, my 1st version ignored the fact that identifiers can't start with a digit.
In some languages, (k==1) cannot be used as an integer, or is not guaranteed to be 0 or 1. In these languages, the 1st version may work better.
Regarding the overflow: That depends on the type.
For integer types, a+b+c-n1-n3 may cause an overflow, but is still correct. That's why: The result of a+b+c-n1-n3 will be correct in the lower bits. For example, if we use 32 bit numbers, a, b and c will be 32-bit-numbers and the result will be correct in the lowest 32 bits. That is, the result will be exactly a or be or c. Thus it is correct. Given the fact that for float numbers ^ won't work and the question did not specify the type of the numbers, I go back from ^ to + and -.
For non-integer types, an overflow may introduce a rounding error. To avoid that, I now chose an implementation that does not use + or - at all except when adding 0.

Related

Why do we have double hashing function as [(hash1(key) + i * hash2(key)) % TABLE_SIZE] but not simply as [(i * hash2(key)) % TABLE_SIZE]?

I learned the notation of double hashing [(hash1(key) + i * hash2(key)) % TABLE_SIZE] couple days ago. There is a part I couldn't understand after thinking about it and searching for answer for days.
Why don't we discard the [hash1(key)] part from the double hashing function and simply make it as [(i * hash2(key)) % TABLE_SIZE]?
I couldn't find any downside of doing this, except all the hashcodes would start from 0 (when i = 0).
The main purpose of using double hashing, avoiding clusters, can still be achieved.
It will be super thankful if anyone can help:D
A quick summary of this answer:
There is a practical performance hit to your modified version, though it's not very large.
I think this is due to there not being as many different probe sequences as in regular double hashing, leading to some extra collisions due to the birthday paradox.
Now, the actual answer. :-)
Let's start off with some empirical analysis. What happens if you switch from the "standard" version of double hashing to the variant of double hashing that you've proposed?
I wrote a C++ program that generates uniformly-random choices of h1 and h2 values for each of the elements. It then inserts them into two different double-hashing tables, one using the normal approach and one using the variant. It repeats this process multiple times and reports the average number of probes required across each insertion. Here's what I found:
#include <iostream>
#include <vector>
#include <random>
#include <utility>
using namespace std;
/* Table size is picked to be a prime number. */
const size_t kTableSize = 5003;
/* Load factor for the hash table. */
const double kLoadFactor = 0.9;
/* Number of rounds to use. */
const size_t kNumRounds = 100000;
/* Random generator. */
static mt19937 generator;
/* Creates and returns an empty double hashing table. */
auto emptyTable(const size_t numSlots) {
return vector<bool>(numSlots, false);
}
/* Simulation of double hashing. Each vector entry represents an item to store.
* The first element of the pair is the value of h1(x), and the second element
* of the pair is the value of h2(x).
*/
auto hashCodes(const size_t numItems) {
vector<pair<size_t, size_t>> result;
uniform_int_distribution<size_t> first(0, kTableSize - 1), second(1, kTableSize - 1);
for (size_t i = 0; i < numItems; i++) {
result.push_back({ first(generator), second(generator) });
}
return result;
}
/* Returns the probe location to use given a number of steps taken so far.
* If modified is true, we ignore h1.
*/
size_t locationOf(size_t tableSize, size_t numProbes, size_t h1, size_t h2, bool modified) {
size_t result = (numProbes == 0 || !modified)? h1 : 0;
result += h2 * numProbes;
return result % tableSize;
}
/* Performs a double-hashing insert, returning the number of probes required to
* settle the element into its place.
*/
size_t insert(vector<bool>& table, size_t h1, size_t h2, bool modified) {
size_t numProbes = 0;
while (table[locationOf(table.size(), numProbes, h1, h2, modified)]) {
numProbes++;
}
table[locationOf(table.size(), numProbes, h1, h2, modified)] = true;
return numProbes + 1; // Count the original location as one probe
}
int main() {
size_t normalProbes = 0, variantProbes = 0;
for (size_t round = 0; round < kNumRounds; round++) {
auto normalTable = emptyTable(kTableSize);
auto variantTable = emptyTable(kTableSize);
/* Insert a collection of items into the table. */
for (auto [h1, h2]: hashCodes(kTableSize * kLoadFactor)) {
normalProbes += insert(normalTable, h1, h2, false);
variantProbes += insert(variantTable, h1, h2, true);
}
}
cout << "Normal probes: " << normalProbes << endl;
cout << "Variant probes: " << variantProbes << endl;
}
Output:
Normal probes: 1150241942
Variant probes: 1214644088
So, empirically, it looks like the modified approach leads to about 5% more probes being needed to place all the elements. The question, then, is why this is.
I do not have a fully-developed theoretical explanation as to why the modified version is slower, but I do have a reasonable guess as to what's going on. Intuitively, double hashing works by assigning each element that's inserted a random probe sequence, which is some permutation of the table slots. It's not a truly-random permutation, since not all permutations can be achieved, but it's random enough for some definition of "random enough" (see, for example, Guibas and Szemendi's "The Analysis of Double Hashing).
Let's think about what happens when we do an insertion. How many times, on expectation, will we need to look at the probe sequence beyond just h1? The first item has 0 probability of needing to look at h2. The second item has 1/T probability, since it hits the first element with probability 1/T. The second item has 2/T probability, since it has a 2/T chance of hitting the first two items. More generally, using linearity of expectation, we can show that the the expected number of times an item will be in a spot that's already taken is given by
1/T + 2/T + 3/T + 4/T + ... + (n-1)/T
=(1+2+3+...+(n-1))/T
= n(n-1)/2T
Now, let's imagine that the load factor on our hash table is α, meaning that αT = n. Then the expected number of collisions works out to
αT(αT - 1) / 2T
≈ α2T / 2.
In other words, we should expect to see a pretty decent number of times where we need to inspect h2 when using double hashing.
Now, what happens when we look at the probe sequence? The number of different probe sequences using traditional double hashing is T(T-1), where T is the number of slots in the table. This is because there are T possible choices of h1(x) and T-1 choices for h2(x).
The math behind the birthday paradox says that once approximately √(2T(T-1)) ≈ T√2 items have been inserted into the table, we have a 50% chance that two of them will end up having the same probe sequence assigned. The good news here is that it's not possible to insert T√2 items into a T-slot hash table - that's more elements than slots! - and so there's a fairly low probability that we see elements that get assigned the same probe sequences. That means that the conflicts we get in the table are mostly due to collisions between elements that have different probe sequences, but coincidentally end up landing smack on top of one another.
On the other hand, let's think about your variation. Technically speaking, there are still T(T-1) distinct probe sequences. However, I'm going to argue that there are "effectively" more like only T-1 distinct probe sequences. The reason for this is that
probe sequences don't really matter unless you have a collision when you do an insertion, and
once there's a collision after an insertion, the probe sequence for a given element is determined purely by its h2 value.
This is not a rigorous argument - it's more of an intuition - but it shows that we have less variation in how our permutations get chosen.
Because there are only T-1 different probe sequences to pick from once we've had a collision, the birthday paradox says that we need to see about √(2T) collisions before we find two items with identical values of h2. And indeed, given that we see, on expectation, α2T/2 items that need to have h2 inspected, this means that we have a very good chance of finding items whose positions will be determined by the exact same sequence of h2 values. This means that we have a new source of collisions compared with "classical" double hashing: collisions from h2 values overlapping one another.
Now, even if we do have collisions with h2 values, it's not a huge deal. After all, it'll only take a few extra probes to skip past items placed with the same h2 values before we get to new slots. But I think that this might be the source of the extra probes seen during the modified version.
Hope this helps!
The proof of double hashing goes through under some weak assumptions about h1 and h2, namely, they're drawn from universal families. The result you get is that every operation is expected constant time (for every access that doesn't depend on the choice of h1 and h2).
With h2 only you need to either strengthen the condition on h2 or give the time bound as stated. Pick a prime p congruent to 1 mod 4, let P* = {1, …, p−1} be the set of units mod p, and consider the following mediocre but universal family of hash functions from {L, R} × P*. Draw a random scalar c ← P* and a random function f ← (P* → P*) and define
h2((L, x)) = cx mod p
h2((R, x)) = f(x).
This family is universal because the (L, x) keys never collide with each other, and every other pair collides with probability exactly 1/|P*|. It's a bad choice for the double-hashing-with-only-h2 algorithm because it's linear on half of its range, and linearity preserves arithmetic sequences.
Consider the following sequence of operations. Fill half of the hash table at random by inserting (R, 1), …, (R, (p−1)/2), then insert half again as many elements (L, (p−1)/4), …, (L, 1). The table load is at most 3/4, so everything should run in expected constant time. Consider what happens, however, when we insert (L, 1). With probability 1/2, the location h2((L, 1)) is occupied by one of the R keys. The ith probe i h2((L, 1)) hits the same location as h2((L, i)), which for i ≤ (p−1)/4 is guaranteed to be full by earlier operations. Therefore the expected cost of this operation is linear in p even though the sequence of keys didn't depend on the hash function, which is unacceptable.
Putting h1 back in the mix smashes this structure.
(Ugh this didn't quite work because the proof of expected constant time assumes strong universality, not universality as stated in the abstract.)
Taking another bite at the apple, this time with strong universality. Leaving my other answer up because this one uses a result by Patrascu and Thorup as a black box (and your choice of some deep number theory or some handwaving), which is perhaps less satisfying. The result is about linear probing, namely, for every table size m that's a power of 4, there exists a 2-universal (i.e., strongly univeral) hash family and a sequence of operations such that, in expectation over the random hash function, one operation (referred to as The Query) probes Theta(√m) cells.
In order to use this result, we'd really like a table of size p−1 where p is a prime, so fixing m and the bad-for-linear-probing hash family Hm (whose functions have codomain {0, …, m-1}), choose p to be the least prime greater than m. (Alternatively, the requirement that m be a power of 4 is basically for convenience writing up the proof; it seems tedious but possible to generalize Patrascu and Thorup's result to other table sizes.) Define the distribution Hp by drawing a function h'' ← Hm and then define each value of h' independently according to the distribution
h'(x) = | h''(x) + 1 with probability m/(p-1)
| m with probability 1/m
...
| p-1 with probability 1/m.
Letting K be the field mod p, the functions h have codomain K* = {1, …, p-1}, the units of K. Unless I botched the definition, it's straightforward to verify that Hp is strongly universal. We need to pull in some heavy-duty number theory to show that p - m is O(m2/3) (this follows from the existence of primes between sufficiently large successive cubes), which means that our linear probe sequence of length O(√m) for The Query remains intact with constant probability for Omega(m1/3) steps, well more than constant.
Now, in order to change this family from a linear probing wrecker to a "double" hash wrecker, we need to give the key involved in The Query a name, let's say q. (We know for sure which one it is because the operation sequence doesn't depend on the hash function.) We define a new distribution of hash functions h by drawing h' as before, drawing c ← K*, and then defining
h(x) = | c h'(x) if x ≠ q
| c if x = q.
Let's verify that this is still 2-universal. Given keys x and y both not equal to q, it's clear that (h(x), h(y)) = (c h'(x), c h'(y)) has uniform distribution over K* × K*. Given q and some other key x, we examine the distribution of (h(q), h(x)) = (c, c h'(x)), which is as needed because c is uniform, and h'(x) is uniform and independent of c, hence so is c h'(x).
OK, the point of this exercise at last. The probe sequence for The Query will be c, 2c, 3c, etc. Which keys hash to (e.g.) 2c? They are the x's that satisfy the equation
h(x) = c h'(x) = 2c
from which we derive
h'(x) = 2,
i.e., the keys whose preferred slot is right after The Query's in linear probe order. Generalizing from 2 to i, we conclude that the bad linear probe sequence for The Query for h' becomes a bad "double" hashing probe sequence for The Query for h, QED.

Google Code Jam 2008 Round 1A Q 3

For the problem statement in google codejam 2008: Round 1A Question 3
In this problem, you have to find the last three digits before the
decimal point for the number (3 + √5)n.
For example, when n = 5, (3 + √5)5 = 3935.73982... The
answer is 935.
For n = 2, (3 + √5)2 = 27.4164079... The answer is 027.
My solution based on the idea that T(i) = 6*T(i-1) - 4*T(n-2) + 1, where T(i) is the integer part for n=i is as below:
#include<stdio.h>
int a[5000];
main(){
unsigned long l,n;
int i,t;
a[0]=1;
a[1]=5;
freopen("C-small-practice.in","r",stdin);
scanf("%d",&t);
for(i=2;i<5000;i++)
a[i]=(6*a[i-1]-4*a[i-2]+10001)%1000;
i=t;
for(i=1;i<=t;i++){
scanf("%ld",&n);
printf("Case #%d: %.3d\n",i,a[(int)n]);
}
fclose(stdin);
}
in the line a[i]=(6*a[i-1]-4*a[i-2]+10001)%1000; i know there will be integer overflow, but i dont know why by adding 10,000 i am getting the right answer.
I am using GCC compiler where sizeof(int)=4
Can anyone explain what is happening ?
First off, the line
a[i]=(6*a[i-1]-4*a[i-2]+10001)%1000;
shouldn't actually cause any overflow, since you're keeping all previous values below 1000.
Second, did you consider what happens if 6*a[i-1]-4*a[i-2]+1 is negative? The modulus operator doesn't have to always return a positive value; it can also return negative values as well (if the thing you are dividing is itself negative).
By adding 10000, you've ensured that no matter what the previous values were, the value of that expression is positive, and hence the mod will give a positive integer result.
Expanding on that second point, here's 6.5.5.6 of the C99 specification:
When integers are divided, the result of the / operator is the algebraic
quotient with any fractional part discarded. If the quotient a/b is
representable, the expression (a/b)*b + a%b shall equal a.
A note beside the word "discarded" states that / "truncates toward zero". Hence, for the second sentence to be true, the result of a % b when a is negative must itself be negative.

Check whether a point is inside a rectangle by bit operator

Days ago, my teacher told me it was possible to check if a given point is inside a given rectangle using only bit operators. Is it true? If so, how can I do that?
This might not answer your question but what you are looking for could be this.
These are the tricks compiled by Sean Eron Anderson and he even put a bounty of $10 for those who can find a single bug. The closest thing I found here is a macro that finds if any integer X has a word which is between M and N
Determine if a word has a byte between m and n
When m < n, this technique tests if a word x contains an unsigned byte value, such that m < value < n. It uses 7 arithmetic/logical operations when n and m are constant.
Note: Bytes that equal n can be reported by likelyhasbetween as false positives, so this should be checked by character if a certain result is needed.
Requirements: x>=0; 0<=m<=127; 0<=n<=128
#define likelyhasbetween(x,m,n) \
((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
This technique would be suitable for a fast pretest. A variation that takes one more operation (8 total for constant m and n) but provides the exact answer is:
#define hasbetween(x,m,n) \
((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
It is possible if the number is a finite positive integer.
Suppose we have a rectangle represented by the (a1,b1) and (a2,b2). Given a point (x,y), we only need to evaluate the expression (a1<x) & (x<a2) & (b1<y) & (y<b2). So the problems now is to find the corresponding bit operation for the expression c
Let ci be the i-th bit of the number c (which can be obtained by masking ci and bit shift). We prove that for numbers with at most n bit, c<d is equivalent to r_(n-1), where
r_i = ((ci^di) & ((!ci)&di)) | (!(ci^di) & r_(i-1))
Prove: When the ci and di are different, the left expression might be true (depends on ((!ci)&di)), otherwise the right expression might be true (depends on r_(i-1) which is the comparison of next bit).
The expression ((!ci)&di) is actually equivalent to the bit comparison ci < di. Hence, this recursive relation return true that it compares the bit by bit from left to right until we can decide c is smaller than d.
Hence there is an purely bit operation expression corresponding to the comparison operator, and so it is possible to find a point inside a rectangle with pure bitwise operation.
Edit: There is actually no need for condition statement, just expands the r_(n+1), then done.
x,y is in the rectangle {x0<x<x1 and y0<y<y1} if {x0<x and x<x1 and y0<y and y<y1}
If we can simulate < with bit operators, then we're good to go.
What does it mean to say something is < in binary? Consider
a: 0 0 0 0 1 1 0 1
b: 0 0 0 0 1 0 1 1
In the above, a>b, because it contains the first 1 whose counterpart in b is 0. We are those seeking the leftmost bit such that myBit!=otherBit. (== or equiv is a bitwise operator which can be represented with and/or/not)
However we need some way through to propagate information in one bit to many bits. So we ask ourselves this: can we "code" a function using only "bit" operators, which is equivalent to if(q,k,a,b) = if q[k] then a else b. The answer is yes:
We create a bit-word consisting of replicating q[k] onto every bit. There are two ways I can think of to do this:
1) Left-shift by k, then right-shift by wordsize (efficient, but only works if you have shift operators which duplicate the last bit)
2) Inefficient but theoretically correct way:
We left-shift q by k bits
We take this result and and it with 10000...0
We right-shift this by 1 bit, and or it with the non-right-shifted version. This copies the bit in the first place to the second place. We repeat this process until the entire word is the same as the first bit (e.g. 64 times)
Calling this result mask, our function is (mask and a) or (!mask and b): the result will be a if the kth bit of q is true, other the result will be b
Taking the bit-vector c=a!=b and a==1111..1 and b==0000..0, we use our if function to successively test whether the first bit is 1, then the second bit is 1, etc:
a<b :=
if(c,0,
if(a,0, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,1,
if(a,1, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,2,
if(a,2, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,3,
if(a,3, B_LESSTHAN_A, A_LESSTHAN_B),
if(...
if(c,64,
if(a,64, B_LESSTHAN_A, A_LESSTHAN_B),
A_EQUAL_B)
)
...)
)
)
)
)
This takes wordsize steps. It can however be written in 3 lines by using a recursively-defined function, or a fixed-point combinator if recursion is not allowed.
Then we just turn that into an even larger function: xMin<x and x<xMax and yMin<y and y<yMax

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Number base conversion as a stream operation

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)
I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.
I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!
Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

Resources