Compress two or more numbers into one byte - algorithm

I think this is not really possible but worth asking anyway. Say I have two small numbers (Each ranges from 0 to 11). Is there a way that I can compress them into one byte and get them back later. How about with four numbers of similar sizes.
What I need is something like: a1 + a2 = x. I only know x and from that get a1, a2
For the second part: a1 + a2 + a3 + a4 = x. I only know x and from that get a1, a2, a3, a4
Note: I know you cannot unadd, just illustrating my question.
x must be one byte. a1, a2, a3, a4 range [0, 11].

Thats trivial with bit masks. Idea is to divide byte into smaller units and dedicate them to different elements.
For 2 numbers, it can be like this: first 4 bits are number1, rest are number2. You would use number1 = (x & 0b11110000) >> 4, number2 = (x & 0b00001111) to retrieve values, and x = (number1 << 4) | number2 to compress them.

For two numbers, sure. Each one has 12 possible values, so the pair has a total of 12^2 = 144 possible values, and that's less than the 256 possible values of a byte. So you could do e.g.
x = 12*a1 + a2
a1 = x / 12
a2 = x % 12
(If you only have signed bytes, e.g. in Java, it's a little trickier)
For four numbers from 0 to 11, there are 12^4 = 20736 values, so you couldn't fit them in one byte, but you could do it with two.
x = 12^3*a1 + 12^2*a2 + 12*a3 + a4
a1 = x / 12^3
a2 = (x / 12^2) % 12
a3 = (x / 12) % 12
a4 = x % 12
EDIT: the other answers talk about storing one number per four bits and using bit-shifting. That's faster.

The 0-11 example is pretty easy -- you can store each number in four bits, so putting them into a single byte is just a matter of shifting one 4 bits to the left, and oring the two together.
Four numbers of similar sizes won't fit -- four bits apiece times four gives a minimum of 16 bits to hold them.

Let's say it in general: suppose you want to mix N numbers a1, a2, ... aN, a1 ranging from 0..k1-1, a2 from 0..k2-1, ... and aN from 0 .. kN-1.
Then, the encoded number is:
encoded = a1 + k1*a2 + k1*k2*a3 + ... k1*k2*..*k(N-1)*aN
The decoding is then more tricky, stepwise:
rest = encoded
a1 = rest mod k1
rest = rest div k1
a2 = rest mod k2
rest = rest div k2
...
a(N-1) = rest mod k(N-1)
rest = rest div k(N-1)
aN = rest # rest is already < kN

If the numbers 0-11 aren't evenly distributed you can do even better by using shorter bit sequences for common values and longer ones for rarer values. It costs at least one bit to code which length you are using so there is a whole branch of CS devoted to proving when it's worth doing.

So a byte can hold upto 256 values or FF in Hex. So you can encode two numbers from 0-16 in a byte.
byte a1 = 0xf;
byte a2 = 0x9;
byte compress = a1 << 4 | (0x0F & a2); // should yield 0xf9 in one byte.
4 Numbers you can do if you reduce it to only 0-8 range.

Since a single byte is 8 bits, you can easily subdivide it, with smaller ranges of values. The extreme limit of this is when you have 8 single bit integers, which is called a bit field.
If you want to store two 4-bit integers (which gives you 0-15 for each), you simply have to do this:
value = a * 16 + b;
As long as you do proper bounds checking, you will never lose any information here.
To get the two values back, you just have to do this:
a = floor(value / 16)
b = value MOD 15
MOD is modulus, it's the "remainder" of a division.
If you want to store four 2-bit integers (0-3), you can do this:
value = a * 64 + b * 16 + c * 4 + d
And, to get them back:
a = floor(value / 64)
b = floor(value / 16) MOD 4
c = floor(value / 4) MOD 4
d = value MOD 4
I leave the last division as an exercise for the reader ;)

#Mike Caron
your last example (4 integers between 0-3) is much faster with bit-shifting. No need for floor().
value = (a << 6) | (b << 4) | (c << 2) | d;
a = (value >> 6);
b = (value >> 4) % 4;
c = (value >> 2) % 4;
d = (value) % 4;

Use Bit masking or Bit Shifting. The later is faster
Test out BinaryTrees for some fun. (it will be handing later on in dev life regarding data and all sorts of dev voodom lol)

Packing four values into one number will require at least 15 bits. This doesn't fit in a single byte, but in two.
What you need to do is a conversion from base 12 to base 65536 and conversely.
B = A1 + 12.(A2 + 12.(A3 + 12.A4))
A1 = B % 12
A2 = (B / 12) % 12
A3 = (B / 144) % 12
A4 = B / 1728
As this takes 2 bytes anyway, conversion from base 12 to (packed) base 16 is by far prefable.
B1 = A1 + 256.A2
B2 = A3 + 256.A4
A1 = B1 % 256
A2 = B1 / 256
A3 = B2 % 256
A4 = B2 / 256
The modulos and divisions are implemented bymaskings and shifts.

0-9 works much easier. You can easily store 11random order decimals in 4 1/2 bytes. Which is tighter compression than log(256)÷log(10). Just by creative mapping. Remember not all compression has to do with, dictionaries, redundancies, or sequences.
If you are talking of random numbers 0 - 9 you can have 4 digits per 14 bits not 15.

Related

Generate number with equal probability

You are given a function let’s say bin() which will generate 0 or 1 with equal probability. Now you are given a range of contiguous integers say [a,b] (a and b inclusive).
Write a function say rand() using bin() to generate numbers within range [a,b] with equal probability
The insight you need is that your bin() function returns a single binary digit, or "bit". Invoking it once gives you 0 or 1. If you invoke it twice you get two bits b0 and b1 which can be combined as b1 * 2 + b0, giving you one of 0, 1, 2 or 3 with equal probability. If you invoke it thrice you get three bits b0, b1 and b2. Put them together and you get b2 * 2^2 + b1 * 2 + b0, giving you a member of {0, 1, 2, 3, 4, 5, 6, 7} with equal probability. And so on, as many as you want.
Your range [a, b] has m = b-a+1 values. You just need enough bits to generate a number between 0 and 2^n-1, where n is the smallest value that makes 2^n-1 greater than or equal to m. Then just scale that set to start at a and you're good.
So let's say you are given the range [20, 30]. There are 11 numbers there from 20 to 30 inclusive. 11 is greater than 8 (2^3), but less than 16 (2^4), so you'll need 4 bits. Use bin() to generate four bits b0, b1, b2, and b3. Put them together as x = b3 * 2^3 + b2 * 2^2 + b1 * 2 + b0. You'll get a result, x, between 0 and 15. If x > 11 then generate another four bits. When x <= 11, your answer is x + 20.
Help, but no code:
You can shift the range [0,2 ** n] easily to [a,a+2 ** n]
You can easily produce an equal probability from [0,2**n-1]
If you need a number that isn't a power of 2, just generate a number up to 2 ** n and re-roll if it exceeds the number you need
Subtract the numbers to work out your range:
Decimal: 20 - 10 = 10
Binary : 10100 - 01010 = 1010
Work out how many bits you need to represent this: 4.
For each of these, generate a random 1 or 0:
num_bits = 4
rand[num_bits]
for (x = 0; x < num_bits; ++x)
rand[x] = bin()
Let's say rand[] = [0,1,0,0] after this. Add this number back to the start of your range.
Binary: 1010 + 0100 = 1110
Decimal: 10 + 4 = 14
You can always change the range [a,b] to [0,b-a], denote X = b - a. Then you can define a function rand(X) as follows:
function int rand(X){
int i = 1;
// determine how many bits you need (see above answer for why)
while (X < 2^i) {
i++;
}
// generate the random numbers
Boolean cont = true;
int num = 0;
while (cont == true) {
for (j = 1 to i) {
// this generates num in range [0,2^i -1] with equal prob
// but we need to discard if num is larger than X
num = num + bin() * 2^j;
}
if (num <= X) { cont = false}
}
return num;
}

Algorithm for sequence calculation

I'm looking for a hint to an algorithm or pseudo code which helps me calculate sequences.
It's kind of permutations, but not exactly as it's not fixed length.
The output sequence should look something like this:
A
B
C
D
AA
BA
CA
DA
AB
BB
CB
DB
AC
BC
CC
DC
AD
BD
CD
DD
AAA
BAA
CAA
DAA
...
Every character above represents actually an integer, which gets incremented from a minimum to a maximum.
I do not know the depth when I start, so just using multiple nested for loops won't work.
It's late here in Germany and I just can't wrap my head around this. Pretty sure that it can be done with for loops and recursion, but I have currently no clue on how to get started.
Any ideas?
EDIT: B-typo corrected.
It looks like you're taking all combinations of four distinct digits of length 1, 2, 3, etc., allowing repeats.
So start with length 1: { A, B, C, D }
To get length 2, prepend A, B, C, D in turn to every member of length 1. (16 elements)
To get length 3, prepend A, B, C, D in turn to every member of length 2. (64 elements)
To get length 4, prepend A, B, C, D in turn to every member of length 3. (256 elements)
And so on.
If you have more or fewer digits, the same method will work. It gets a little trickier if you allow, say, A to equal B, but that doesn't look like what you're doing now.
Based on the comments from the OP, here's a way to do the sequence without storing the list.
Use an odometer analogy. This only requires keeping track of indices. Each time the first member of the sequence cycles around, increment the one to the right. If this is the first time that that member of the sequence has cycled around, then add a member to the sequence.
The increments will need to be cascaded. This is the equivalent of going from 99,999 to 100,000 miles (the comma is the thousands marker).
If you have a thousand integers that you need to cycle through, then pretend you're looking at an odometer in base 1000 rather than base 10 as above.
Your sequence looks more like (An-1 X AT) where A is a matrices and AT is its transpose.
A= [A,B,C,D]
AT X An-1 ∀ (n=0)
sequence= A,B,C,D
AT X An-1 ∀ (n=2)
sequence= AA,BA,CA,DA,AB,BB,CB,DB,AC,BC,CC,DC,AD,BD,CD,DD
You can go for any matrix multiplication code like this and implement what you wish.
You have 4 elements, you are simply looping the numbers in a reversed base 4 notation. Say A=0,B=1,C=2,D=3 :
first loop from 0 to 3 on 1 digit
second loop from 00 to 33 on 2 digits
and so on
i reversed i output using A,B,C,D digits
loop on 1 digit
0 0 A
1 1 B
2 2 C
3 3 D
loop on 2 digits
00 00 AA
01 10 BA
02 20 CA
03 30 DA
10 01 AB
11 11 BB
12 21 CB
13 31 DB
20 02 AC
21 12 BC
22 22 CC
...
The algorithm is pretty obvious. You could take a look at algorithm L (lexicographic t-combination generation) in fascicle 3a TAOCP D. Knuth.
How about:
Private Sub DoIt(minVal As Integer, maxVal As Integer, maxDepth As Integer)
If maxVal < minVal OrElse maxDepth <= 0 Then
Debug.WriteLine("no results!")
Return
End If
Debug.WriteLine("results:")
Dim resultList As New List(Of Integer)(maxDepth)
' initialize with the 1st result: this makes processing the remainder easy to write.
resultList.Add(minVal)
Dim depthIndex As Integer = 0
Debug.WriteLine(CStr(minVal))
Do
' find the term to be increased
Dim indexOfTermToIncrease As Integer = 0
While resultList(indexOfTermToIncrease) = maxVal
resultList(indexOfTermToIncrease) = minVal
indexOfTermToIncrease += 1
If indexOfTermToIncrease > depthIndex Then
depthIndex += 1
If depthIndex = maxDepth Then
Return
End If
resultList.Add(minVal - 1)
Exit While
End If
End While
' increase the term that was identified
resultList(indexOfTermToIncrease) += 1
' output
For d As Integer = 0 To depthIndex
Debug.Write(CStr(resultList(d)) + " ")
Next
Debug.WriteLine("")
Loop
End Sub
Would that be adequate? it doesn't take much memory and is relatively fast (apart from the writing to output...).

Xnary (like binary but different) counting

I'm making a function that converts a number into a string with predefined characters. Original, I know. I started it, because it seemed fun at the time. To do on my own. Well, it's frustrating and not fun.
I want it to be like binary as in that any left character is worth more than its right neigbour. Binary is inefficient because every bit has only 1 positive value. Xnary is efficient, because a 'bit' is never 0.
The character set (in this case): A - Z.
A = 1 ..
Z = 26
AA = 27 ..
AZ = 52
BA = 53 ..
BZ = 2 * 26 (B) + 26 * 1 (Z) = 78... Right?
ZZ = 26 * 26 (Z) + 26 * 1 (Z) = 702?? Right??
I found this here, but there AA is the same as A and AAA. The result of the function is never AA or AAA.
The string A is different from AA and AAA however, so the number should be too. (Unlike binary 1, 01, 001 etc.) And since a longer string is always more valuable than a shorter... A < AA < AAA.
Does this make sense? I've tried to explain it before and have failed. I've also tried to make it before. =)
The most important thing: since A < AA < AAA, the value of 'my' ABC is higher than the value of the other script. Another difference: my script doesn't exist, because I keep failing.
I've tried with this algorithm:
N = 1000, Size = 3, (because 26 log(1000) = 2.x), so use 676, 26 and 1 for positions:
N = 1000
P0 = 1000 / 676 = 1.x = 1 = A
N = 1000 - 1 * 676 = 324
P1 = 324 / 26 = 12.x = 12 = L
N = 324 - 12 * 26 = 12
P1 = 12 / 1 = 12 = L
1000 => ALL
Sounds fair? Apparently it's crap. Because:
N = 158760, Size = 4, so use 17576, 676, 26 and 1
P0 = 158760 / 17576 = 9.x = 9 = I
N = 158760 - 9 * 17576 = 576
P1 = 576 / 676 = 0.x = 0 <<< OOPS
If 1 is A (the very first of the xnary), what's 0? Impossible is what it is.
So this one is a bust. The other one (on jsFiddle) is also a bust, because A != AA != AAA and that's a fact.
So what have I been missing for a few long nights?
Oh BTW: if you don't like numbers, don't read this.
PS. I've tried searching for similar questions but none are similar enough. The one references is most similar, but 'faulty' IMO.
Also known as Excel column numbering. It's easier if we shift by one, A = 0, ..., Z = 25, AA = 26, ..., at least for the calculations. For your scheme, all that's needed then is a subtraction of 1 before converting to Xnary resp. an addition after converting from.
So, with that modification, let's start finding the conversion. First, how many symbols do we need to encode n? Well, there are 26 one-digit numbers, 26^2 two-digit numbers, 26^3 three-digit numbers etc. So the total of numbers using at most d digits is 26^1 + 26^2 + ... + 26^d. That is the start of a geometric series, we know a closed form for the sum, 26*(26^d - 1)/(26-1). So to encode n, we need d digits if
26*(26^(d-1)-1)/25 <= n < 26*(26^d-1)/25 // remember, A = 0 takes one 'digit'
or
26^(d-1) <= (25*n)/26 + 1 < 26^d
That is, we need d(n) = floor(log_26(25*n/26+1)) + 1 digits to encode n >= 0. Now we must subtract the total of numbers needing at most d(n) - 1 digits to find the position of n in the d(n)-digit numbers, let's call it p(n) = n - 26*(26^(d(n)-1)-1)/25. And the encoding of n is then simply a d(n)-digit base-26 encoding of p(n).
The conversion in the other direction is then a base-26 expansion followed by an addition of 26*(26^(d-1) - 1)/25.
So for N = 1000, we encode n = 999, log_26(25*999/26+1) = log_26(961.5769...) = 2.x, we need 3 digits.
p(999) = 999 - 702 = 297
297 = 0*26^2 + 11*26 + 11
999 = ALL
For N = 158760, n = 158759 and log_26(25*158759/26+1) = 3.66..., we need four digits
p(158759) = 158759 - 18278 = 140481
140481 = 7*26^3 + 25*26^2 + 21*26 + 3
158759 = H Z V D
This appears to be a very standard "implement conversion from base 10 to base N" where N happens to be 26, and you're using letters to represent all digits.
If you have A-Z as a 26ary value, you can represent 0 through (26 - 1) (like binary can represent 0 - (2 - 1).
BZ = 1 * 26 + 25 *1 = 51
The analogue would be:
19 = 1 * 10 + 9 * 1 (1/B being the first non-zero character, and 9/Z being the largest digit possible).
You basically have the right idea, but you need to shift it so A = 0, not A = 1. Then everything should work relatively sanely.
In the lengthy answer by #Daniel I see a call to log() which is a red flag for performance. Here is a simple way without much complex math:
function excelize(colNum) {
var order = 0, sub = 0, divTmp = colNum;
do {
divTmp -= 26**order;
sub += 26**order;
divTmp = (divTmp - (divTmp % 26)) / 26;
order++;
} while(divTmp > 0);
var symbols = "0123456789abcdefghijklmnopqrstuvwxyz";
var tr = c => symbols[symbols.indexOf(c)+10];
Number(colNum-sub).toString(26).split('').map(c=>tr(c)).join('');
}
Explanation:
Since this is not base26, we need to substract the base times order for each additional symbol ("digit"). So first we count the order of the resulting number, and at the same time count the substract. And then we convert it to base 26 and substract that, and then shift the symbols to A-Z instead of 0-P.

Create smallest unique number from two byte

As we all know a byte range is 0 to 255
i have two byte, I want to create another number that is reversible but with minimum length that combine these two number and create a new one, the simplest solution is
a = b1 * 1000 + b2
reverse would be
b1 = a / 1000
b2 = a % 1000
the length of above solution varying 0 to 6 length, i want a formula with FIXED and minimum length
Encode:
x = b1 * 256 + b2;
x = x + 10000;
Decode:
x = x - 10000;
b1 = x >> 8;
b2 = x & 255;
Encoded result always has length 5 (10000 through 75535 inclusive). And since there are 65536 different pairs (b1, b2) you can't encode them into numbers of length < 5 (because there are at most 10000 such numbers).
If you mean a fixed-length number when expressed in decimal, you can use:
a = 100,000 + b1 * 256 + b2
That will give you a number from 100,000 to 165,535 inclusive.
To reverse the operation:
b1 = (a - 100,000) / 256
b2 = (a - 100,000) % 256
Well the simplest general purpose solution here would be
a = b1 * 256 + b2;
aka
a = (b1 << 8) | b2;
Then to get them back (assuming you've got unsigned bytes available):
b1 = (a >> 8) & 0xff;
b2 = a & 0xff;
That will produce a 2-byte value for all inputs, unless you deem the results of b1=0, b2=* to be one byte (as every value is less than 256). You could potentially interleave things, such that the low four bits of each input byte ends up in the low eight bits of the output, so that you'd end up with a value of less than 256 for (b1 < 16, b2 < 16). Frankly it's easiest to always just consider it as a 2-byte value, using the above shifting.
Using 2 bytes is clearly a minimum, and it's easily achievable.
If that's not what you're looking for, please give more information.
EDIT: I've been assuming you're looking for a fixed binary length. If you want a fixed decimal length, use falagar's solution.

Fast modulo 3 or division algorithm?

is there a fast algorithm, similar to power of 2, which can be used with 3, i.e. n%3.
Perhaps something that uses the fact that if sum of digits is divisible by three, then the number is also divisible.
This leads to a next question. What is the fast way to add digits in a number? I.e. 37 -> 3 +7 -> 10
I am looking for something that does not have conditionals as those tend to inhibit vectorization
thanks
4 % 3 == 1, so (4^k * a + b) % 3 == (a + b) % 3. You can use this fact to evaluate x%3 for a 32-bit x:
x = (x >> 16) + (x & 0xffff);
x = (x >> 10) + (x & 0x3ff);
x = (x >> 6) + (x & 0x3f);
x = (x >> 4) + (x & 0xf);
x = (x >> 2) + (x & 0x3);
x = (x >> 2) + (x & 0x3);
x = (x >> 2) + (x & 0x3);
if (x == 3) x = 0;
(Untested - you might need a few more reductions.) Is this faster than your hardware can do x%3? If it is, it probably isn't by much.
This comp.compilers item has a specific recommendation for computing modulo 3.
An alternative, especially if the maximium size of the dividend is modest, is to multiply by the reciprocal of 3 as a fixed-point value, with enough bits of precision to handle the maximum size dividend to compute the quotient, and then subtract 3*quotient from the the dividend to get the remainder. All of these multiplies can be implemented with a fixed sequence of shifts-and-adds. The number of instructions will depend on the bit pattern of the reciprocal. This works pretty well when the dividend max is modest in size.
Regarding adding digits in the number... if you want to add the decimal digits, you're going to end up doing what amounts to a number-conversion-to-decimal, which involves divide by 10 somewhere. If you're willing to settle for adding up the digits in base2, you can do this with an easy shift-right and add loop. Various clever tricks can be used to do this in chunks of N bits to speed it up further.
Not sure for your first question, but for your second, you can take advantage of the % operator and integer division:
int num = 12345;
int sum = 0;
while (num) {
sum += num % 10;
num /= 10;
}
This works because 12345 % 10 = 5, 12345 / 10 = 1234 and keep going until num == 0
If you are happy with 1 byte integer division, here's a trick. You could extend it to 2 bytes, 4 bytes, etc.
Division is essentially multiplication by 0.3333. If you want to simulate floating point arithmetic then you need closest approximation for the 256 (decimal) boundary. This is 85, because 85 / 256 = 0.332. So if you multiply your value by 85, you should be getting a value close to the result in the high 8 bits.
Multiplying a value with 85 fast is easy. n * 85 = n * 64 + n * 16 + n * 4 + n. Now all these factors are powers of 2 so you can calculate n * 4 by shifting, then use this value to calculate n * 16, etc. So you have max 5 shifts and 4 additions.
As said, this'll give you approximation. To know how good it is you'll need to check the lower byte of the next value using this rule
n ... is the 16 bit number you want to divide
approx = HI(n*85)
if LO(n*85)>LO((n+1)*85)THEN approx++
And that should do the trick.
Example 1:
3 / 3 =?
3 * 85 = 00000000 11111111 (approx=0)
4 * 85 = 00000001 01010100 (LO(3*85)>LO(4*85)=>approx=1)
result approx=1
Example 2:
254 / 3
254 * 85 = 01010100 01010110 (approx=84)
255 * 85 = 01010100 10101011 (LO(254*85)<LO(255*85), don't increase)
result approx=84
If you're dealing with big-integers, one very fast method is realizing the fact for all
bases 10 +/- multiple-of-3
i.e.
4,7,10,13,16,19,22…. etc
All you have to do is count the digits, then % 3. something like :
** note : x ^ y is power, not bit-wise XOR,
x ** y being the python equivalent
function mod3(__,_) {
#
# can handle bases
# { 4, 7,10,13,16,19,
# 22,25,28,31,34 } w/o conversion
#
# assuming base digits :
#
# 0-9A-X for any base,
# or 0-9a-f for base-16
return \
(length(__)<=+((_+=++_+_)+_^_)\
&& (__~"^[0-9]+$") )\
? (substr(__,_~_,_+_*_+_)+\
substr(__,++_*_--))%+_\
:\
(substr("","",gsub(\
"[_\3-0369-=CFILORUXcf-~]+","",__))\
+ length(__) \
+ gsub("[258BbEeHKNQTW]","",__))%+_
}
This isn't the fastest method possible, but it's one of the more agile methods.

Resources