Compression by quazi-logarithmic scale - algorithm

I need to compress a large set of (unsigned) integer values, whereas the goal is to keep their relative accuracy. Simply speaking, there is a big difference between 1,2,3, but the difference between 1001,1002,1003 is minor.
So, I need a sort of a lossy transformation. The natural choice is to build a logarithmic scale, but the drawback is that conversion to/from it requires floating-point operations, log/exp calculation and etc.
OTOH I don't need a truly logarithmic scale, I just need it to resemble it in some sense.
I came up with an idea of encoding numbers in a floating-point manner. That is, I allocate N bits for each compressed number, from which some represent the mantissa, and the remaining are for the order. The choice for the size of the mantissa and order would depend on the needed range and accuracy.
My question is: is it a good idea? Or perhaps there exists a better encoding scheme w.r.t. computation complexity vs quality (similarity to logarithmic scale).
What I implemented in details:
As I said, bits for mantissa and for order. Order bits are leading, so that the greater the encoded number - the greater the raw one.
The actual number is decoded by appending an extra leading bit to mantissa (aka implicit bit), and left-shifting it by the encoded order. The smallest decoded number would be 1 << M where M is the size of mantissa. If the needed range should start from 0 (like in my case) then this number can be subtracted.
Encoding the number is also simple. Add the 1 << M, then find its order, i.e. how much it should be right-shifted until it fits our mantissa with implicit leading bit, and then encoding is trivial. Finding the order is done via median-search, which results to just a few ifs. (for example, if there are 4 order bits, the max order is 15, and it's found within 4 ifs).
I call this a "quazi-logarithmic" scale. The absolute precision decreases the greater is the number. But unlike the true logarithmic scale, where the granularity increases contiguously, in our case it jumps by factor of 2 after each fixed-size range.
The advantages of this encoding:
Fast encoding and decoding
No floating-point numbers, no implicit precision loss during manipulations with them, boundary cases and etc.
Not dependent on standard libraries, complex math functions, etc.
Encoding and decoding may be implemented via C++ template stuff, so that conversion may even be implemented in compile-time. This is convenient to define some compile-time constants in a human-readable way.

In your compression-algorithm every group of numbers that result in the same output after being compressed will be decompressed to the lowest number in that group. If you changed that to the number in the middle the average fault would be reduced.
E.g. for a 8-bit mantissa and 5bit exponent the numbers in the range[0x1340, 0x1350) will be translated into 0x1340 by decompress(compress(x)). If the entire range would first be compressed and afterwards decompressed the total difference would be 120. If the output would be 0x1348 instead, the total error would only be 64, which reduces the error by a solid 46.7%. So simply adding 2 << (exponent - 1) to the output will significantly reduce the error of the compression-scheme.
Apart from that I don't see much of an issue with this scheme. Just keep in mind that you'll need a specific encoding for 0. There would be alternative encodings, but without knowing anything specific about the input this one will be the best you can get.
EDIT:
While it is possible to move the correction of the result from the decompression to the compression-step, this comes at an increased expenses of enlarging the exponent-range by one. This is due the fact that for the numbers with the MSB set only half of the numbers will use the corresponding exponent (the other half will be populated by numbers with the second-most significant bit set). The higher half of numbers with the MSB set will be placed in the next-higher order.
So for e.g. for 32-bit numbers encoded with a 15bit-mantissa only numbers until 0x8FFF FFFF will have order 15 (Mantissa = 0x1FFF and Exponent = 15). All higher values will have order 16 (Mantissa = 0x?FFF and Exponent = 16). While the increase of the exponent by 1 in itself doesn't seem much, in this example it already costs an additional bit for the exponent.
In addition the decompression-step for the above example will produce an integer-overflow, which may be problematic under certain circumstances (e.g. C# will throw an exception if the decompression is done in checked-mode). Same applies for the compression-step: unless properly handled, adding 2^(order(n) - 1) to the input n will cause an overflow, thus placing the number in order 0.
I would recommend moving the correction to the decompression-step (as shown above) to remove potential integer-overflows as a source of problems/bugs and keep the number of exponents that need to be encoded minimal.
EDIT2:
Another issue with this approach is the fact that half of the numbers (excluding the lowest order) wind up in a larger "group" when the correction is done on compression, thus reducing precision.

Related

Shuffle sequential numbers without a buffer

I am looking for a shuffle algorithm to shuffle a set of sequential numbers without buffering. Another way to state this is that I’m looking for a random sequence of unique numbers that have a given period.
Your typical Fisher–Yates shuffle needs to have each element all of the elements it is going to shuffle, so that isn’t going to work.
A Linear-Feedback Shift Register (LFSR) does what I want, but only works for periods that are powers-of-two less two. Here is an example of using a 4-bit LFSR to shuffle the numbers 1-14:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
8
12
14
7
4
10
5
11
6
3
2
1
9
13
The first two is the input, and the second row the output. What’s nice is that the state is very small—just the current index. You can start of any index and get a difference set of numbers (starting at 1 yields: 8, 12, 14; starting at 9: 6, 3, 2), although the sequence is always the same (5 is always followed by 11). If I want a different sequence, I can pick a different generator polynomial.
The limitations to the LFSR are that the periods are always power-of-two less two (the min and max are always the same, thus unshuffled) and there not enough enough generator polynomials to allow every possible random sequence.
A block cipher algorithm would work. Every key produces a uniquely shuffled set of numbers. However all block ciphers (that I know about) have power-of-two block sizes, and usually a fixed or limited number of block sizes. A block cipher with a arbitrary non-binary block size would be perfect if such a thing exists.
There are a couple of projects I have that could benefit from such an algorithm. One is for small embedded micros that need to produce a shuffled sequence of numbers with a period larger than the memory they have available (think Arduino Uno needing to shuffle 1 to 100,000).
Does such an algorithm exist? If not, what things might I search for to help me develop such an algorithm? Or is this simply not possible?
Edit 2022-01-30
I have received a lot of good feedback and I need to better explain what I am searching for.
In addition to the Arduino example, where memory is an issue, there is also the shuffle of a large number of records (billions to trillions). The desire is to have a shuffle applied to these records without needing a buffer to hold the shuffle order array, or the time needed to build that array.
I do not need an algorithm that could produce every possible permutation, but a large number of permutations. Something like a typical block cipher in counter mode where each key produces a unique sequence of values.
A Linear Congruential Generator using coefficients to produce the desired sequence period will only produce a single sequence. This is the same problem for a Linear Feedback Shift Register.
Format-Preserving Encryption (FPE), such as AES FFX, shows promise and is where I am currently focusing my attention. Additional feedback welcome.
It is certainly not possible to produce an algorithm which could potentially generate every possible sequence of length N with less than N (log2N - 1.45) bits of state, because there are N! possible sequence and each state can generate exactly one sequence. If your hypothetical Arduino application could produce every possible sequence of 100,000 numbers, it would require at least 1,516,705 bits of state, a bit more than 185Kib, which is probably more memory than you want to devote to the problem [Note 1].
That's also a lot more memory than you would need for the shuffle buffer; that's because the PRNG driving the shuffle algorithm also doesn't have enough state to come close to being able to generate every possible sequence. It can't generate more different sequences than the number of different possible states that it has.
So you have to make some compromise :-)
One simple algorithm is to start with some parametrisable generator which can produce non-repeating sequences for a large variety of block sizes. Then you just choose a block size which is as least as large as your target range but not "too much larger"; say, less than twice as large. Then you just select a subrange of the block size and start generating numbers. If the generated number is inside the subrange, you return its offset; if not, you throw it away and generate another number. If the generator's range is less than twice the desired range, then you will throw away less than half of the generated values and producing the next element in the sequence will be amortised O(1). In theory, it might take a long time to generate an individual value, but that's not very likely, and if you use a not-very-good PRNG like a linear congruential generator, you can make it very unlikely indeed by restricting the possible generator parameters.
For LCGs you have a couple of possibilities. You could use a power-of-two modulus, with an odd offset and a multiplier which is 5 mod 8 (and not too far from the square root of the block size), or you could use a prime modulus with almost arbitrary offset and multiplier. Using a prime modulus is computationally more expensive but the deficiencies of LCG are less apparent. Since you don't need to handle arbitrary primes, you can preselect a geometrically-spaced sample and compute the efficient division-by-multiplication algorithm for each one.
Since you're free to use any subrange of the generator's range, you have an additional potential parameter: the offset of the start of the subrange. (Or even offsets, since the subrange doesn't need to be contiguous.) You can also increase the apparent randomness by doing any bijective transformation (XOR/rotates are good, if you're using a power-of-two block size.)
Depending on your application, there are known algorithms to produce block ciphers for subword bit lengths [Note 2], which gives you another possible way to increase randomness and/or add some more bits to the generator state.
Notes
The approximation for the minimum number of states comes directly from Stirling's approximation for N!, but I computed the number of bits by using the commonly available lgamma function.
With about 30 seconds of googling, I found this paper on researchgate.net; I'm far from knowledgable enough in crypto to offer an opinion, but it looks credible; also, there are references to other algorithms in its footnotes.

Generate random 128 bit decimal in given range in go

Let's say that we have a random number generator that can generate random 32 or 64 bit integers (like rand.Rand in the standard library)
Generating a random int64 in a given range [a,b] is fairly easy:
rand.Seed(time.Now().UnixNano())
n := rand.Int63n(b-a) + a
Is it possible to generate random 128 bit decimal (as defined in specification IEEE 754-2008) in a given range from a combination of 32 or 64 bit random integers?
It is possible, but the solution is far from trivial. For a correct solution, there are several things to consider.
For one thing, values with exponent E are 10 times more likely than values with exponent E - 1.
Other issues include subnormal numbers and ranges that straddle zero.
I am aware of the Rademacher Floating-Point Library, which tackled this problem for binary floating-point numbers, but the solution there is complicated and its author has not yet written up how his algorithm works.
EDIT (May 11):
I have now specified an algorithm for generating random "uniform" floating-point numbers—
In any range,
with full coverage, and
regardless of the digit base (such as binary or decimal).
Possible, but by no means easy. Here is a sketch of a solution that might be acceptable — writing and debugging it would probably be at least a day of concerted effort.
Let min and max be primitive.Decimal128 objects from go.mongodb.org/mongo-driver/bson. Let MAXBITS be a multiple of 32; 128 is likely to be adequate.
Get the significand (as big.Int) and exponent (as int) of min and max using the BigInt method.
Align min and max so that they have the same exponent. As far as possible, left-justify the value with the larger exponent by decreasing its exponent and adding a corresponding number of zeroes to the right side of its significand. If this would cause the absolute value of the significand to become >= 2**(MAXBITS-1), then either
(a) Right-shift the value with the smaller exponent by dropping digits from the right side of its significand and increasing its exponent, causing precision loss.
(b) Dynamically increase MAXBITS.
(c) Throw an error.
At this point both exponents will be the same, and both significands will be aligned big integers. Set aside the exponents for now, and let range (a new big.Int) be maxSignificand - minSignificand. It will be between 0 and 2**MAXBITS.
Turn range into MAXBITS/32 uint32s using the Bytes or DivMod methods, whatever is easier.
If the highest word of range is equal to math.MaxUint32 then set a flag limit to false, otherwise true.
For n from 0 to MAXBITS/32:
if limit is true, use rand.Int63n (!, not rand.Int31n or rand.Uint32) to generate a value between 0 and the nth word of range, inclusive, cast it to uint32, and store it as the nth word of the output. If the value generated is equal to the nth word of range (i.e. if we generated the maximum possible random value for this word) then let limit remain true, otherwise set it false.
If limit is false, use rand.Uint32 to generate the nth word of the output. limit remains false regardless of the generated value.
Combine the generated words into a big.Int by building a []byte and using big/Int.SetBytes or multiplication and addition, as convenient.
Add the generated value to minSignificand to obtain the significand of the result.
Use ParseDecimal128FromBigInt with the result significand and the exponent from steps 2-3 to obtain the result.
The heart of the algorithm is step 6, which generates a uniform random unsigned integer of arbitrary length 32 bits at a time. The alignment in step 2 reduces the problem from a floating-point to an integer one, and the subtraction in step 3 reduces it to an unsigned one, so that we only have to think about one bound instead of 2. The limit flag records whether we're still dealing with that bound, or whether we've already narrowed the result down to an interval that doesn't include it.
Caveats:
I haven't written this, let alone tested it. I may have gotten it quite wrong. A sanity check by someone who does more numerical computation work than me would be welcome.
Generating numbers across a large dynamic range (including crossing zero) will lose some precision and omit some possible output values with smaller exponents unless a ludicrously large MAXBITS is used; however, 128 bits should give a result at least as good as a naive algorithm implemented in terms of decimal128.
The performance is probably pretty bad.
Go has a large number package that can do arbitrary length integers: https://golang.org/pkg/math/big/
It has a pseudo random number generator https://golang.org/pkg/math/big/#Int.Rand, and the crypto package also has https://golang.org/pkg/crypto/rand/#Int
You'd want to specify the max using https://golang.org/pkg/math/big/#Int.Exp as 2^128.
Can't speak to performance, though, or whether this is compliant if the IEEE standard, but large random numbers like what you'd use for UUIDs are possible.
It depends how many values you want to generate. If it's enough to have no more 10^34 values in a specified range - it's quite simple.
As I see the problem, a random value in the range min..max can be calculated as random(0..1)*(max-min)+min
Look like we need to generate only decimal128 value in range 0..1. So it's a random value in range 0..10^34-1 with exponent -34. This value can be generated with a golang standard random package.
To multiply, add and substruct float128 values can be used golang math/big package with values normalization.
This is definitely what you are looking for.

How to compress an array of random positive integers in a certain range?

I want to compress an array consisting of about 10^5 random integers in range 0 to 2^15. The integers are unsorted and I need to compress them lossless.
I don't care much about the amount of computation and time needed to run the algorithm, just want to have better compression ratio.
Are there any suggested algorithms for this?
Assuming you don´t need to preserve original order, instead of passing the numbers themselves, pass the count. If they have a normal distribution, you can expect each number to be repeated 3 or 4 times. With 3 bits per number, we can count up to 7. You can make an array of 2^15 * 3 bits and every 3 bits set the count of that number. To handle extreme cases that have more than 7, we can also send a list of numbers and their counts for these cases. Then you can read the 3 bits array and overwrite with the additional info for count higher than 7.
For your exact example: just encode each number as a 15-bit unsigned int and apply bit packing. This is optimal since you have stated each integer in uniformly random in [0, 2^15), and the Shannon Entropy of this distribution is 15 bits.
For a more general solution, apply Quantile Compression (https://github.com/mwlon/quantile-compression/). It takes advantage of any smooth-ish data and compresses near optimally on shuffled data. It works by encoding each integer with a Huffman code for it coarse range in the distribution, then an exact offset within that range.
These approaches are both computationally cheap, but more compute won't get you further in this case.

How to quickly determine if two sets of checksums are equal, with the same "strength" as the individual checksums

Say you have two unordered sets of checksums, one of size N and one of size M. Depending on the algorithm to compare them, you may not even know the sizes but can compare N != M for a quick abort if you do.
The hashing function used for a checksum has some chance of collision, which as a layman I'm foolishly referring to as "strength". Is there a way to take two sets of checksums, all made from the same hashing function, and quickly compare them (so comparing element to element is right out) with the same basic chance of collision between two sets as there is between two individual checksums?
For instance, one method would be to compute a "set checksum" by XORing all of the checksums in the set. This new single hash is used for comparing with other sets' hashes, meaning storage of size is no longer necessary. Especially since it can be modified for the addition/removal of an element checksum by XORing with the set's checksum without having to recompute the whole thing. But does that reduce the "strength" of the set's checksum compared to a brute force comparison of all the original ones? Is there a way to conglomerate the checksums of a set that doesn't reduce the "strength" (as much?) but still is less complex than a straight comparison of the set elements' checksums?
After my initial comment, I got to thinking about the math behind it. Here's what I came up with. I'm no expert so feel free to jump in with corrections. Note: This all assumes your hash function is uniformly distributed, as it should be.
Basically, the more bits in your checksum, the lower the chance of collision. The more files, the higher.
First, let's find the odds of a collision with a single pair of files XOR'd together. We'll work with small numbers at first, so let's assume our checksum is 4 bits(0-15), and we'll call it n.
With two sums, the total number of bits 2n(8), so there are 2^(2n)(256) possibilities total. However, we're only interested in the collisions. To collide an XOR, you need to flip the same bits in both sums. There are only 2^n(16) ways to do that, since we're using n bits.
So, the overall probability of a collision is 16/256, which is (2^n) / (2^(2n)), or simply 1/(n^2). That means the probability of a non-collision is 1 - (1/(n^2)). So, for our sample n, that means that it's only 15/16 secure, or 93.75%. Of course, for bigger checksums, it's better. Even for a puny n=16, you get 99.998%
That's for a single comparison, of course. Since you're rolling them all together, you're doing f-1 comparisons, where f is the number of files. To get the total odds of a collision that way, you take the f-1 power of the odds we got in the first step.
So, for ten files with a 4-bit checksum, we get pretty terrible results:
(15/16) ^ 9 = 55.92% chance of non-collision
This rapidly gets better as we add bits, even when we increase the number of files.
For 10 files with a 8-bit checksum:
(255/256) ^ 9 = 96.54%
For 100/1000 files with 16 bits:
(65536/65536) ^ 99 = 99.85%
(65536/65536) ^ 999 = 98.49%
As you can see, we're still working with small checksums. If you're using anything >= 32 bits, my calculator gets off into floating-point rounding errors when I try to do the math on it.
TL,DR:
Where n is the number of checksum bits and f is the number of files in each set:
nonCollisionChance = ( ((2^n)-1) / (2^n) ) ^ (f-1)
collisionChance = 1 - ( ((2^n)-1) / (2^n) ) ^ (f-1)
Your method of XOR'ing a bunch of checksums together is probably just fine.

Best way to represent numbers of unbounded length?

What's the most optimal (space efficient) way to represent integers of unbounded length?
(The numbers range from zero to positive-infinity)
Some sample number inputs can be found here (each number is shown on it's own line).
Is there a compression algorithm that is specialized in compressing numbers?
You've basically got two alternatives for variable-length integers:
Use 1 bit of every k as an end terminator. That's the way Google protobuf does it, for example (in their case, one bit from every byte, so there are 7 useful bits in every byte).
Output the bit-length first, and then the bits. That's how ASN.1 works, except for OIDs which are represented in form 1.
If the numbers can be really big, Option 2 is better, although it's more complicated and you have to apply it recursively, since you may have to output the length of the length, and then the length, and then the number. A common technique is to use a Option 1 (bit markers) for the length field.
For smallish numbers, option 1 is better. Consider the case where most numbers would fit in 64 bits. The overhead of storing them 7 bits per byte is 1/7; with eight bytes, you'd represent 56 bits. Using even the 7/8 representation for length would also represent 56 bits in eight bytes: one length byte and seven data bytes. Any number shorter than 48 bits would benefit from the self-terminating code.
"Truly random numbers" of unbounded length are, on average, infinitely long, so that's probably not what you've got. More likely, you have some idea of the probability distribution of number sizes, and could choose between the above options.
Note that none of these "compress" (except relative to the bloated ascii-decimal format). The asymptote of log n/n is 0, so as the numbers get bigger the size of the size of the numbers tends to occupy no (relative) space. But it still needs to be represented somehow, so the total representation will always be a bit bigger than log2 of the number.
You cannot compress per se, but you can encode, which may be what you're looking for. You have files with sequences of ASCII decimal digits separated by line feeds. You should simply Huffman encode the characters. You won't do much better than about 3.5 bits per character.

Resources