What's the most optimal (space efficient) way to represent integers of unbounded length?
(The numbers range from zero to positive-infinity)
Some sample number inputs can be found here (each number is shown on it's own line).
Is there a compression algorithm that is specialized in compressing numbers?
You've basically got two alternatives for variable-length integers:
Use 1 bit of every k as an end terminator. That's the way Google protobuf does it, for example (in their case, one bit from every byte, so there are 7 useful bits in every byte).
Output the bit-length first, and then the bits. That's how ASN.1 works, except for OIDs which are represented in form 1.
If the numbers can be really big, Option 2 is better, although it's more complicated and you have to apply it recursively, since you may have to output the length of the length, and then the length, and then the number. A common technique is to use a Option 1 (bit markers) for the length field.
For smallish numbers, option 1 is better. Consider the case where most numbers would fit in 64 bits. The overhead of storing them 7 bits per byte is 1/7; with eight bytes, you'd represent 56 bits. Using even the 7/8 representation for length would also represent 56 bits in eight bytes: one length byte and seven data bytes. Any number shorter than 48 bits would benefit from the self-terminating code.
"Truly random numbers" of unbounded length are, on average, infinitely long, so that's probably not what you've got. More likely, you have some idea of the probability distribution of number sizes, and could choose between the above options.
Note that none of these "compress" (except relative to the bloated ascii-decimal format). The asymptote of log n/n is 0, so as the numbers get bigger the size of the size of the numbers tends to occupy no (relative) space. But it still needs to be represented somehow, so the total representation will always be a bit bigger than log2 of the number.
You cannot compress per se, but you can encode, which may be what you're looking for. You have files with sequences of ASCII decimal digits separated by line feeds. You should simply Huffman encode the characters. You won't do much better than about 3.5 bits per character.
Related
I'm having difficulty storing and displaying numbers greater than 32767 in LC-3 since a register can only hold values from -32768 to 32767. My apology for not being able to come up with any idea for the algorithm. Please give me some suggestion. Thanks!
You'll need a representation to store the larger number in a pair or more of words.
There are several approaches to how big integers are stored: in a fixed number of words, and in a variable number of words or bytes. The critical part is being able to detect the presence and amount of overflow/carry on mathematical operations like *10.
For that reason, one simple approach is to use a variable number of words/bytes (for a single number), and store only one decimal digit in each of the words/bytes. That way multiplication by 10, means simply adding a digit on the end (which has the effect of moving each existing digit to the next higher power of ten position). Adding numbers of this form numbers is fairly easy as well, we need to line up the digits and then, we add them up and detect when the sum is >= 10, then there is a carry (of 1) to be added to the next higher order digit of the sum. (If adding two such (variable length) numbers is desired, I would store the decimal digits in reverse order, because then the low order numbers are already lined up for addition.) See also https://en.wikipedia.org/wiki/Binary-coded_decimal . (In some sense, this is like storing numbers in a form like string, but using binary values instead of ascii characters.)
To simplify this approach for your needs, you can fix the number of words to use, e.g. at 7, for 7 digits.
A variation on (unpacked) Binary-coded Decimal to pack them two decimal digits per byte. Its a bit more complicated but saves some storage.
Another approach is to store as many decimal digits as will fit full in a word, minus 1. Which is to say if we can store 65536 in 16-bits that's only 4 full decimal digits, which means putting 3 digits at a time into a word. You'd need 3 words for 9 digits. Multiplication by 10 means multiplying each word by 10 numerically, and then checking for larger than 999, and if larger, then carry the 1 to the next higher order word while also subtracting 10,000 from the overflowing word.
This approach will require actual multiplication and division by 10 on each of the individual words.
There are other approaches, such as using all 16-bits in a word as magnitude, but the difficulty there is determining the amount of overflow/carry on *10 operations. It is not a monumental task but will require work. See https://stackoverflow.com/a/1815371/471129, for example.
(If you also want to store negative numbers, that is also an issue for representation. We can either store the sign as separately known as sign-magnitude form (as in stored its own word/byte or packed into the highest byte) or store the number in a compliment form. The former is better for variable length implementations and the latter can be made to work for fixed length implementations.)
I want to compress an array consisting of about 10^5 random integers in range 0 to 2^15. The integers are unsorted and I need to compress them lossless.
I don't care much about the amount of computation and time needed to run the algorithm, just want to have better compression ratio.
Are there any suggested algorithms for this?
Assuming you don´t need to preserve original order, instead of passing the numbers themselves, pass the count. If they have a normal distribution, you can expect each number to be repeated 3 or 4 times. With 3 bits per number, we can count up to 7. You can make an array of 2^15 * 3 bits and every 3 bits set the count of that number. To handle extreme cases that have more than 7, we can also send a list of numbers and their counts for these cases. Then you can read the 3 bits array and overwrite with the additional info for count higher than 7.
For your exact example: just encode each number as a 15-bit unsigned int and apply bit packing. This is optimal since you have stated each integer in uniformly random in [0, 2^15), and the Shannon Entropy of this distribution is 15 bits.
For a more general solution, apply Quantile Compression (https://github.com/mwlon/quantile-compression/). It takes advantage of any smooth-ish data and compresses near optimally on shuffled data. It works by encoding each integer with a Huffman code for it coarse range in the distribution, then an exact offset within that range.
These approaches are both computationally cheap, but more compute won't get you further in this case.
The difference between little endian and big endian was explained to me like this: "In little endian the least significant byte goes into the low address, and in big endian the most significant byte goes into the low address". What exactly makes one byte more significant than the other?
In the number 222, you could regard the first 2 as most significant because it represents the value 200; the second 2 is less significant because it represents the value 20; and the third 2 is the least significant digit because it represents the value 2.
So, although the digits are the same, the magnitude of the number they represent is used to determine the significance of a digit.
It is the same as when a value is rounded to a number of significant figures ("S.F." or "SF"): 123.321 to 3SF is 123, to 2SF it is 120, to 4SF it is 123.3. That terminology has been used since before electronics were invented.
In any positional numeric system, each digit has a different weight in creating the overall value of the number.
Consider the number 51354 (in base 10): the first 5 is more significant than the second 5, as it stands for 5 multiplied by 10000, while the second 5 is just 5 multiplied by 10.
In computers number are generally fixed-width: for example, a 16 bit unsigned integer can be thought as a sequence of exactly 16 binary digits, with the leftmost one being unambiguously the most significant (it is worth exactly 32768, more than any other bit in the number), and the rightmost the least significant (it is worth just one).
As long as integers are in the CPU registers we don't really need to care about their representation - the CPU will happily perform operations on them as required. But when they are saved to memory (which generally is some random-access bytes store), they need to be represented as bytes in some way.
If we consider "normal" computers, representing a number (bigger than one byte) as a sequence of bytes means essentially representing it in base 256, each byte being a digit in base 256, and each base-256 digit being more or less significant.
Let's see an example: take the value 54321 as a 16 bit integer. If you write it in base 256, it'll be two base-256 digits: the digit 0xD41 (which is worth 0xD4 multiplied by 256) and the digit 0x31 (which is worth 0x31 multiplied by 1). It's clear that the first one is more significant than the second one, as indeed the leftmost "digit" is worth 256 times more than the one at its right.
Now, little endian machines will write in memory the least significant digit first, big endian ones will do the opposite.
Incidentally, there's a nice relationship between binary, hexadecimal and base-256: 4 bits are mapped straight to a hexadecimal digit, and 2 hexadecimal digits are mapped straight to a byte. For this reason you can also see that 54321 is in binary
1101010000110001 = 0xD431
can be split straight into two groups of 8 bits
11010100 00110001
which are the 0xD4 and 0x31 above. So you can see as well that the most significant byte is the one that contains the most significant bits.
Here I'm using the corresponding hexadecimal values to represent each base-256 digit, as there's no good way to represent them symbolically. I could use their ASCII character value, but I 0xD4 is outside ASCII, and 0x31 is 1, which would only add confusion.
I need to compress a large set of (unsigned) integer values, whereas the goal is to keep their relative accuracy. Simply speaking, there is a big difference between 1,2,3, but the difference between 1001,1002,1003 is minor.
So, I need a sort of a lossy transformation. The natural choice is to build a logarithmic scale, but the drawback is that conversion to/from it requires floating-point operations, log/exp calculation and etc.
OTOH I don't need a truly logarithmic scale, I just need it to resemble it in some sense.
I came up with an idea of encoding numbers in a floating-point manner. That is, I allocate N bits for each compressed number, from which some represent the mantissa, and the remaining are for the order. The choice for the size of the mantissa and order would depend on the needed range and accuracy.
My question is: is it a good idea? Or perhaps there exists a better encoding scheme w.r.t. computation complexity vs quality (similarity to logarithmic scale).
What I implemented in details:
As I said, bits for mantissa and for order. Order bits are leading, so that the greater the encoded number - the greater the raw one.
The actual number is decoded by appending an extra leading bit to mantissa (aka implicit bit), and left-shifting it by the encoded order. The smallest decoded number would be 1 << M where M is the size of mantissa. If the needed range should start from 0 (like in my case) then this number can be subtracted.
Encoding the number is also simple. Add the 1 << M, then find its order, i.e. how much it should be right-shifted until it fits our mantissa with implicit leading bit, and then encoding is trivial. Finding the order is done via median-search, which results to just a few ifs. (for example, if there are 4 order bits, the max order is 15, and it's found within 4 ifs).
I call this a "quazi-logarithmic" scale. The absolute precision decreases the greater is the number. But unlike the true logarithmic scale, where the granularity increases contiguously, in our case it jumps by factor of 2 after each fixed-size range.
The advantages of this encoding:
Fast encoding and decoding
No floating-point numbers, no implicit precision loss during manipulations with them, boundary cases and etc.
Not dependent on standard libraries, complex math functions, etc.
Encoding and decoding may be implemented via C++ template stuff, so that conversion may even be implemented in compile-time. This is convenient to define some compile-time constants in a human-readable way.
In your compression-algorithm every group of numbers that result in the same output after being compressed will be decompressed to the lowest number in that group. If you changed that to the number in the middle the average fault would be reduced.
E.g. for a 8-bit mantissa and 5bit exponent the numbers in the range[0x1340, 0x1350) will be translated into 0x1340 by decompress(compress(x)). If the entire range would first be compressed and afterwards decompressed the total difference would be 120. If the output would be 0x1348 instead, the total error would only be 64, which reduces the error by a solid 46.7%. So simply adding 2 << (exponent - 1) to the output will significantly reduce the error of the compression-scheme.
Apart from that I don't see much of an issue with this scheme. Just keep in mind that you'll need a specific encoding for 0. There would be alternative encodings, but without knowing anything specific about the input this one will be the best you can get.
EDIT:
While it is possible to move the correction of the result from the decompression to the compression-step, this comes at an increased expenses of enlarging the exponent-range by one. This is due the fact that for the numbers with the MSB set only half of the numbers will use the corresponding exponent (the other half will be populated by numbers with the second-most significant bit set). The higher half of numbers with the MSB set will be placed in the next-higher order.
So for e.g. for 32-bit numbers encoded with a 15bit-mantissa only numbers until 0x8FFF FFFF will have order 15 (Mantissa = 0x1FFF and Exponent = 15). All higher values will have order 16 (Mantissa = 0x?FFF and Exponent = 16). While the increase of the exponent by 1 in itself doesn't seem much, in this example it already costs an additional bit for the exponent.
In addition the decompression-step for the above example will produce an integer-overflow, which may be problematic under certain circumstances (e.g. C# will throw an exception if the decompression is done in checked-mode). Same applies for the compression-step: unless properly handled, adding 2^(order(n) - 1) to the input n will cause an overflow, thus placing the number in order 0.
I would recommend moving the correction to the decompression-step (as shown above) to remove potential integer-overflows as a source of problems/bugs and keep the number of exponents that need to be encoded minimal.
EDIT2:
Another issue with this approach is the fact that half of the numbers (excluding the lowest order) wind up in a larger "group" when the correction is done on compression, thus reducing precision.
There is one question and I have the solution to it also. But I couldn't understand the solution. Kindly help with some set of examples and shower some experience.
Question
Given a file containing roughly 300 million social security numbers (9-digit numbers), find a 9-digit number that is not in the file. You have unlimited drive space but only 2MB of RAM at your disposal.
Answer
In the first step, we build an array 2^16 integers that is initialized to 0 and for every number in the file, we take its 16 most significant bits to index into this array and increment the number.
Since there are less than 2^32 numbers in the file, there is bound to be (at least) one number in the array that is less than 2^16. This tells us that there is at least one number missing among the possible numbers with those upper bits.
In the second pass, we can focus only only on the numbers that match this criterion and use a bit-vector of size 2^16 to identify one of the missing numbers.
To make the explanation simpler, let's say you have a list of two-digit numbers, where each digit is between 0 and 3, but you can't spare the 16 bits to remember for each of the 16 possible numbers, whether you have already encountered it. What you do is to create an array a of 4 3-bit integers and in a[i], you store how many numbers with the first digit i you encountered. (Two-bit integers wouldn't be enough, because you need the values 0, 4 and all numbers between them.)
If you had the file
00, 12, 03, 31, 01, 32, 02
your array would look like this:
4, 1, 0, 2
Now you know that all numbers starting with 0 are in the file, but for each of the remaining, there is at least one missing. Let's pick 1. We know there is at least one number starting with 1 that is not in the file. So, create an array of 4 bits, for each number starting with 1 set the appropriate bit and in the end, pick one of the bits that wasn't set, in our example if could be 0. Now we have the solution: 10.
In this case, using this method is the difference between 12 bits and 16 bits. With your numbers, it's the difference between 32 kB and 119 MB.
In round terms, you have about 1/3 of the numbers that could exist in the file, assuming no duplicates.
The idea is to make two passes through the data. Treat each number as a 32-bit (unsigned) number. In the first pass, keep a track of how many numbers have the same number in the most significant 16 bits. In practice, there will be a number of codes where there are zero (all those for 10-digit SSNs, for example; quite likely, all those with a zero for the first digit are missing too). But of the ranges with a non-zero count, most will not have 65536 entries, which would be how many would appear if there were no gaps in the range. So, with a bit of care, you can choose one of the ranges to concentrate on in the second pass.
If you're lucky, you can find a range in the 100,000,000..999,999,999 with zero entries - you can choose any number from that range as missing.
Assuming you aren't quite that lucky, choose one with the lowest number of bits (or any of them with less than 65536 entries); call it the target range. Reset the array to all zeroes. Reread the data. If the number you read is not in your target range, ignore it. If it is in the range, record the number by setting the array value to 1 for the low-order 16-bits of the number. When you've read the whole file, any of the numbers with a zero in the array represents a missing SSN.