Encode an array of integers to a short string - algorithm

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!

Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...

In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

Related

Bitmasking--when to use hex vs binary

I'm working on a problem out of Cracking The Coding Interview which requires that I swap odd and even bits in an integer with as few instructions as possible (e.g bit 0 and 1 are swapped, bits 2 and 3 are swapped, etc.)
The author's solution revolves around using a mask to grab, in one number, the odd bits, and in another num the even bits, and then shifting them off by 1.
I get her solution, but I don't understand how she grabbed the even/odd bits. She creates two bit masks --both in hex -- for a 32 bit integer. The two are: 0xaaaaaaaa and 0x55555555. I understand she's essentially creating the equivalent of 1010101010... for a 32 bit integer in hexadecimal and then ANDing it with the original num to grab the even/odd bits respectively.
What I don't understand is why she used hex? Why not just code in 10101010101010101010101010101010? Did she use hex to reduce verbosity? And when should you use one over the other?
It's to reduce verbosity. Binary 10101010101010101010101010101010, hexadecimal 0xaaaaaaaa, and decimal 2863311530 all represent exactly the same value; they just use different bases to do so. The only reason to use one or another is for perceived readability.
Most people would clearly not want to use decimal here; it looks like an arbitrary value.
The binary is clear: alternating 1s and 0s, but with so many, it's not obvious that this is a 32-bit value, or that there isn't an adjacent pair of 1s or 0s hiding in the middle somewhere.
The hexadecimal version takes advantage of chunking. Assuming you recognize that 0x0a == 0b1010, you can mentally picture the 8 groups of 1010 in the assumed value.
Another possibility would be octal 25252525252, since... well, maybe not. You can see that something is alternating, but unless you use octal a lot, it's not clear what that alternating pattern in binary is.

Efficient storage of large matrix on HDD

I have many large 1GB+ matrices of doubles (floats), many of them 0.0, that need to be stored efficiently. I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving). A string header is optional. The matrices have no missing elements, NaNs, NAs, nulls, etc: they are all doubles.
Some columns will be sparse, others will not be. The proportion of columns that are sparse will vary from file to file.
What is a space efficient alternative to CSV? For my use, I need to parse this matrix quickly into R, python and Java, so a file format specific to a single language is not appropriate. Access may need to be by row or column.
I am also not looking for a commercial solution.
My main objective is to save HDD space without blowing out io times. RAM usage once imported is not the primary consideration.
The most important question is if you always expand the whole matrix into memory or if you need a random access to the compacted form (and how). Expanding is way simpler, so I'm concentrating on this.
You could use a bitmap stating if a number is present or zero. This costs 1 bit per entry and thus can increase the file size by 1/64 in case of no zeros or shrink it to 1/64 in case of all zeros. If there are runs of zeros, you may store the number of following zeros and the number non-zeros instead, e.g., by packing two 4-bit numbers into one byte.
As the double representation is standard, you can use binary representation in both languages. If many of your numbers are actually ints, you may consider something like I did.
If consecutive numbers are related, you could consider storing their differences.
I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving).
Obviously, switching to float would trade half precision for haltf the memory. This is probably too imprecise, so you could instead omit a few bits from the mantisa and get e.g. 6 bytes per entry. Alternatively, you could reduce the exponent to a single byte instead as the range 1e-38 to 3e38 should suffice.

Why Huffman Coding is good?

I am not asking how Huffman coding is working, but instead, I want to know why it is good.
I have the following two questions:
Q1
I understand the ultimate purpose of Huffman coding is to give certain char a less bit number, so space is saved. What I don't understand is that why the decision of number of bits for a char can be related to the char's frequency?
Huffman Encoding Trees says
It is sometimes advantageous to use variable-length codes, in which
different symbols may be represented by different numbers of bits. For
example, Morse code does not use the same number of dots and dashes
for each letter of the alphabet. In particular, E, the most frequent
letter, is represented by a single dot.
So in Morse code, E can be represented by a single dot because it is the most frequent letter. But why? Why can it be a dot just because it is most frequent?
Q2
Why the probability / statistics of the chars are so important to Huffman coding?
What happen if the statistics table is wrong?
If you assign less number or bits or shorter code words for most frequently used symbols you will be saving a lot of storage space.
Suppose you want to assign 26 unique codes to English alphabet and want to store an english novel ( only letters ) in term of these code you will require less memory if you assign short length codes to most frequently occurring characters.
You might have observed that postal code and STD codes for important cities are usually shorter ( as they are used very often ). This is very fundamental concept in Information theory.
Huffman encoding gives prefix codes.
Construction of Huffman tree:
A greedy approach to construct Huffman tree for n characters is as follows:
places n characters in n sub-trees.
Starts by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.
Do this until you get a single tree.
For example consider below binary tree where E and T have high weights ( as very high occurrence )
It is a prefix tree. To get the Huffman code for any character, start from the node corresponding to the the character and backtrack till you get the root node.
Indeed, an E could be, say, three dashes followed by two dots. When you make your own encoding, you get to decide. If your goal is to encode a certain text so that the result is as short as possible, you should choose short codes for the most frequent characters. The Huffman algorithm ensures that we get the optimal codes for a specific text.
If the frequency table is somehow wrong, the Huffman algorithm will still give you a valid encoding, but the encoded text would be longer than it could have been if you had used a correct frequency table. This is usually not a problem, because we usually create the frequency table based on the actual text that is to be encoded, so the frequency table will be "perfect" for the text that we are going to encode.
well.. you want assign shorter codes to the symbols which appear more frequently... huffman encoding works just by this simple assumption.. :-)
you compute the frequency of all symbols, sort them all, and start assigning bit codes to each one.. the more frequent a symbol is, the shorter the code you'll assign to it.. simple as this.
the big question is: how large the window in which we compute such frequencies should be? should it be as large as the entire file? or should it be smaller? and if the latter apply, how large? Most huffman encoding have some sort of "test-run" in which they estimate the best window size a little bit like TCP/IP do with its windows frame sizes.
Huffman codes provide two benefits:
they are space efficient given some corpus
they are prefix codes
Given some set of documents for instance, encoding those documents as Huffman codes is the most space efficient way of encoding them, thus saving space. This however only applies to that set of documents as the codes you end up are dependent on the probability of the tokens/symbols in the original set of documents. The statistics are important because the symbols with the highest probability (frequency) are given the shortest codes. Thus the symbols most likely to be in your data use the least amount of bits in the encoding, making the coding efficient.
The prefix code part is useful because it means that no code is the prefix of another. In morse code for instance A = dot dash and J = dot dash dash dash, how do you know where to break reading the code. This increases the inefficiency of transmitting data using morse as you need a special symbol (pause) to signify the end of transmission of one code. Compare that to Huffman codes where each code is unique, as soon as you discover the encoding for a symbol in the input, you know that that is the transmitted symbol because it is guaranteed not to be the prefix of some other symbol.
It's the dual effect of having the most frequent characters using the shortest bit sequences that gives you the savings.
For a concrete example, let's say you have a piece of text that consists of 1024 e characters and 1024 of all other characters combined.
With 8 bits for code, that's a full 2048 bytes used in uncompressed form.
Now let's say we represent e as a single 1-bit and every other letter as a 0-bit followed by its original 8 bits (a very primitive form of Huffman).
You can see that half the characters have been expanded from 8 bits to 9, giving 9216 bits, or 1152 bytes. However, the e characters have been reduced from 8 bits to 1, meaning they take up 1024 bits, or 128 bytes.
The total bytes used is therefore 1152 + 128, or 1280 bytes, representing a compression ratio of 62.5%.
You can use a fixed encoding scheme based on the likely frequencies of characters (such as English text), or you can use adaptive Huffman encoding which changes the encoding scheme as characters are processed and frequencies are adjusted. While the former may be okay for input which has high probability of matching frequencies, the latter can adapt to any input.
Statistic table can't be wrong, because in general Huffman algorithm, analyze hole text at the beginning, and builds frequent-statistics of the given text, while Morse has a static symbol -code map.
Huffman algorithm uses the advantage of a given text. As an example, if E is most frequent letter in English in general, that doesn't mean that E is most frequent in a given text for a given author.
Another advantage of Huffman algorithm is that you can use it for any alphabet starting from [0, 1] finished Chinese hieroglyphs, while Morse is defined only for English letters
So in Morse code, "E" can be represented by a single dot, because it is the most frequent letter. But why? Why is it a dot because of its frequency?
"E" can be encoded to any unique code for a specific code dictionary, so it can be "0", we choose it to be short to save memory, so the average bytes used after encode is minimized.
Why is the probability / statistics of the chars so important to Huffman coding? What happens if the statistics table is wrong?
why do we encode? save space right? Space used after encode is freq(wordi)*Length(wordi), it is what we should try to minimize, so we choose to assign words with high prob short code greedly to save space.
If the statistics table is wrong, then the encoding is not the best way to save space.

encoding most efficient way 64 character sequence for lesser writing time to memory

The problem is as follows: Given a 64 charater sequences which is built from the english alphabet having 26 charcaters therefore just case characters, the occurrence distribution is such that any character has an equal chance of occurring at a given time.
Due to the fact that I have some computation which needs to be done with regards to the sequences, which requires writing to a text files, since the amount of sequences goes beyond a given ram. I thought of encoding a sequence such that I would be able to have lesser amount of bytes to write to a text file per given sequence.
With such reasoning I thought of the L-Z which would allow me to go down to 40 bytes. Is there any way by which i can go lower to encode a 64 character sequence?
With a large(-ish) lookup table you could encode each of the possible 26^64 character sequences in 301 (actually 300.8281==log2(26^64)) bits. This is slightly less than the 320 bits your straightforward compression would use. It is also the theoretical minimum given that any of the 26 characters occurs with equal probability.
Since you could derive the lookup table at any time you don't even need to store it. I suppose the bits used to represent the functions to encode a character string into a 301-bit integer and vice-versa ought to be counted into your compression ratio.
This is, of course, a long-winded restatement of #lhf's comment.

What's the name of this algorithm/routine?

I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).
The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.
Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.
I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm:
Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.
In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.
EDITED example:
input alpabet: ABC, output alphabet: 0123456789
message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.
11 to base 10: 11
We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:
1) encode somehow in the stream the size of compressed data.
2) use a dummy 1 at the start of the stream: in this way our example will become:
10102 (base 3) = 81 + 9 + 2 = 92 (base 10).
Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).
The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...

Resources