Encoding numbers in RLE Compression - algorithm
I am working on implementing dynamic version of RLE compression.
On this version I only insert characters count in the RLE Code if it's more than 1, other than that I keep it as its original state.
The problem I encounter here when trying to encode numbers in the original input text.
Is there any way to encode them like alphabet characters without losing efficiency of the RLE compression ratio.
Example
text = "aabccc 123 ghe dd12 Goooooal"
I want to encode it like:
2ab3c 123 ghe 2d12 G5oal
But I need some kind of encoding for 123 and 12 in RLE code.
Related
JPEG compression using DCT
I am a little confused about the Huffman code. So as I read the books, it states that after the zigzag ordering, it will be the run length encoding and the Huffman for the run length. I have 3 questions: 1) Is it necessary to do both run length encoding and Huffman, or just Huffman for the whole image( which is gray scale). I mean like could I just scan the block 8x8 and count the frequency of appearance of characters, then create the codewords. 2) If I use the run length coding for each block, so the Huffman is also for each block of 8x8, or I have to scan through all the whole image. 3) In the book it states that we could just use the Table K.3 and Table K.5 in Annex K for the DC and AC coefficient encoding. Could I not use those tables and generate my own based on the theory in question 2 which Im also confused. Thank you for helping me out This is the link for the Annex K https://www.w3.org/Graphics/JPEG/itu-t81.pdf
You COULD compress like you are saying but it would not be JPEG. The encoding process is rather complicated in JPEG. It is not really Huffman encoding of values. It is Huffman encoding of instructions on zero runs and the number of additional raw bits that have to be read. 1) Is it necessary to do both run length encoding and Huffman, or just Huffman for the whole image( which is gray scale). For it to be a JPEG stream, you have to do both. 2) 2) If I use the run length coding for each block, so the Huffman is also for each block of 8x8, or I have to scan through all the whole image. Some encoders do that to generate optimum Huffman tables. 3) In the book it states that we could just use the Table K.3 and Table K.5 in Annex K for the DC and AC coefficient encoding. Could I not use those tables and generate my own based on the theory in question 2 which Im also confused. Some encoders do that to avoid having to make two passes over the DCT data to generate Huffman tables.
How smaz compression library works?
I'm currently working for a short text compression project based on my language. But as a beginner, I also know some basic compression algorithm like LZW. But I still don't understand how smaz works. I have 2 questions: How does smaz work? How to build the codebook and reversed codebook? Can any one explain it for me? Thank you very much.
trying to answer your questions How does smaz work? according [1], Smaz has a hard-wired constant built-in codebook of 254 common English words, word fragments, bigrams, and the lowercase letters (except j, k, q). The inner loop of the Smaz decoder is very simple: Fetch the next byte X from the compressed file. Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text. Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text. Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded text. Repeat until there are no more compressed bytes left in the compressed file. Because the codebook is constant, the Smaz decoder is unable to "learn" new words and compress them, no matter how often they appear in the original text. This page could be helpful to understand the code. How to build the codebook and reversed codebook? TODO file in repository and author comments in redit poitns that the dictionary was generated by a unreleased ruby script. Also, the author explains: btw what the Ruby program does is to consider all the possible substrings, and even all the possible separated words, and build a table of frequencies, than adjust the weight based on the string length, and finally hand tuning the table to compress specific things very well. I added by hand the "http://" and ".com" token for example, removing the final two entries. An alternative to your project could be the shoco library which supports generation of a custom compression model based on your language.
The smaz sources is only 178 lines and just 99 lines without comments and codebook tables. You should look to see how it works. Smaz is pretty simple compression by codebook (like LZW which you know). The library contains table with most popular terms in english (lines 5 - 51 for compression table and 56 -76 for decompression) and replace this terms with indexes in compressed string. And contrary to decompress. For example, string the end would compressed by 58% becouse if terms the would be one byte index in compression table. So 7 bytes lenght string became 4 bytes length string.
Rubyist way to decode this encoded string assuming invariant ASCII encoding
My program is a decoder for a binary protocol. One of the fields in that binary protocol is an encoded String. Each character in the String is printable, and represents an integral value. According to the spec of the protocol I'm decoding, the integral value it represents is taken from the following table, where all possible characters are listed: Character Value ========= ===== 0 0 1 1 2 2 3 3 [...] : 10 ; 11 < 12 = 13 [...] B 18 So for example, the character = represents an integral 13. My code was originally using ord to get the ASCII code for the character, and then subtracting 48 from that, like this: def Decode(val) val[0].ord - 48 end ...which works perfectly, assuming that val consists only of characters listed in that table (this is verified elsewhere). However, in another question, I was told that: You are asking for a Ruby way to use ord, where using it is against the Ruby way. It seems to me that ord is exactly what I need here, so I don't understand why using ord here is not a Rubyist way to do what I'm trying to do. So my questions are: First and foremost, what is the Rubyist way to write my function above? Secondary, why is using ord here a non-Rubyist practice? A note on encoding: This protocol which I'm decoding specifies precisely that these strings are ASCII encoded. No other encoding is possible here. Protocols like this are extremely common in my industry (stock & commodity markets).
I guess the Rubyistic way, and faster, to decode the string into an array of integers is the unpack method: "=01:".unpack("C*").map {|v| v - 48} >> [13, 0, 1, 10] The unpack method, with "C*" param, converts each character to an 8-bit unsigned integer.
Probably ord is entirely safe and appropriate in your case, as the source data should always be encoded the same way. Especially if when reading the data you set the encoding to 'US-ASCII' (although the format used looks safe for 'ASCII-8BIT', 'UTF-8' and 'ISO-8859', which may be the point of it - it seems resilient to many conversions, and does not use all possible byte values). However, ord is intended to be used with character semantics, and technically you want byte semantics. With basic ASCII and variants there is no practical difference, all byte values below 128 are the same character code. I would suggest using String#unpack as a general method for converting binary input to Ruby data types, but there is not an unpack code for "use this byte with an offset", so that becomes a two-part process.
Encode an array of integers to a short string
Problem: I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers. For example, here is an array of 360 elements: 0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0, 0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0. Goal: The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y What I've tried: I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one". To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5". That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...) Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect. You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours. Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy. The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse". Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best. Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have: 14 1s (3.89% of data) 4 2s (1.11%) 1 3s, 4s and 5s (0.28%) 339 0s (94.17%) Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.
What's the name of this algorithm/routine?
I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16). The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters. Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted. I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm: Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data. In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task. EDITED example: input alpabet: ABC, output alphabet: 0123456789 message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10. 11 to base 10: 11 We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications: 1) encode somehow in the stream the size of compressed data. 2) use a dummy 1 at the start of the stream: in this way our example will become: 10102 (base 3) = 81 + 9 + 2 = 92 (base 10). Now after decoding we just have to ignore the first 1 (this also provides a basic error detection). The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...