What is the name of this text compression scheme? - algorithm

A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.
It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.
Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.
I recall that it was invented at Google, but it was not Snappy.

I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,
The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

Related

What for people sometimes convert numbers or strings to bytes?

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

What are the design decisions behind Google Maps encoded polyline algorithm format?

Several Google Maps products have the notion of polylines, which in terms of underlying data is basically just a sequence of lat/lng points that might for example manifest in a line drawn on a map. The Google Map developer libraries make use of an encoded polyline format that churns out an ASCII string representing the points making up the polyline. This encoded format is then typically decoded with a built in function of the Google libraries or a function written by a third party that implements the decoding algorithm.
The algorithm for encoding polyline points is described in the Encoded Polyline Algorithm Format document. What is not described is the rationale for implementing the algorithm this way, and the significance of each of the individual steps. I'm interested to know whether the thinking/purpose behind implementing the algorithm this way is publicly described anywhere. Two example questions:
Do some of the steps have a quantifiable impact on compression and how does this impact vary as a function of the delta between points?
Is the summing of values with ASCII 63 a compatibility hack of some sort?
But just in general, a description to go along with the algorithm explaining why the algorithm is implemented the way it is.
Update: This blog post from James Snook also has the 'valid ascii' range argument and reads logically for other steps I wondered. E.g. the left shifting before storing which makes place for the negative bit as the first bit.
Some explanations I found, not sure if everything is 100% correct.
One double value is stored in multiple 5 bits chunks and 0x20 (binary '0010 0000') is used as indication that the next 5 bit entry belongs to the current double.
0x1f (binary '0001 1111') is used as bit mask to throw away other bits
I expect that 5 bits are used because the delta of lat or lons are in this range. So that every double value takes only 5 bits on average when done for a lot of examples (but not verified yet).
Now, compression is done by assuming nearby double values are very close and creating the difference is nearly 0, so that the results fits in a few bytes. Then this result is stored in a dynamic fashion: store 5 bits and if the value is longer mark with 0x20 and store the next 5 bits and so on. So I guess you can tweak the compression if you try 6 or 4 bits but I guess 5 is a practically reasonable choice.
Now regarding the magic 63, this is 0x3f and binary 0011 1111. I'm not sure why they add it. I thought that adding 63 will give some 'better' asci characters (e.g. allowed in XML or in URL) as we skip e.g. 62 which is > but 63 which is ? is really better? At least the first ascii chars are not displayable and have to be avoided. Note that if one would use 64 then one would hit the ascii char 127 for the maximum value of 31 (31+64+32) and this char is not defined in html4. Or is because of a signed char is going from -128 to 127 and we need to store the negative numbers as positive, thus adding the maximum possible negative number?
Just for me: here is a link to an official Java implementation with Apache License

Why Huffman Coding is good?

I am not asking how Huffman coding is working, but instead, I want to know why it is good.
I have the following two questions:
Q1
I understand the ultimate purpose of Huffman coding is to give certain char a less bit number, so space is saved. What I don't understand is that why the decision of number of bits for a char can be related to the char's frequency?
Huffman Encoding Trees says
It is sometimes advantageous to use variable-length codes, in which
different symbols may be represented by different numbers of bits. For
example, Morse code does not use the same number of dots and dashes
for each letter of the alphabet. In particular, E, the most frequent
letter, is represented by a single dot.
So in Morse code, E can be represented by a single dot because it is the most frequent letter. But why? Why can it be a dot just because it is most frequent?
Q2
Why the probability / statistics of the chars are so important to Huffman coding?
What happen if the statistics table is wrong?
If you assign less number or bits or shorter code words for most frequently used symbols you will be saving a lot of storage space.
Suppose you want to assign 26 unique codes to English alphabet and want to store an english novel ( only letters ) in term of these code you will require less memory if you assign short length codes to most frequently occurring characters.
You might have observed that postal code and STD codes for important cities are usually shorter ( as they are used very often ). This is very fundamental concept in Information theory.
Huffman encoding gives prefix codes.
Construction of Huffman tree:
A greedy approach to construct Huffman tree for n characters is as follows:
places n characters in n sub-trees.
Starts by combining the two least weight nodes into a tree which is assigned the sum of the two leaf node weights as the weight for its root node.
Do this until you get a single tree.
For example consider below binary tree where E and T have high weights ( as very high occurrence )
It is a prefix tree. To get the Huffman code for any character, start from the node corresponding to the the character and backtrack till you get the root node.
Indeed, an E could be, say, three dashes followed by two dots. When you make your own encoding, you get to decide. If your goal is to encode a certain text so that the result is as short as possible, you should choose short codes for the most frequent characters. The Huffman algorithm ensures that we get the optimal codes for a specific text.
If the frequency table is somehow wrong, the Huffman algorithm will still give you a valid encoding, but the encoded text would be longer than it could have been if you had used a correct frequency table. This is usually not a problem, because we usually create the frequency table based on the actual text that is to be encoded, so the frequency table will be "perfect" for the text that we are going to encode.
well.. you want assign shorter codes to the symbols which appear more frequently... huffman encoding works just by this simple assumption.. :-)
you compute the frequency of all symbols, sort them all, and start assigning bit codes to each one.. the more frequent a symbol is, the shorter the code you'll assign to it.. simple as this.
the big question is: how large the window in which we compute such frequencies should be? should it be as large as the entire file? or should it be smaller? and if the latter apply, how large? Most huffman encoding have some sort of "test-run" in which they estimate the best window size a little bit like TCP/IP do with its windows frame sizes.
Huffman codes provide two benefits:
they are space efficient given some corpus
they are prefix codes
Given some set of documents for instance, encoding those documents as Huffman codes is the most space efficient way of encoding them, thus saving space. This however only applies to that set of documents as the codes you end up are dependent on the probability of the tokens/symbols in the original set of documents. The statistics are important because the symbols with the highest probability (frequency) are given the shortest codes. Thus the symbols most likely to be in your data use the least amount of bits in the encoding, making the coding efficient.
The prefix code part is useful because it means that no code is the prefix of another. In morse code for instance A = dot dash and J = dot dash dash dash, how do you know where to break reading the code. This increases the inefficiency of transmitting data using morse as you need a special symbol (pause) to signify the end of transmission of one code. Compare that to Huffman codes where each code is unique, as soon as you discover the encoding for a symbol in the input, you know that that is the transmitted symbol because it is guaranteed not to be the prefix of some other symbol.
It's the dual effect of having the most frequent characters using the shortest bit sequences that gives you the savings.
For a concrete example, let's say you have a piece of text that consists of 1024 e characters and 1024 of all other characters combined.
With 8 bits for code, that's a full 2048 bytes used in uncompressed form.
Now let's say we represent e as a single 1-bit and every other letter as a 0-bit followed by its original 8 bits (a very primitive form of Huffman).
You can see that half the characters have been expanded from 8 bits to 9, giving 9216 bits, or 1152 bytes. However, the e characters have been reduced from 8 bits to 1, meaning they take up 1024 bits, or 128 bytes.
The total bytes used is therefore 1152 + 128, or 1280 bytes, representing a compression ratio of 62.5%.
You can use a fixed encoding scheme based on the likely frequencies of characters (such as English text), or you can use adaptive Huffman encoding which changes the encoding scheme as characters are processed and frequencies are adjusted. While the former may be okay for input which has high probability of matching frequencies, the latter can adapt to any input.
Statistic table can't be wrong, because in general Huffman algorithm, analyze hole text at the beginning, and builds frequent-statistics of the given text, while Morse has a static symbol -code map.
Huffman algorithm uses the advantage of a given text. As an example, if E is most frequent letter in English in general, that doesn't mean that E is most frequent in a given text for a given author.
Another advantage of Huffman algorithm is that you can use it for any alphabet starting from [0, 1] finished Chinese hieroglyphs, while Morse is defined only for English letters
So in Morse code, "E" can be represented by a single dot, because it is the most frequent letter. But why? Why is it a dot because of its frequency?
"E" can be encoded to any unique code for a specific code dictionary, so it can be "0", we choose it to be short to save memory, so the average bytes used after encode is minimized.
Why is the probability / statistics of the chars so important to Huffman coding? What happens if the statistics table is wrong?
why do we encode? save space right? Space used after encode is freq(wordi)*Length(wordi), it is what we should try to minimize, so we choose to assign words with high prob short code greedly to save space.
If the statistics table is wrong, then the encoding is not the best way to save space.

encoding most efficient way 64 character sequence for lesser writing time to memory

The problem is as follows: Given a 64 charater sequences which is built from the english alphabet having 26 charcaters therefore just case characters, the occurrence distribution is such that any character has an equal chance of occurring at a given time.
Due to the fact that I have some computation which needs to be done with regards to the sequences, which requires writing to a text files, since the amount of sequences goes beyond a given ram. I thought of encoding a sequence such that I would be able to have lesser amount of bytes to write to a text file per given sequence.
With such reasoning I thought of the L-Z which would allow me to go down to 40 bytes. Is there any way by which i can go lower to encode a 64 character sequence?
With a large(-ish) lookup table you could encode each of the possible 26^64 character sequences in 301 (actually 300.8281==log2(26^64)) bits. This is slightly less than the 320 bits your straightforward compression would use. It is also the theoretical minimum given that any of the 26 characters occurs with equal probability.
Since you could derive the lookup table at any time you don't even need to store it. I suppose the bits used to represent the functions to encode a character string into a 301-bit integer and vice-versa ought to be counted into your compression ratio.
This is, of course, a long-winded restatement of #lhf's comment.

Decimal Presentation : different purpose between zoned decimal and packed decimal

I'm learning Computer Science course and when I read to these definition, I understand. But I don't know what different purpose of two presentations and why.
Here some short explanation of purpose that my book said:
Zone decimal : hightly compatible with text data.
Packed decimal : faster computing speed.
Something I want to know is:
1) in zone decimal presentation there is a zone section that duplicate every digit. Why ? I see this is no purpose :(
2) why they say zone decimal is compatible with text data and why packed decimaal is faster.
Thanks :)
Firstly - where are you learning CS? Those terms are from the 1960s, the more common name is BCD (Binary Coded Decimal)
Zone decimal uses an entire byte for each digit. This means you can just print a number as if it was text (each 'character' stores a digit 0-9) but since there are only 10 digits and a byte can hold 256 different values this is a bit wasteful.
Packed decimal uses the fact that 4bits can store 16different values. So you can store two digits in a byte (top 4bits and bottom 4bits). This is still a bit wasteful since you only use half the capacity. But it's pretty easy to extract the two digits with just shift and mask operations.
Pretty much the only place you would see BCD these days is in some low level hardware where you want to read/x-mit a digit without using a microprocessor at all. It's easy to make a BCD counter just in transistors
but if you want to do any maths you either have to do long multiplication on each digit like you would on paper - or convert into regular ints and back again
Both of these representations have fallen out of favor, perhaps because they are not directly supported by C, and hence all of the systems descended from Unix.
Packed decimal has an advantage in two respects: since takes up less space it can get off the bus and into the processor faster, and many CISC instruction sets have dedicated instructions for arithmetic. To quote from http://en.wikipedia.org/wiki/Packed_decimal#Packed_BCD:
Packed BCD [binary coded decimal] is supported in the COBOL programming language as the
"COMPUTATIONAL-3" (an IBM extension adopted by many other compiler
vendors) or "PACKED-DECIMAL" (part of the 1985 COBOL standard) data
type. Besides the IBM System/360 and later compatible mainframes,
packed BCD was implemented in the native instruction set of the
original VAX processors from Digital Equipment Corporation and was the
native format for the Burroughs Corporation Medium Systems line of
mainframes (descended from the 1950s Electrodata 200 series).
Zoned decimal (http://en.wikipedia.org/wiki/Zoned_decimal#Zoned_decimal) has an easy mapping between characters on punch cards and their representation in memory, which perhaps explains your textbook's claim that it is "highly compatible with text data." As the Wikipedia article suggests, it's a term more used in IBM mainframe circles. On minis, we tended to just call it plain old decimal, PIC 9 data.
"Zoned Decimal" in its natural environment is meant to be compatable with the EBCDIC char set .
ASCII represents numbers as x'3x' -- x'39' which display as character "0" to "9".
The EBCDIC character sets (which has its origins in Hollerith pucnched cards) uses a similar but different scheme where x'F0' is displayed as characer "0' and x'F9' is displayed as character '9'.
Punched cards had a fixed length of 80 characters in many cases 10 or 12 of these characters were eaten up with record type identifiers and sequence numbers (desperately important if you dropped a bunch of cards on the floor!). So space was at a premium. Rather than enter a "+" or "-" character next to each number an "overpunch" extra holes near the top bit of the card was used to represent a positive or negative numbers, so saving a byte.
These overpunched characters were encdoded in EBCDIC as x"D0' to x'D9" for -0 to -9 and x'C0' to x'C9' for +0 to +9 usually in the last digit of the number.
Hence the "Zoned Decimal" format. The first four bits of each byte are the Zone, the second four bits the "number" to -42 was encoded as x'F4D2'.
This is more of a convention than anything else as the computer could not do anything with this format. So it needed to be encoded into "packed" format before any calculations took place. This is pretty easy s 'X'F4D2' -> x'042D' is mostly a case a grabbing the last zone then extracting the "numeric" four bits from each byte, which, could then be converted to binary.
When IBM mainframes were designed the largest group of users were banks, insurance companies and utility companies. The bulk of there processing followed this pattern.
read punch card.
read tape record.
add monthly payment to balance
store new balance on tape
print new balance
Most of the calculations involved currency amounts and most of the results were displayed immediately. It became clear that if the machine could do the arithmetic directly on the packed decimal values you could avoid several expensive "convert to binary" and "convert to decimal" instructions. As a bonus it made it easy to place the decimal point at the correct position and perform any decimal rounding. So a great deal of work went into implementing native packed decimal instructions (zero, add, subtract, multiply, divide, shift and round etc.).
This has been the preferred currency format for IBM mainframes ever since.
For many years developers on other platforms poured scorn on the mainframers for using such an archaic format, and, only recently began to realize how difficult it was to do fixed point decimal arithmetic to the standards accountants and tax collectors expect. Thanks to the efforts of Mike_Cowlishaw and others the rest of the world has caught up with the venerable IBM 360 and Java programmers can now calculate sales tax correctly using the BigDecimal library which is based on a variation on the old packed decimal format.

Resources