I'm learning Computer Science course and when I read to these definition, I understand. But I don't know what different purpose of two presentations and why.
Here some short explanation of purpose that my book said:
Zone decimal : hightly compatible with text data.
Packed decimal : faster computing speed.
Something I want to know is:
1) in zone decimal presentation there is a zone section that duplicate every digit. Why ? I see this is no purpose :(
2) why they say zone decimal is compatible with text data and why packed decimaal is faster.
Thanks :)
Firstly - where are you learning CS? Those terms are from the 1960s, the more common name is BCD (Binary Coded Decimal)
Zone decimal uses an entire byte for each digit. This means you can just print a number as if it was text (each 'character' stores a digit 0-9) but since there are only 10 digits and a byte can hold 256 different values this is a bit wasteful.
Packed decimal uses the fact that 4bits can store 16different values. So you can store two digits in a byte (top 4bits and bottom 4bits). This is still a bit wasteful since you only use half the capacity. But it's pretty easy to extract the two digits with just shift and mask operations.
Pretty much the only place you would see BCD these days is in some low level hardware where you want to read/x-mit a digit without using a microprocessor at all. It's easy to make a BCD counter just in transistors
but if you want to do any maths you either have to do long multiplication on each digit like you would on paper - or convert into regular ints and back again
Both of these representations have fallen out of favor, perhaps because they are not directly supported by C, and hence all of the systems descended from Unix.
Packed decimal has an advantage in two respects: since takes up less space it can get off the bus and into the processor faster, and many CISC instruction sets have dedicated instructions for arithmetic. To quote from http://en.wikipedia.org/wiki/Packed_decimal#Packed_BCD:
Packed BCD [binary coded decimal] is supported in the COBOL programming language as the
"COMPUTATIONAL-3" (an IBM extension adopted by many other compiler
vendors) or "PACKED-DECIMAL" (part of the 1985 COBOL standard) data
type. Besides the IBM System/360 and later compatible mainframes,
packed BCD was implemented in the native instruction set of the
original VAX processors from Digital Equipment Corporation and was the
native format for the Burroughs Corporation Medium Systems line of
mainframes (descended from the 1950s Electrodata 200 series).
Zoned decimal (http://en.wikipedia.org/wiki/Zoned_decimal#Zoned_decimal) has an easy mapping between characters on punch cards and their representation in memory, which perhaps explains your textbook's claim that it is "highly compatible with text data." As the Wikipedia article suggests, it's a term more used in IBM mainframe circles. On minis, we tended to just call it plain old decimal, PIC 9 data.
"Zoned Decimal" in its natural environment is meant to be compatable with the EBCDIC char set .
ASCII represents numbers as x'3x' -- x'39' which display as character "0" to "9".
The EBCDIC character sets (which has its origins in Hollerith pucnched cards) uses a similar but different scheme where x'F0' is displayed as characer "0' and x'F9' is displayed as character '9'.
Punched cards had a fixed length of 80 characters in many cases 10 or 12 of these characters were eaten up with record type identifiers and sequence numbers (desperately important if you dropped a bunch of cards on the floor!). So space was at a premium. Rather than enter a "+" or "-" character next to each number an "overpunch" extra holes near the top bit of the card was used to represent a positive or negative numbers, so saving a byte.
These overpunched characters were encdoded in EBCDIC as x"D0' to x'D9" for -0 to -9 and x'C0' to x'C9' for +0 to +9 usually in the last digit of the number.
Hence the "Zoned Decimal" format. The first four bits of each byte are the Zone, the second four bits the "number" to -42 was encoded as x'F4D2'.
This is more of a convention than anything else as the computer could not do anything with this format. So it needed to be encoded into "packed" format before any calculations took place. This is pretty easy s 'X'F4D2' -> x'042D' is mostly a case a grabbing the last zone then extracting the "numeric" four bits from each byte, which, could then be converted to binary.
When IBM mainframes were designed the largest group of users were banks, insurance companies and utility companies. The bulk of there processing followed this pattern.
read punch card.
read tape record.
add monthly payment to balance
store new balance on tape
print new balance
Most of the calculations involved currency amounts and most of the results were displayed immediately. It became clear that if the machine could do the arithmetic directly on the packed decimal values you could avoid several expensive "convert to binary" and "convert to decimal" instructions. As a bonus it made it easy to place the decimal point at the correct position and perform any decimal rounding. So a great deal of work went into implementing native packed decimal instructions (zero, add, subtract, multiply, divide, shift and round etc.).
This has been the preferred currency format for IBM mainframes ever since.
For many years developers on other platforms poured scorn on the mainframers for using such an archaic format, and, only recently began to realize how difficult it was to do fixed point decimal arithmetic to the standards accountants and tax collectors expect. Thanks to the efforts of Mike_Cowlishaw and others the rest of the world has caught up with the venerable IBM 360 and Java programmers can now calculate sales tax correctly using the BigDecimal library which is based on a variation on the old packed decimal format.
Related
I have an actually very easy question about the IEEE-754 standard in which numbers are coded and saved on the computer.
At uni (exams) I have come across the following definition for 16-bit IEEE-754-format (half precision): 1 sign bit, 6 exponent bits & 9 mantissa bits.
An internet search (or books) reveal another definition:
1 sign bit, 5 exponent bits & 10 mantissa bits
The reason why I’m asking is that I cannot believe the uni might have made such a simple mistake, so are there multiple definitions for numbers given in 16-bit IEEE-754 format?
Conforming to an IEEE standard is voluntary. People are free to use other formats. The IEEE-754 standard specifies a binary16 format that uses 1 bit for the sign, 5 bits for the exponent, and 10 bits for the primary significand encoding.
People may use other formats because they want more or less precision in the significand or range in the exponent.
Textbooks and academic exercises often use non-standard formats for the purpose of inducing students to reason about them on their own rather than looking up answers or learning existing formats by rote.
If the hardware you are using supports a 16-bit floating-point format, the binding specification for that format is in the hardware documentation, not in the IEEE-754 standard.
Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.
I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you
Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.
Several Google Maps products have the notion of polylines, which in terms of underlying data is basically just a sequence of lat/lng points that might for example manifest in a line drawn on a map. The Google Map developer libraries make use of an encoded polyline format that churns out an ASCII string representing the points making up the polyline. This encoded format is then typically decoded with a built in function of the Google libraries or a function written by a third party that implements the decoding algorithm.
The algorithm for encoding polyline points is described in the Encoded Polyline Algorithm Format document. What is not described is the rationale for implementing the algorithm this way, and the significance of each of the individual steps. I'm interested to know whether the thinking/purpose behind implementing the algorithm this way is publicly described anywhere. Two example questions:
Do some of the steps have a quantifiable impact on compression and how does this impact vary as a function of the delta between points?
Is the summing of values with ASCII 63 a compatibility hack of some sort?
But just in general, a description to go along with the algorithm explaining why the algorithm is implemented the way it is.
Update: This blog post from James Snook also has the 'valid ascii' range argument and reads logically for other steps I wondered. E.g. the left shifting before storing which makes place for the negative bit as the first bit.
Some explanations I found, not sure if everything is 100% correct.
One double value is stored in multiple 5 bits chunks and 0x20 (binary '0010 0000') is used as indication that the next 5 bit entry belongs to the current double.
0x1f (binary '0001 1111') is used as bit mask to throw away other bits
I expect that 5 bits are used because the delta of lat or lons are in this range. So that every double value takes only 5 bits on average when done for a lot of examples (but not verified yet).
Now, compression is done by assuming nearby double values are very close and creating the difference is nearly 0, so that the results fits in a few bytes. Then this result is stored in a dynamic fashion: store 5 bits and if the value is longer mark with 0x20 and store the next 5 bits and so on. So I guess you can tweak the compression if you try 6 or 4 bits but I guess 5 is a practically reasonable choice.
Now regarding the magic 63, this is 0x3f and binary 0011 1111. I'm not sure why they add it. I thought that adding 63 will give some 'better' asci characters (e.g. allowed in XML or in URL) as we skip e.g. 62 which is > but 63 which is ? is really better? At least the first ascii chars are not displayable and have to be avoided. Note that if one would use 64 then one would hit the ascii char 127 for the maximum value of 31 (31+64+32) and this char is not defined in html4. Or is because of a signed char is going from -128 to 127 and we need to store the negative numbers as positive, thus adding the maximum possible negative number?
Just for me: here is a link to an official Java implementation with Apache License
A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.
It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.
Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.
I recall that it was invented at Google, but it was not Snappy.
I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,
The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.