Most of the ASCII codes under \x20 appear to be entirely obsolete. Are they used at all today? Can they be considered "up for grabs", or is it best to avoid them?
I need a delimiter for grouping "lines" together and it would sure be nice to co-opt one of these for that purpose.
From man ascii:
Oct Dec Hex Char
----------------------------------------------
000 0 00 NUL '\0'
001 1 01 SOH (start of heading)
002 2 02 STX (start of text)
003 3 03 ETX (end of text)
004 4 04 EOT (end of transmission)
005 5 05 ENQ (enquiry)
006 6 06 ACK (acknowledge)
007 7 07 BEL '\a' (bell)
010 8 08 BS '\b' (backspace)
011 9 09 HT '\t' (horizontal tab)
012 10 0A LF '\n' (new line)
013 11 0B VT '\v' (vertical tab)
014 12 0C FF '\f' (form feed)
015 13 0D CR '\r' (carriage ret)
016 14 0E SO (shift out)
017 15 0F SI (shift in)
020 16 10 DLE (data link escape)
021 17 11 DC1 (device control 1)
022 18 12 DC2 (device control 2)
023 19 13 DC3 (device control 3)
024 20 14 DC4 (device control 4)
025 21 15 NAK (negative ack.)
026 22 16 SYN (synchronous idle)
027 23 17 ETB (end of trans. blk)
030 24 18 CAN (cancel)
031 25 19 EM (end of medium)
032 26 1A SUB (substitute)
033 27 1B ESC (escape)
034 28 1C FS (file separator)
035 29 1D GS (group separator)
036 30 1E RS (record separator)
037 31 1F US (unit separator)
040 32 20 SPACE
First the easy part: There are no network transmission concerns in most modern systems. Current protocols handle almost any data - whether 7-bit ASCII, 8-bit ASCII, Unicode characters, image data or compiled programs - as binary data. That has not always been the case. Many older systems had issues transferring control codes and other "unprintable" characters and especially problems with 8-bit data. But those days are, fortunately, behind us. The one big exception is if you want to be able to copy/paste data via an HTML form - for that you want to leave out all control codes and other funny stuff.
You can, of course, make the format anything you like. However, some characters are still used pretty frequently:
000 0 00 NUL '\0' - does "nothing" but is hard for some text editors to handle
003 3 03 ETX (end of text) - Control-C - "break" in a lot of systems
007 7 07 BEL '\a' (bell) - Still makes a bell sound.
011 9 09 HT '\t' (horizontal tab) - A lot of text editors and file formats use this to set a fixed number of spaces
012 10 0A LF '\n' (new line) - like it says
015 13 0D CR '\r' (carriage ret) - used instead of, or together with \n on many systems
021 17 11 DC1 (device control 1) - Control-Q - Resume transmission - XON
023 19 13 DC3 (device control 3) - Control-S - Pause transmission - XOFF
033 27 1B ESC (escape) - Used for PCL and other printer control codes and plenty of other things
Everything else is pretty much up for grabs. I would especially avoid NUL and XON/XOFF - they are sometimes hard to enter into a file - and BEL because typing a file with BEL can be noisy.
If you have a truly binary format then you can do anything you want. But if you want to have a mostly-human-readable format then limiting the control codes is a good idea.
The ASCII control codes aren't obsolete. They are not used as much these days because the technologies that made them so useful aren't mainstream technologies anymore as technology improvements in communication technologies (USB, Ethernet, WiFi, cellular at 3G and greater, etc.) as well as improvements in integrated circuit manufacturing (increases in components per square millimeter, CPU architecture improvements, more miniaturization of components such as System on a Chip) as well as improvements in protocols.
However in the world of Internet of Things, the same technology considerations that influenced the design of these codes still applies:
small processors with limited RAM and storage
low bandwidth communication over slow speed paths
There are several ASCII control codes that are designed to be used to structure text. The Wikipedia topic C0 and C1 control codes, Basic ASCII control codes describes the separator control codes, FS (File Separator), GS (Group Separator), RS (Record Separator), and US Unit Separator.
Can be used as delimiters to mark fields of data structures. If used
for hierarchical levels, US is the lowest level (dividing plain-text
data items), while RS, GS, and FS are of increasing level to divide
groups made up of items of the level beneath it. The Unix info format
uses US, followed by an optional form-feed and a line break, to mark
the beginning of a node.[14]
MARC 21 uses US as a subfield delimiter, RS as a field terminator and
GS as a record terminator.[15]
In the current edition of IPTC 7901, if they are not used for other
purposes, US is recommended for use as a column separator in tables,
FS as a "Central Field Separator" in tables, and GS and RS
respectively for marking a following space or hyphen-minus as
non-breaking or soft respectively (in character sets not supplying
explicit NBSP and SHY characters).2
See as well the description in RFC20, ASCII format for Network Interchange, which describes FS, GS, RS, and US as:
FS (File Separator), GS (Group Separator), RS (Record Separator), and US (Unit Separator): These information separators
may be used within data in optional fashion, except that their
hierarchical relationship shall be: FS is the most inclusive, then
GS, then RS, and US is least inclusive. (The content and length of
a File, Group, Record, or Unit are not specified.)
The Wikipedia topic IPTC 7901 describes the use of control characters with news service messages beginning with the formal approval of the protocol in 1979 which sounds to be similar to an RSS feed protocol. The actual specification is available from the IPTC web site as The IPTC Recommended Message Format, 1995.
Bit patterns -- that is, digitized numeric values -- do not become obsolete. The labels of the ASCII control codes reflect suggested uses in a wide variety of contexts -- serial comms, text display and printing, command-line editing, etc. The better word processors and text editors have used all of those codes in their keyboard command sets, and allowed all of them to be inserted into files, since the 1970s, maybe even earlier. Such programs are careful not to send these codes directly to the screen; they interpret newlines and tabs and sometimes others, and show everything else symbolically, in caret notation ("^A" for SOH, for example) or as underlined or bracketed characters. Certainly avoid ESC and a few others mentioned above if you are afraid users will cat your files to the screen. Otherwise, use them freely.
Long ago I patched WordStar to make it put my dot-matrix printer into graphics mode when desired. Using WordStar, any seven-bit code at all could be put into the graphics data. Worked like a charm.
Related
If ASCII uses 7 bits to represent characters. Could someone explain what this means towards the number of characters that are supported. How would that change if ASCII used 12 bits per character?
A bit has two possible states. A group of n bits has 2n possible states.
Therefore 7 bits can represent 27 = 128 possible characters and 12 bits can represent 212 = 4096 possible characters.
This abridged excerpt from Wikipedia's table of character sets provides historical perspective:
BCDIC 1928 6 bits Introduced with
the IBM card
FIELDATA 1956 6/7 Battlefield
bits information (USA)
EBCDIC 1963 8 bits IBM computers
Teleprinters and
computers;
ASCII 1963-06-17 7 bits original
(ASA X3.4-1963) definition of
ASCII
ECMA-6 1965-04-30 7 bits ASCII localization
ISO 646 1967 (ISO/R646-1967) 7 bits ASCII localization
1967 (USAS Close to "modern"
ASCII X3.4-1967) 7 bits definition of
ASCII
IBM data
Braille ASCII 1969 6/7 Tactile print for
bits blind persons
Terminal text
ECMA-48 1972 7 bits manipulation and
colors
ISO/IEC 8859 1987 8 bits International
codes
Unified encoding
Unicode 1991 16/32 for most of the
bits world's writing
systems
A 12 bit code can support 2 to the twelve or 4096 characters, minus one or two for non-characters like null, maybe escape, and a few whitespace characters.
Now you could construct a computer with 12 bit bytes. But it would be an expensive re-engineering operation. Most computers have 8 bit bytes, at least partly because of ascii.
But the method chosen to extend ascii was Unicode, and the encoding that is emerging as standard is UTF-8 This is a superset of ascii in a sense - ascii is unicode. The unused top bit is set and additional bytes added to generate extended non-Latin characters. So it is variable width encoding, the codes are always a multiple of 8 bits, and its slightly open ended in that it is possible to add codes at the top of the range, but currently encoding never goes wider than four bytes.
I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.
I am comparing two strings and want to verify they are equal. Text-wise, they look equal, but in digging into the Ascii Bytcode, the space character used on each string are different. Is there a way to do regex or a bytecode change?
I am using Ruby/Watir.
More details:
79 99 101 97 110 32 79 108 101 111 #Employee
79 99 101 97 110 194 160 79 108 101 111 #Emp
The two Strings are "Ocean Oleo" and "Ocean Oleo". They look be equal, but according to the Ascii Bytecode the Ascii Bytecode they appear to be using different spaces. The first uses number 32 (space), and the second uses 194, 160 (which apparently also creates a space).
assert((employee.include? emp), "Employee, #{employee}, from search result is NOT expected")
I want this code to evaluate to true, but it can't because of the space issue.
Thoughts?
You’ve got a non-breaking space in your string. The bytes 194, 160 (c2, a0 in hex) are the UTF-8 encoding of the Unicode character U+00A0 NO-BREAK SPACE.
The simple way to fix this would be to swap all non-breaking spaces with normal ones with gsub!, something like:
my_string.gsub! /\u00a0/, ' '
# now my_string will just have "normal" spaces
This may be enough for you, but a more complete way to do this would be to use a library to normalize your strings before comparing them. For example using the UnicodeUtils:
# first install the gem, obviously
require 'unicode_utils'
# ...
my_string = UnicodeUtils.compatibility_decomposition(my_string)
This not only changes non-breaking spaces to normal spaces but a range of other things like making sure any characters with diacritics (e.g. é) are represented the same way (they can be represented in two ways in Unicode), and changing ligatures like ffi to separate characters (ffi).
I have a cobol "tape format" dump which has a mixture of text and number fields. I'm reading the file in C# as a binary array (array of byte). I have the copy book and the formats are lining up fine on the text fields. There are a number of COMP-3 fields as well. The data in those fields doesnt seem to match any BCD format. I know what the data should be and I have the raw bytes of the COMP-3. I tried converting to EBCDIC first which yielded no better results. Any thoughts on how a COMP-3 number can be otherwise internally stored? Below are three examples of the PIC, the raw data and the expected number. I know I have the field positions correct because there is alpha data on either side of the numbers and that all lines up correctly.
First Example:
The PIC of the field is 9(9) COMP-3
There are 5 bytes to the data, the hex values are 02 01 20 91 22
The resulting data should be a date (00CCYYMMDD). This particular date should be 3-17-14.
Second Example:
The PIC of the field is S9(3) COMP-3
There are 2 bytes to the data, the hex values are 0A 14
The resulting value should be between 900 and 999
My understanding is that the "S" means that the last nibble should be 0xC or 0xD to indicate + or -
Third Example:
The PIC of the field is S9(15)V99 COMP-3
There are 9 bytes to the data, the hex values are 00 00 00 00 00 00 01 80 0C
The resulting value should be 12.00
Ok so thanks to the people who responded as they pointed me in the right direction. This is indeed an ASCII/EBCDIC representation issue. The BCD is stored in EBCDIC. Using an ASCII to EBCDIC conversion table yields properly formatted BCD digits:
I used this link to map the data: http://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php
My data: 0A 14 Converted: 25 3C (turns out that 253 is a valid value, spec was wrong) C = +, all good
My data: 01 80 0C (excluding leading zeros) Converted: 01 20 0C 12.00 C = +, implied 2 digits in format, all good
My data: 02 01 20 91 22 Converted: 02 01 40 31 7F 2014/03/17 (F is unused nibble), all good
There is no such thing as a COBOL "tape format" although the phrase may mean something to the person who gave you the data.
The clue to your problem is that you can read the text. Connect that to the EBCDIC tag and your reference to C#.
So, you are reading data which is originally source from a Mainframe, most likely an IBM Mainframe, which uses EBCDIC instead of ASCII.
COBOL does not have native support for BCD.
What some kind soul has done for you is "convert" the data from EBCDIC to ASCII. Otherwise you wouldn't even recognise the "text".
Unfortunately, what that means for any binary or packed-decimal or floating-point fields (you won't see much of the last, but they are COMP-1/COMP-2) is that "convert" means "potentially scrambled", because the coversion is assuming individual bytes, with simple byte values, whereas all of those fields have conventional coding, either through multiple bytes or non-EBCDIC values or both.
So: COMP-3 PIC 9(9). As you say, five bytes. It is unsigned, so the rightmost nybble will be F (all bits on). You are slightly out with your positions due to the sign position being occupied, even for an unsigned field.
On the Mainframe, it contains a value X'020140317F'. Only that field in its entirety can make any sense as to its value. However, the EBCDIC to ASCII conversion has made it X'0201209122'.
How?
Look up the EBCDIC value of X'02' and X'01'. They don't change. Look up the value of X'40', whoops, that's a space, change it to ASCII X'20'. Look up the value of X'31'. Actually nothing special there, and it has converted to something higher than X'7F', but if you look at the translation table used, I guess you'll see why it happens. The X'7F' is a double-quote, so gets changed to X'22'.
The other values you show suffer the same problem.
You should only ever take data from a Mainframe in character-only format. There are many answers here on this, you should look at the related to the right.
Have a look at this recent question: Convert COMP and COMP-3 Packed Decimal into readable value with C
OK, let's have a look at your first example. Given the format and value the original BCD-content should have been something like
02 01 40 31 7F
When transforming that from EBCDIC to ASCII we run into trouble with the first, second and fourth byte because they are control-characters - so here we would need some more details on how the ASCII->EBCDIC-converter worked. Looking at the two remaining bytes those would be changed
EBCDIC ASCII CHARACTER
40 -> 20 (blank)
7F -> 22 "
So assuming the first two bytes remain unchanged and the third gets converted like 31->91 we end up with
02 01 20 91 22
which is what you got. So it looks like some kind of EBCDIC->ASCII-conversion took place. If that is the case it might be that you can't repair the data since the transformation may not be one-one and thus not reversible.
Looking at the second example and using
EBCDIC ASCII CHARACTER
25 -> 0A (LF)
3C -> 14 (DC4)
you would have started with 25 3C which would fit the format but not the range you gave.
In the third example the original 01 20 0C could be converted to 01 80 0C since 20 also is an EBCDIC control-character with no direct ASCII-equivalent.
But given all other examples I would assume there is some codepage-conversion issue.
If you used some kind of filetransfer to move the data fromm the (supposed) mainframe make sure it is set to binary-mode and don't do any character-conversion before you split the file into fields and know what's meant to be a character and what not.
EDIT: You can find a list of several EBCDIC and ASCII-based codepages here or look here for the same as one pdf.
I'm coming to this a bit late, but have a couple of suggestions that might make your life easier...
First, see if you can get your mainframe conterparts to convert all non-character (i.e. binary numeric and packed decimal) data
to display format (e.g. PIC X) before you
download it. Then you only need to deal with the "printable" range of numeric characters representing 0 through 9. Printable character
only code-page conversions are fairly standaard and tend not to screw up as much. Reformatting data given a copybook is not
a difficult prospect for anybody
proficient in a mainframe environment. Unfortunately, sometimes you get the "runaround" and a claim is made that it is
extremely costly or, takes special software, or any one of a hundred other bogus excuses.
If you get the "runaround" then the next best thing is to is download the file in binary format and do your own codepage conversion
for the character data (fairly
straight forward). Next deal with the binary data based on your copybook definitions. With a few Googles you should be able to find
enough information to get through converting the PACKED-DECIMAL (COMP-3) data to whatever you need.
Here are a couple of links to get you started:
Numeric Data Formats
Packed Decimal
I do not recommend trying to reverse engineer the code page conversions applied by your file transfer package in order to
decode the packed decimal and other binary data.
Ok so thanks to both people who responded as they pointed me in the right direction. This is indeed an ASCII/EBCDIC representation issue. The BCD is stored in EBCDIC. Using an ASCII to EBCDIC conversion table yields properly formatted BCD digits:
I used this link to map the data: http://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php
My data: 0A 14
Converted: 25 3C (turns out that 253 is a valid value, spec was wrong) C = +, all good
My data: 01 80 0C (excluding leading zeros)
Converted: 01 20 0C 12.00 C = +, implied 2 digits in format, all good
My data: 02 01 20 91 22
Converted: 02 01 40 31 7F 2014/03/17 (F is unused nibble), all good
Thanks again for the two above answers which led me in the right direction.
You can avoid the above issues by having the data converted into a modern method for transferring data: XML.
So I've been thinking lately about how compression might be implemented, and what I've postulated so far is that it might be using a sort of HashTable of 'byte signature' keys with memory location values where that 'byte signature' should be replaced upon expansion of the compressed item in question.
Is this far from the truth?
How is compression typically implemented? No need for a page worth of answer, just in simple terms is fine.
Compressing algorithms try to find repeated subsequences to replace them with a shorter representation.
Let's take the 25 byte long string Blah blah blah blah blah! (200 bit) from An Explanation of the Deflate Algorithm for example.
Naive approach
A naive approach would be to encode every character with a code word of the same length. We have 7 different characters and thus need codes with the length of ceil(ld(7)) = 3. Our code words can than look like these:
000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used
Now we can encode our string as follows:
000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B l a h _ b l a h _ b l a h _ b l a !
That would just need 25·3 bit = 75 bit for the encoded word plus 7·8 bit = 56 bit for the dictionary, thus 131 bit (65.5%)
Or for sequences:
00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used
The encoded word:
01 00 00 00 00 10
B lah b lah b lah b lah b lah!
Now we just need 6·2 bit = 12 bit for the encoded word and 10·8 bit = 80 bit plus 3·8 bit = 24 bit for the length of each word, thus 116 bit (58.0%).
Huffman code approach
The Huffman code is used to encode more frequent characters/substrings with shorter code than less frequent ones:
5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"
// or for sequences
4 × "lah b"
1 × "B", "lah!"
A possible Huffman code for that is:
0 → "l"
10 → "a"
110 → "h"
1110 → " "
11110 → "b"
111110 → "B"
111111 → "!"
Or for sequences:
0 → "lah b"
10 → "B"
11 → "lah!"
Now our Blah blah blah blah blah! can be encoded to:
111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B l a h _ b l a h _ b l a h _ b l a h _ b l a h !
Or for sequences:
10 0 0 0 0 11
B lah b lah b lah b lah b lah!
Now out first code just needs 78 bit or 8 bit instead of 25·8 = 200 bit like our initial string has. But we still need to add the dictionary where our characters/sequences are stored. For our per-character example we would need 7 additional bytes (7·8 bit = 56 bit) and our per-sequence example would need again 7 bytes plus 3 bytes for the length of each sequence (thus 59 bit). That would result in:
56 + 78 = 134 bit (67.0%)
59 + 8 = 67 bit (33.5%)
The actual numbers may not be correct. Please feel free to edit/correct it.
Check this wiki page...
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually (text files usually can be compressed more after being compressed, due to fewer symbols), result in an expansion, as will attempts to compress all but the most trivially encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, like for example always removing the last byte of a file, will always compress a file up to the point where it is empty.
An example of lossless vs. lossy compression is the following string:
25.888888888
This string can be compressed as:
25.[9]8
Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using
26
instead, the original data is lost, at the benefit of a smaller file size.
Lossless compression algorithms translate each possible input into distinct outputs, in such a way that more common inputs translate to shorter outputs. It's mathematically impossible for all possible inputs to be compressed -- otherwise, you'd have multiple inputs A and B compressing to the same form, so when you decompress it, do you get back to A or back to B? In practice, most useful information has some redundancy and this redundancy fits certain patterns; hence the data can usefully be compressed because the cases that expand when you compress them don't naturally arise.
Lossy compression, for example, that used in JPEG or MP3 compression, works by approximating the input data by some signal that can be expressed in fewer bits than the original. When you decompress it, you don't get the original, but you usually get something close enough.
In VERY simple terms, a common form of compression is a http://en.wikipedia.org/wiki/Dictionary_coder. This involves replacing longer repeated strings with shorter ones.
For example if you have a file that looks like this:
"Monday Night","Baseball","7:00pm"
"Tuesday Night","Baseball","7:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Baseball","5:00pm"
It would be roughly 150 characters, but if you where to do a simple substitution as follows:
A="Monday Night",B="Tuesday Night",C="Baseball",D="Softball",E="7:00pm",F="8:00pm",G=5:00pm"
Then the same content could be encoded as:
A,C,E
B,C,E
A,D,F
A,D,F
A,C,G
Using on 25 characters! A clever observer could also see how to easily reduce this further to 15 characters if we assumed some more things about the format of the file. Obviously there is the overhead of the substitution key, but often very large files have a lot of these substitutions. This can be a very efficient way to compress large files or data structures and still allow them to be "somewhat" human readable.
Rosetta Code has an entry on Huffman Coding, as does an earlier blog entry of mine.