Why is the default terminal width 80 characters? - terminal

80 seems to be the default in many different environments and I'm looking for a technical or historical reason. It is common knowledge that lines of code shouldn't exceed 80 characters, but I'm hard pressed to find a reason why outside of "some people might get annoyed."

As per Wikipedia:
80 chars per line is historically descended from punched
cards
and later broadly used in monitor text mode
source: http://en.wikipedia.org/wiki/Characters_per_line
Shall I still use 80 CPL?
Many developers argue to use 80 CPL even if you could use more. Quoting from: http://richarddingwall.name/2008/05/31/is-the-80-character-line-limit-still-relevant/
Long lines that span too far across the monitor are hard to read. This
is typography 101. The shorter your line lengths, the less your eye
has to travel to see it.
If your code is narrow enough, you can fit two files on screen, side
by side, at the same time. This can be very useful if you’re comparing
files, or watching your application run side-by-side with a debugger
in real time.
Plus, if you write code 80 columns wide, you can relax knowing that
your code will be readable and maintainable on more-or-less any
computer in the world.
Another nice side effect is that snippets of narrow code are much
easier to embed into documents or blog posts.
As a Vim user, I keep ColorColumn=80 in my ~/.vimrc. If I remember correctly, Eclipse autoformat CtrlShiftF, breaks lines at 80 chars by default.

It is because IBM punch cards were 80 characters wide.

Your computer probably doesn't have a punch card reader, but it probably does have lpr(1) which follows the convention set by IBM for punch cards. The lpr(1) command defaults to Courier font with margins set for 80-columns and 8-space tabs for plain text files on 8.5"x11" paper. Try cat foo.c | lpr and if the author of foo.c used conventional line width and source code formatting rules, then the printed page will look mostly readable. Otherwise, best not to kill the trees.

One of the characteristics of good typography is properly set measure - length of the line of characters.
There is an optimum width for a Measure and that is defined by the amount of characters are in the line. A general good rule of thumb is 2-3 alphabets in length, or 52-78 characters (including spaces).
It simply makes sense to make your text readable.
See Five simple steps to better typography for more info.

If I remember correctly the old dot matrix printers were only able to print 80 characters across. I am pretty sure my old commodore 64 and 128 had the same 80 characters, now that I think about it, I don't think the monitor could display more than 80 characters either
The LA30 was a 30 character/second dot
matrix printer introduced in 1970 by
Digital Equipment Corporation of
Maynard, Massachusetts. It printed 80
columns of uppercase-only 5x7 dot
matrix characters across a
unique-sized paper.
http://en.wikipedia.org/wiki/Dot_matrix_printer

possibly due to screen resolution of 640 pixels..
each character is or was 8 pixels wide giving you 640 (80x8)

Related

ASCII - Whats the point of it?

I always wanted to ask this, I know that ASCII uses numbers to represent characters like 65 = A
Whats the point? computer understand when i press A is A why we need to convert to 65?
You have it backwards: computers understand when you press an A because of codes like ASCII. Or rather, one part of the computer is able to tell another part of the computer that you pressed an A because they agree on conventions of binary signals like ASCII.
At its lowest level, each part of the computer "knows" that it is in one of two states - maybe off and on, maybe high voltage and low voltage, maybe two directions of magnetism, and so on. For convenience, we label these two states 0 and 1. We then build elaborate (and microscopic) sequences of machinery that each say "if this thing's a 1, then do this, if it's a 0 do this".
If we string a sequence of 1s and 0s together, we can write a number, like 1010; and we can make machinery that does maths with those numbers, like 1010 + 0001 = 1011. Alternatively, we can string a much longer sequence together to represent the brightness of pixels from the top left to bottom right of a screen, in order - a bitmap image. The computer doesn't "know" which sequences are numbers and which are images, we just tell it "draw the screen based on this sequence" and "calculate my wages based on this sequence".
If we want to represent not numbers or images, but text, we need to come up with a sequence of bits for each letter and symbol. It doesn't really matter what sequence we use, we just need to be consistent - we could say that 000001 is A, and as long as we remember that's what we chose, we can write programs that deal with text. ASCII is simply one of those mappings of sequences of bits to letters and symbols.
Note that A is not defined as "65" in ASCII, it's defined as the 7 bit sequence 1000001; it just happens that that's the same sequence of bits we generally use for the number 65. Note also that ASCII is a very old mapping, and almost never used directly in modern computers; it is however very influential, and a lot of more recent mappings are designed to use the same or similar sequences for the letters and symbols that it covers.

how data is interpreted in computers

Something that troubles me when I think of is how incoming data is interpreted in computers. I searched a lot but could not find an answer so as a last resort I am asking in here. What I am saying is that you plug in a USB to your computer and data stream starts. Your computer receives ones and zeros from the USB and interpret them correctly like for example inside of the USB there are pictures with different names and different formats and resolutions. What I do not understand is how computer correctly puts them together and the big picture emerges. This could be seen as a stupid question but had me thinking for a while. How does this system work?
I am not a computer scientist but I am studying Electrical and electronics engineering and know somethings.
It is all just streams of ones and zeros, which get counted up into bytes. As you probably know one can multiplex them, but with modern hardware that isn't very necessary (the 's' in USB standing for 'serial)
A pure black and white image of an "A" would be a 2d array:
111
101
111
101
101
3x5 font
I would guess that "A" is stored in a font file as 111101111101101, with a known length of 3*5=15 bits.
When displayed in a window, that A would be broken down into lines, and inserted on the respective line of the window, becoming a stream which contains 320x256 pixels perhaps.
When the length of data is not constant, it can:
If there is a max size, could be the size of the max size (integers and other primitive data types do this, a 0 takes 32/64 bits, as does 400123)
A length is included somewhere, often a sort of "header"
It gets chunked up into either constant or variable sized chunks, and has a continue bit (UTF-8 is a good simple example of constant chunks, some networking protocols (maybe TCP/IP) are a good example of variable chunks)
Both sides need to know how to decode the data, in your example of a USB stick with an image on it. The operating system has a driver which understands the UUID is a storage device, and attempts to read special sectors from it. If it detects a partition type it recognizes (for windows that would be NTFS or FAT32), it will then load the file tables, using drivers that understand how to decode those. It finds a filename allows access via the filename. Then an image reading program is able to load the bytestream of that file and decode it using its headers and installed codecs into a raster image array. If any of those pieces are not available in your system, you cannot view the image, and it will be just any random binary to you (if you format the usb stick with Linux, or use a uncommon/old image format)
So its all various level of explicit or implicit handshakes to agree on what the data is when you get to the higher levels (higher level being at least once you agree on endianness and baudrate of data transmission)

Bash string compression

I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you
Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.

Why are non-printable ASCII characters actually printable?

Characters, that are not alphanumeric or punctuation are termed not printable:
Codes 20hex to 7Ehex, known as the printable characters
So why is e.g. 005 representable (and represented by clubs)?
Most of the original set of ASCII control characters are no longer useful, so many different vendors have recycled them as additional graphic characters, often dingbats as in your table. However, all such assignments are nonstandard, and usually incompatible with each other. If you can, it's better to use the official Unicode codepoints for these characters. (Similar things have been done with the additional block of control characters in the high half of the ISO 8859.x standards, which were already obsolete at the time they were specified. Again, use the official Unicode codepoints.)
The tiny print at the bottom of your table appears to say "Copyright 1982 Leading Edge Computer Products, Inc." That company was an early maker of IBM PC clones, and this is presumably their custom ASCII extension. You should only pay attention to the assignments for 000-031 and 127 in this table if you're writing software to convert files produced on those specific computers to a more modern format.
The representation of the "not printable" chars depends on the used charset (of the OS, of the Browser, what ever), see ISO 8859, Code Page 1252 for example.
In dos for example you do have funny Signs that were used for very old style window frames (ascii art like).

Algorithm to estimate number of English translation words from Japanese source

I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest).
Examples:
computer: コンピュータ (Katakana - 6
characters); 計算機 (Kanji: 3
characters)
whale: くじら (Hiragana --
3 characters); 鯨 (Kanji: 1
character)
As data, I have a large glossary of Japanese words and their English translations, and a fairly large corpus of matched Japanese source documents and their English translations. I want to come up with a formula that will count numbers of Kanji, Hiragana, and Katakana characters in a source text, and estimate the number of English words this is likely to turn into.
Here's what Borland (now Embarcadero) thinks about English to non-English:
Length of English string (in characters)
Expected increase
1-5 100%
6-12 80%
13-20 60%
21-30 40%
31-50 20%
over 50 10%
I think you can sort of apply this (with some modification) for Japanese to non-Japanese.
Another element you might want to consider is the tone of the language. In English, instructions are phrased as an imperative as in "Press OK." But in Japanese language, imperatives are considered rude, and you must phrase instructions in honorific (or keigo) as in "OKボタンを押してください。"
Watch out for three-letter kanji combos. Many of the big words translate into three- or four- letter kanji combo such as 国際化(internationalization: 20 chars), 高可用性(high availability: 17 chars).
I would start with linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3, with the coefficients a1, a2, a3 fit from your data using linear least squares.
If this doesn't approximate very well, then look at the worst cases for the reasons they don't fit (specialized words, etc.).
In my experience as a translator and localization specialist, a good rule of thumb is 2 Japanese characters per English word.
As an experienced translator between Japanese and English, I can say that this is extremely difficult to quantify, but typically in my experience English text translated from Japanese is nearly 200% as many characters as the source text. In Japanese there are many culturally specific phrases and nouns that can't be translated literally and need to be explained in English.
When translating it is not unusual for me to take a single Japanese sentence and to make a single English paragraph out of it in order for the meaning to be communicated to the reader. Off the top of my here is an example:
「懐かしい」
This literally means nostalgic. However, in Japanese it can be used as a single phrase in an exclamation. Yet, in English in order to convey a feeling of nostalgia we require a lot more context. For instance, you may need to turn that single phrase into a sentence:
"As I walked by my old elementary school, I was flooded with memories of the past."
This is why machine translation between Japanese and English is impossible.
Well, it's a little more complex than just the number of characters in a noun compared to English, for instance, Japanese also has a different grammatical structure compared to English, so certain sentences would use MORE words in Japanese, and others would use LESS words. I don't really know Japanese, so please forgive me for using Korean as an example.
In Korean, a sentence is often shorter than an English sentence, due mainly to the fact that they are cut short by using context to fill in the missing words. For instance, saying "I love you" could be as short as 사랑해 ("sarang hae", simply the verb "love"), or as long as the fully qualified sentence 저는 당신을 살앙해요 (I [topic] you [object] love [verb + polite modifier]. In a text how it is written depends on context, which is usually set by earlier sentences in the paragraph.
Anyway, having an algorithm to actually KNOW this kind of thing would be very difficult, so you're probably much better off, just using statistics. What you should do is use random samples where the known Japanese texts, and English texts have the same meaning. The larger the sample (and the more random it is) the better... though if they are truly random, it won't make much difference how many you have past a few hundred.
Now, another thing is this ratio would change completely on the type of text being translated. For instance, highly technical document is quite likely to have a much higher Japanese/English length ratio than a soppy novel.
As for simply using your dictionary of word to word translations - that probably won't work to well (and is probably wrong). The same word does not translate to the same word every time in a different language (although much more likely to happen in technical discussions). For instance, the word beautiful. There is not only more than one word I could assign it to in Korean (i.e. there is a choice), but sometimes I lose that choice, as in the sentence (that food is beautiful), where I don't mean the food looks good. I mean it tastes good, and my option of translations for that word changes. And this is a VERY common circumstance.
Another big problem is optimal translation. Something that human's are really bad at, and something that computers are much much worse at. Whenever I've proofread a document translated from another text to English, I can always see various ways to cut it much much shorter.
So although, with statistics, you would be able to work out a pretty good average ratio in length between translations, this will be far different than it would be were all translations to be optimal.
It seems simple enough - you just need to find out the ratios.
For each script, count the number of script characters and English words in your glossary and work out the ratio.
This can be augmented with the Japanese source documents assuming you can both detect which script a Japanese word is in and what the English equivalent phrase is in the translation. Otherwise you'll have to guesstimate the ratios or ignore this as source data,
Then, as you say, count the number of words in each script of your source text, do the multiplies, and you should have a rough estimate.
My (albeit tiny) experience seems to indicate that, no matter what the language, blocks of text take the same amount of printed space to convey equivalent information. So, for a large-ish block of text, you could assign a width count to each character in English (grab this from a common font like Times New Roman), and likewise use a common Japanese font at the same point size to calculate the number of characters that would be required.

Resources