As I understand the escape char can be represented in the following ways.
octal: \033
hexadecimal: \x1B
decimal: 27
unicode: \u001B
in my terminal: ^[
The first 4 representations are just decimal 27 in varying number systems. But the last representation ^[ doesn't seem to have any link to decimal 27, it seems arbitrary. So I am wondering why ^[ was chosen as the way to represent escape in a terminal, and how that came about?
But the last representation ^[ doesn't seem to have any link to decimal 27
It may appear so at first glance, but in reality, there is a link.
First, you need to understand that the caret in notation like ^[ means that the control key is held while pressing [, so ^[ is ctrl-[. In other words, the escape key acts exactly the same (in a terminal) as ctrl-[. (As to why the escape key produces this particular character: see the second part of my answer.)
The character [ is encoded in ASCII as decimal 91, or 0x5b, but it's most useful to look at the binary representation: 0b01011011. ^[, or the escape key, is encoded as decimal 27, or 0b00011011. If we align these two binary numbers:
[ 0b01011011
^[ 0b00011011
we can see that ^[ is just a [ with bit 7 cleared. In fact, adding the control key essentially just clears the top three bits of the character code1.
So the link between ^[ and 27 is that 91 − 64 = 27 :)
why ^[ was chosen as the way to represent escape in a terminal
I have absolutely no idea!
Related
I have found out that 015 is an octal code. Then what is the term for "\r"? What kind of system is that? And why isn't there a system where I just enter the decimal value of the ASCII table, like e.g. "\13" for a carriage return?
why isn't there a system where I just enter the decimal value of the ASCII table, like e.g. "\13" for a carriage return?
For historical reasons, most systems allow direct representation of characters in hexadecimal and/or octal
This is because although humans used to base 10 find decimal easier, octal and hexadecimal are easier to understand at the bit level
Each octal digit is exactly 3 bits and each hexadecimal digit is 4 bits, whereas base 10 digits do not not exactly fit a fixed number of bits
If you wish to make arbitary characters based on decimal codes, there is usually function for this. For example in python
python -c 'print chr(13)'
This will output the carriage return character you are interested in
Does anybody know the ASCII equivalent of 80(hexadecimal)? Does it even exist? I was just wondering, the table only goes up to 7F.
No.
ASCII is by definition a 7-bit character code, with encodings from 0 to 127 (0x7F). Anything outside that range is not ASCII.
There are a number of 8-bit and wider character codes based on ASCII (sometimes, with questionable accuracy, called "extended ASCII") that assign some meaning to 0x80. For example, both Latin-1 and Unicode treat 0x80 as a control character, while Windows-1252 uses it for the Euro symbol.
I just have a question to make sure I get something well.
If I used my computer to sum 10+11 which is 21, it will store 21 in usually a byte such that 0001 0101, however, when it prints it on the screen, it will actually represent it as two digits 2 (0110010) and 1 (0110001) appended to each other to form 21) using ASCII.
Is that right?
Thank you!
That is correct.
The representation of characters in a simple terminal is ASCII, where each character is represented by a (technically 7-bit) code.
Some terminals support more complex encodings like UTF8, but since UTF8 is backwards compatible with ASCII you need not worry about it.
I want to print the value '½' in a file. I searched for the ascii value of this as Alt+(ascii Value) will give you the same. To my surprise I found 2 ascii values for this symbol in various sites. One is 171 and the other is 189.
I tried to write this symbol by using 171 and 189. Again to my surprise, if I am writing in Windows, 171 will give me this symbol. But if I am writing in UNIX, 189 will give me this symbol.
I was aware that there cant be 2 ASCII Values for a same symbol. But I got 2 valid codes for the same symbol in different OS. So can anyone tell what is the real ASCII Code for the symbol ½ ??
½ is not a character in the ASCII character set.
The values you're finding online probably differ because they're using different character sets. For example, before Unicode was invented, localized versions of Windows all used different code pages, in which the basic ASCII set was extended with some additional characters.
Now, of course, everything is (or should be) fully Unicode. Detailed Unicode information for that character (vulgar fraction one half) can be found here. Note that there are also multiple representations for the same numerical value (e.g., base 10, hex, binary, etc.).
In Windows if you use the ALT codes,
3 digits will insert the equivalent "Code page 850" character
so ALT + 171 will insert the ½ symbol
189 is the ANSI/UTF-8/WIN-1252/ISO8859-1 value for the ½ symbol.
To use ALT codes for ANSI you MUST press 0 first,
so ALT + 0189 inserts the ½ symbol
Please read the ASCII wikipedia page. You'll learn that ASCII has no "one half" character.
These days, most systems can be configured to use UTF-8 encoding (which is the "default" or at least the most commonly used encoding on the Web and on Unix systems).
UTF-8 is a variable length encoding for Unicode. So many characters or glyphs are represented by several bytes. For the ½ (officially the vulgar fraction one half unicode character) its UTF8 encoding is the two hex bytes 0Xc2 0xBD so in C notation \302\275
I am using the Linux Gnome Character Map utility gucharmap to find all that.
You might be interested in UTF-32 (a fixed length encoding using 32 bits characters, in which ½ is represented by 0x000000BD), or even UTF-16 in which many characters are 16 bits (in particular ½ is 0x00BD e.g. one 16 bit character in UTF-16), but not all. You may also be interested in wide characters i.e. the wchar_t of recent C or C++ standards (which is UTF-16 on Windows and UTF-32 on many Unix).
FWIW, Qt is using QChar as UTF-16 (Java also has UTF-16 char ...), but Gtk use UTF-8 i.e. variable length characters.
Notice that with variable length character encodings like UTF-8 getting the N-th character (which is not the N-th byte!) in a string requires to scan the string. Also, some byte combinations are not valid UTF-8.
As others have pointed out: it's not in the ASCII table (values 0..127).
But it has a Unicode code of:
½ U+00BD Vulgar Fraction One Half
It can also be put into text using the unicode U+2044 Fraction Slash:
where your text contains the three code points: 1⁄2
but it gets rendered as 1⁄2
This has the virtue of working for any fractions:
1⁄2
3⁄5
22⁄7
355⁄113
355⁄113 - 1⁄3748629
I am quite sure that it is indeed part of the ASCII Table:
In Windows, ensure 'NumLock' is on then try [ALT + (NumPAD)171] = ½.
For ¼ use [ALT + 172]
I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.