What is the ASCII Code of ½? - windows

I want to print the value '½' in a file. I searched for the ascii value of this as Alt+(ascii Value) will give you the same. To my surprise I found 2 ascii values for this symbol in various sites. One is 171 and the other is 189.
I tried to write this symbol by using 171 and 189. Again to my surprise, if I am writing in Windows, 171 will give me this symbol. But if I am writing in UNIX, 189 will give me this symbol.
I was aware that there cant be 2 ASCII Values for a same symbol. But I got 2 valid codes for the same symbol in different OS. So can anyone tell what is the real ASCII Code for the symbol ½ ??

½ is not a character in the ASCII character set.
The values you're finding online probably differ because they're using different character sets. For example, before Unicode was invented, localized versions of Windows all used different code pages, in which the basic ASCII set was extended with some additional characters.
Now, of course, everything is (or should be) fully Unicode. Detailed Unicode information for that character (vulgar fraction one half) can be found here. Note that there are also multiple representations for the same numerical value (e.g., base 10, hex, binary, etc.).

In Windows if you use the ALT codes,
3 digits will insert the equivalent "Code page 850" character
so ALT + 171 will insert the ½ symbol
189 is the ANSI/UTF-8/WIN-1252/ISO8859-1 value for the ½ symbol.
To use ALT codes for ANSI you MUST press 0 first,
so ALT + 0189 inserts the ½ symbol

Please read the ASCII wikipedia page. You'll learn that ASCII has no "one half" character.
These days, most systems can be configured to use UTF-8 encoding (which is the "default" or at least the most commonly used encoding on the Web and on Unix systems).
UTF-8 is a variable length encoding for Unicode. So many characters or glyphs are represented by several bytes. For the ½ (officially the vulgar fraction one half unicode character) its UTF8 encoding is the two hex bytes 0Xc2 0xBD so in C notation \302\275
I am using the Linux Gnome Character Map utility gucharmap to find all that.
You might be interested in UTF-32 (a fixed length encoding using 32 bits characters, in which ½ is represented by 0x000000BD), or even UTF-16 in which many characters are 16 bits (in particular ½ is 0x00BD e.g. one 16 bit character in UTF-16), but not all. You may also be interested in wide characters i.e. the wchar_t of recent C or C++ standards (which is UTF-16 on Windows and UTF-32 on many Unix).
FWIW, Qt is using QChar as UTF-16 (Java also has UTF-16 char ...), but Gtk use UTF-8 i.e. variable length characters.
Notice that with variable length character encodings like UTF-8 getting the N-th character (which is not the N-th byte!) in a string requires to scan the string. Also, some byte combinations are not valid UTF-8.

As others have pointed out: it's not in the ASCII table (values 0..127).
But it has a Unicode code of:
½ U+00BD Vulgar Fraction One Half
It can also be put into text using the unicode U+2044 Fraction Slash:
where your text contains the three code points: 1⁄2
but it gets rendered as 1⁄2
This has the virtue of working for any fractions:
1⁄2
3⁄5
22⁄7
355⁄113
355⁄113 - 1⁄3748629

I am quite sure that it is indeed part of the ASCII Table:
In Windows, ensure 'NumLock' is on then try [ALT + (NumPAD)171] = ½.
For ¼ use [ALT + 172]

Related

Does an ascii equivalent of 0x80 exist?

Does anybody know the ASCII equivalent of 80(hexadecimal)? Does it even exist? I was just wondering, the table only goes up to 7F.
No.
ASCII is by definition a 7-bit character code, with encodings from 0 to 127 (0x7F). Anything outside that range is not ASCII.
There are a number of 8-bit and wider character codes based on ASCII (sometimes, with questionable accuracy, called "extended ASCII") that assign some meaning to 0x80. For example, both Latin-1 and Unicode treat 0x80 as a control character, while Windows-1252 uses it for the Euro symbol.

Typing ALT + 251 and ALT + 0251 at the keyboard produce different character entries

In Windows:
when I press Alt + 251, I get a √ character
when I press Alt + 0251 get û character!
A leading zero doesn't have value.
Actually, I want get check mark(√) from Chr(251) function in Client Report Definition (RDLC) but it gets me û!
I think it interprets four numbers as hex not decimal.
Using a leading zero forces the Windows to interpret the code in the Windows-1252 set. Without 0 the code is interpreted using the OEM set.
Alt+251:
You'll get √, because you'll use OEM 437, where 251 is for square root.
I'll get ¹, because I'll use OEM 850, where 251 is for superscript 1.
Alt+0251:
Both of us will get û, because we'll use Windows-1252, where 251 is for u-circumflex.
This is historical.
From ASCII to Unicode
At the beginning of DOS/Windows, characters were one byte wide and were from the American alphabet, the conversion was set using the ASCII encoding.
Additional characters were needed as soon as the PC was used off the US (many languages use accents for instance). So different codepages were designed and different encoding tables were used for conversion.
But a computer in the US wouldn't use the same codepage than one in Spain. This required the user and the programmer to assume the currently active codepage, and this has been a great period in the history of computing...
At the same period it was determined that using only one byte was not going to make it, more than 256 characters were required to be available at the same time. Different encoding systems were designed by a consortium, and collectively known as Unicode.
In Unicode "characters" can be one to four bytes wide, and the number of bytes for one character may vary in the same string.
Other notions have been introduced, such as codepoint and glyph to deal with the complexity of written language.
While Unicode was being adopted as a standard, Windows retained the old one-byte codepages for efficiency, simplicity and retro-compatibility. Windows also added codepages to deal with glyphs found only in Unicode.
Windows has:
A default OEM codepage which is usually 437 in the US -- your case -- or 850 in Europe -- my case --, used with the command line ("DOS"),
the Windows-1252 codepage (aka Latin-1 and ISO 8859-1, but this is a misuse) to ease conversion to/from Unicode. Current tendency is to replace all such extended codepages by Unicode. Java designers make a drastic decision and use only Unicode to represent strings.
When entering a character with the Alt method, you need to tell Windows which codepage you want to use for its interpretation:
No leading zero: You want the OEM codepage to be used.
Leading zero: You want the Windows codepage to be used.
Note on OEM codepages
OEM codepages are so called because for the first PC/PC-Compatible computers the display of characters was hard-wired, not software-done. The computer had a character generator with a fixed encoding and graphical definitions in a ROM. The BIOS would send a byte and a position (line, position in line) to the generator, and the generator would draw the corresponding glyph at this position. This was named "text-mode" at the time.
A computer sold in the US would have a different character ROM than one sold in Germany. This was really dependent on the manufacturer, and the BIOS was able to read the value of the installed codepage(s).
Later the generation of glyphs became software-based, to deal with unlimited fonts, style, and size. It was possible to define a set of glyphs and its corresponding encoding table at the OS level. This combination could be used on any computer, independently of the installed OEM generator.
Software-generated glyphs started with VGA display adapters, the code required for the drawing of glyphs was part of the VGA driver.
As you understood, +0251 is ASCII character, it does not represent a number.
You must understand that when you write 0 to the left of numbers it does not have any value but here it is ASCII codes and not numbers.

ASCII characters set

I am reading a (.txt) file, the contents of first line are just the four alphabet letters: "abcd".
when I display the ASCII code of these letters, I expect I to find 97,98,99 and 100 respectively for a,b,c and b. But I found tow special characters which their ASCII code are 255 and 254 for ÿ and þ.
Therefore the length of read line is 6 not 4 because of "ÿþabcd". Are these special character must-to-insert at the start of any sequential text file or is there any way to avoid both of them?
ASCII is only used in niche or archaic systems. Your data proved to you that your file is not ASCII. You must find out which character set and encoding the file was stored in.
Character Sets
All text is an encoding of elements of a character set. Elements of a character set are called codepoints. A character set consists of a list of codepoints and their descriptions. The description states how the codepoint is used semantically in text, such as LATIN CAPITAL LETTER A (A) or N-ARY PRODUCT (∏). (The style in which the codepoint is rendered is the purview of typefaces/fonts.)
Encodings
Codepoints are numbered with non-negative integers. The number is encoded into bytes. Most character sets have only one encoding, which is the number as an unsigned integer in the smallest size that can represent all of the codepoints. For example, Windows-1252 has 251 codepoints, with numbers between 0 and 255. A byte is big enough to represent any of them. The Unicode character set has about 1 million codepoints, numbered from 0 to 1,114,112. A 32-bit integer is big enough to represent all of them. That's the UTF-32 encoding.
Byte order
Computer memory is usually byte-addressable and file are byte sequences so for a large integer the question becomes, in which order the bytes are stored: The most significant byte first (big endian) or least significant byte first (little endian). Software adapts to or assumes one way or the other. So, UTF-32 actually identifies one of two encodings: UTF32BE or UTF-32LE. UTF-32 is shorthand for the endianness that the software assumes. Typically, the OS assumes the endianness of the hardware it is running on and programs follow suit.
Unicode Encodings
UTF-32 takes a lot of space. The most commonly used codepoints are numbered below 65,536. So there can be savings if codepoints are represented variable number of smaller integers. The size of the integer is called a code unit. The value of a code unit contains some of the bits of a code unit and indicates if there are more code units to follow that have more of the bits. So, there are UTF-16LE and UTF-16BE and UTF-8 (and more) encodings for Unicode. UTF-16 uses one or two 16-bit code units for a codepoint and UTF-8 uses one to four 8-bit code units for a codepoint.
Files are data outside of programs. So for a program to read text, it has to know the character set and encoding. Often this metadata is not stored with the file (within or beside). That's how you made the mistake of believing your file is ASCII. If you don't know the encoding of a file, you've lost data. You might be able to recover it through guessing. It is notable that the CP437 character set has 256 codepoints, numbered 0 to 255 and encoded in one byte. So every file can be read as CP437; The question is, is that right? Even if it looks right, it's probably not right unless it's from a Western culture circa 1990.
Unicode Byte-order Mark
A strong clue about which character set and encoding to guess is called the byte-order mark (BOM). Recall encodings with code units larger than one byte have an endianness. Endianness is a hardware concern. So, although a file can be passed between systems with agreement on which character set and encoding scheme is used, the endianness attribute of encoding is critical to each system. It has become standard to indicate byte order within the file, as the first bytes. Unicode specifies a codepoint to use for this purpose, as long as it is at the beginning of the file. (That means that programs reading Unicode from a file must separate this metadata from the data.) Many file writing libraries write the BOM codepoint regardless of the code unit size. So, you'll see it at the beginning of UTF-8 files. Since the Unicode BOM looks different in each of the Unicode encodings, it completely identifies which Unicode encoding is being used.
Guessing
Your file begins with the UTF-16LE BOM. Read it as UTF-16LE (and discard the BOM codepoint if your library doesn't already.)
Given the specificness of the Unicode BOM, its presence is a strong indicator that the file is encoded in Unicode and the actual bytes tell which Unicode encoding. However, as noted above, it's possible that this guess is wrong.
As #Lưu Vĩnh Phúc points out, it is unclear how you are reading "ÿþabcd" from what you say is a 6-byte file. Open the file in a hex editor. UTF-16LE should be FF FE 61 00 62 00 63 00 64 00.

How many characters can UTF-8 encode?

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?
The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?
How does this work?
UTF-8 does not use one byte all the time, it's 1 to 4 bytes.
The first 128 characters (US-ASCII) need one byte.
The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.
Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.
Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
source: Wikipedia
UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.
So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.
Unicode vs UTF-8
Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.
Unicode
Unicode is designated with "planes." Each plane carries 216 code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.
Rather than explain all the nuances, let me just quote the above article on planes.
The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.
UTF-8
Now let's go back to the article linked above,
The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.
So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.
UTF-8, even with a four byte limitation, supports 221 code points, which is far more than 17 * 2^16
According to this table* UTF-8 should support:
231 = 2,147,483,648 characters
However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us
221 = 2,097,152 characters
Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.
* Wikipedia used show a table with 6 bytes -- they've since updated the article.
2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes
2,164,864 “characters” can be potentially coded by UTF-8.
This number is 27 + 211 + 216 + 221, which comes from the way the encoding works:
1-byte chars have 7 bits for encoding
0xxxxxxx (0x00-0x7F)
2-byte chars have 11 bits for encoding
110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)
3-byte chars have 16 bits for encoding
1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)
4-byte chars have 21 bits for encoding
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)
As you can see this is significantly larger than current Unicode (1,112,064 characters).
UPDATE
My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.
UTF-8 is a variable length encoding with a minimum of 8 bits per character.
Characters with higher code points will take up to 32 bits.
Quote from Wikipedia: "UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard)."
Some links:
http://www.utf-8.com/
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
http://en.wikipedia.org/wiki/UTF-8
Check out the Unicode Standard and related information, such as their FAQ entry, UTF-8 UTF-16, UTF-32 & BOM. It’s not that smooth sailing, but it’s authoritative information, and much of what you might read about UTF-8 elsewhere is questionable.
The “8” in “UTF-8” relates to the length of code units in bits. Code units are entities use to encode characters, not necessarily as a simple one-to-one mapping. UTF-8 uses a variable number of code units to encode a character.
The collection of characters that can be encoded in UTF-8 is exactly the same as for UTF-16 or UTF-32, namely all Unicode characters. They all encode the entire Unicode coding space, which even includes noncharacters and unassigned code points.
While I agree with mpen on the current maximum UTF-8 codes (2,164,864) (listed below, I couldn't comment on his), he is off by 2 levels if you remove the 2 major restrictions of UTF-8: only 4 bytes limit and codes 254 and 255 can not be used (he only removed the 4 byte limit).
Starting code 254 follows the basic arrangement of starting bits (multi-bit flag set to 1, a count of 6 1's, and terminal 0, no spare bits) giving you 6 additional bytes to work with (6 10xxxxxx groups, an additional 2^36 codes).
Starting code 255 doesn't exactly follow the basic setup, no terminal 0 but all bits are used, giving you 7 additional bytes (multi-bit flag set to 1, a count of 7 1's, and no terminal 0 because all bits are used; 7 10xxxxxx groups, an additional 2^42 codes).
Adding these in gives a final maximum presentable character set of 4,468,982,745,216. This is more than all characters in current use, old or dead languages, and any believed lost languages. Angelic or Celestial script anyone?
Also there are single byte codes that are overlooked/ignored in the UTF-8 standard in addition to 254 and 255: 128-191, and a few others. Some are used locally by the keyboard, example code 128 is usually a deleting backspace. The other starting codes (and associated ranges) are invalid for one or more reasons (https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences).
Unicode is firmly married to UTF-8. Unicode specifically supports 2^21 code points (2,097,152 characters) which is exactly the same number of code points supported by UTF-8. Both systems reserve the same 'dead' space and restricted zones for code points etc. ...as of June 2018 the most recent version, Unicode 11.0, contains a repertoire of 137,439 characters
From the unicode standard. Unicode FAQ
The Unicode Standard encodes characters in the range U+0000..U+10FFFF,
which amounts to a 21-bit code space.
From the UTF-8 Wikipedia page. UTF-8 Description
Since the restriction of the Unicode code-space to 21-bit values in
2003, UTF-8 is defined to encode code points in one to four bytes, ...

What is a multibyte character set?

Does the term multibyte refer to a charset whose characters can - but don't have to be - wider than 1 byte, (e.g. UTF-8) or does it refer to character sets which are in any case wider than 1 byte (e.g. UTF-16) ? In other words: What is meant if anybody talks about multibyte character sets?
The term is ambiguous, but in my internationalization work, we typically avoided the term "multibyte character sets" to refer to Unicode-based encodings. Generally, we used the term only for legacy encoding schemes that had one or more bytes to define each character (excluding encodings that require only one byte per character).
Shift-jis, jis, euc-jp, euc-kr, along with Chinese encodings are typically included.
Most of the legacy encodings, with some exceptions, require a sort of state machine model (or, more simply, a page swapping model) to process, and moving backwards in a text stream is complicated and error-prone. UTF-8 and UTF-16 do not suffer from this problem, as UTF-8 can be tested with a bitmask and UTF-16 can be tested against a range of surrogate pairs, so moving backward and forward in a non-pathological document can be done safely without major complexity.
A few legacy encodings, for languages like Thai and Vietnamese, have some of the complexity of multibyte character sets but are really just built on combining characters, and aren't generally lumped in with the broad term "multibyte."
What is meant if anybody talks about multibyte character sets?
That, as usual, depends on who is doing the talking!
Logically, it should include UTF-8, Shift-JIS, GB etc.: the variable-length encodings. UTF-16 would often not be considered in this group (even though it kind of is, what with the surrogates; and certainly it's multiple bytes when encoded into bytes via UTF-16LE/UTF-16BE).
But in Microsoftland the term would more typically be used to mean a variable-length default system codepage (for legacy non-Unicode applications, of which there are sadly still plenty). In this usage, UTF-8 and UTF-16LE/UTF-16BE cannot be included because the system codepage on Windows cannot be set to either of these encodings.
Indeed, in some cases “mbcs” is no more than a synonym for the system codepage, otherwise known (even more misleadingly) as “ANSI”. In this case a “multibyte” character set could actually be something as trivial as cp1252 Western European, which only uses one byte per character!
My advice: use “variable-length” when you mean that, and avoid the ambiguous term “multibyte”; when someone else uses it you'll need to ask for clarification, but typically someone with a Windows background will be talking about a legacy East Asian codepage like cp932 (Shift-JIS) and not a UTF.
All character sets where you dont have a 1 byte = 1 character mapping. All Unicode variants, but also asian character sets are multibyte.
For more information, I suggest reading this Wikipedia article.
A multibyte character will mean a character whose encoding requires more than 1 byte. This does not imply however that all characters using that particular encoding will have the same width (in terms of bytes). E.g: UTF-8 and UTF-16 encoded character may use multiple bytes sometimes whereas all UTF-32 encoded characters always use 32-bits.
References:
IBM: Multibyte Characters
Unicode and MultiByte Character Set (archived), Unicode and Multibyte Character Set (MBCS) Support | Microsoft Docs
Unicode Consortium Website
A multibyte character set may consist of both one-byte and two-byte
characters. Thus a multibyte-character string may contain a mixture of
single-byte and double-byte characters.
Ref: Single-Byte and Multibyte Character Sets
UTF-8 is multi-byte, which means that each English character (ASCII) is stored in 1 byte while non-english character like Chinese, Thai, is stored in 3 bytes. When you mix Chinese/Thai with English, like "ทt", the first Thai character "ท" uses 3 bytes while the second English character "t" uses only 1 byte. People who designed multi-byte encoding realized that English character shouldn't be stored in 3 bytes while it can fit in 1 byte due to the waste of storage space.
UTF-16 stores each character either English or non-English in a fixed 2 byte length so it is not multi-byte but called a wide character. It is very suitable for Chinese/Thai languages where each character fits entirely in 2 bytes but printing to utf-8 console output need a conversion from wide character to multi-byte format by using function wcstombs().
UTF-32 stores each character in a fixed 4 byte length but nobody use it to store character due to a waste of storage space.
Typically the former, i.e. UTF-8-like. For more info, see Variable-width encoding.
The former - although the term "variable-length encoding" would be more appropriate.
I generally use it to refer to any character that can have more than one byte per character.

Resources