what is UTF-8 code page equivalent of ASCII 128 to 160 - utf-8

The UTF-8 code page doesn't include the character of the rage 128 to 160 of Extended ASCII
what is the equivalent code in UTF-8 of this characters?
such as Bullet, En dash, Em dash etc.

Try this page. just search by name, code or almost anything you want.
http://www.fileformat.info/info/unicode/char/search.htm

Related

Special character 'Â' inserted before copyright symbol

Our source code contains a copyright at the top of every CSS file...
/* Copyright © ... */
Every time CSS files are loaded by the Firefox Style Editor, a special character is inserted before the copyright symbol...
/* Copyright © ... */
It adds an additional special character each time the file is loaded. I do not believe this is limited to Firefox, but that's what I use at the moment for CSS dynamic styling. It's annoying to have to delete this char every time and occasionally it gets into commits and pushed.
Question: How can the special character insertion be prevented?
Instead of using copyright symbol itself, try to use its numerical number:
©
My advice is to open the files in Notepad++ and check the detected encoding, as displayed under the Encoding menu. I expect that it will read:
Encode in UTF-8
If so, apply Convert to UTF-8-BOM. It will prepend 3 magic bytes to your text file, making the UTF-8 encoding explicit. Save the files and see if it works.
Explanation
The reason for this  to appear, is that some tool is not detecting the encoding correctly and assumes it is ANSI (a.k.a. Windows-1252) or ISO 8859-1. Those one-bytes encodings and UTF-8 are very much alike for normal English texts and code files. The standard ASCII set is encoded in exactly the same way. Only special characters, like in your case, the copyright symbol, are encoded differently, using two, three of four bytes, rather than one.
Now, the copyright symbol has bytes 0xC2 0xA9 or 11000010 10101001 in UTF-8 encoding, and byte 0xA9 in ANSI encoding.
The latin capital letter A with circumflex has byte 0xC2 or 11000010 in ANSI encoding.
When 11000010 10101001 is encountered and interpreted as UTF-8, the first three bits, of the first byte, 110, in combination with the first two bits of the second byte , 10, indicate the start of a two-byte UTF-8 character. So this is the correct UTF-8 encoding of the copyright symbol.
If, however, 11000010 10101001 is encountered and interpreted as ANSI, two separate characters are seen, Â and ©.
I think it is no coincidence that the second byte of the UTF-8 encoding of © is the same as the one-byte ANSI encoding. It looks like the Latin-1 supplement is inserted in UTF-8 at exactly the same order as it has in ANSI and with the same offset, leaving the second bytes equal. E.g. a UTF-8 encoded
µ
would show up as
µ
if wrongly interpreted as ANSI.
Maybe, this was done to preserve some information about the original characters, if an encoding error were made.
Check if you have set a correct charset-meta-tag in your html head
<meta charSet="UTF-8"/>

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

If my XML document instruction specifies encoding of UTF-8 do I still need to escape characters?

I know I need to escape these in all cases:
quot "
amp &
apos '
lt <
gt >
But what about international characters that have accents, or Russian characters to name a couple. Do I need to escape characters of this type when my encoding instruction is set to UTF-8?
What If I were to set the encoding instruction to ASCII? Would I need to escape all those characters also?
This is a sample of the XML (from a legacy system) I am trying to reproduce using Nokogiri(lib2xml):
<?xml version="1.0" encoding="UTF-8"?>
<DESCRIPTION lang="rus">
<SHORT_DESCRIPTION>МОДУЛЬ- ELECTRONIC OUTPUT 120 V DC 5 mA</SHORT_DESCRIPTION>
<LONG_DESCRIPTION>МОДУЛЬ- ТИП ELECTRONIC OUTPUT ВХОД 120 V DC ВЫХОД 5 mA ИСТОЧНИК ПИТАНИЯ 120 V DC ДОПОЛНИТЕЛЬНАЯ ДЕТАЛЬ 1 ANALOG SM322-8S TOR</LONG_DESCRIPTION>
</DESCRIPTION>
You can see that the instruction in the sample says UTF-8 but they have escaped a lot of characters, characters that Nokogiri only escapes when I specify an ASCII encoding instruction. This is what is confusing me.
EDIT 2 : If I do not pass an encoding instruction to Nokogiri, the resulting XML leaves all the Russian characters in their native Cyrillic alphabet, BUT that would not be consistent with the XML I need to replicate.
You only need to represent a character with a character reference if either:
It would have special meaning in the current context (so the five characters you listed only need encoding sometimes)
It does not exist in the character encoding the file is encoded in
ASCII doesn't have many characters in it, so if you encoded your XML in ASCII you would have to use character references for many characters.
Don't encode your XML in ASCII. The default encoding for XML is UTF-8, which is very well supported.

à © and other codes

I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?
Thank you very much in advance.
Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.
You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.
To work through your example:
à © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)
You mention wanting to handle this with PHP, something like this might do it for you:
//to load from a file, use
//$file=file_get_contents("/path/to/filename.txt");
//example below uses a literal string to demonstrate technique...
$file="&Précédent is a French word";
$utf8=html_entity_decode($file);
$iso8859=utf8_decode($utf8);
//$utf8 contains "Précédent is a French word" in UTF-8
//$iso8859 contains "Précédent is a French word" in ISO-8859

Can UTF8 encoded data missread as Latin-1 produce ASCII artifacts?

UTF-8 single byte characters map perfectly to Latin-1 (ISO 8859-1) characters (those below the character code of 128); basicly the default ASCII characters.
If I have a UTF-8 encoded string and pass it to a function, that expects a Latin-1 string is there any possibility that the Latin-1 function misinterprets parts of UTF-8 mutlibyte characters as ASCII characters?
I imagine something like this could happen:
(imagniray) UTF-8 multibyte character: 0xA330
(mis-)interpreted by Latin-1 function as two Latin-1 characters: 0xA3 0x30
The first of those characters does not lie within the ASCII set, but the second is the ASCII code for the 0 character. Is it possible that an multibyte UTF-8 character produces an artifact that looks like a single-byte UTF-8 / ASCII character like in the example above?
From my understanding of UTF-8 only single-byte characters contain any bytes with the most significant bit unset, so basicly multibyte characters never contain a byte that could be misinterpreted by a Latin-1 function as a valid ASCII character (because all those characters have the most significant bit unset). But I want to make sure this is true and I don't screw up on this, because this may have security implications when dealing with data sanitization - which I am apparently currently doing.
You are correct in your understanding that only single byte characters contain any bytes with the most significant bit unset. There is a nice table showing this at: http://en.wikipedia.org/wiki/UTF-8#Description

Resources